impala1.2.3 udf问题-低调大师

impala1.2.3 udf问题

2017-11-12 641

新的impala已经支持udf了，在测试环境部署了1.2.3版本的cluster.
在运行测试udf时遇到下面这个错误：
java.lang.IllegalArgumentException （表明向方法传递了一个不合法或不正确的参数。）
经过确认这是一个bug:
https://issues.cloudera.org/browse/IMPALA-791
The currently impala 1.2.3 doesn't support String as the input and return types. You'll instead have to use Text or BytesWritable.
1.2.3版本的impala udf的输入参数和返回值还不支持String,可以使用import org.apache.hadoop.io.Text类代替String

Text的api文档：
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/Text.html
重要的几点：
Constructor:
Text(String string) Construct from a string.
Method:
String toString() Convert text back to string
void set(String string) Set to contain the contents of a string.
void set(Text other) copy a text.
void clear() clear the string to empty

在eclipse中测试Text类的用法：

 
        package 
        com.hive.myudf; 
       
        import 
        java.util.Arrays; 
       
        import 
        java.util.regex.Pattern; 
       
        import 
        java.util.regex.Matcher; 
       
        import 
        org.apache.hadoop.io.Text; 
       
        public 
        class 
        TextTest { 
       
        private 
        static 
        Text schemal = 
        new 
        Text( 
        "http://"
        ); 
       
        private 
        static 
        Text t = 
        new 
        Text( 
        "GET /vips-mobile/router.do?api_key=04e0dd9c76902b1bfc5c7b3bb4b1db92&app_version=1.8.7 HTTP/1.0"
        ); 
       
        private 
        static 
        Pattern p = 
        null
        ; 
       
        private 
        static 
        Matcher m = 
        null
        ; 
       
        public 
        static 
        void 
        main(String[] args) { 
       
        p = Pattern. compile( 
        "(.+?) +(.+?) (.+)"
        ); 
       
        Matcher m = p.matcher( t.toString()); 
       
        if 
        (m.matches()){ 
       
        String tt = schemal +
        "test.test.com" 
        +m.group(
        2
        ); 
       
        System. out .println(tt); 
       
        //return m.group(2); 
       
        } 
        else 
        { 
       
        System. out .println(
        "not match" 
        ); 
       
        //return null; 
       
        } 
       
        schemal .clear(); 
       
        t.clear(); 
       
        } 
       
        }

测试udf：

 
        package 
        com.hive.myudf; 
       
        import 
        java.net.URL; 
       
        import 
        java.util.regex.Matcher; 
       
        import 
        java.util.regex.Pattern; 
       
        import 
        org.apache.hadoop.hive.ql.exec.UDF; 
       
        import 
        org.apache.hadoop.io.Text; 
       
        import 
        org.apache.log4j.Logger; 
       
        public 
        class 
        UDFNginxParseUrl 
        extends 
        UDF { 
       
        private 
        static 
        final 
        Logger LOG = Logger.getLogger(UDFNginxParseUrl.
        class
        );  
       
        private 
        Text schemal = 
        new 
        Text(
        "http://" 
        ); 
       
        private 
        Pattern p1 = 
        null
        ; 
       
        private 
        URL url = 
        null
        ; 
       
        private 
        Pattern p = 
        null
        ; 
       
        private 
        Text lastKey = 
        null 
        ; 
       
        private 
        String rt; 
       
        public 
        UDFNginxParseUrl() { 
       
        } 
       
        public 
        Text evaluate(Text host1, Text urlStr, Text partToExtract) { 
       
        LOG.debug( 
        "3args|args1:" 
        + host1 +
        ",args2:" 
        + urlStr + 
        ",args3:" 
        + partToExtract); 
       
        System. out.println(
        "3 args" 
        ); 
       
        System. out.println(
        "args1:" 
        + host1 +
        ",args2:" 
        + urlStr + 
        ",args3:" 
        + partToExtract); 
       
        if 
        (host1 == 
        null 
        || urlStr == 
        null 
        || partToExtract == 
        null
        ) { 
       
        //return null; 
       
        return 
        new 
        Text(
        "a" 
        ); 
       
        } 
       
        p1 = Pattern.compile(
        "(.+?) +(.+?) (.+)" 
        ); 
       
        Matcher m1 = p1.matcher(urlStr.toString()); 
       
        if 
        (m1.matches()){ 
       
        LOG.debug(
        "into match" 
        ); 
       
        String realUrl = schemal.toString() + host1.toString() + m1.group(
        2
        ); 
       
        Text realUrl1 = 
        new 
        Text(realUrl); 
       
        System. out.println(
        "URL is " 
        + realUrl1); 
       
        LOG.debug(
        "realurl:" 
        + realUrl1.toString()); 
       
        try
        { 
       
        LOG.debug(
        "into try" 
        ); 
       
        url = 
        new 
        URL(realUrl1.toString()); 
       
        } 
        catch 
        (Exception e){ 
       
        //return null; 
       
        LOG.debug(
        "into exception" 
        ); 
       
        return 
        new 
        Text(
        "b" 
        ); 
       
        } 
       
        } 
       
        if 
        (partToExtract.equals( 
        "HOST"
        )) { 
       
        rt = url.getHost(); 
       
        LOG.debug( 
        "get host" 
        + rt ); 
       
        } 
       
        //return new Text(rt); 
       
        LOG.debug( 
        "get what"
        ); 
       
        return 
        new 
        Text(
        "rt" 
        ); 
       
        } 
       
        }

几个注意的地方：
1.function是和db相关联的。
2.jar文件存放在hdfs中
3.function会被catalog缓存

本文转自菜菜光 51CTO博客，原文链接：http://blog.51cto.com/caiguangguang/1359312，如需转载请自行联系原作者

微信关注我们

原文链接：https://yq.aliyun.com/articles/434663

转载内容版权归作者及来源网站所有！

低调大师中文资讯倾力打造互联网数据资讯、行业资源、电子商务、移动互联网、网络营销平台。持续更新报道IT业界、互联网、市场资讯、驱动更新,是最及时权威的产业资讯及硬件资讯报道平台。

hive join

hive0.11之前，默认的join方式是reduce端join，即shuffle join(hive.auto.convert.join默认为false),其原理是map的输出数据通过hash进行partition，然后shuffle至对应的reduce端,执行join.如果join key分布不均匀，则会造成一定的数据倾斜，比较明显的现象就是某一个reduce会一直运行在99%，在join运行完毕后，可以通过job的counter看到，reduce处理的数据量相差很大。 join中还有一个方式是map join，即在map 端进行join，其原理是broadcast join，即把小表作为一个完整的驱动表来进行join操作。这种方式比较适合表中有一个小表的情况（比如过比较大，可能会出现oom的情况），hive是rbo的方法来执行操作的，所以需要把小表放在前面，不过也可以手动指定hint,比如/*+ mapjoin(a)*/。 hive 0.11之后，在表的大小符合设置时（hive.auto.convert.join.noconditionaltask=true ,hive.aut...

2017-11-13

616

学习笔记TF065:TensorFlowOnSpark

Hadoop生态大数据系统分为Yam、 HDFS、MapReduce计算框架。TensorFlow分布式相当于MapReduce计算框架，Kubernetes相当于Yam调度系统。TensorFlowOnSpark，利用远程直接内存访问(Remote Direct Memory Access,RDMA)解决存储功能和调度，实现深度学习和大数据融合。TensorFlowOnSpark(TFoS)，雅虎开源项目。https://github.com/yahoo/TensorFlowOnSpark 。支持ApacheSpark集群分布式TensorFlow训练、预测。TensorFlowOnSpark提供桥接程序，每个Spark Executor启动一个对应TensorFlow进程，通过远程进程通信(RPC)交互。 TensorFlowOnSpark架构。TensorFlow训练程序用Spark集群运行，管理Spark集群步骤：预留，在Executor执行每个TensorFlow进程保留一个端口，启动数据消息监听器。启动，在Executor启动TensorFlow主函数。数据获取，Tenso...

2017-11-13

628

发表评论

资源下载

更多资源

优质分享App

近一个月的开发和优化，本站点的第一个app全新上线。该app采用极致压缩，本体才4.36MB。系统里面做了大量数据访问、缓存优化。方便用户在手机上查看文章。后续会推出HarmonyOS的适配版本。

Mario

马里奥是站在游戏界顶峰的超人气多面角色。马里奥靠吃蘑菇成长，特征是大鼻子、头戴帽子、身穿背带裤，还留着胡子。与他的双胞胎兄弟路易基一起，长年担任任天堂的招牌角色。

腾讯云软件源

为解决软件依赖安装时官方源访问速度慢的问题，腾讯云为一些软件搭建了缓存服务。您可以通过使用腾讯云软件源站来提升依赖包的安装速度。为了方便用户自由搭建服务架构，目前腾讯云软件源站支持公网访问和内网访问。

Rocky Linux

Rocky Linux（中文名：洛基）是由Gregory Kurtzer于2020年12月发起的企业级Linux发行版，作为CentOS稳定版停止维护后与RHEL（Red Hat Enterprise Linux）完全兼容的开源替代方案，由社区拥有并管理，支持x86_64、aarch64等架构。其通过重新编译RHEL源代码提供长期稳定性，采用模块化包装和SELinux安全架构，默认包含GNOME桌面环境及XFS文件系统，支持十年生命周期更新。