Hive动态分区
Hive动态分区
hive提供了一个动态分区功能,其可以基于查询参数的位置去推断分区的名称,从而建立分区。
1、配置参数
1.1 主要配置参数
以下为Hive动态分区涉及的配置参数:
set hive.exec.dynamic.partition=true --表示开启动态分区功能,默认是false set hive.exec.dynamic.partition.mode=nonstrict --表示允许所有分区都是动态的,否则必须有静态分区字段,默认strict。
1.2 调优相关参数
动态分区相关的调优参数:
set hive.exec.max.dynamic.partitions.pernode=100 --默认100,一般可以设置大一点,比如1000。表示每个maper或reducer可以允许创建的最大动态分区个数,默认是100,超出则会报错。 set hive.exec.max.dynamic.partitions =1000(默认值) --表示一个动态分区语句可以创建的最大动态分区个数,超出报错。 set hive.exec.max.created.files =10000(默认) --全局可以创建的最大文件个数,超出报错。
2、实现基本动态分区
2.1 准备数据
以下为准备数据的步骤,以及数据中的内容。
hive> create table user_3g (id int, name string, city string) row format delimited fields terminated by '\t'; OK Time taken: 0.259 seconds hive> load data local inpath '/root/book/3guser.txt' overwrite into table user_3g; Loading data to table test.user_3g Table test.user_3g stats: [numFiles=1, numRows=0, totalSize=127, rawDataSize=0] OK Time taken: 1.379 seconds hive> select * from user_3g; OK 1 刘备 shu 2 关羽 shu 3 张飞 shu 4 曹操 wei 5 曹丕 wei 6 孙权 wu 7 孙策 wu 8 孙尚香 wu 9 夏侯惇 wei Time taken: 0.354 seconds, Fetched: 9 row(s)
2.2 创建分区表
以下为创建内部分区表的语句:
hive> create table user_3g_area(id int, name string) partitioned by(city string) row format delimited fields terminated by '\t'; OK Time taken: 0.081 seconds
以下为创建外部分区表的语句:
本人这里有现成test目录,如果没有,记得加-r参数。
#创建外部表,首先创建一个数据目录 [root@hadoop ~]# hadoop fs -mkdir /test/user_xy_area_out/ #然后建表 hive> create external table user_3g_area_out (id int, name string) partitioned by (city string) row format delimited fields terminated by '\t' location '/test/user_xy_area_out'; OK Time taken: 0.336 seconds
2.3 设置动态分区参数
这里只配置了主要的动态分区参数,其他参数,根据需要自己进行配置。
hive> set hive.exec.dynamic.partition=true; hive> set hive.exec.dynamic.partition.mode=nonstrict;
2.4 开始导入数据
内部分区表数据导入:
hive> insert into table user_3g_area partition(city) select id, name, city from user_3g; Query ID = root_20200614162430_25d9fa2d-7811-484a-bb52-efa9dae1fa72 Total jobs = 3 Launching Job 1 out of 3 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1592121918794_0001, Tracking URL = http://hadoop:8088/proxy/application_1592121918794_0001/ Kill Command = /home/software/hadoop-2.7.1/bin/hadoop job -kill job_1592121918794_0001 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2020-06-14 16:24:44,429 Stage-1 map = 0%, reduce = 0% 2020-06-14 16:24:52,638 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.19 sec MapReduce Total cumulative CPU time: 1 seconds 190 msec Ended Job = job_1592121918794_0001 Stage-4 is selected by condition resolver. Stage-3 is filtered out by condition resolver. Stage-5 is filtered out by condition resolver. Moving data to: hdfs://hadoop:9000/user/hive/warehouse/test.db/user_3g_area/.hive-staging_hive_2020-06-14_16-24-30_674_3273172977143207137-1/-ext-10000 Loading data to table test.user_3g_area partition (city=null) Time taken for load dynamic partitions : 451 Loading partition {city=wu} Loading partition {city=shu} Loading partition {city=wei} Time taken for adding to write entity : 7 Partition test.user_3g_area{city=shu} stats: [numFiles=1, numRows=3, totalSize=27, rawDataSize=24] Partition test.user_3g_area{city=wei} stats: [numFiles=1, numRows=3, totalSize=30, rawDataSize=27] Partition test.user_3g_area{city=wu} stats: [numFiles=1, numRows=3, totalSize=30, rawDataSize=27] MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Cumulative CPU: 1.19 sec HDFS Read: 3768 HDFS Write: 258 SUCCESS Total MapReduce CPU Time Spent: 1 seconds 190 msec OK Time taken: 25.395 seconds
外部分区表数据导入:
hive> insert into table user_3g_area_out partition(city) select id,name,city from user_3g; Query ID = root_20200615232438_f6a9d195-c416-4d5e-bd31-b96fb9c0133e Total jobs = 3 Launching Job 1 out of 3 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1592152196890_0001, Tracking URL = http://hadoop:8088/proxy/application_1592152196890_0001/ Kill Command = /home/software/hadoop-2.7.1/bin/hadoop job -kill job_1592152196890_0001 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2020-06-15 23:24:53,191 Stage-1 map = 0%, reduce = 0% 2020-06-15 23:25:02,148 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.19 sec MapReduce Total cumulative CPU time: 1 seconds 190 msec Ended Job = job_1592152196890_0001 Stage-4 is selected by condition resolver. Stage-3 is filtered out by condition resolver. Stage-5 is filtered out by condition resolver. Moving data to: hdfs://hadoop:9000/test/user_xy_area_out/.hive-staging_hive_2020-06-15_23-24-38_208_8331054339743666004-1/-ext-10000 Loading data to table test.user_3g_area_out partition (city=null) Time taken for load dynamic partitions : 466 Loading partition {city=wu} Loading partition {city=wei} Loading partition {city=shu} Time taken for adding to write entity : 8 Partition test.user_3g_area_out{city=shu} stats: [numFiles=1, numRows=3, totalSize=27, rawDataSize=24] Partition test.user_3g_area_out{city=wei} stats: [numFiles=1, numRows=3, totalSize=30, rawDataSize=27] Partition test.user_3g_area_out{city=wu} stats: [numFiles=1, numRows=3, totalSize=30, rawDataSize=27] MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Cumulative CPU: 1.19 sec HDFS Read: 3718 HDFS Write: 270 SUCCESS Total MapReduce CPU Time Spent: 1 seconds 190 msec OK Time taken: 26.51 seconds
2.5 查看分区
查看内部分区表分区:
hive> show partitions user_3g_area; OK city=shu city=wei city=wu Time taken: 0.6 seconds, Fetched: 3 row(s)
查看外部分区表分区:
#查看分区: hive> show partitions user_3g_area_out; OK city=shu city=wei city=wu Time taken: 0.138 seconds, Fetched: 3 row(s) #查看数据目录 [root@hadoop ~]# hadoop fs -ls /test/user_xy_area_out/ 20/06/15 23:29:13 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Found 3 items drwxr-xr-x - root supergroup 0 2020-06-15 23:25 /test/user_xy_area_out/city=shu drwxr-xr-x - root supergroup 0 2020-06-15 23:25 /test/user_xy_area_out/city=wei drwxr-xr-x - root supergroup 0 2020-06-15 23:25 /test/user_xy_area_out/city=wu
经过上面的演示可以看出,动态分区的内部表和外部表的实现过程几乎是一样的,不一样的地方就是数据存储的位置。
为什么说这个是基础的,因为这个分区是英文字符的,下面实现以下中文字符的。
注意: 使用insert...select 往表中导入数据时,查询的字段个数必须和目标的字段个数相同,不能多,也不能少,否则会报错。但是如果字段的类型不一致的话,则会使用null值填充,不会报错。 使用load data形式往hive表中装载数据时,则不会检查。如果字段多了则会丢弃,少了则会null值填充。同样如果字段类型不一致,也是使用null值填充。
3、实现中文动态分区
实现包含中文的动态分区和英文的动态分区是一样的步骤,不一样的地方在于Hive和MySQL的支持,以及乱码问题。
3.1 MySQL
在MySQL5.7版本之前的MySQL数据库支持不了Hive元数据超长字符的索引字段,所以在MySQL5.7之前版本的MySQL中的Hive元数据库默认使用的是latin1字符集,元数据信息存储就不支持中文。
如果强行更改MySQL5.7版本之前的MySQL数据库的Hive元数据库的字符集或者更改其中某些表的字段的字符集的话,会出现下列错误:
Specified key was too long; max key length is 767 bytes
所以字符类型不能更改是因为字段长度问题,也可以先进行字段长度的调整,在进行字符集的更改。这个方法这里不做赘述。
当要在Hive的元数据信息中出现中文内容时,首先需要保证MySQL数据库使用5.7以上版本,因为MySQL5.7支持最大字符长度做了调整。能够达到Hive的需求。
如果安装了MySQL5.7之前的版本,考虑数据问题,就升级数据库版本到5.7。数据库升级请参考:MySQL5.6.29升级到MySQL5.7.29
如果是新的环境直接安装5.7以上版本即可,安装Hive,请参考:HIVE简介。
如下是数据库Hive库涉及的更改字符集的命令:
#修改数据库编码 alter database hive default character set utf8; #修改表的编码 alter table BUCKETING_COLS default character set utf8; alter table CDS default character set utf8; alter table COLUMNS_V2 default character set utf8; alter table DATABASE_PARAMS default character set utf8; alter table DBS default character set utf8; alter table FUNCS default character set utf8; alter table FUNC_RU default character set utf8; alter table GLOBAL_PRIVS default character set utf8; alter table PARTITIONS default character set utf8; alter table PARTITION_KEYS default character set utf8; alter table PARTITION_KEY_VALS default character set utf8; alter table PARTITION_PARAMS default character set utf8; alter table PART_COL_STATS default character set utf8; alter table ROLES default character set utf8; alter table SDS default character set utf8; alter table SD_PARAMS default character set utf8; alter table SEQUENCE_TABLE default character set utf8; alter table SERDES default character set utf8; alter table SERDE_PARAMS default character set utf8; alter table SKEWED_COL_NAMES default character set utf8; alter table SKEWED_COL_VALUE_LOC_MAP default character set utf8; alter table SKEWED_STRING_LIST default character set utf8; alter table SKEWED_STRING_LIST_VALUES default character set utf8; alter table SKEWED_VALUES default character set utf8; alter table SORT_COLS default character set utf8; alter table TABLE_PARAMS default character set utf8; alter table TAB_COL_STATS default character set utf8; alter table TBLS default character set utf8; alter table VERSION default character set utf8; #修改列编码 alter table BUCKETING_COLS convert to character set utf8; alter table CDS convert to character set utf8; alter table COLUMNS_V2 convert to character set utf8; alter table DATABASE_PARAMS convert to character set utf8; alter table DBS convert to character set utf8; alter table FUNCS convert to character set utf8; alter table FUNC_RU convert to character set utf8; alter table GLOBAL_PRIVS convert to character set utf8; alter table PARTITIONS convert to character set utf8; alter table PARTITION_KEYS convert to character set utf8; alter table PARTITION_KEY_VALS convert to character set utf8; alter table PARTITION_PARAMS convert to character set utf8; alter table PART_COL_STATS convert to character set utf8; alter table ROLES convert to character set utf8; alter table SDS convert to character set utf8; alter table SD_PARAMS convert to character set utf8; alter table SEQUENCE_TABLE convert to character set utf8; alter table SERDES convert to character set utf8; alter table SERDE_PARAMS convert to character set utf8; alter table SKEWED_COL_NAMES convert to character set utf8; alter table SKEWED_COL_VALUE_LOC_MAP convert to character set utf8; alter table SKEWED_STRING_LIST convert to character set utf8; alter table SKEWED_STRING_LIST_VALUES convert to character set utf8; alter table SKEWED_VALUES convert to character set utf8; alter table SORT_COLS convert to character set utf8; alter table TABLE_PARAMS convert to character set utf8; alter table TAB_COL_STATS convert to character set utf8; alter table TBLS convert to character set utf8; alter table VERSION convert to character set utf8; alter table PART_COL_STATS convert to character set utf8; #修改数据库属性,如果数据库默认设置不是UTF8可执行。 SET character_set_client = utf8 ; SET character_set_connection = utf8; SET character_set_database = utf8; SET character_set_results = utf8; SET character_set_server = utf8; SET collation_connection = utf8; SET collation_database = utf8; SET collation_server = utf8; SET NAMES 'utf8';
3.2 实现
以上都准备好了,就可以进行下面的步骤了:
#准备数据 hive> create table user_xy(id int, name string, city string) row format delimited fields terminated by '\t'; OK Time taken: 0.102 seconds hive> load data local inpath '/root/book/user.txt' overwrite into table user_xy; Loading data to table test.user_xy Table test.user_xy stats: [numFiles=1, numRows=0, totalSize=232, rawDataSize=0] OK Time taken: 0.33 seconds hive> select * from user_xy; OK 1 孙悟空 花果山 2 猪八戒 高老庄 3 沙和尚 流沙河 4 唐玄奘 西安 5 小白龙 清潭涧 6 唐玄宗 西安 7 观世音 南海 8 玉皇大帝 凌霄宝殿 9 黄风怪 小雷音寺 10 如来佛祖 雷音寺 Time taken: 0.094 seconds, Fetched: 10 row(s) #设置参数 hive> set hive.exec.dynamic.partition=true; hive> set hive.exec.dynamic.partition.mode=nonstrict; #开始动态分区导入数据 hive> insert into table user_xy_area partition(city) select id,name,city from user_xy; Query ID = root_20200614163041_ff9e15f6-ce41-4be2-bb86-ea569cab795e Total jobs = 3 Launching Job 1 out of 3 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1592121918794_0003, Tracking URL = http://hadoop:8088/proxy/application_1592121918794_0003/ Kill Command = /home/software/hadoop-2.7.1/bin/hadoop job -kill job_1592121918794_0003 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2020-06-14 16:30:49,649 Stage-1 map = 0%, reduce = 0% 2020-06-14 16:30:57,202 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.26 sec MapReduce Total cumulative CPU time: 1 seconds 260 msec Ended Job = job_1592121918794_0003 Stage-4 is selected by condition resolver. Stage-3 is filtered out by condition resolver. Stage-5 is filtered out by condition resolver. Moving data to: hdfs://hadoop:9000/user/hive/warehouse/test.db/user_xy_area/.hive-staging_hive_2020-06-14_16-30-41_545_7128885006398250032-1/-ext-10000 Loading data to table test.user_xy_area partition (city=null) Time taken for load dynamic partitions : 793 Loading partition {city=花果山} Loading partition {city=凌霄宝殿} Loading partition {city=雷音寺} Loading partition {city=南海} Loading partition {city=清潭涧} Loading partition {city=高老庄} Loading partition {city=流沙河} Loading partition {city=西安} Loading partition {city=小雷音寺} Time taken for adding to write entity : 1 Partition test.user_xy_area{city=凌霄宝殿} stats: [numFiles=1, numRows=1, totalSize=15, rawDataSize=14] Partition test.user_xy_area{city=南海} stats: [numFiles=1, numRows=1, totalSize=12, rawDataSize=11] Partition test.user_xy_area{city=小雷音寺} stats: [numFiles=1, numRows=1, totalSize=12, rawDataSize=11] Partition test.user_xy_area{city=流沙河} stats: [numFiles=1, numRows=1, totalSize=12, rawDataSize=11] Partition test.user_xy_area{city=清潭涧} stats: [numFiles=1, numRows=1, totalSize=12, rawDataSize=11] Partition test.user_xy_area{city=花果山} stats: [numFiles=1, numRows=1, totalSize=12, rawDataSize=11] Partition test.user_xy_area{city=西安} stats: [numFiles=1, numRows=2, totalSize=24, rawDataSize=22] Partition test.user_xy_area{city=雷音寺} stats: [numFiles=1, numRows=1, totalSize=16, rawDataSize=15] Partition test.user_xy_area{city=高老庄} stats: [numFiles=1, numRows=1, totalSize=12, rawDataSize=11] MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Cumulative CPU: 1.26 sec HDFS Read: 4053 HDFS Write: 632 SUCCESS Total MapReduce CPU Time Spent: 1 seconds 260 msec OK Time taken: 19.53 seconds #查看数据,没问题 hive> select * from user_xy_area; OK 8 玉皇大帝 凌霄宝殿 7 观世音 南海 9 黄风怪 小雷音寺 3 沙和尚 流沙河 5 小白龙 清潭涧 1 孙悟空 花果山 4 唐玄奘 西安 6 唐玄宗 西安 10 如来佛祖 雷音寺 2 猪八戒 高老庄 Time taken: 0.242 seconds, Fetched: 10 row(s)
3.3 乱码问题
查看分区:
hive> show partitions user_xy_area; OK city=� �� city=Ww city=��� city=A�� city=m� city=��q city� city=��� city=� � Time taken: 0.135 seconds, Fetched: 9 row(s)
乱码了,崩溃不?坑一个接一个!
那么修改一下配置文件:hive-site-xml
[root@hadoop conf]# vim hive-site.xml <configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://hadoop:3306/hive?createDatabaseIfNotExist=true&useUnicode=true&characterEncoding=UTF-8</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>root</value> </property> </configuration>
在配置中添加的是如下这一段:
&useUnicode=true&characterEncoding=UTF-8
注意分隔符,在xml文件中&符号需要转义。
如果没有转义,就会出现如下错误:
[root@hadoop ~]# hive [Fatal Error] hive-site.xml:21:105: The reference to entity "characterEncoding" must end with the ';' delimiter.
所有都搞完,重启之后,结果如下:
hive> show partitions user_xy_area; OK city=� �� city=Ww city=��� city=A�� city=m� city=��q city� city=��� city=� � Time taken: 0.135 seconds, Fetched: 9 row(s)
还是乱码,这个问题,本人暂时没能解决!哪位大神有解决方案,麻烦告知一下,谢谢!
上一篇:Hive应用:explode和lateral view
下一篇:
低调大师中文资讯倾力打造互联网数据资讯、行业资源、电子商务、移动互联网、网络营销平台。
持续更新报道IT业界、互联网、市场资讯、驱动更新,是最及时权威的产业资讯及硬件资讯报道平台。
转载内容版权归作者及来源网站所有,本站原创内容转载请注明来源。
- 上一篇
你们公司用的 Maven 私服就是这么搭建的
我是风筝,公众号「古时的风筝」,一个兼具深度与广度的程序员鼓励师,一个本打算写诗却写起了代码的田园码农! 文章会收录在 JavaNewBee 中,更有 Java 后端知识图谱,从小白到大牛要走的路都在里面。 差不多所有的 Java 开发者应该都用过 Maven 的吧,如果你没用过,那可能你正在用的是 Gradle,如果这俩你都没用过,那就有点不太正常了。顺道说一句,在 Gradle 眼里,Maven 什么的都是垃圾。 Maven 作为 Java 项目管理工具,它不仅可以用作包管理,还有许多的插件,可以支持整个项目的开发、打包、测试、部署等一系列行为。 而包管理又是其核心功能,除非是个人项目,我们获取包都是从中央仓库或者 github 获取,但是公司项目呢,为了提供高速,一般都会搭建公司内部或者团队内部的 Maven 私服仓库。 为什么要搭建私服 当我们再 pom 文件中依赖了某个包后,如果在没有做特殊配置(也就是使用 maven 的默认配置)的情况下,Maven 会首先到本地仓库去搜索,如果本地仓库没有此依赖包,会到中央仓库获取,默认的中央仓库地址是 http://repo1.mave...
- 下一篇
ECharts5.0版本即将上线,来说说我与ECharts的那些事吧!
ApacheECharts(incubating),一个使用 JavaScript 实现的开源可视化数据图表库。作为一款基于JavaScript的数据可视化图表库,ECharts为用户提供直观,生动,可交互,可个性化定制的数据可视化图表,让项目数据能清晰的被理解和发现差异。 回顾过去,ECharts 于 2012 年 8 月立项,2013 年 6 月发布 1.0 版本,一年后的ECharts2.0 成长为了一个成熟的图标库,而2016年的 ECharts 3.0 被广大开发者看作是重生。ECharts4.0 除了不断提升易用性之外,一直在尝试从数据量,平台,行业三个方向拓展前端可视化的边界。 那么ECharts 5.0将会是什么样的呢? ECharts团队用 TypeScript 重构了代码,大幅度提升开发者的工作效率,5.0版本将给你带来全新的使用体验: 高效:静态类型重构代码更高效可靠 直观:阅读源代码更加方便直观 清晰:类型声明让开发者在调用时更清楚的了解参数含义及类型 ECharts5.0如何让开发者获得全新体验? 增强动画效果 -通过动态排序柱状图、动态走势折线图,增强图表讲...
相关文章
文章评论
共有0条评论来说两句吧...
文章二维码
点击排行
-
Docker使用Oracle官方镜像安装(12C,18C,19C)
- Springboot2将连接池hikari替换为druid,体验最强大的数据库连接池
- CentOS8编译安装MySQL8.0.19
- Docker快速安装Oracle11G,搭建oracle11g学习环境
- SpringBoot2配置默认Tomcat设置,开启更多高级功能
- MySQL8.0.19开启GTID主从同步CentOS8
- CentOS7,8上快速安装Gitea,搭建Git服务器
- Jdk安装(Linux,MacOS,Windows),包含三大操作系统的最全安装
- SpringBoot2编写第一个Controller,响应你的http请求并返回结果
推荐阅读
最新文章
- Docker使用Oracle官方镜像安装(12C,18C,19C)
- CentOS6,CentOS7官方镜像安装Oracle11G
- SpringBoot2整合Redis,开启缓存,提高访问速度
- Jdk安装(Linux,MacOS,Windows),包含三大操作系统的最全安装
- SpringBoot2配置默认Tomcat设置,开启更多高级功能
- Springboot2将连接池hikari替换为druid,体验最强大的数据库连接池
- SpringBoot2整合MyBatis,连接MySql数据库做增删改查操作
- Hadoop3单机部署,实现最简伪集群
- MySQL8.0.19开启GTID主从同步CentOS8
- SpringBoot2编写第一个Controller,响应你的http请求并返回结果