hive查询报错:java.io.IOException:org.apache.parquet.io.ParquetDecodingExce...-低调大师

hive查询报错:java.io.IOException:org.apache.parquet.io.ParquetDecodingExce...

2018-05-20 554

我的原创地址：https://dongkelun.com/2018/05/20/hiveQueryException/

前言

本文解决如标题所述的一个hive查询异常，详细异常信息为：

Failed with exception java.io.IOException:org.apache.parquet.io.ParquetDecodingException: Can not read value at 1 in block 0 in file hdfs://192.168.44.128:8888/user/hive/warehouse/test.db/test/part-00000-9596e4bd-f511-4f76-9030-33e426d0369c-c000.snappy.parquet

这个异常是用spark sql将oracle（不知道mysql中有没有该问题，大家可以自己测试一下）中表数据查询出来然后写入hive表中，之后在hive命令行执行查询语句时产生的，下面先具体看一下如何产生这个异常的。

1、建立相关的库和表

1.1 建立hive测试库

在hive里执行如下语句

create database test;

1.2 建立oracle测试表

CREATE TABLE TEST
(   "ID" VARCHAR2(100), 
    "NUM" NUMBER(10,2)
)

1.3 在oracle表里插入一条记录

INSERT INTO TEST (ID, NUM) VALUES('1', 1);

2、spark sql代码

执行如下代码,便可以将之前在oracle里建的test的表导入到hive里了，其中hive的表会自动创建，具体的spark连接hive，连接关系型数据库，可以参考我的其他两篇博客：spark连接hive（spark-shell和eclipse两种方式）、Spark Sql 连接mysql

package com.dkl.leanring.spark.sql

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SaveMode

object Oracle2HiveTest {
  def main(args: Array[String]): Unit = {

    //初始化spark
    val spark = SparkSession
      .builder()
      .appName("Oracle2HiveTest")
      .master("local")
      //      .config("spark.sql.parquet.writeLegacyFormat", true)
      .enableHiveSupport()
      .getOrCreate()

    //表名为我们新建的测试表
    val tableName = "test"

    //spark连接oracle数据库
    val df = spark.read
      .format("jdbc")
      .option("url", "jdbc:oracle:thin:@192.168.44.128:1521:orcl")
      .option("dbtable", tableName)
      .option("user", "bigdata")
      .option("password", "bigdata")
      .option("driver", "oracle.jdbc.driver.OracleDriver")
      .load()
    //导入spark的sql函数，用起来较方便
    import spark.sql
    //切换到test数据库
    sql("use test")
    //将df中的数据保存到hive表中（自动建表）
    df.write.mode(SaveMode.Overwrite).saveAsTable(tableName)
    //停止spark
    spark.stop
  }
}

3、在hive里查询

hive

use test;
select * from test;

这时就可以出现如标题所述的异常了，附图：

4、解决办法

将2里面spark代码中的.config(“spark.sql.parquet.writeLegacyFormat”, true)注释去掉，再执行一次，即可解决该异常，该配置的默认值为false，如果设置为true，Spark将使用与Hive相同的约定来编写Parquet数据。

5、异常原因

出现该异常的根本原因是由于Hive和Spark中使用的不同的parquet约定引起的，参考https://stackoverflow.com/questions/37829334/parquet-io-parquetdecodingexception-can-not-read-value-at-0-in-block-1-in-file中的最后一个回答（加载可能比较慢），由于博主英文水平不是那么的好，所以附上英文吧~

Root Cause:
This issue is caused because of different parquet conventions used in Hive and Spark. In Hive, the decimal datatype is represented as fixed bytes (INT 32). In Spark 1.4 or later the default convention is to use the Standard Parquet representation for decimal data type. As per the Standard Parquet representation based on the precision of the column datatype, the underlying representation changes.
eg: DECIMAL can be used to annotate the following types: int32: for 1 <= precision <= 9 int64: for 1 <= precision <= 18; precision < 10 will produce a warning

Hence this issue happens only with the usage of datatypes which have different representations in the different Parquet conventions. If the datatype is DECIMAL (10,3), both the conventions represent it as INT32, hence we won't face an issue. If you are not aware of the internal representation of the datatypes it is safe to use the same convention used for writing while reading. With Hive, you do not have the flexibility to choose the Parquet convention. But with Spark, you do.

Solution: The convention used by Spark to write Parquet data is configurable. This is determined by the property spark.sql.parquet.writeLegacyFormat The default value is false. If set to "true", Spark will use the same convention as Hive for writing the Parquet data. This will help to solve the issue.

6、注意

1.2中的建表语句中NUMBER(10,2)的精度(10,2)必须要写，如果改为NUMBER就不会出现该异常，至于其他精度会不会出现该问题，大家可自行测试。

微信关注我们

原文链接：https://yq.aliyun.com/articles/676202

转载内容版权归作者及来源网站所有！

低调大师中文资讯倾力打造互联网数据资讯、行业资源、电子商务、移动互联网、网络营销平台。持续更新报道IT业界、互联网、市场资讯、驱动更新,是最及时权威的产业资讯及硬件资讯报道平台。

第二篇详细Python正则表达式操作指南(re使用)

接下来昨天的内容执行匹配一旦你有了已经编译了的正则表达式的对象，你要用它做什么呢？`RegexObject` 实例有一些方法和属性。这里只显示了最重要的几个，如果要看完整的列表请查阅 Python Library Reference 如果没有匹配到的话，match() 和 search() 将返回 None。如果成功的话，就会返回一个 `MatchObject` 实例，其中有这次匹配的信息：它是从哪里开始和结束，它所匹配的子串等等。你可以用采用人机对话并用 re 模块实验的方式来学习它。如果你有 Tkinter 的话，你也许可以考虑参考一下 Tools/scripts/redemo.py，一个包含在 Python 发行版里的示范程序。首先，运行 Python 解释器，导入 re 模块并编译一个 RE：现在，你可以试着用 RE

2018-05-21

656

博客地址：http://blog.csdn.net/FoxDave本篇主要讨论在SharePoint现代化用户接口中如何最大化地使用列表和库。前一篇我们已经做过说明，我们无法将所有的列表和库转换到现代化体验的方式。现代化用户接口中可用的列表模板下面列出的是SharePoint现代化界面能够展示的常用的列表模板类型(截至2018年3月，微软还在不断更新以支持更多的列表类型)： List (100) Document Library (101) Links list (103) Announcements list (104) Picture library (109) Form library (115) Site pages library (119) Promoted links list (170) Assets library (851) 查明现代化用户接口下不可用的列表和库推荐的查明哪些列表和库在现代化用户接口中不可用的方式是采用前一篇中提到的扫描工具SharePoint "Modern" user interface experience scanner。这个工具会在我...

2018-05-21

547

资源下载

更多资源

优质分享App

近一个月的开发和优化，本站点的第一个app全新上线。该app采用极致压缩，本体才4.36MB。系统里面做了大量数据访问、缓存优化。方便用户在手机上查看文章。后续会推出HarmonyOS的适配版本。

Nacos

Nacos /nɑ:kəʊs/ 是 Dynamic Naming and Configuration Service 的首字母简称，一个易于构建 AI Agent 应用的动态服务发现、配置管理和AI智能体管理平台。Nacos 致力于帮助您发现、配置和管理微服务及AI智能体应用。Nacos 提供了一组简单易用的特性集，帮助您快速实现动态服务发现、服务配置、服务元数据、流量管理。Nacos 帮助您更敏捷和容易地构建、交付和管理微服务平台。

Rocky Linux

Rocky Linux（中文名：洛基）是由Gregory Kurtzer于2020年12月发起的企业级Linux发行版，作为CentOS稳定版停止维护后与RHEL（Red Hat Enterprise Linux）完全兼容的开源替代方案，由社区拥有并管理，支持x86_64、aarch64等架构。其通过重新编译RHEL源代码提供长期稳定性，采用模块化包装和SELinux安全架构，默认包含GNOME桌面环境及XFS文件系统，支持十年生命周期更新。

Sublime Text

Sublime Text具有漂亮的用户界面和强大的功能，例如代码缩略图，Python的插件，代码段等。还可自定义键绑定，菜单和工具栏。Sublime Text 的主要功能包括：拼写检查，书签，完整的 Python API ， Goto 功能，即时项目切换，多选择，多窗口等等。Sublime Text 是一个跨平台的编辑器，同时支持Windows、Linux、Mac OS X等操作系统。