Hive事物和锁管理-低调大师

Hive事物和锁管理

2016-02-27 1260

摘自Hive技术文档，锁管理：https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-LockManager

有几个参数需要修改下，使得Hive支持事物和并发；

Lock Manager

A new lock manager has also been added to Hive, the DbLockManager. This lock manager stores all lock information in the metastore. In addition all transactions are stored in the metastore. This means that transactions and locks are durable in the face of server failure. To avoid clients dying and leaving transaction or locks dangling, a heartbeat is sent from lock holders and transaction initiators to the metastore on a regular basis. If a heartbeat is not received in the configured amount of time, the lock or transaction will be aborted.

As of Hive 1.3.0, the length of time that the DbLockManger will continue to try to acquire locks can be controlled via hive.lock.numretires and hive.lock.sleep.between.retries. When the DbLockManager cannot acquire a lock (due to existence of a competing lock), it will back off and try again after a certain time period. In order to support short running queries and not overwhelm the metastore at the same time, the DbLockManager will double the wait time after each retry. The initial back off time is 100ms and is capped by hive.lock.sleep.between.retries. hive.lock.numretries is the total number of times it will retry a given lock request. Thus the total time that the call to acquire locks will block (given default values of 10 retries and 60s sleep time) is (100ms + 200ms + 400ms + ... + 51200ms + 60s + 60s + ... + 60s) = 91m:42s:300ms.

More details on locks used by this Lock Manager.

Configuration

These configuration parameters must be set appropriately to turn on transaction support in Hive:

hive.support.concurrency – true
hive.enforce.bucketing – true (Not required as of Hive 2.0)
hive.exec.dynamic.partition.mode – nonstrict
hive.txn.manager – org.apache.hadoop.hive.ql.lockmgr.DbTxnManager
hive.compactor.initiator.on – true (for exactly one instance of the Thrift metastore service)
hive.compactor.worker.threads – a positive number on at least one instance of the Thrift metastore service

The following sections list all of the configuration parameters that affect Hive transactions and compaction. Also see Limitations above and Table Properties below.

New Configuration Parameters for Transactions

A number of new configuration parameters have been added to the system to support transactions.

Configuration key	Values	Location	Notes
hive.txn.manager	Default:org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager Value required for transactions:org.apache.hadoop.hive.ql.lockmgr.DbTxnManager	Client/ HiveServer2	DummyTxnManager replicates pre Hive-0.13 behavior and provides no transactions.
hive.txn.timeout	Default: 300	Client/ HiveServer2/ Metastore	Time after which transactions are declared aborted if the client has not sent a heartbeat, in seconds. It's critical that this property has the same value for all components/services.⁵
hive.timedout.txn.reaper.start	Default: 100s	Metastore	Time delay of first reaper (the process which aborts timed-out transactions) run after the metastore starts (as of Hive 1.3.0).
hive.timedout.txn.reaper.interval	Default: 180s	Metastore	Time interval describing how often the reaper (the process which aborts timed-out transactions) runs (as of Hive 1.3.0).
hive.txn.max.open.batch	Default: 1000	Client	Maximum number of transactions that can be fetched in one call to open_txns().¹
hive.compactor.initiator.on	Default: false Value required for transactions: true (for exactly one instance of the Thrift metastore service)	Metastore	Whether to run the initiator and cleaner threads on this metastore instance. It's critical that this is enabled on exactly one metastore service instance (not enforced yet).
hive.compactor.worker.threads	Default: 0 Value required for transactions: > 0 on at least one instance of the Thrift metastore service	Metastore	How many compactor worker threads to run on this metastore instance.²
hive.compactor.worker.timeout	Default: 86400	Metastore	Time in seconds after which a compaction job will be declared failed and the compaction re-queued.
hive.compactor.cleaner.run.interval	Default: 5000	Metastore	Time in milliseconds between runs of the cleaner thread. (Hive 0.14.0 and later.)
hive.compactor.check.interval	Default: 300	Metastore	Time in seconds between checks to see if any tables or partitions need to be compacted.³
hive.compactor.delta.num.threshold	Default: 10	Metastore	Number of delta directories in a table or partition that will trigger a minor compaction.
hive.compactor.delta.pct.threshold	Default: 0.1	Metastore	Percentage (fractional) size of the delta files relative to the base that will trigger a major compaction. 1 = 100%, so the default 0.1 = 10%.
hive.compactor.abortedtxn.threshold	Default: 1000	Metastore	Number of aborted transactions involving a given table or partition that will trigger a major compaction.
hive.compactor.max.num.delta	Default: 500	Metastore	Maximum number of delta files that the compactor will attempt to handle in a single job (as ofHive 1.3.0).⁴
hive.compactor.job.queue	Default: "" (empty string)	Metastore	Used to specify name of Hadoop queue to which Compaction jobs will be submitted. Set to empty string to let Hadoop choose the queue (as of Hive 1.3.0).

¹hive.txn.max.open.batch controls how many transactions streaming agents such as Flume or Storm open simultaneously. The streaming agent then writes that number of entries into a single file (per Flume agent or Storm bolt). Thus increasing this value decreases the number of delta files created by streaming agents. But it also increases the number of open transactions that Hive has to track at any given time, which may negatively affect read performance.

²Worker threads spawn MapReduce jobs to do compactions. They do not do the compactions themselves. Increasing the number of worker threads will decrease the time it takes tables or partitions to be compacted once they are determined to need compaction. It will also increase the background load on the Hadoop cluster as more MapReduce jobs will be running in the background.

³Decreasing this value will reduce the time it takes for compaction to be started for a table or partition that requires compaction. However, checking if compaction is needed requires several calls to the NameNode for each table or partition that has had a transaction done on it since the last major compaction. So decreasing this value will increase the load on the NameNode.

⁴If the compactor detects a very high number of delta files, it will first run several partial minor compactions (currently sequentially) and then perform the compaction actually requested.

⁵If the value is not the same active transactions may be determined to be "timed out" and consequently Aborted. This will result in errors like "No such transaction...", "No such lock ..."

Configuration Values to Set for INSERT, UPDATE, DELETE

In addition to the new parameters listed above, some existing parameters need to be set to support INSERT ... VALUES, UPDATE, and DELETE.

Configuration key	Must be set to
hive.support.concurrency	true (default is false)
hive.enforce.bucketing	true (default is false) (Not required as of Hive 2.0)
hive.exec.dynamic.partition.mode	nonstrict (default is strict)

Configuration Values to Set for Compaction

If the data in your system is not owned by the Hive user (i.e., the user that the Hive metastore runs as), then Hive will need permission to run as the user who owns the data in order to perform compactions. If you have already set up HiveServer2 to impersonate users, then the only additional work to do is assure that Hive has the right to impersonate users from the host running the Hive metastore. This is done by adding the hostname to hadoop.proxyuser.hive.hosts in Hadoop's core-site.xml file. If you have not already done this, then you will need to configure Hive to act as a proxy user. This requires you to set up keytabs for the user running the Hive metastore and add hadoop.proxyuser.hive.hosts and hadoop.proxyuser.hive.groups to Hadoop's core-site.xml file. See the Hadoop documentation on secure mode for your version of Hadoop (e.g., for Hadoop 2.5.1 it is at Hadoop in Secure Mode).

Table Properties

If a table is to be used in ACID writes (insert, update, delete) then the table property "transactional=true" must be set on that table, starting with Hive 0.14.0. Also, hive.txn.manager must be set toorg.apache.hadoop.hive.ql.lockmgr.DbTxnManager either in hive-site.xml or in the beginning of the session before any query is run. Without those, inserts will be done in the old style; updates and deletes will be prohibited. However, this does not apply to Hive 0.13.0.

If a table owner does not wish the system to automatically determine when to compact, then the table property "NO_AUTO_COMPACTION" can be set. This will prevent all automatic compactions. Manual compactions can still be done with Alter Table/Partition Compact statements.

Table properties are set with the TBLPROPERTIES clause when a table is created or altered, as described in the Create Table and Alter Table Properties sections of Hive Data Definition Language. Currently the "transactional" and "NO_AUTO_COMPACTION" table properties are case-sensitive, although that will change in a future release with HIVE-8308.

微信关注我们

原文链接：https://yq.aliyun.com/articles/7152

转载内容版权归作者及来源网站所有！

低调大师中文资讯倾力打造互联网数据资讯、行业资源、电子商务、移动互联网、网络营销平台。持续更新报道IT业界、互联网、市场资讯、驱动更新,是最及时权威的产业资讯及硬件资讯报道平台。

一分钟了解阿里云产品：数据集成概述

阿里云的众多产品中，数据集成肯定是其中重要的一员。今天让我们来一起了解下数据集成吧。什么是数据集成呢？数据集成（Data Integration），简称CDP，是阿里集团对外提供的稳定高效、弹性伸缩的数据同步平台，为阿里云大数据计算引擎(包括ODPS、分析型数据库、OSPS)提供离线(批量)、实时(流式)的数据进出通道。有别于传统的客户端点对点同步运行工具，数据集成本身以公有云服务为基本设计目标，集群化、服务化、多租户、水平扩展等功能都是其基本实现要求。那么数据集成有哪些功能与特点呢？以下是简要说明。数据集成支持云上所有主要数据存储产品的传输能力，支持用户按需购买数据传输通道，支持用户全链路流控防护，支持传输自定义加工转换，云道支持传输业务脏数据收集和展示。数据集成支持云上结构化存储数据产品的binlog订阅能力，支持目的端到ODPS、消息队列等数据消费能力，云道支持数据全链路流控防护。数据集成将阿里云上各类异构数据流动打通，让数据不再成为孤岛。当然，目前数据集成也存在如下约束和限制条件: CDP支持传输能够抽象为逻辑二维表的数据同步，其他完全非...

2016-02-27

777

1.创建test.log 点击(此处)折叠或打开 [root@sht-sgmhadoopnn-01 mapreduce]# more /tmp/test.log 1 2 3 a b a v a a a abc 我是谁 %…… % 2.hadoop创建目录及上传点击(此处)折叠或打开 [root@sht-sgmhadoopnn-01 ~]# hadoop fs -mkdir /testdir 16/02/28 19:40:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable [root@sht-sgmhadoopnn-01 ~]# hadoop fs -put /tmp/test.log /testdir/ 16/02/28 19:40:19 WARN util.NativeCodeLoader: Unable to load native-hadoop library for...

2016-02-27

765

资源下载

更多资源

Mario

马里奥是站在游戏界顶峰的超人气多面角色。马里奥靠吃蘑菇成长，特征是大鼻子、头戴帽子、身穿背带裤，还留着胡子。与他的双胞胎兄弟路易基一起，长年担任任天堂的招牌角色。

腾讯云软件源

为解决软件依赖安装时官方源访问速度慢的问题，腾讯云为一些软件搭建了缓存服务。您可以通过使用腾讯云软件源站来提升依赖包的安装速度。为了方便用户自由搭建服务架构，目前腾讯云软件源站支持公网访问和内网访问。

Spring

Spring框架（Spring Framework）是由Rod Johnson于2002年提出的开源Java企业级应用框架，旨在通过使用JavaBean替代传统EJB实现方式降低企业级编程开发的复杂性。该框架基于简单性、可测试性和松耦合性设计理念，提供核心容器、应用上下文、数据访问集成等模块，支持整合Hibernate、Struts等第三方框架，其适用范围不仅限于服务器端开发，绝大多数Java应用均可从中受益。

WebStorm

WebStorm 是jetbrains公司旗下一款JavaScript 开发工具。目前已经被广大中国JS开发者誉为“Web前端开发神器”、“最强大的HTML5编辑器”、“最智能的JavaScript IDE”等。与IntelliJ IDEA同源，继承了IntelliJ IDEA强大的JS部分的功能。