环境如下
linux版本:ubuntu 14.04 LTS
jdk版本:jdk1.7.0_67
hadoop版本:hadoop-2.0.0-cdh4.1.0.tar.gz
impala版本:impala_1.4.0-1.impala1.4.0.p0.7~precise-impala1.4.0_all.deb
hadoop-cdh下载地址:http://archive.cloudera.com/cdh4/cdh/4/
ubuntu impala下载地址:
http://archive.cloudera.com/impala/ubuntu/precise/amd64/impala/pool/contrib/i/impala/
建议:hadoop的版本不要太高,还是用cdh4比较靠谱,之前我用了apache-hadoop2.7,hadoop2.6-cdh5,impala启动时都报了错误,原因为protobuf不兼容,该错误我查了几天。
为了方便,以下教程我在root用户下进行,及root作为使用用户。
1、安装hadoop
# apt-get install openssh-server
设置免密码登录
# ssh-keygen -t rsa -P ""
# cat .ssh/id_rsa.pub >> .ssh/authorized_keys
下载jdk-7u67-linux-x64.tar.gz,解压后配置环境变量
# tar -vzxf jdk-7u67-linux-x64.tar.gz
# mkdir /usr/java
# mv jdk1.7.0_67 /usr/java/
# vi /etc/profile
export JAVA_HOME=/usr/java/jdk1.7.0_67
export PATH=$JAVA_HOME/bin:$JAVA_HOME/jre/bin:$PATH
export CLASSPATH=$CLASSPATH:.:$JAVA_HOME/lib:$JAVA_HOME/jre/lib
# tar -xvzf hadoop-2.0.0-cdh4.1.0.tar.gz
# mv hadoop-2.0.0-cdh4.1.0 /usr/local/
# vi /etc/profile
export HADOOP_HOME=/usr/local/hadoop-2.0.0-cdh4.1.0
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
export HADOOP_PREFIX=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_LIB=$HADOOP_HOME/lib
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
# source /etc/profile
2、配置hadoop,伪分布式配置
# cd /etc/local/hadoop-2.0.0-cdh4.1.0
# cd /etc/hadoop
# vi hadoop-env.sh
export JAVA_HOME=/usr/java/jdk1.7.0_67
# vi core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/root/hadoop/tmp</value>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
# vi hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/root/hadoop/tmp/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/root/hadoop/tmp/dfs/data</value>
</property>
</configuration>
# cd ~
# mkdir -p hadoop/tmp/dfs/name
# mkdir hadoop/tmp/dfs/data
注:需要保证用户为hadoop-2.0.0-cdh4.1.0目录、namenode目录和datanode目录的拥有者
3、启动hadoop
格式化namenode
# hadoop namenode -format
# start-all.sh (该命令在$HADOOP_HOME/sbin)
测试
# hadoop fs -ls / #查看hdfs的/目录
# hadoop fs -mkdir /user #在hdfs创建目录user
# hadoop fs -put a.out /user #在hdfs的/user下上传文件a.out
# hadoop fs -get /user/a.out #下载a.out文件到本地
关闭hadoop
# stop-all.sh
4、安装impala
修改源:
# vi /etc/apt/sources.list.d/cloudera.list
deb [arch=amd64] http://archive.cloudera.com/cm5/ubuntu/trusty/amd64/cm trusty-cm5 contrib
deb-src http://archive.cloudera.com/cm5/ubuntu/trusty/amd64/cm trusty-cm5 contrib
deb [arch=amd64] http://archive.cloudera.com/impala/ubuntu/precise/amd64/impala precise-impala1 contrib
deb-src http://archive.cloudera.com/impala/ubuntu/precise/amd64/impala precise-impala1 contrib
# apt-get update
# apt-get install bigtop-utils
用apt-get下载impala太慢了,可在
http://archive.cloudera.com/impala/ubuntu/precise/amd64/impala/pool/contrib/i/impala/
下载相应安装包。
# dpkg -i impala_1.4.0-1.impala1.4.0.p0.7~precise-impala1.4.0_all.deb
# dpkg -i impala-server_1.4.0-1.impala1.4.0.p0.7-precise-impala1.4.0_all.deb
# dpkg -i impala-state-store_1.4.0-1.impala1.4.0.p0.7-precise-impala1.4.0_all.deb
# dpkg -i impala-catalog_1.4.0-1.impala1.4.0.p0.7-precise-impala1.4.0_all.deb
# apt-get install python-setuptools
出错则根据错误修改(apt-get -f install)
# dpkg -i impala-shell_1.4.0-1.impala1.4.0.p0.7-precise-impala1.4.0_all.deb
impala安装完毕。
5、impala配置
# vi /etc/hosts
127.0.0.1 localhost
在$HADOOP_HOME/etc/hadoop下将core-site.xml及hdfs-site.xml拷贝到/etc/impala/conf
# cd /usr/local/hadoop-2.0.0-cdh4.1.0/etc/hadoop/
# cp core-site.xml hdfs-site.xml /etc/impala/conf
# cd /etc/impala/conf
# vi hdfs-site.xml
增加:
<property>
<name>dfs.client.read.shortcircuit</name>
<value>true</value>
</property>
<property>
<name>dfs.domain.socket.path</name>
<value>/var/run/hadoop-hdfs/dn._PORT</value>
</property>
<property>
<name>dfs.datanode.hdfs-blocks-metadata.enabled</name>
<value>true</value>
</property>
<property>
<name>dfs.client.use.legacy.blockreader.local</name>
<value>true</value>
</property>
<property>
<name>dfs.datanode.data.dir.perm</name>
<value>750</value>
</property>
<property>
<name>dfs.block.local-path-access.user</name>
<value>impala</value>
</property>
<property>
<name>dfs.client.file-block-storage-locations.timeout</name>
<value>3000</value>
</property>
# mkdir /var/run/hadoop-hdfs
注:保证/var/run/hadoop-hdfs为用户所有
6、impala启动
# service impala-state-store start
# service impala-catalog start
# service impala-server start
查看是否启动:
# ps -ef | grep impala
错误信息查看日志
启动impala-shell
# impala-shell -i localhost --quiet
[localhost:21000] > select version();
...
[localhost:21000] > select current_database();
...
impala-shell操作见
http://www.cloudera.com/documentation/enterprise/latest/topics/impala_tutorial.html#tutorial
7、impala日志错误处理
impala日志位置为:/var/log/impala
impala启动错误1:
Failed on local exception:
com.google.protobuf.InvalidProtocolBufferException: Message missing required fields: callId, status; Host Details : local host is: "database32/127.0.1.1"; destination host is: "localhost":9000;
原因:
hadoop2.6的protobuf版本为2.5,为impala用的版本为protobuf2.4
解决:
将hadoop的版本降低时与impala的版本匹配,这里impala采用二进制方式安装,无法
重新编译,解决为将hadoop的版本与impala版本一致。我下载的hadoop为hadoop-2.0.0-cdh4.1.0,impala为impala_1.4.0
impala启动错误2:
dfs.client.read.shortcircuit is not enabled because - dfs.client.use.legacy
.blockreader.local is not enabled
原因:
hdfs-site.xml配置出错
解决:
将dfs.datanode.hdfs-blocks-metadata.enabled选项设为true
impala启动错误3:
Impalad services did not start correctly, exiting. Error: Couldn't open
transport for 127.0.0.1:24000(connect() failed: Connection refused)
原因:
未启动impala-state-store,impala-catalog
解决:
# service impala-state-store start
# service impala-catalog start
# service impala start
与Hive类似,Impala也可以直接与HDFS和HBase库直接交互。只不过Hive和其它建立在MapReduce上的框架适合需要长时间运行的批处理任务。例如:那些批量提取,转化,加载(ETL)类型的Job,而Impala主要用于实时查询。
Hadoop集群各节点的环境设置及安装过程见 使用yum安装CDH Hadoop集群,参考这篇文章。
1. 环境
- CentOS 6.4 x86_64
- CDH 5.0.1
- jdk1.6.0_31
集群规划为3个节点,每个节点的ip、主机名和部署的组件分配如下:
192.168.56.121 cdh1 NameNode、Hive、ResourceManager、HBase、impala
192.168.56.122 cdh2 DataNode、SSNameNode、NodeManager、HBase、impala
192.168.56.123 cdh3 DataNode、HBase、NodeManager、impala
2. 安装
目前,CDH 5.0.1中 impala 版本为1.4.0,下载repo文件到 /etc/yum.repos.d/:
然后,可以执行下面的命令安装所有的 impala 组件。
$ sudo yum install impala impala-server impala-state-store impala-catalog impala-shell -y
但是,通常只是在需要的节点上安装对应的服务:
- 在 hive metastore 所在节点安装impala-state-store和impala-catalog
- 在 DataNode 所在节点安装 impala-server 和 impala-shell
3. 配置
3.1 修改配置文件
查看安装路径:
$ find / -name impala
/var/run/impala
/var/lib/alternatives/impala
/var/log/impala
/usr/lib/impala
/etc/alternatives/impala
/etc/default/impala
/etc/impala
/etc/default/impala
impalad的配置文件路径由环境变量IMPALA_CONF_DIR指定,默认为/usr/lib/impala/conf,impala 的默认配置在/etc/default/impala,修改该文件中的IMPALA_CATALOG_SERVICE_HOST 和 IMPALA_STATE_STORE_HOST
IMPALA_CATALOG_SERVICE_HOST=cdh1
IMPALA_STATE_STORE_HOST=cdh1
IMPALA_STATE_STORE_PORT=24000
IMPALA_BACKEND_PORT=22000
IMPALA_LOG_DIR=/var/log/impala
IMPALA_CATALOG_ARGS=" -log_dir=${IMPALA_LOG_DIR} "
IMPALA_STATE_STORE_ARGS=" -log_dir=${IMPALA_LOG_DIR} -state_store_port=${IMPALA_STATE_STORE_PORT}"
IMPALA_SERVER_ARGS=" \
-log_dir=${IMPALA_LOG_DIR} \
-catalog_service_host=${IMPALA_CATALOG_SERVICE_HOST} \
-state_store_port=${IMPALA_STATE_STORE_PORT} \
-use_statestore \
-state_store_host=${IMPALA_STATE_STORE_HOST} \
-be_port=${IMPALA_BACKEND_PORT}"
ENABLE_CORE_DUMPS=false
# LIBHDFS_OPTS=-Djava.library.path=/usr/lib/impala/lib
# MYSQL_CONNECTOR_JAR=/usr/share/java/mysql-connector-java.jar
# IMPALA_BIN=/usr/lib/impala/sbin
# IMPALA_HOME=/usr/lib/impala
# HIVE_HOME=/usr/lib/hive
# HBASE_HOME=/usr/lib/hbase
# IMPALA_CONF_DIR=/etc/impala/conf
# HADOOP_CONF_DIR=/etc/impala/conf
# HIVE_CONF_DIR=/etc/impala/conf
# HBASE_CONF_DIR=/etc/impala/conf
设置 impala 可以使用的最大内存:在上面的 IMPALA_SERVER_ARGS 参数值后面添加 -mem_limit=70% 即可。
如果需要设置 impala 中每一个队列的最大请求数,需要在上面的 IMPALA_SERVER_ARGS 参数值后面添加 -default_pool_max_requests=-1 ,该参数设置每一个队列的最大请求数,如果为-1,则表示不做限制。
在节点cdh1上拷贝hive-site.xml、core-site.xml、hdfs-site.xml至/usr/lib/impala/conf目录并作下面修改在hdfs-site.xml文件中添加如下内容:
<property>
<name>dfs.client.read.shortcircuit</name>
<value>true</value>
</property>
<property>
<name>dfs.domain.socket.path</name>
<value>/var/run/hadoop-hdfs/dn._PORT</value>
</property>
<property>
<name>dfs.datanode.hdfs-blocks-metadata.enabled</name>
<value>true</value>
</property>
同步以上文件到其他节点。
3.2 创建socket path
在每个节点上创建/var/run/hadoop-hdfs:
$ mkdir -p /var/run/hadoop-hdfs
拷贝postgres jdbc jar:
$ ln -s /usr/share/java/postgresql-jdbc.jar /usr/lib/impala/lib/
3.3 用户要求
impala 安装过程中会创建名为 impala 的用户和组,不要删除该用户和组。
如果想要 impala 和 YARN 和 Llama 合作,需要把 impala 用户加入 hdfs 组。
impala 在执行 DROP TABLE 操作时,需要把文件移到到 hdfs 的回收站,所以你需要创建一个 hdfs 的目录 /user/impala,并将其设置为impala 用户可写。同样的,impala 需要读取 hive 数据仓库下的数据,故需要把 impala 用户加入 hive 组。
impala 不能以 root 用户运行,因为 root 用户不允许直接读。
创建 impala 用户家目录并设置权限:
sudo -u hdfs hadoop fs -mkdir /user/impala
sudo -u hdfs hadoop fs -chown impala /user/impala
查看 impala 用户所属的组:
$ groups impala
impala : impala hadoop hdfs hive
由上可知,impala 用户是属于 imapal、hadoop、hdfs、hive 用户组的
4. 启动服务
在 cdh1节点启动:
$ service impala-state-store start
$ service impala-catalog start
如果impalad正常启动,可以在/tmp/ impalad.INFO查看。如果出现异常,可以查看/tmp/impalad.ERROR定位错误信息。
5. 使用shell
使用impala-shell启动Impala Shell,连接 cdh1,并刷新元数据
>impala-shell
[Not connected] >connect cdh1
[cdh1:21000] >invalidate metadata
[cdh2:21000] >connect cdh2
[cdh2:21000] >select * from t
当在 Hive 中创建表之后,第一次启动 impala-shell 时,请先执行 INVALIDATE METADATA 语句以便 Impala 识别出新创建的表(在 Impala 1.2 及以上版本,你只需要在一个节点上运行 INVALIDATE METADATA ,而不是在所有的 Impala 节点上运行)。
你也可以添加一些其他参数,查看有哪些参数:
# impala-shell -h
Usage: impala_shell.py [options]
Options:
-h, --help show this help message and exit
-i IMPALAD, --impalad=IMPALAD
<host:port> of impalad to connect to
-q QUERY, --query=QUERY
Execute a query without the shell
-f QUERY_FILE, --query_file=QUERY_FILE
Execute the queries in the query file, delimited by ;
-k, --kerberos Connect to a kerberized impalad
-o OUTPUT_FILE, --output_file=OUTPUT_FILE
If set, query results will written to the given file.
Results from multiple semicolon-terminated queries
will be appended to the same file
-B, --delimited Output rows in delimited mode
--print_header Print column names in delimited mode, true by default
when pretty-printed.
--output_delimiter=OUTPUT_DELIMITER
Field delimiter to use for output in delimited mode
-s KERBEROS_SERVICE_NAME, --kerberos_service_name=KERBEROS_SERVICE_NAME
Service name of a kerberized impalad, default is
'impala'
-V, --verbose Enable verbose output
-p, --show_profiles Always display query profiles after execution
--quiet Disable verbose output
-v, --version Print version information
-c, --ignore_query_failure
Continue on query failure
-r, --refresh_after_connect
Refresh Impala catalog after connecting
-d DEFAULT_DB, --database=DEFAULT_DB
Issue a use database command on startup.
例如,你可以在连接时候字段刷新:
$ impala-shell -r
Starting Impala Shell in unsecure mode
Connected to 192.168.56.121:21000
Server version: impalad version 1.1.1 RELEASE (build 83d5868f005966883a918a819a449f636a5b3d5f)
Invalidating Metadata
Welcome to the Impala shell. Press TAB twice to see a list of available commands.
Copyright (c) 2012 Cloudera, Inc. All rights reserved.
(Shell build version: Impala Shell v1.1.1 (83d5868) built on Fri Aug 23 17:28:05 PDT 2013)
Query: invalidate metadata
Query finished, fetching results ...
Returned 0 row(s) in 5.13s
[192.168.56.121:21000] >
使用 impala 导出数据:
$ impala-shell -i '192.168.56.121:21000' -r -q "select * from test" -B --output_delimiter="\t" -o result.txt
版权声明:本文内容由互联网用户自发贡献,版权归作者所有,本社区不拥有所有权,也不承担相关法律责任。如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件至:
yqgroup@service.aliyun.com 进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容。