参考:
http://hadoop.apache.org/common/docs/r0.15.2/streaming.html
注意
目前 streaming 对 linux pipe #也就是 cat |wc -l 这样的管道 不支持,但不妨碍我们使用perl,python 行式命令!!
原话是 :
Can I use UNIX pipes? For example, will -mapper "cut -f1 | sed s/foo/bar/g" work?
Currently this does not work and gives an "java.io.IOException: Broken pipe" error.
This is probably a bug that needs to be investigated.
但如果你是强烈的 linux shell pipe 发烧友 ! 参考下面
$> perl -e 'open( my $fh, "grep -v null
tt
|sed -n 1,5p |");while ( <$fh> ) {print;} '
#不过我没测试通过 !!
环境 :hadoop-0.18.3
$> find . -type f -name "*streaming*.jar"
./contrib/streaming/hadoop-0.18.3-streaming.jar
测试数据:
-
bash
-
3.00
$ head tt
null
false
3702
208100
6005100
false
70
13220
6005127
false
24
4640
6005160
false
25
4820
6005161
false
20
3620
6005164
false
14
1280
6005165
false
37
7080
6005168
false
104
20140
6005169
false
35
6680
6005240
false
169
32140
......
运行:
c1
=
"
perl -ne 'if(/.*
\
t(
.*
)
/
){
\
$sum+=
\
$
1
;}END{
print
\"
\
$sum\";}'
"
# 注意 这里 $ 要写成 \$ " 写成 \"
echo $c1; # 打印输出 perl -ne 'if(/.*"t(.*)/){$sum+=$1;}END{print $sum;}'
hadoop jar hadoop
-
0.18
.
3
-
streaming
.
jar
-
input file
:///
data
/
hadoop
/
lky
/
jar
/
tt
-
mapper
"
/bin/cat
"
-
reducer
"
$c1
"
-
output file
:///
tmp
/
lky
/
streamingx8
结果:
cat
/
tmp
/
lky
/
streamingx8/*
1166480
本地运行输出:
perl -ne 'if(/.*"t(.*)/){$sum+=$1;}END{print $sum;}' < tt
1166480
结果正确!!!!
命令自带文档:
-
bash
-
3.00
$ hadoop jar hadoop
-
0.18
.
3
-
streaming.jar
-
info
09
/
09
/
25
14
:
50
:
12
ERROR streaming.StreamJob: Missing required option
-
input
Usage: $HADOOP_HOME
/
bin
/
hadoop [
--
config dir] jar \
$HADOOP_HOME
/
hadoop
-
streaming.jar [options]
Options:
-
input
<
path
>
DFS input file(s)
for
the Map step
-
output
<
path
>
DFS output directory
for
the Reduce step
-
mapper
<
cmd
|
JavaClassName
>
The streaming command to run
-
combiner
<
JavaClassName
>
Combiner has to be a Java
class
-
reducer
<
cmd
|
JavaClassName
>
The streaming command to run
-
file
<
file
>
File
/
dir to be shipped
in
the Job jar file
-
dfs
<
h:p
>|
local Optional. Override DFS configuration
-
jt
<
h:p
>|
local Optional. Override JobTracker configuration
-
additionalconfspec specfile Optional.
-
inputformat TextInputFormat(
default
)
|
SequenceFileAsTextInputFormat
|
JavaClassName Optional.
-
outputformat TextOutputFormat(
default
)
|
JavaClassName Optional.
-
partitioner JavaClassName Optional.
-
numReduceTasks
<
num
>
Optional.
-
inputreader
<
spec
>
Optional.
-
jobconf
<
n
>=<
v
>
Optional. Add or
override
a JobConf property
-
cmdenv
<
n
>=<
v
>
Optional. Pass env.var to streaming commands
-
mapdebug
<
path
>
Optional. To run
this
script when a map task fails
-
reducedebug
<
path
>
Optional. To run
this
script when a reduce task fails
-
cacheFile fileNameURI
-
cacheArchive fileNameURI
-
verbose
本文转自博客园刘凯毅的博客,原文链接:hadoop streaming( hadoop + perl )小试,如需转载请自行联系原博主。