本文尝试使用shell脚本来自动化安装配置Hadoop。使用的操作系统为CentOS,Hadoop版本为 1.x,jdk版本 1.7,其他版本未测试,可能有未知bug。
Contents
Hadoop安装脚本
Hadoop安装分为3步,首先安装jdk,然后安装Hadoop,接着配置ssh免密码登陆(非必须)。[1]
#!/bin/bash # Usage: Hadoop自动配置脚本 # History: # 20140425 annhe 基本功能 #Hadoop版本 HADOOP_VERSION=1.2.1 #Jdk版本,Oracle官方无直链下载,请自备rpm包并设定版本号 JDK_VESION=7u51 #Hadoop下载镜像,默认为北理(bit) MIRRORS=mirror.bit.edu.cn #操作系统版本 OS=`uname -a |awk '{print $13}'` # Check if user is root if [ $(id -u) != "0" ]; then printf "Error: You must be root to run this script!\n" exit 1 fi # 检查是否是Centos cat /etc/issue|grep CentOS && r=0 || r=1 if [ $r -eq 1 ]; then echo "This script can only run on CentOS!" exit 1 fi #软件包 HADOOP_FILE=hadoop-$HADOOP_VERSION-1.$OS.rpm if [ "$OS"x = "x86_64"x ]; then JDK_FILE=jdk-$JDK_VESION-linux-x64.rpm else JDK_FILE=jdk-$JDK_VESION-linux-i586.rpm fi function Install () { #卸载已安装版本 rpm -qa |grep hadoop rpm -e hadoop rpm -qa | grep jdk rpm -e jdk #恢复/etc/profile备份文件 mv /etc/profile.bak /etc/profile #准备软件包 if [ ! -f $HADOOP_FILE ]; then wget "http://$MIRRORS/apache/hadoop/common/stable1/$HADOOP_FILE" && r=0 || r=1 [ $r -eq 1 ] && { echo "download error, please check your mirrors or check your network....exit"; exit 1; } fi [ ! -f $JDK_FILE ] && { echo "$JDK_FILE not found! Please download yourself....exit"; exit 1; } #开始安装 rpm -ivh $JDK_FILE && r=0 || r=1 if [ $r -eq 1 ]; then echo "$JDK_FILE install failed, please verify your rpm file....exit" exit 1 fi rpm -ivh $HADOOP_FILE && r=0 || r=1 if [ $r -eq 1 ]; then echo "$HADOOP_FILE install failed, please verify your rpm file....exit" exit 1 fi #备份/etc/profile cp /etc/profile /etc/profile.bak #配置java环境变量 cat >> /etc/profile <<eof #set java enviroment JAVA_HOME=/usr/java/default CLASSPATH=.:\$JAVA_HOME/lib PATH=\$JAVA_HOME/bin:\$PATH export JAVA_HOME CLASSPATH PATH eof bash /etc/profile #配置Hadoop脚本权限 chmod u+x /usr/sbin/*.sh #重置HADOOP_CLIENT_OPTS值 sed -i "s/export\ HADOOP_CLIENT_OPTS=\"-Xmx128m \$HADOOP_CLIENT_OPTS\"/export\ HADOOP_CLIENT_OPTS=\"-Xmx512m \$HADOOP_CLIENT_OPTS\"/g" /etc/hadoop/hadoop-env.sh } #配置ssh免密码登陆 function SSHlogin () { #不重复生成私钥 [ ! -f ~/.ssh/id_dsa ] && ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa cat ~/.ssh/authorized_keys |grep "`cat ~/.ssh/id_dsa.pub`" && r=0 || r=1 #没有公钥的时候才添加 [ $r -eq 1 ] && cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys chmod 644 ~/.ssh/authorized_keys } Install 2>&1 | tee -a hadoop_install.log SSHlogin 2>&1 | tee -a hadoop_install.log #修改HADOOP_CLIENT_OPTS后需要重启 shutdown -r now
单节点运行自带示例
默认情况下,Hadoop被配置成以非分布式模式运行的一个独立Java进程。这对调试非常有帮助。新建测试文本
[root@linux hadoop]# echo "hello world" >input/hello.txt [root@linux hadoop]# echo "hello hadoop" >input/hadoop.txt
运行Wordcount
[root@linux hadoop]# hadoop jar /usr/share/hadoop/hadoop-examples-1.2.1.jar wordcount input output 14/04/26 02:56:23 INFO util.NativeCodeLoader: Loaded the native-hadoop library 14/04/26 02:56:23 INFO input.FileInputFormat: Total input paths to process : 2 14/04/26 02:56:24 WARN snappy.LoadSnappy: Snappy native library not loaded 14/04/26 02:56:24 INFO mapred.JobClient: Running job: job_local275273933_0001 14/04/26 02:56:24 INFO mapred.LocalJobRunner: Waiting for map tasks 14/04/26 02:56:24 INFO mapred.LocalJobRunner: Starting task: attempt_local275273933_0001_m_000000_0 14/04/26 02:56:25 INFO util.ProcessTree: setsid exited with exit code 0 14/04/26 02:56:25 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@7e86fe3a 14/04/26 02:56:25 INFO mapred.MapTask: Processing split: file:/root/hadoop/input/hadoop.txt:0+13 14/04/26 02:56:25 INFO mapred.MapTask: io.sort.mb = 100 14/04/26 02:56:25 INFO mapred.MapTask: data buffer = 79691776/99614720 14/04/26 02:56:25 INFO mapred.MapTask: record buffer = 262144/327680 14/04/26 02:56:25 INFO mapred.MapTask: Starting flush of map output 14/04/26 02:56:25 INFO mapred.MapTask: Finished spill 0 14/04/26 02:56:25 INFO mapred.Task: Task:attempt_local275273933_0001_m_000000_0 is done. And is in the process of commiting 14/04/26 02:56:25 INFO mapred.LocalJobRunner: 14/04/26 02:56:25 INFO mapred.Task: Task 'attempt_local275273933_0001_m_000000_0' done. 14/04/26 02:56:25 INFO mapred.LocalJobRunner: Finishing task: attempt_local275273933_0001_m_000000_0 14/04/26 02:56:25 INFO mapred.LocalJobRunner: Starting task: attempt_local275273933_0001_m_000001_0 14/04/26 02:56:25 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@16ed889d 14/04/26 02:56:25 INFO mapred.MapTask: Processing split: file:/root/hadoop/input/hello.txt:0+12 14/04/26 02:56:25 INFO mapred.MapTask: io.sort.mb = 100 14/04/26 02:56:25 INFO mapred.MapTask: data buffer = 79691776/99614720 14/04/26 02:56:25 INFO mapred.MapTask: record buffer = 262144/327680 14/04/26 02:56:25 INFO mapred.MapTask: Starting flush of map output 14/04/26 02:56:25 INFO mapred.MapTask: Finished spill 0 14/04/26 02:56:25 INFO mapred.Task: Task:attempt_local275273933_0001_m_000001_0 is done. And is in the process of commiting 14/04/26 02:56:25 INFO mapred.LocalJobRunner: 14/04/26 02:56:25 INFO mapred.Task: Task 'attempt_local275273933_0001_m_000001_0' done. 14/04/26 02:56:25 INFO mapred.LocalJobRunner: Finishing task: attempt_local275273933_0001_m_000001_0 14/04/26 02:56:25 INFO mapred.LocalJobRunner: Map task executor complete. 14/04/26 02:56:25 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@42701c57 14/04/26 02:56:25 INFO mapred.LocalJobRunner: 14/04/26 02:56:25 INFO mapred.Merger: Merging 2 sorted segments 14/04/26 02:56:25 INFO mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 53 bytes 14/04/26 02:56:25 INFO mapred.LocalJobRunner: 14/04/26 02:56:25 INFO mapred.Task: Task:attempt_local275273933_0001_r_000000_0 is done. And is in the process of commiting 14/04/26 02:56:25 INFO mapred.LocalJobRunner: 14/04/26 02:56:25 INFO mapred.Task: Task attempt_local275273933_0001_r_000000_0 is allowed to commit now 14/04/26 02:56:25 INFO output.FileOutputCommitter: Saved output of task 'attempt_local275273933_0001_r_000000_0' to output 14/04/26 02:56:25 INFO mapred.LocalJobRunner: reduce > reduce 14/04/26 02:56:25 INFO mapred.Task: Task 'attempt_local275273933_0001_r_000000_0' done. 14/04/26 02:56:25 INFO mapred.JobClient: map 100% reduce 100% 14/04/26 02:56:25 INFO mapred.JobClient: Job complete: job_local275273933_0001 14/04/26 02:56:25 INFO mapred.JobClient: Counters: 20 14/04/26 02:56:25 INFO mapred.JobClient: File Output Format Counters 14/04/26 02:56:25 INFO mapred.JobClient: Bytes Written=37 14/04/26 02:56:25 INFO mapred.JobClient: FileSystemCounters 14/04/26 02:56:25 INFO mapred.JobClient: FILE_BYTES_READ=429526 14/04/26 02:56:25 INFO mapred.JobClient: FILE_BYTES_WRITTEN=586463 14/04/26 02:56:25 INFO mapred.JobClient: File Input Format Counters 14/04/26 02:56:25 INFO mapred.JobClient: Bytes Read=25 14/04/26 02:56:25 INFO mapred.JobClient: Map-Reduce Framework 14/04/26 02:56:25 INFO mapred.JobClient: Reduce input groups=3 14/04/26 02:56:25 INFO mapred.JobClient: Map output materialized bytes=61 14/04/26 02:56:25 INFO mapred.JobClient: Combine output records=4 14/04/26 02:56:25 INFO mapred.JobClient: Map input records=2 14/04/26 02:56:25 INFO mapred.JobClient: Reduce shuffle bytes=0 14/04/26 02:56:25 INFO mapred.JobClient: Physical memory (bytes) snapshot=0 14/04/26 02:56:25 INFO mapred.JobClient: Reduce output records=3 14/04/26 02:56:25 INFO mapred.JobClient: Spilled Records=8 14/04/26 02:56:25 INFO mapred.JobClient: Map output bytes=41 14/04/26 02:56:25 INFO mapred.JobClient: CPU time spent (ms)=0 14/04/26 02:56:25 INFO mapred.JobClient: Total committed heap usage (bytes)=480915456 14/04/26 02:56:25 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0 14/04/26 02:56:25 INFO mapred.JobClient: Combine input records=4 14/04/26 02:56:25 INFO mapred.JobClient: Map output records=4 14/04/26 02:56:25 INFO mapred.JobClient: SPLIT_RAW_BYTES=197 14/04/26 02:56:25 INFO mapred.JobClient: Reduce input records=
结果
[root@linux hadoop]# cat output/* hadoop 1 hello 2 world 1
运行自己编写的Wordcount
package net.annhe.wordcount; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.*; import org.apache.hadoop.mapreduce.lib.output.*; import org.apache.hadoop.util.*; public class WordCount extends Configured implements Tool { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word,one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce (Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum=0; for(IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public int run(String[] args) throws Exception { Job job = new Job(getConf()); job.setJarByClass(WordCount.class); job.setJobName("wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); boolean success = job.waitForCompletion(true); return success ? 0 : 1; } public static void main(String[] args) throws Exception { int ret = ToolRunner.run(new WordCount(),args); System.exit(ret); } }
编译
javac -classpath /usr/share/hadoop/hadoop-core-1.2.1.jar -d . WordCount.java
打包
jar -vcf wordcount.jar -C demo/ .
运行
hadoop jar wordcount.jar net.annhe.wordcount.WordCount input/ out
结果
[root@linux hadoop]# cat out/* hadoop 1 hello 2 world 1
遇到的问题
1. 内存不足
分给虚拟机的内存才180M,运行实例程序时报错:
java.lang.Exception: java.lang.OutOfMemoryError: Java heap space
解决方案:
增加虚拟机内存,并编辑/etc/hadoop/hadoop-env.sh,修改:
export HADOOP_CLIENT_OPTS="-Xmx512m $HADOOP_CLIENT_OPTS" #改成512m
原来启动JVM时配置的最大内存是128m,当运行hadoop的一些自带的实例会报内存溢出,其实这里是可以修改内存大小
如果不需要也不必修改。[2]
2. 带有包名的类的引用
带有包名的类要按照包层次调用类。如上面的 net.annhe.wordcount.WordCount [3]
3. 带有包名的类的编译
需要打包编译,加-d选项。
java的类文件是应该放入包中的,如package abc;
public class ls {...} 那么这个abc就是就是类ls的包,那么编译的时候就应该创建相应的abc包,具体就是用javac的一个参数,就是这个-d来生成这个类文件的包,例如上面的类在编译时应该写javac -d . ls.java注意javac和-d,-d和后面的.,.和后面的ls.java中间都有空格[4]
参考资料
[1]. 陆嘉桓. Hadoop实战. 第二版. 机械工业出版社
[2]. OSchina博客:http://my.oschina.net/mynote/blog/93340
[3]. CSDN博客:http://blog.csdn.net/xw13106209/article/details/6861855
[4]. 百度知道:http://zhidao.baidu.com/link?url=ND1BWmyGb_5a05Jntd9vGZNWGtmJmcKF1V6dhVNM1eFNuHL6kbQyVrEWtCUmy7KYP5F66R2BumCifCnPQnYdD_
Pingback: 基于Kickstart的Hadoop集群自动化部署 | 知行近思
单节点模式貌似不需要ssh免密码登陆