Hadoop添加或调整的参数:
一、hadoop-env.sh
1、hadoop的heapsize的设置,默认1000# The maximum amount of heap to use, in MB. Default is 1000.
# export HADOOP_HEAPSIZE=20002、改变pid的路径,pid文件默认在/tmp目录下,而/tmp是会被系统定期清理的
# The directory where pid files are stored. /tmp by default.
# export HADOOP_PID_DIR=/var/hadoop/pids
二、core-site.xml
1、hadoop.tmp.dir 是hadoop文件系统依赖的基础配置(默认值/tmp),尽量手动配置这个选项。<property>
<name>hadoop.tmp.dir</name> <value>/tmp/hadoop-${user.name}</value> <description>A base for other temporary directories.</description></property>2、SequenceFiles在读写中可以使用的缓存大小,可减少 I/O 次数。默认4096(byte),建议可设定为 65536 到 131072。
<property>
<name>io.file.buffer.size</name> <value>4096</value> <description>The size of buffer for use in sequence files. The size of this buffer should probably be a multiple of hardware page size (4096 on Intel x86), and it determines how much data is buffered during read and write operations.</description></property>
三、hdfs-site.xml
1、默认hdfs里每個block是 67108864(64MB)。如果確定存取的文件块都很大可以改為 134217728(128MB)dfs.block.size 67108864 The default block size for new files.
2、Hadoop启动时会进入safe mode,也就是安全模式,這时是不能写入数据的。只有当设置的的blocks(默认0.999f)达到最小的dfs.replication.min数量才会离开safe mode。
在 dfs.replication.min 设的比较大或 data nodes 数量比较多时会等比较久。
<property>
<name>dfs.safemode.threshold.pct</name> <value>0.999f</value> <description> Specifies the percentage of blocks that should satisfy the minimal replication requirement defined by dfs.replication.min. Values less than or equal to 0 mean not to start in safe mode. Values greater than 1 will make safe mode permanent. </description></property>3、指定namenode、datanode 的存储路径,默认保存在 ${hadoop.tmp.dir}/dfs/ 目录里。
dfs.name.dir ${hadoop.tmp.dir}/dfs/name Directory in NameNode's local filesystem to store HDFS's metadata. dfs.data.dir ${hadoop.tmp.dir}/dfs/data Directory in a DataNode's local filesystem to store HDFS's file blocks.
四、mapred-site.xml
1、缓存map中间结果的buffer大小(默认100MB)map任务运算产生的中间结果并非直接写入磁盘。hadoop利用内存buffer对部分结果缓存,并在内存buffer中进行一些预排序来优化整个map的性能。
map在运行过程中,不停的向该buffer中写入计算结果,但是该buffer并不一定能将全部的map输出缓存下来,当map输出超出一定阈值,那么map就必须将该buffer中的数据写入到磁盘中去,这个过程在mapreduce中叫做spill,把io.sort.mb调大,则map的spill的次数就会降低,map任务对磁盘的操作变少,最终提高map的计算性能。io.sort.mb 100 The total amount of buffer memory to use while sorting files, in megabytes. By default, gives each merge stream 1MB, which should minimize seeks.
2、排序文件merge时,一次合并的文件上限(默认10)。
当map任务全部完成后,就会生成一个或者多个spill文件。map在正常退出之前,需要将这些spill合并(merge)成一个。
参数io.sort.factor调整merge行为,它表示最多能有多少并行的stream向merge文件中写入。调大io.sort.factor,有利于减少merge次数,进而减少map对磁盘的读写频率,最终提高map的计算性能。io.sort.factor 10 The number of streams to merge at once while sorting files. This determines the number of open file handles.
3、JobTracker的管理线程数(默认10)
mapred.job.tracker.handler.count 10 The number of server threads for the JobTracker. This should be roughly 4% of the number of tasktracker nodes.
4、每个作业的map/reduce任务数(map默认2,reduce默认1)
mapred.map.tasks 2 The default number of map tasks per job.Ignored when mapred.job.tracker is "local". mapred.reduce.tasks 1 The default number of reduce tasks per job. Typically set to 99% of the cluster's reduce capacity, so that if a node fails the reduces can still be executed in a single wave. Ignored when mapred.job.tracker is "local".
5、单个tasktracker节点最多可并行执行的map/reduce任务数(map默认2,reduce默认2)
mapred.tasktracker.map.tasks.maximum 2 The maximum number of map tasks that will be run simultaneously by a task tracker. mapred.tasktracker.reduce.tasks.maximum 2 The maximum number of reduce tasks that will be run simultaneously by a task tracker.
6、限制每个用户在JobTracker的内存中保存已完成任务的个数(默认100)。
任务被扔到历史作业之前完成的最大任务数,一般我们只会关注运行中的队列,所以可以考虑降低它的值,减少内存资源占用这个参数在0.21.0以后的版本已经没有必要设置了,因为0.21版本改造了completeuserjobs的算法,让它尽快的写入磁盘,不再内存中长期存在了。mapred.jobtracker.completeuserjobs.maximum 100 The maximum number of complete jobs per user to keep around before delegating them to the job history.
7、从map复制时reduce并行传送的值(默认5)
默认情况下,每个reduce只会有5个并行的下载线程从map复制数据,如果一个时间段内job完成的map有100个或者更多,那么reduce也最多只能同时下载5个map的数据。所以这个参数比较适合map很多并且完成的比较快的job的情况下调大,有利于reduce更快的获取属于自己部分的数据。mapred.reduce.parallel.copies 5 The default number of parallel transfers run by reduce during the copy(shuffle) phase.
8、Map的输出中间结果时是否压缩(默认false)
将这个参数设置为true时,那么map在写中间结果时,就会将数据压缩后再写入磁盘,读结果时也会采用先解压后读取数据。这样做的后果就是:写入磁盘的中间结果数据量会变少,但是cpu会消耗一些用来压缩和解压。所以这种方式通常适合job中间结果非常大,瓶颈不在cpu,而是在磁盘的读写的情况。说的直白一些就是用cpu换IO。mapred.compress.map.output false Should the outputs of the maps be compressed before being sent across the network. Uses SequenceFile compression.
10、启动tasktracker的子进程时的heapsize设置(默认200M)
mapred.child.java.opts -Xmx200m -verbose:gc -Xloggc:/tmp/@taskid@.gc Java opts for the task tracker child processes. The following symbol, if present, will be interpolated: @taskid@ is replaced by current TaskID. Any other occurrences of '@' will go unchanged. For example, to enable verbose gc logging to a file named for the taskid in /tmp and to set the heap maximum to be a gigabyte, pass a 'value' of: -Xmx1024m -verbose:gc -Xloggc:/tmp/@taskid@.gc The configuration variable mapred.child.ulimit can be used to control the maximum virtual memory of the child processes.