How to Optimize the Hadoop cluster for high performance?
A Hadoop cluster is the central part of the Hadoop framework that includes a group of systems linked together through LAN. This is useful to store and process big data sets. Moreover, the Hadoop cluster includes several commodity hardware linked together. They communicate with a high-end system that acts as a master.
Installation of the Hadoop cluster within production is like getting into a
battlefield. Besides, a Hadoop admin needs to tune the cluster setup to achieve high performance. Moreover, the Hadoop cluster is configured with default settings. It includes less hardware configuration. Therefore, admins must be familiar with different hardware configurations.
There is no unique performance tuning technique that fits all Hadoop works. The performance tuning tools & tricks may differ based on the size of data that is being moved. It also depends upon the type of Hadoop task that runs within production.
The highest sale point for Apache Hadoop as a big data processing framework is the cost-effectiveness in configuring various data centers. This is useful for
processing large size of structured and unstructured data. However, the major
blockage in achieving high performance from a Hadoop cluster is its core
hardware stack.
Therefore, Hadoop admin has to make the best use of a cluster’s capacity to gain the best performance from the hardware stack.
To learn complete Hadoop admin tutorials visit through OnlineItGuru’s hadoop admin online course
Hadoop cluster performance tuning
Let us discuss in detail some of the best and effective performance tuning
techniques, to set up Hadoop clusters with commodity hardware, to increase
cluster performance with reduced operational cost.
Hadoop cluster memory
The initial step to ensure high performance for a Hadoop task is tuning the best configuration parameter for memory. It needs observing the usage of memory on the server. Hadoop includes different options on memory, CPU, and network that help to optimize the performance of the Hadoop cluster. Moreover, each Hadoop MapReduce task gathers data of different input records read, the number of records pipelined for further execution, reducer records, swap memory, etc.
Hadoop jobs are not bound to CPU. But the prime concern is to optimize memory usage.
Here, there is a thumb rule while tuning the memory is to ensure that the tasks don’t attempt swapping. The memory for the job is adjusted by modifying the mapred.child.java.opts within the mapred-site.xml file.
IO Performance improvement
There are some key factors to follow while optimizing MapReduce performance. It ensures that the Hadoop cluster setup is tuned very well.
Here, the Linux Operating System has a test point for each file including
checksum, last session time, creation time, the user of the file, etc. To gain better IO performance, the test point should be disabled in HDFS. Besides, the HDFS supports the write-once-read-many-times model. Therefore, the applications will be able to access the information on the HDFS system randomly.
Here, the mount points for the Data Node configures with no time option. It makes sure that the metadata is not updated by the NameNode. This is held each time when the data is accessible. The mount for MapReduce storage and HDFS is mounted with no time option. It automatically deactivates access time tracking. Moreover, it offers increased IO performance.
It is important to know that the user doesn’t use LVM and RAID on Data Node
systems. It minimizes performance speed.
Minimize Disk Spill
Disk IO is the major bottleneck in performance speedup. Two different ways
minimize disk spilling are:
● MapReduce job uses a mapper with 70% of heap memory for spilling
buffer.
● Compressing Mapper output
The user mustn't spill more than once; otherwise, it needs to read & write once again.
Compressing LZO
For instance, the Map Output is larger. Here, the mid-size data is reduced with
different techniques such as LZO, BZIP, Snappy, etc. Besides, the Map Output is not compressive usually. To enable Map Output compress such as
MapReduce.map.output.compress, is set to be true. The code is useful to set on the compression technique used such as LZO, Snappy, etc.
Each Map Reduce task releases large Map Output. This helps to benefit from intermediate data compressed with LZO. Using the LZO compression each 1GB of output data saves a maximum of 3GB of disk space. For example, there is a large.amount of data on the disk to execute Map tasks. Then it enhances the memory size of the buffer to help it well.
Tuning the quantity Mapper/Reducer Tasks
Each map or reduce job generally takes 40 seconds to complete any execution.
When there is a big task to perform it doesn’t use all the slots available within the Hadoop cluster. Therefore, it is much important to tune the number or quantity of map or reduce jobs using the below techniques;
● In case, if the MapReduce task contains more than 1 TB input, then it needs to make the number of jobs smaller. Here, the block size of the input dataset needs to enhance up to 512M. The block size of existing files also changed by setting up the dfs.block.size file. After the usage of commands to modify the block size, the actual data can be deleted.
● In case, the MapReduce task upon the Hadoop cluster drives different map
tasks where each task completes in a few seconds. Then by reducing the number of maps launched without impacting the setup of the cluster will help optimizing its performance.
Using Skewed Joins
Using standard joins in the transform logic with Pig or Hive tools can result in the fatal performance of the MapReduce tasks. Because the data processed may include some skewness. This means 80% of the data is going towards a single reducer. In case, if there is a large amount of data for a single key. Then one of the reducers may hold up with major data processes. This is held when Skewed join comes to prevent. Besides, the Skewed join computes a histogram to identify the dominant key. Later, the data splits based on its different reducers to gain optimal performance.
Writing a Combiner
Based on the Hadoop cluster ecosystem, combiner writing helps to reduce the
amount of data to transfer. This is useful other than the data compression technique. Moreover, it also proves to be beneficial in enhancing cluster performance.
Speculative Execution
The performance of MapReduce tasks severely impacts when the job takes longer time to finish its execution. Speculative execution is a general approach to resolve this issue by backing up slow going tasks on different systems.
Moreover, by setting up the various configuration parameters such as
‘MapReduce.map.tasks.speculative.execution’and‘MapReduce.reduce.tasks.speculative.execution to true enables speculative execution. This helps to reduce the task execution time in case the task progress becomes slow due to a lack of memory.
There are various performance-optimizing tips and tricks for a Hadoop Cluster and we have discussed some of the important points above. Moreover, the Hadoop community also updates certain tips that help to get optimal performance benefits. Hadoop scales horizontally therefore, admins keep adding many instances to the cluster that results in maximum performance. Therefore, many users want to have their Hadoop cluster so there is no need to share. The above- mentioned tips may help in achieving the best performance.
Get more results practically through the expert’s voice by getting into hadoop
admin online training at Online IT Guru. This learning may help to enhance
Hadoop skills to get into a better career.
No comments: