Hadoop Running a MapReduce JobBig DataDiogo RibeiroBlockedUnblockFollowFollowingJun 9Photo by Capturing the human heart.
on UnsplashMapReduce PreparationBefore we jump into the programming our MapReduce, we may need to talk about the preparation steps that are commonly taken.
Because MapReduce is usually operating on huge data, we need to consider those steps before we actually do the MapReduce.
The underlying structure of the HDFS filesystem is very different from our normal file systems.
The block sizes are quite a bit larger, and the actual block size for our clusters dependent on the cluster configuration as shown in the picture below: 64, 128, or 256 MB.
So, we may need to have blocks with customized partitioned.
Another consideration is where we’re going to retrieve our data from in order to perform the MapReduce operations or the parallel processing on it.
Though we’ll work with the core Hadoop filesystem, we may execute MapReduce algorithms against information stored on different locations such as native filesystem, cloud storage such as Amazon S3 buckets, or Windows Azure blobs.
Another consideration is the output of the MapReduce job results are immutable.
So, our output is a one-time output, and when new output is generated, we have a new file name for it.
The last consideration in preparing for MapReduce is about the logic that we’ll be writing, and it should fit our situation that we’re trying to address.
We’ll be writing the logic in some programming language, library, or tools to map our data to, and then reduce it, and then we have some output.
Note also that we’ll be working with key-value pairs, so regardless of the format of the data coming in, we want to output key-value pairs.
Hadoop shell commandsBefore performing MapReduce jobs, we should be familiar with some of the Hadoop shell commands.
Running a MapReduce JobNow it’s time to run our first Hadoop MapReduce job.
We will use one of the examples that come with Hadoop package.
hduser@laptop:~$ cd /usr/local/hadoophduser@laptop:/usr/local/hadoop$ lsbin include libexec logs README.
txt shareetc lib LICENSE.
txt sbinThe following sample “pi” program calculates the value of pi using a quasi-Monte Carlo method with MapReduce.
This is a straightforward example we can use to verify that we can run MapReduce jobs.
hduser@laptop:/usr/local/hadoop$ hadoop jar .
jar pi 2 5Number of Maps = 2Samples per Map = 5Wrote input for Map #0Wrote input for Map #1Starting Job16/11/18 10:03:53 INFO Configuration.
id is deprecated.
Instead, use dfs.
session-id16/11/18 10:03:53 INFO jvm.
JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=16/11/18 10:03:54 INFO input.
FileInputFormat: Total input paths to process : 216/11/18 10:03:54 INFO mapreduce.
JobSubmitter: number of splits:2.
File Input Format Counters Bytes Read=236 File Output Format Counters Bytes Written=97Job Finished in 9.
25 secondsEstimated value of Pi is 3.
60000000000000000000hduser@laptop:/usr/local/hadoop$We can vary both the number of map tasks we want to create as well as the number of samples to compute:hduser@laptop:/usr/local/hadoop$ hadoop jar .
jar pi 16 1000Number of Maps = 16Samples per Map = 1000.
File Input Format Counters Bytes Read=1888 File Output Format Counters Bytes Written=97Job Finished in 9.
817 secondsEstimated value of Pi is 3.
14250000000000000000Hadoop FileSystem (HDFS)Files are stored in the Hadoop Distributed File System (HDFS).
Suppose we’re going to store a file called data.
txt in HDFS.
This file is 160 megabytes.
When a file is loaded into HDFS, it’s split into chunks which are called blocks.
The default size of each block is 64 megabytes.
Each block is given a unique name, which is blk, an underscore, and a large number.
In our case, the first block is 64 megabytes.
The second block is 64 megabytes.
The third block is the remaining 32 megabytes, to make up our 160-megabyte file.
As the file is uploaded to HDFS, each block will get stored on one node in the cluster.
There’s a Daemon running on each of the machines in the cluster, and it is called the DataNode.
Now, we need to know which blocks make up the original file.
And that’s handled by a separate machine, running the Daemon called the NameNode.
The information stored on the NameNode is known as the Metadata.
HDFS CommandsWhile Hadoop is running, let’s create hdfsTest.
txt in our home directory (/home/hduser) in the local disk:hduser@laptop:~$ echo "hdfs test" > hdfsTest.
txtThen, we want to create the Home Directory in HDFS :hduser@laptop:~$ hdfs dfs -mkdir -p /user/hduserNote that, in the command, the dfs is defined as “runs a filesystem command on the file systems supported in Hadoop” in its man page.
We can copy file hdfsTest.
txt from local disk to the user’s directory in HDFS:hduser@laptop:~$ hdfs dfs -copyFromLocal ~/hdfsTest.
txtWe could have used put instead of copyFromLocal:hduser@laptop:~$ hdfs dfs -put ~/hdfsTest.
txtGet a directory listing of the user’s home directory in HDFS:hduser@laptop:~$ hdfs dfs -lsFound 1 items-rw-r–r– 1 hduser supergroup 10 2016-11-18 11:29 hdfsTest.
txtIf we want to display the contents of the HDFS file /user/hduser/hdfsTest.
txt:hduser@laptop:~$ hdfs dfs -cat /user/hduser/hdfsTest.
txtCopy that file to the local disk from HDFS, named as hdfsTest2.
txt :hduser@laptop:~$ hdfs dfs -copyToLocal /user/hduser/hdfsTest.
txtTo delete the file from Hadoop HDFS:hduser@laptop:~$ hdfs dfs -rm hdfsTest.
txt16/11/18 11:37:36 INFO fs.
TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
txtCheck if it’s been really deleted from Hadoop HDFS:hduser@laptop:~$ hdfs dfs -lshduser@laptop:~$Hadoop Setup for DevelopmentThroughout my tutorials on Hadoop Echo Systems, I used:Hadoop Binaries — Local (Linux), Cloudera’s Demo VM, and AWS for Cloud.
Data Storage — Local (HDFS Pseudo-distributed, single-node) and Cloud.
MapReduce — Both Local and Cloud.
Ways to MapReduceJava is the most common language to use, but other languages can be used:.