Running a MapReduce Word Count Application in Docker Using Python SDKA demonstration of the working principles behind MapReducePei Seng TanBlockedUnblockFollowFollowingJun 1Photo by Sadie Teper on UnsplashMotivationIf you read through the financial reports of all public listed companies in Malaysia, there is a section in the financial reports mentions about Sustainable Development Goals (SDGs).
Generally, there are 17 SDGs and you may need a tool to extract these data from the financial reports and carry out some analysis.
One simple application is to count the occurrences of certain keywords related to SDGs such as “green,” “health,” “education,” “equality,” etc.
With a traditional approach, which does not involve the concept of distributed processing, the job of word counting might not be easy or quick.
In addition, it would be very computationally expensive.
IntentionThe purpose of this project is to develop a simple word count application that demonstrates the working principle of MapReduce, involving multiple Docker Containers as the clients, to meet the requirements of distributed processing, using Python SDK for Docker.
What Is Docker?Docker is a tool that allows the developers to create, deploy and run the applications easily by using containers.
Containers package up an application with all the needed parts, such as libraries and other dependencies, and represent all of these as a single package.
By using Docker, the application can run on any different operating systems regardless of any customized settings that machine might have that could differ from the machine used for writing and testing the code.
Docker is a bit like a virtual machine but it does not create a whole virtual operating system.
It allows the applications to use the same kernel as the system that is running on.
This provides a significant performance boost and reduces the size of the application.
It is open source as well.
What Is MapReduce?MapReduce is a programming model designed for processing and generating big data set with a parallel and distributed algorithms across single or multiple clusters.
It consists of two main phases, which are Map and Reduce.
The Map function takes a set of data and breaks them down into key-value pairs.
In this project, the inputs of the MapReduce will be the texts inside the section of SDGs in the companies’ financial reports.
The Reduce function then takes the outputsfrom the Map function as the inputs and reduces the key-value pairs into unique keys with values according to the algorithm defined in the Reduce function.
In this project, all the key-value pairs will be added up to find the total occurrences for the specific words.
Data Set PreparationDownload a total of at least 40 different Malaysian listed companies’ latest financial reports into a temporary folder.
The companies of the financial reports that I have downloaded include Asia Media, Maybank, Stone Master, 7–11, etc.
(Optional) Record the starting and ending page numbers of the targeted information in the financial reports discussing the SDGs, as shown in the figure below.
Figure: The starting and ending page numbers of the SDGs section for all the financial reports are recorded.
Convert them into text file (.
txt) format to ease further processing.
Two free online tools are recommended: “Zamzar Online File Conversion” and “I Love PDF.
” Referring to the figure above, the first output of the converter will contain all the words present in the first financial report, from page 24 to 39 only.
The converter will remove the content in other pages in the first financial report.
a) “Zamzar Online File Conversion” — https://www.
com/convert/pdf-to-txtb) “I Love PDF” — https://www.
Save all the converted text files in the folder named “data,” as shown in the figure below.
Figure: A total of 40 converted text files are stored in the folder named “data.
Create a blank folder named “out” in the same directory as the folder named “data.
”Develop Python Code for MapReduce in a ContainerAssume that one of the Docker Containers received the files to be processed from the host machine, which distributes the tasks to numerous containers.
Firstly, define the needed modules, libraries, and directories.
The data directory is the location to read the file in a container whereas the output directory is the location where the processed outputs are stored in a container.
Get the current directory and split the files’ names sent from the host machine in string format.
Define a function to convert the content of a text file into the key-value pairs.
These key-value pairs are then reduced and stored in dictionary format.
The developed function will be looped for all the text files assigned by the host machine for processing.
The results are saved in the folder named “out” in the container in json format.
Save the code with the filename “docker_analyze.
py”Create an Image’s DockerfileIn order to build a Docker Container successfully, you need a Docker Image.
To use a programming metaphor, if an image is a class, then a container is an instance of a class — a runtime object.
A Docker Image is an inert, immutable file that’s essentially a snapshot of a container.
An Image is created with the build command, and a container will be produced when started with run.
A Docker image is built up from a series of layers.
Each layer represents an instruction in the image’s Dockerfile.
The commands inside the Dockerfile and their brief explanations are stated below.
After defining these commands, save the file with the name “Dockerfile” without any file extension.
Create a layer from the ubuntu:16.
Make sure the operating system is up to date.
Install Python3 libraries.
Copy the file “docker_analyze.
py” from the host machine to the Docker Container.
Run the file “docker_analayze.
py” when the Docker Container is successfully created.
Figure: Prepare the commands inside the Dockerfile with a simple code editor like Notepad++.
Figure: Save the file with the name “Dockerfile” without any file extension.
Create a Docker Image and Try to Run a Docker ContainerMake sure that the Docker Daemon in the host machine is running.
Open a command prompt or “git bash” window and type the command below to deploy the image.
docker build –tag analysis_test .
Figure: A Docker Image is built successfully.
Verify the presence of the previously deployed Docker Image by typing the command below in the command prompt to display the list of all available Docker Images.
docker image lsFigure: The Docker Image with tag name “analysis_test” is built.
(Optional) You can type the command below to run a Docker Container by using Docker Image as a base.
However, you will fail to do so this time, as you haven’t defined the input for the Docker Container, which is a string containing the names of all files from the financial report to be processed for this particular container.
docker run analysis_testFigure: Fail to run the Docker Container due to absence of inputs of Docker Container.
Create a Function to Establish the Communication Path between Docker Daemon and Host MachineOpen a new file named “docker_parallelize.
”Import required modules and libraries.
Pre-initialize some variables.
As shown, the “CONTAINER_NAME” is defined as “analysis_test,” which is the name used for tagging the Docker Image previously.
# Docker Configuration# Docker ImageCONTAINER_NAME = 'analysis_test'Container_count = 1The variables named “DATA_DIRECTORY” and “OUTPUT_DIRECTORY” are used to establish the connection between the host machine and Docker Containers so the Docker Containers can read the files from or write the file to the host machine directory.
# Volume Bind between Docker and Local OSDATA_DIRECTORY = 'c:/Users/pei-seng.
tan/Desktop/Docker/docker-map-reduce-example/data'OUTPUT_DIRECTORY = 'c:/Users/pei-seng.
tan/Desktop/Docker/docker-map-reduce-example/out'The search result of these words related to the SDGs will only be displayed to the user.
# Some targeted words of Sustainable Development Goal UN_word_list = ["poverty", "hunger", "health", "education", "equality", "sanitation", "clean", "growth", "innovation", "sustainable", "production", "climate", "water", "life", "peace", "partnerships", "resources"]3.
Create a function to perform the distributing processing by dividing the files to be processed to Docker Containers.
The number of files to be processed for each Docker Container depends on the number of input files and the number of containers set by the user.
Develop Python Code for Combining and Displaying the Processed Result Done by Docker Containers and Integrating with Simple User InterfaceDemonstrationRun the application by typing the command below.
The user interface will pop up, as shown in the figure below.
Figure: User Interface3.
Select the number of containers (the maximum is four, which is set in the code).
In this demo, I set two.
Figure: Select the number of containers.
Select the number of files to be processed.
In this demo, I select six.
Image: Select six financial reports to process.
Click the button named “Run Word Count.
”Figure: Click the button “Run Word Count.
The result will be displayed at the bottom part of the user interface.
Figure: The user interface shows the total word count among the selected six companies’ financial reports.
The words “sustainable,” “health,” and “growth” ranked the top three of the total word counts.
By scrolling the log information displayed in the center of the user interface, you can get more info about the searched result.
a) The containers’ IDs and the files assigned to each container.
Figure: Container IDs and Files assigned to each container.
b) The total word count for each company.
Figure: The total word count for each company.
c) The overall completion time is eight seconds.
After the investigation, the delay is caused by combining all the content in the host machine.
Git-hub Source Code:PeiSeng/Docker-MapReduce-Word_Count-Python_SDKContribute to PeiSeng/Docker-MapReduce-Word_Count-Python_SDK development by creating an account on GitHub.
comReferences:What is Docker?.https://opensource.
com/resources/what-dockerDocker Python for Docker https://docker-py.
io/en/stable/#Docker Documentation https://docs.
com/Hadoop — MapReduce https://www.
htmSustainable Development Goals https://www.
htmlDocker Map Reduce Example https://github.
com/adewes/docker-map-reduce-exampleAppreciationThis project was contributed by Tan Pei Seng, Yeap Soon Kent, Elvis Yoon Yu Jing and Cheng Xiyang for CDS 504 Enabling Technologies & Infrastructure for Big Data, one of the core subjects under Master of Data Science and Analytics in Universiti Sains Malaysia.