A Journey Into Big Data with Apache Spark: Part 1

Simply tweak the docker run command to add the –name, –hostname and -p options as per below and run:docker run –rm -it –name spark-master –hostname spark-master -p 7077:7077 -p 8080:8080 $MYNAME/spark:latest /bin/shRun docker ps and you should see the container running with an output similar to this (I’ve removed some output to make it fit in the code block):CONTAINER ID PORTS NAMES3dfc3a95f7f4 ..:7077->7077/tcp, ..:8080->8080/tcp spark-masterIn the container, re-run the command to start the Spark Master and once it’s up, you should be able to browse to http://localhost:8080 and see the WebUi for the cluster, as per the screenshot below.Spark Master WebUIAdding Worker NodesAs I mentioned, I’m using Docker for Mac, which makes DNS painful and accessing the container by IP nigh on impossible without running a local VPN server or something to work around the issue — that’s beyond the scope of this post. Luckily, Docker has its own networking capability the specifics of which are out of scope of this post, too) which we’ll use to create a network for the local cluster to sit within..Creating a network is pretty simple and is done by running the following command:docker network create spark_networkWe don’t need to specify any particular options as the defaults are fine for our use case..Now we need to recreate our Master to attach it to the new network..Run docker stop spark-master and docker rm spark-master to remove the current instance of the running Master..To recreate the Master on the new network we can simply add the –network option to docker run, as per the below:docker run –rm -it –name spark-master –hostname spark-master -p 7077:7077 -p 8080:8080 –network spark_network $MYNAME/spark:latest /bin/shThis is really no different to the first time we ran the Spark Master, except it uses a newly defined network that we can use to attach Workers to, to make the cluster work ????..Now the Master is up and running, let’s add a Worker node to it..This is where the magic of Docker really shines through..To create a Worker and add it to the cluster, we can simply launch a new instance of the same docker image and run the command to start the Worker..We’ll need to give the Worker a new name, but other than the command remains largely the same:docker run –rm -it –name spark-worker –hostname spark-worker -p 7077:7077 -p 8080:8080 –network spark_network $MYNAME/spark:latest /bin/shAnd to start the Spark Worker on the container, we simply run:/spark/bin/spark-class org.apache.spark.deploy.worker.Worker –webui-port 8080 spark://spark-master:7077When it’s started and connect to the master, you should see the last line of the output being:INFO Worker:54 – Successfully registered with master spark://spark-master:7077And the Master will output the following line:INFO Master:54 – Registering worker 172.21.0.2:37013 with 4 cores, 1024.0 MB RAMCongratulations!.You’ve setup a Spark cluster using Docker!But Does it Work?To check it works, we can load the Master WebUI and we should see the Worker node listed under the “Workers” section, but this only really confirms the log output from attaching the Worker to the Master.Spark Master WebUI with WorkerTo be a true test, we need to actually run some Spark code across the cluster..Let’s run a new instance of the docker image so we can run one of the examples provided when we installed Spark..Again, we can reuse the existing docker image and simply launch a new instance to use as the driver (the thing that submits the application to the cluster).. More details

Leave a Reply