1.1 Billion Taxi Rides with Spark 2.2 & 3 Raspberry Pi 3 Model Bs

Total DISK READ : 5.87 M/s | Total DISK WRITE : 0.00 B/s Actual DISK READ: 12.22 M/s | Actual DISK WRITE: 1551.84 K/s TID PRIO USER DISK READ> DISK WRITE SWAPIN IO COMMAND 5590 be/4 pi 433.00 M 100.00 K 12.19 % 0.00 % java -cp /opt/spark/conf/:/opt/spark/jars/*:/opt/ha~r.SparkSQLCLIDriver –num-executors 3 spark-internal 5757 be/4 pi 32.13 M 4.00 K 1.07 % 0.00 % java -cp /opt/spark/conf/:/opt/spark/jars/*:/opt/ha~r.SparkSQLCLIDriver –num-executors 3 spark-internal 5594 be/4 pi 26.69 M 12.00 K 1.59 % 0.01 % java -cp /opt/spark/conf/:/opt/spark/jars/*:/opt/ha~r.SparkSQLCLIDriver –num-executors 3 spark-internal 5761 be/4 pi 21.22 M 0.00 B 1.20 % 0.00 % java -cp /opt/spark/conf/:/opt/spark/jars/*:/opt/ha~r.SparkSQLCLIDriver –num-executors 3 spark-internal 5755 be/4 pi 19.81 M 0.00 B 1.03 % 0.00 % java -cp /opt/spark/conf/:/opt/spark/jars/*:/opt/ha~r.SparkSQLCLIDriver –num-executors 3 spark-internal 5679 be/4 pi 18.87 M 0.00 B 0.69 % 0.01 % java -cp /opt/spark/conf/:/opt/spark/jars/*:/opt/ha~r.SparkSQLCLIDriver –num-executors 3 spark-internal 5708 be/4 pi 16.49 M 0.00 B 0.44 % 0.01 % java -cp /opt/spark/conf/:/opt/spark/jars/*:/opt/ha~r.SparkSQLCLIDriver –num-executors 3 spark-internal 5677 be/4 pi 16.01 M 0.00 B 0.67 % 0.04 % java -cp /opt/spark/conf/:/opt/spark/jars/*:/opt/ha~r.SparkSQLCLIDriver –num-executors 3 spark-internal Total DISK READ : 18.41 M/s | Total DISK WRITE : 0.00 B/s Actual DISK READ: 11.65 M/s | Actual DISK WRITE: 713.97 K/s TID PRIO USER DISK READ> DISK WRITE SWAPIN IO COMMAND 606 be/4 root 5.05 G 524.00 K 0.00 % 26.04 % mount.exfat /dev/sda1 /mnt/usb -o rw,nonempty 25774 be/4 root 148.02 M 96.00 K 3.17 % 0.77 % java -cp /opt/spark/conf/:/opt/spark/jars/*:/opt/ha~-0000 –worker-url spark://Worker@192.168.0.22:43827 1231 be/4 root 6.48 M 0.00 B 0.00 % 24.02 % java -Dproc_datanode -Xmx1000m -Djava.library.path=~RFAS org.apache.hadoop.hdfs.server.datanode.DataNode 25833 be/4 root 5.93 M 0.00 B 0.04 % 0.01 % java -cp /opt/spark/conf/:/opt/spark/jars/*:/opt/ha~-0000 –worker-url spark://Worker@192.168.0.22:43827 25614 be/4 root 5.63 M 24.00 K 0.21 % 0.00 % java -cp /opt/spark/conf/:/opt/spark/jars/*:/opt/ha~ploy.worker.Worker –webui-port 8081 spark://r1:7077 25779 be/4 root 5.25 M 24.00 K 4.06 % 0.61 % java -cp /opt/spark/conf/:/opt/spark/jars/*:/opt/ha~-0000 –worker-url spark://Worker@192.168.0.22:43827 25618 be/4 root 4.75 M 0.00 B 1.71 % 0.02 % java -cp /opt/spark/conf/:/opt/spark/jars/*:/opt/ha~ploy.worker.Worker –webui-port 8081 spark://r1:7077 Total DISK READ : 31.17 M/s | Total DISK WRITE : 27.10 K/s Actual DISK READ: 25.37 M/s | Actual DISK WRITE: 304.85 K/s TID PRIO USER DISK READ> DISK WRITE SWAPIN IO COMMAND 563 be/4 root 5.80 G 564.00 K 0.00 % 17.56 % mount.exfat /dev/sda1 /mnt/usb -o rw,nonempty 25069 be/4 root 200.09 M 132.00 K 4.73 % 2.72 % java -cp /opt/spark/conf/:/opt/spark/jars/*:/opt/ha~-0000 –worker-url spark://Worker@192.168.0.25:44851 25238 be/4 root 7.37 M 396.00 K 0.21 % 0.88 % java -cp /opt/spark/conf/:/opt/spark/jars/*:/opt/ha~-0000 –worker-url spark://Worker@192.168.0.25:44851 980 be/4 root 6.93 M 0.00 B 0.00 % 31.54 % java -Dproc_datanode -Xmx1000m -Djava.library.path=~RFAS org.apache.hadoop.hdfs.server.datanode.DataNode 24914 be/4 root 6.26 M 4.00 K 1.95 % 0.15 % java -cp /opt/spark/conf/:/opt/spark/jars/*:/opt/ha~ploy.worker.Worker –webui-port 8081 spark://r1:7077 25078 be/4 root 5.35 M 24.00 K 2.69 % 1.48 % java -cp /opt/spark/conf/:/opt/spark/jars/*:/opt/ha~-0000 –worker-url spark://Worker@192.168.0.25:44851 24908 be/4 root 5.00 M 24.00 K 1.67 % 0.07 % java -cp /opt/spark/conf/:/opt/spark/jars/*:/opt/ha~ploy.worker.Worker –webui-port 8081 spark://r1:7077 25127 be/4 root 4.52 M 0.00 B 0.11 % 0.01 % java -cp /opt/spark/conf/:/opt/spark/jars/*:/opt/ha~-0000 –worker-url spark://Worker@192.168.0.25:44851 Closing Thoughts I spend 1-2 weeks a month living out of a suitcase so carrying a few motherboards and power supplies around isnt practical..When I started this project I had three Raspberry Pis shipped overnight to an Amazon drop off point next to one of my clients offices..I was able to flash the Micro SD cards off a MacBook Pro I take with me when working abroad..I got the devices to connect to a Wi-Fi hotspot I ran off of my Samsung Galaxy S8 phone in my hotel room within an hour of unpacking everything..This all felt like a very convenient and portable way to explore both these small devices and revisit Hadoop on a minimalist hardware setup..Ill admit that the slow network connectivity and slow I/O had me making many diversions along this journey..Im not sure if most people interested in learning about Hadoop will have the patience to deal with these limitations..If you want a more practical learning experience Id suggest trying out Amazon EMR as so much is already setup when you launch a cluster..If you do want to use your own hardware at home Id suggest anything from the "Modest" tier upward on Logical Increments as a shopping list of parts..You should use multiple computers as distributing workloads and horizontal scalability are Hadoops main selling points..Make sure to find a motherboard with built-in HDMI so you can save money on graphics cards.. More details

Leave a Reply