A Billion Taxi Rides: AWS S3 versus HDFS

$ hive CREATE EXTERNAL TABLE trips_orc_s3 ( trip_id INT, vendor_id STRING, pickup_datetime TIMESTAMP, dropoff_datetime TIMESTAMP, store_and_fwd_flag STRING, rate_code_id SMALLINT, pickup_longitude DOUBLE, pickup_latitude DOUBLE, dropoff_longitude DOUBLE, dropoff_latitude DOUBLE, passenger_count SMALLINT, trip_distance DOUBLE, fare_amount DOUBLE, extra DOUBLE, mta_tax DOUBLE, tip_amount DOUBLE, tolls_amount DOUBLE, ehail_fee DOUBLE, improvement_surcharge DOUBLE, total_amount DOUBLE, payment_type STRING, trip_type SMALLINT, pickup STRING, dropoff STRING, cab_type STRING, precipitation SMALLINT, snow_depth SMALLINT, snowfall SMALLINT, max_temperature SMALLINT, min_temperature SMALLINT, average_wind_speed SMALLINT, pickup_nyct2010_gid SMALLINT, pickup_ctlabel STRING, pickup_borocode SMALLINT, pickup_boroname STRING, pickup_ct2010 STRING, pickup_boroct2010 STRING, pickup_cdeligibil STRING, pickup_ntacode STRING, pickup_ntaname STRING, pickup_puma STRING, dropoff_nyct2010_gid SMALLINT, dropoff_ctlabel STRING, dropoff_borocode SMALLINT, dropoff_boroname STRING, dropoff_ct2010 STRING, dropoff_boroct2010 STRING, dropoff_cdeligibil STRING, dropoff_ntacode STRING, dropoff_ntaname STRING, dropoff_puma STRING ) STORED AS orc LOCATION 's3://<s3_bucket>/orc/'; CREATE EXTERNAL TABLE trips_orc_hdfs ( trip_id INT, vendor_id STRING, pickup_datetime TIMESTAMP, dropoff_datetime TIMESTAMP, store_and_fwd_flag STRING, rate_code_id SMALLINT, pickup_longitude DOUBLE, pickup_latitude DOUBLE, dropoff_longitude DOUBLE, dropoff_latitude DOUBLE, passenger_count SMALLINT, trip_distance DOUBLE, fare_amount DOUBLE, extra DOUBLE, mta_tax DOUBLE, tip_amount DOUBLE, tolls_amount DOUBLE, ehail_fee DOUBLE, improvement_surcharge DOUBLE, total_amount DOUBLE, payment_type STRING, trip_type SMALLINT, pickup STRING, dropoff STRING, cab_type STRING, precipitation SMALLINT, snow_depth SMALLINT, snowfall SMALLINT, max_temperature SMALLINT, min_temperature SMALLINT, average_wind_speed SMALLINT, pickup_nyct2010_gid SMALLINT, pickup_ctlabel STRING, pickup_borocode SMALLINT, pickup_boroname STRING, pickup_ct2010 STRING, pickup_boroct2010 STRING, pickup_cdeligibil STRING, pickup_ntacode STRING, pickup_ntaname STRING, pickup_puma STRING, dropoff_nyct2010_gid SMALLINT, dropoff_ctlabel STRING, dropoff_borocode SMALLINT, dropoff_boroname STRING, dropoff_ct2010 STRING, dropoff_boroct2010 STRING, dropoff_cdeligibil STRING, dropoff_ntacode STRING, dropoff_ntaname STRING, dropoff_puma STRING ) STORED AS orc LOCATION '/orc/'; Benchmarking S3 Ill be using Presto 1.4.0 to benchmark both the S3 and the HDFS-based data..The following shows the queries times I saw for the S3-based data..$ presto-cli –catalog hive –schema default The following completed in 1 minute and 1 second..SELECT cab_type, count(*) FROM trips_orc_s3 GROUP BY cab_type; Query 20160414_110943_00006_hzqcy, FINISHED, 4 nodes Splits: 749 total, 749 done (100.00%) 1:01 [1.11B rows, 48.8GB] [18.3M rows/s, 821MB/s] The following completed in 58 seconds..SELECT passenger_count, avg(total_amount) FROM trips_orc_s3 GROUP BY passenger_count; Query 20160414_111100_00007_hzqcy, FINISHED, 4 nodes Splits: 749 total, 749 done (100.00%) 0:58 [1.11B rows, 48.8GB] [19.1M rows/s, 859MB/s] The following completed in 1 minute and 43 seconds..SELECT passenger_count, year(pickup_datetime), count(*) FROM trips_orc_s3 GROUP BY passenger_count, year(pickup_datetime); Query 20160414_111213_00008_hzqcy, FINISHED, 4 nodes Splits: 749 total, 749 done (100.00%) 1:43 [1.11B rows, 48.8GB] [10.8M rows/s, 484MB/s] The following completed in 1 minute and 41 seconds..SELECT passenger_count, year(pickup_datetime) trip_year, round(trip_distance), count(*) trips FROM trips_orc_s3 GROUP BY passenger_count, year(pickup_datetime), round(trip_distance) ORDER BY trip_year, trips desc; Query 20160414_111407_00009_hzqcy, FINISHED, 4 nodes Splits: 749 total, 749 done (100.00%) 1:41 [1.11B rows, 48.8GB] [11.1M rows/s, 497MB/s] Benchmarking HDFS The following shows the query times I saw for the HDFS-based data..The following completed in 35 seconds (1.75x faster than S3)..SELECT cab_type, count(*) FROM trips_orc_hdfs GROUP BY cab_type; Query 20160414_110514_00002_hzqcy, FINISHED, 4 nodes Splits: 793 total, 793 done (100.00%) 0:35 [1.11B rows, 48.8GB] [31.9M rows/s, 1.4GB/s] The following completed in 39 seconds (1.5x faster than S3)..SELECT passenger_count, avg(total_amount) FROM trips_orc_hdfs GROUP BY passenger_count; Query 20160414_110607_00003_hzqcy, FINISHED, 4 nodes Splits: 793 total, 793 done (100.00%) 0:39 [1.11B rows, 48.8GB] [28.2M rows/s, 1.24GB/s] The following completed in 1 minute and 4 seconds (1.6x faster than S3)..SELECT passenger_count, year(pickup_datetime), count(*) FROM trips_orc_hdfs GROUP BY passenger_count, year(pickup_datetime); Query 20160414_110658_00004_hzqcy, FINISHED, 4 nodes Splits: 793 total, 793 done (100.00%) 1:04 [1.11B rows, 48.8GB] [17.5M rows/s, 786MB/s] The following completed in 1 minute and 21 seconds (1.25x faster than S3)..SELECT passenger_count, year(pickup_datetime) trip_year, round(trip_distance), count(*) trips FROM trips_orc_hdfs GROUP BY passenger_count, year(pickup_datetime), round(trip_distance) ORDER BY trip_year, trips desc; Query 20160414_110810_00005_hzqcy, FINISHED, 4 nodes Splits: 793 total, 793 done (100.00%) 1:21 [1.11B rows, 48.8GB] [13.7M rows/s, 617MB/s] Though the speed improvements using HDFS are considerable, S3 did perform pretty well.. More details

Leave a Reply