1.1 Billion Taxi Rides with MapD & 4 Nvidia Titan Xs

$ vi create_trips_table.sql CREATE TABLE trips ( trip_id INTEGER, vendor_id VARCHAR(3) ENCODING DICT, pickup_datetime TIMESTAMP, dropoff_datetime TIMESTAMP, store_and_fwd_flag VARCHAR(1) ENCODING DICT, rate_code_id SMALLINT, pickup_longitude DECIMAL(14,2), pickup_latitude DECIMAL(14,2), dropoff_longitude DECIMAL(14,2), dropoff_latitude DECIMAL(14,2), passenger_count SMALLINT, trip_distance DECIMAL(14,2), fare_amount DECIMAL(14,2), extra DECIMAL(14,2), mta_tax DECIMAL(14,2), tip_amount DECIMAL(14,2), tolls_amount DECIMAL(14,2), ehail_fee DECIMAL(14,2), improvement_surcharge DECIMAL(14,2), total_amount DECIMAL(14,2), payment_type VARCHAR(3) ENCODING DICT, trip_type SMALLINT, pickup VARCHAR(50) ENCODING DICT, dropoff VARCHAR(50) ENCODING DICT, cab_type VARCHAR(6) ENCODING DICT, precipitation SMALLINT, snow_depth SMALLINT, snowfall SMALLINT, max_temperature SMALLINT, min_temperature SMALLINT, average_wind_speed SMALLINT, pickup_nyct2010_gid SMALLINT, pickup_ctlabel VARCHAR(10) ENCODING DICT, pickup_borocode SMALLINT, pickup_boroname VARCHAR(13) ENCODING DICT, pickup_ct2010 VARCHAR(6) ENCODING DICT, pickup_boroct2010 VARCHAR(7) ENCODING DICT, pickup_cdeligibil VARCHAR(1) ENCODING DICT, pickup_ntacode VARCHAR(4) ENCODING DICT, pickup_ntaname VARCHAR(56) ENCODING DICT, pickup_puma VARCHAR(4) ENCODING DICT, dropoff_nyct2010_gid SMALLINT, dropoff_ctlabel VARCHAR(10) ENCODING DICT, dropoff_borocode SMALLINT, dropoff_boroname VARCHAR(13) ENCODING DICT, dropoff_ct2010 VARCHAR(6) ENCODING DICT, dropoff_boroct2010 VARCHAR(7) ENCODING DICT, dropoff_cdeligibil VARCHAR(1) ENCODING DICT, dropoff_ntacode VARCHAR(4) ENCODING DICT, dropoff_ntaname VARCHAR(56) ENCODING DICT, dropoff_puma VARCHAR(4) ENCODING DICT ) WITH (FRAGMENT_SIZE=100000000); Ill create two environment variables with my credentials for MapD..$ read MAPD_USERNAME $ read MAPD_PASSWORD $ export MAPD_USERNAME $ export MAPD_PASSWORD The following will create the table schema using the mapdql cli tool..$ mapdql mapd -u $MAPD_USERNAME -p $MAPD_PASSWORD < create_trips_table.sql Ill then check that the table has been created: $ echo " " | mapdql mapd -u $MAPD_USERNAME -p $MAPD_PASSWORD User mapd connected to database mapd trips User mapd disconnected from database mapd MapD doesnt support loading CSV data from GZIP files at this time so Ill decompress the CSV files before loading them..$ gunzip trips_x*.csv.gz With the table and files in place Ill load the 500 GB of CSV data into MapD..$ for filename in *.csv; do echo "COPY trips FROM '/raidStorage/mark/$filename' WITH (header='false');" | mapdql mapd -u $MAPD_USERNAME -p $MAPD_PASSWORD done The above completed in 86 minutes and 24 seconds..Benchmarking MapD The times quoted below are the lowest query times seen during a series of runs..$ mapdql mapd -u $MAPD_USERNAME -p $MAPD_PASSWORD iming on The following completed in 0.036 seconds..SELECT cab_type, count(*) FROM trips GROUP BY 1; The following completed in 0.131 seconds..SELECT passenger_count, avg(total_amount) FROM trips GROUP BY 1; The following completed in 0.439 seconds..SELECT passenger_count, extract(year from pickup_datetime), count(*) FROM trips GROUP BY 1, 2; The following completed in 0.964 seconds..SELECT passenger_count, extract(year from pickup_datetime), cast(trip_distance as int), count(*) FROM trips GROUP BY 1, 2, 3 ORDER BY 2, 4 desc; Its fantastic to see that Ive been able to use a machine that costs 1/10th of the one used in the 8 x Tesla K80s benchmark but still have queries running within 33% of the previous performances witnessed..Despite the initial outlay for the hardware I think this is a very cost-effective setup.. More details

Leave a Reply