A Billion Taxi Rides in Redshift

CREATE TABLE trips ( trip_id INTEGER NOT NULL DISTKEY ENCODE LZO, vendor_id VARCHAR(3) ENCODE LZO, — Sort keys shouldn't be encoded (compressed) pickup_datetime TIMESTAMP NOT NULL, dropoff_datetime TIMESTAMP NOT NULL ENCODE LZO, store_and_fwd_flag VARCHAR(1) ENCODE RUNLENGTH, rate_code_id SMALLINT NOT NULL ENCODE LZO, pickup_longitude DECIMAL(18,14) ENCODE MOSTLY8, pickup_latitude DECIMAL(18,14) ENCODE MOSTLY8, dropoff_longitude DECIMAL(18,14) ENCODE MOSTLY8, dropoff_latitude DECIMAL(18,14) ENCODE MOSTLY8, passenger_count SMALLINT NOT NULL DEFAULT '0' ENCODE LZO, trip_distance DECIMAL(6,3) DEFAULT '0.0' ENCODE MOSTLY8, fare_amount DECIMAL(6,2) DEFAULT '0.0' ENCODE MOSTLY8, extra DECIMAL(6,2) DEFAULT '0.0' ENCODE MOSTLY8, mta_tax DECIMAL(6,2) DEFAULT '0.0' ENCODE MOSTLY8, tip_amount DECIMAL(6,2) DEFAULT '0.0' ENCODE MOSTLY8, tolls_amount DECIMAL(6,2) DEFAULT '0.0' ENCODE MOSTLY8, ehail_fee DECIMAL(6,2) DEFAULT '0.0' ENCODE MOSTLY8, improvement_surcharge DECIMAL(6,2) DEFAULT '0.0' ENCODE MOSTLY8, total_amount DECIMAL(6,2) DEFAULT '0.0' ENCODE MOSTLY8, payment_type VARCHAR(3) ENCODE RUNLENGTH, trip_type SMALLINT ENCODE LZO, pickup VARCHAR(50) ENCODE LZO, dropoff VARCHAR(50) ENCODE LZO, cab_type VARCHAR(6) NOT NULL ENCODE LZO, precipitation SMALLINT DEFAULT '0' ENCODE LZO, snow_depth SMALLINT DEFAULT '0' ENCODE LZO, snowfall SMALLINT DEFAULT '0' ENCODE LZO, max_temperature SMALLINT DEFAULT '0' ENCODE LZO, min_temperature SMALLINT DEFAULT '0' ENCODE LZO, average_wind_speed SMALLINT DEFAULT '0' ENCODE LZO, pickup_nyct2010_gid SMALLINT ENCODE LZO, pickup_ctlabel VARCHAR(10) ENCODE LZO, pickup_borocode SMALLINT ENCODE LZO, pickup_boroname VARCHAR(13) ENCODE LZO, pickup_ct2010 VARCHAR(6) ENCODE LZO, pickup_boroct2010 VARCHAR(7) ENCODE LZO, pickup_cdeligibil VARCHAR(1) ENCODE RUNLENGTH, pickup_ntacode VARCHAR(4) ENCODE LZO, pickup_ntaname VARCHAR(56) ENCODE LZO, pickup_puma VARCHAR(4) ENCODE LZO, dropoff_nyct2010_gid SMALLINT ENCODE LZO, dropoff_ctlabel VARCHAR(10) ENCODE LZO, dropoff_borocode SMALLINT ENCODE LZO, dropoff_boroname VARCHAR(13) ENCODE LZO, dropoff_ct2010 VARCHAR(6) ENCODE LZO, dropoff_boroct2010 VARCHAR(7) ENCODE LZO, dropoff_cdeligibil VARCHAR(1) ENCODE RUNLENGTH, dropoff_ntacode VARCHAR(4) ENCODE LZO, dropoff_ntaname VARCHAR(56) ENCODE LZO, dropoff_puma VARCHAR(4) ENCODE LZO, primary key(trip_id) ) sortkey(pickup_datetime); Loading Records into Redshift For demonstration purposes Ill be running a single load into Redshift..Because Ill need 95% of the drive capacity during the import process Ill just be importing the first 10 gzip files..If you wish to import all the gzip files in parallel you will need a larger cluster type as discussed above..Ive amended my trips.manifest file with the following contents and uploaded it to S3..{ "entries": [ {"url": "s3://trips_metadata_example/trips_xaa.csv.gz", "mandatory": true}, {"url": "s3://trips_metadata_example/trips_xab.csv.gz", "mandatory": true}, {"url": "s3://trips_metadata_example/trips_xac.csv.gz", "mandatory": true}, {"url": "s3://trips_metadata_example/trips_xad.csv.gz", "mandatory": true}, {"url": "s3://trips_metadata_example/trips_xae.csv.gz", "mandatory": true}, {"url": "s3://trips_metadata_example/trips_xaf.csv.gz", "mandatory": true}, {"url": "s3://trips_metadata_example/trips_xag.csv.gz", "mandatory": true}, {"url": "s3://trips_metadata_example/trips_xah.csv.gz", "mandatory": true}, {"url": "s3://trips_metadata_example/trips_xai.csv.gz", "mandatory": true}, {"url": "s3://trips_metadata_example/trips_xaj.csv.gz", "mandatory": true} ] } $ s3cmd put trips.manifest s3://trips_metadata_example/ Ill now execute the copy command which will begin loading in the data from the 10 gzip files..Be sure to add in your access and secret keys to the fourth line in the SQL command below..$ PGPASSWORD=$MASTER_PASSWORD psql -h trips-data.cttuaolixpsz.us-east-1.redshift.amazonaws.com -p 5439 -U $MASTER_USERNAME trips COPY trips FROM 's3://trips_metadata_example/trips.manifest' CREDENTIALS 'aws_access_key_id=…;aws_secret_access_key=…' DELIMITER ',' EMPTYASNULL ESCAPE GZIP MANIFEST MAXERROR 10 REMOVEQUOTES TRIMBLANKS TRUNCATECOLUMNS; Metrics from the Import The 200M records loaded in 1 hour, 4 minutes and 25 seconds..During the import there were two distinct phases that manifested themselves in the CloudWatch screens..During the first phase the disk was just being written to for the most part..I saw 343 Write IOPS being consumed with 43 MB/s being written to disk..In the second phase the write IOPS shot up to 1,275 and write throughput hit 167 MB/s..Read IOPS, which had been non-existent in the first phase, hit 1,461 with a read throughput of 191 MB/s.. More details

Leave a Reply