1.1 Billion Taxi Rides on kdb+/q & 4 Xeon Phi CPUs

td:{` sv`:trips,(`$string x),`trips`} i:-1; wr:{td[(`$n),(i+:1),y]set update `p#passenger_count from delete year from select from x where year=y} rd:.Q.fc[{flip c!(t;",")0:x}] sr:{`year`passenger_count xasc update year:pickup_datetime.year,first each cab_type from x} The above define four functions, td returns the table handle for a given segment/partition, wr that takes an in-memory partition and writes it to disk while indexing the passenger_count column data in the process, rd parses CSV data in parallel into an in-memory table and sr will sort an in-memory table…Q.fpn[{wr[q]each exec distinct year from q:update `p#year from sr rd x};`$":fifo",n;550000000] The above both times and triggers the process of reading data from the FIFO, parsing it into an in-memory table, sorting it and writing it out to an on-disk table..Each of the four servers will have a copy of load.q along side a load.sh script..When executed it will take the last digit of the host name (a value between 1 and 4) and use that to decide which fourth of the 56 gzip files will be loaded onto its own system..This means the dataset will be spread amongst the four machines with none of them holding on to more than 25% of the dataset..It is possible to share the data across all nodes but I chose to not take that approach for this benchmark..The xargs command will be used to load the 14 files specific to each server concurrently..Once the data has been loaded in, a partitions manifest file is created and stored in trips/p/par.txt..$ cat load.sh #!/bin/sh id=`hostname | grep -o '.$'` ls csv/* | awk -v i=$id 'NR > 14 * (i – 1) && NR <= 14 * i' | xargs -P 14 -L 1 q load.q -q -s 8 mkdir -p trips/p ls -d trips/??/* | sed 's/trips/../' > trips/p/par.txt The load.sh script was executed concurrently across all four servers..The loading of the CSV data into kdb+s internal format took about 30 minutes all together..This is one of the fastest load times Ive seen of this dataset..As compression isnt being used for this benchmark, ~125 GB of disk capacity is being used on each machine (excluding source files).. More details

Leave a Reply