FoundationDB (FDB) is an ACID-compliant, multi-model, distributed database.
The software started out life in 2013 as a closed-source offering with a free download available.
In March of 2015, Apple acquired the company behind FDB and in 2018, they open sourced the software under an Apache 2.
VMWares WaveFront runs an FDB cluster with at least a petabyte of capacity, Snowflake runs FDB for their metadata storage for their Cloud database service and Apple uses FDB for their CloudKit backend.
FDB uses the concept of Layers to add functionality.
There are layers for MongoDB API-compatible document storage, record-oriented storage and SQL support among others.
FDB is optimised for SSDs to the point that you need to decide between HDD and SSD-specific configurations when setting up a database.
The clustering support allows for scaling both up and down with data being automatically rebalanced.
FDB utilises SQLite for its underlying storage engine.
FDB itself is written in Flow, a programming language the engineers behind FDB developed.
The language adds actor-based concurrency as well as new keywords and control-flow primitives to C++11.
As of this writing FDB is comprised of 100K lines of Flow / C++11 code and a further 83K lines of C code.
In this post Ill take a look at setting up a FoundationDB cluster and running a simple leaderboard example using Python.
The leaderboard code used in this post originated in this forum post.
A FoundationDB Cluster, Up & Running Ive put together a cluster of three m5d.
xlarge on AWS EC2.
These instance types come with 4 vCPUs, 15 GB of RAM, 150 GB of NVMe SSD storage and up to 10 GBit/s of networking connectivity.
The three instances cost of $0.
75 / hour to run.
On all three instances Ill first format the NVMe partition using the XFS file system.
This file system was first created by Silicon Graphics in 1993 and has excellent performance characteristics when run on SSDs.
$ sudo mkfs -t xfs /dev/nvme1n1 $ sudo mkdir -p /var/lib/foundationdb/data $ sudo mount /dev/nvme1n1 /var/lib/foundationdb/data Ill then install some prerequisites for the Python code in this post.
$ sudo apt update $ sudo apt install python-dev python-pip virtualenv On the first server Ill create a virtual environment and install the FoundationDB and Pandas python packages.
$ virtualenv ~/.
fdb $ source ~/.
fdb/bin/activate $ pip install foundationdb pandas FoundationDBs server package depends on the client package being installed beforehand so Ill download and install that first.
The following was run on all three instances.
$ wget -c https://www.
deb $ sudo dpkg -i foundationdb-clients_6.
deb The following will install the server package and was run on all three instances as well.
$ wget -c https://www.
deb $ sudo dpkg -i foundationdb-server_6.
deb Ill run a command to configure the first server to switch binding from the local network interface to the private network instead.
This way itll be reachable by the other two servers without being available to the wider internet.
$ sudo /usr/lib/foundationdb/make_public.
cluster is now using address 172.
218 Ill then take the contents from /etc/foundationdb/fdb.
cluster on the first server and place them in the same file on the other two servers.
With the cluster configuration synced between all three machines Ill restart FDB on each of the systems.
$ sudo service foundationdb restart Ill then configure FDB for SSD storage, triple replication and set all three instances up as coordinators.
$ fdbcli configure triple ssd coordinators auto This is the resulting status after those changes.
status details Using cluster file `/etc/foundationdb/fdb.
Configuration: Redundancy mode – triple Storage engine – ssd-2 Coordinators – 3 Cluster: FoundationDB processes – 3 Machines – 3 Memory availability – 15.
1 GB per process on machine with least available Fault Tolerance – 0 machines (1 without data loss) Server time – 02/18/19 08:53:13 Data: Replication health – Healthy (Rebalancing) Moving data – 0.
000 GB Sum of key-value sizes – 0 MB Disk space used – 0 MB Operating space: Storage server – 142.
4 GB free on most full server Log server – 142.
4 GB free on most full server Workload: Read rate – 14 Hz Write rate – 0 Hz Transactions started – 9 Hz Transactions committed – 1 Hz Conflict rate – 0 Hz Backup and DR: Running backups – 0 Running DRs – 0 Process performance details: 172.
4:4500 ( 1% cpu; 0% machine; 0.
000 Gbps; 0% disk IO; 0.
3 GB / 15.
1 GB RAM ) 172.
137:4500 ( 1% cpu; 0% machine; 0.
000 Gbps; 0% disk IO; 0.
4 GB / 15.
2 GB RAM ) 172.
218:4500 ( 1% cpu; 0% machine; 0.
026 Gbps; 0% disk IO; 0.
4 GB / 15.
1 GB RAM ) Coordination servers: 172.
4:4500 (reachable) 172.
137:4500 (reachable) 172.
218:4500 (reachable) A FoundationDB & Python Leaderboard Ill use Pandas to read CSV data in chunks out of a GZIP-compressed CSV file.
This file contains 20 million NYC taxi trips conducted in 2009.
Ill feed the originating neighbourhood(s) and final total taxi fare into FDB.
$ python from datetime import datetime import fdb from fdb.
tuple import pack, unpack import pandas as pd fdb.
transactional def get_score(tr, user): user_key = pack(('scores', user)) score = tr.
get(user_key) if score == None: score = 0 tr.
set(user_key, pack((score,))) tr.
set(pack(('leaderboard', score, user)), b'') else: score = unpack(score) return score @fdb.
transactional def add(tr, user, increment=1): score = get_score(tr, user) total = score + increment user_key = pack(('scores', user)) tr.
set(user_key, pack((total,))) tr.
clear(pack(('leaderboard', score, user))) tr.
set(pack(('leaderboard', total, user)), b'') return total cols = ['trip_id', 'vendor_id', 'pickup_datetime', 'dropoff_datetime', 'store_and_fwd_flag', 'rate_code_id', 'pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude', 'passenger_count', 'trip_distance', 'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount', 'ehail_fee', 'improvement_surcharge', 'total_amount', 'payment_type', 'trip_type', 'pickup', 'dropoff', 'cab_type', 'precipitation', 'snow_depth', 'snowfall', 'max_temperature', 'min_temperature', 'average_wind_speed', 'pickup_nyct2010_gid', 'pickup_ctlabel', 'pickup_borocode', 'pickup_boroname', 'pickup_ct2010', 'pickup_boroct2010', 'pickup_cdeligibil', 'pickup_ntacode', 'pickup_ntaname', 'pickup_puma', 'dropoff_nyct2010_gid', 'dropoff_ctlabel', 'dropoff_borocode', 'dropoff_boroname', 'dropoff_ct2010', 'dropoff_boroct2010', 'dropoff_cdeligibil', 'dropoff_ntacode', 'dropoff_ntaname', 'dropoff_puma'] db = fdb.
open() counter, start = 0, datetime.
utcnow() for chunk in pd.
gz', header=None, chunksize=10000, names=cols, usecols=['total_amount', 'pickup_ntaname']): for x in range(0, len(chunk)): add(db, chunk.
total_amount) counter = counter + 1 print (counter * 10000) / (datetime.
utcnow() – start).
total_seconds() The above imported at a rate of 495 records per second.
While the import was taking place I was able to begin querying the leaderboard.
from operator import itemgetter import fdb from fdb.
tuple import pack, unpack fdb.
transactional def top(tr, count=3): out = dict() iterator = tr.
get_range_startswith(pack(('leaderboard',)), reverse=True) for key, _ in iterator: _, score, user = unpack(key) if score in out.
append(user) elif len(out.
keys()) == count: break else: out[score] = [user] return dict(sorted(out.
items(), key=itemgetter(0), reverse=True)) top(db) This is the top three pick up points by total cab fare after a few minutes of importing the CSV file.
25000000016: ['Hudson Yards-Chelsea-Flatiron-Union Square'], 47637.
469999999936: ['SoHo-TriBeCa-Civic Center-Little Italy'], 134147.
24000000008: ['Midtown-Midtown South']}.. More details