millions of people set a timer to remind themselves to watch the Superbowl, so all the timers fire within close proximity to kickoff time.
DynamoDB TTL as a scheduling mechanismFrom a high level, this approach looks like this:A scheduled_items DynamoDB table which holds all the tasks that are scheduled for execution.
A scheduler function that writes the scheduled task into the scheduled_items table, with the TTL set to the scheduled execution time.
An execute-on-schedule function that subscribes to the DynamoDB Stream for scheduled_items and reacts to REMOVE events.
These events correspond to when items have been deleted from the table.
Scalability (number of open tasks)Since the number of open tasks just translates to the number of items in the scheduled_items table, this approach can scale to millions of open tasks.
DynamoDB can handle large throughputs (thousands of TPS) too.
So this approach can also be applied to scenarios where thousands of items are scheduled per second.
Scalability (hotspots)When many items are deleted at the same time, they are simply queued in the DynamoDB Stream.
AWS also auto scales the number of shards in the stream, so as throughput increases the number of shards would go up accordingly.
But, events are processed in sequence.
So it can take some time for your function to process the event depending on:its position in the stream, andhow long it takes to process each event.
So, while this approach can scale to support many tasks all expiring at the same time, it cannot guarantee that tasks are executed on time.
PrecisionThis is a big question about this approach.
According to the official documentation, expired items are deleted within 48 hours.
That is a huge margin of error!As an experiment, I set up a Step Functions state machine to:add a configurable number of items to the scheduled_items table, with TTL expiring between 1 and 10 minstrack the time the task is scheduled for and when it’s actually picked up by the execute-on-schedule functionwait for all the items to be deletedThe state machine looks like this:I performed several runs of tests.
The results are consistent regardless of the number of items in the table.
A quick glimpse at the table tells you that, on average, a task is executed over 11 mins AFTER its scheduled time.
US-EAST-1I repeated the experiments in several other AWS regions:I don’t know why there is such a marked difference between US-EAST-1 and the other regions.
One explanation is that the TTL process requires a bit of time to kick in after a table is created.
Since I was developing against the US-EAST-1 region initially, its TTL process has been “warmed” compared to the other regions.
ConclusionsBased on the result of my experiment, it will appear that using DynamoDB TTL as a scheduling mechanism cannot guarantee a reasonable precision.
On the one hand, the approach scales very well.
But on the other, the scheduled tasks are executed at least several minutes behind, which renders it unsuitable for many use cases.