It perhaps also shows that the full picture of #NoDeployFriday contains nuances of arguments that don’t translate too well to Twitter.
Is it correct that we should we all be practicing continuous delivery, else we’re “doing it wrong”?One aspect is the psychology involved in the decision.
The aversion to Friday deployments stems from a fear of making mistakes due to the time of the week (tiredness, rushing) and also the potential that those mistakes might cause harm while most staff are getting two days of rest.
After all, the Friday commit that contains a potential production issue could end up bothering a whole host of people over the weekend: the on-call engineer, the other engineers that get contacted to solve the problem, and perhaps even specialist infrastructure engineers who mend corrupted data caused by the change.
If it blows up badly, then others in the business may potentially need to be involved for client communications and damage limitation.
Taking the stance of the idealist, we could reason that in a perfect world with perfect code, perfect test coverage and perfect QA, no change should ever go live that causes a problem.
But we are humans, and humans will always make mistakes.
There’s always going to be some bizarre edge case that doesn’t get picked up during development.
That’s just life.
So #NoDeployFriday makes sense, at least theoretically.
However, it’s a blunt instrument.
I would argue that we should consider changes on a case by case basis, and our default stance should be to deploy them whenever, even on Fridays, but we should be able to isolate the few that should wait until Monday instead.
There are some considerations that we can work with.
I’ve grouped them into the following categories:Understanding the blast radius of a changeThe maturity of the deployment pipelineThe ability to automatically detect errorsThe time it takes to fix problemsLet’s have a look at these in turn.
Understanding the blast radiusSomething vital is always missed when differences of opinion butt heads online about Friday deploys: the nature of the change itself.
No change to a codebase is equal.
Some commits make small changes to the UI and nothing else.
Some refactor hundreds of classes with no changes in the functionality of the program.
Some alter database schemas and make breaking changes to how a real-time data ingest works.
Some may restart one instance whereas others may trigger a rolling restart of a global fleet of different services.
Engineers should be able to look at their code and have a good idea of the blast radius of their change.
How much of the code and application estate is affected?.What could fail if this new code fails?.Is it just a button click that will throw an error, or will all new writes get dropped on the floor?.Is the change in one isolated service or have many services and dependencies changed in lockstep?I can’t see why anyone would be averse to shipping changes with small blast radii and straightforward deployment at any time of the week, yet I would expect major — especially storage infrastructure-related — changes to a platform to be considered more carefully, perhaps being done at the time when there are the least number of users online.
Even better, such large-scale changes should run in parallel in production so that they can be tested and measured with real system load without anyone ever knowing.
Good local decisions are key here.
Does each engineer understand the blast radius of their changes in the production environment and not just on their development environment?.If not, why not?.Could there be better documentation, training and visibility into how code changes impact production?Tiny blast radius?.Ship it on Friday.
Gigantic blast radius?.Wait until Monday.
The maturity of the deployment pipelineOne way of reducing risk is by continually investing in the deployment pipeline.
If getting the latest version of the application live still involves specialist knowledge of which scripts to run and which files to copy where, then it’s time to automate, automate, automate.
The quality of tools in this area has improved greatly over the last few years.
We’ve been using Jenkins Pipeline and Concourse a lot, which allow the build, test and deploy pipeline to be defined as code.
The process of fully automating your deployment is interesting.
It lets you step back and try to abstract what should be going on from the moment that a pull request is raised through to applications being pushed out into production.
Defining these steps in code, such as in the tools mentioned previously, also lets you generalize your step definitions and reuse them across all of your applications.
It also does wonders at highlighting some of the wild or lazy decisions you’ve made in the past and have been putting up with since.
For every engineer that has read the previous two paragraphs and reacted in a way such as “But of course!.We’ve been doing that for years!”, I can guarantee you that there are nine others picturing their application infrastructure and grimacing at the amount of work that it would take to move their system to a modern deployment pipeline.
This entails taking advantage of the latest tools that not only perform continuous integration, but also allow continuous deployment by publishing artifacts and allowing engineers to press a button to deploy them into production (or even automatically, if you’re feeling brave).
Investing in the deployment pipeline needs buy-in, and it needs proper staffing: it’s definitely not a side-project.
Having a team dedicated to improving internal tooling can work well here.
If they don’t already know the pressing issues — and they probably will — they can gather information on the biggest frustrations around the deployment process, then prioritize them and work with teams on fixing them.
Slowly but surely, things will improve: code will move to production faster and with fewer problems.
More people will be able to learn best practice and make improvements themselves.
And as things improve, practices begin to spread, and that new project will get done the right way from the start, rather than copying old bad habits ad infinitum.
The journey between a pull request being merged and the commits going live should be automated to the point that you don’t need to think about it.
Not only does this help isolate real problems in QA, since the changed code is the only variable, it also makes the job of writing code much more fun.
The power to deploy to production becomes decentralized, increasing individual autonomy and responsibility, which in turn breeds more considered decisions about when and how to roll out new code.
Solid deployment pipeline?.Deploy on Friday.
Copying scripts around manually?.Wait until Monday.
The ability to detect errorsDeployment to production doesn’t stop once the code has gone live.
If something goes wrong, we need to know, and preferably we should be told rather than needing to hunt out this information ourselves.
This involves the application logs being automatically scanned for errors, the explicit tracking of key metrics (such as messages processed per second, or error rates), and an alerting system that lets engineers know when there are critical issues or particular metrics that have shown a trend in the wrong direction.
Production is always a different beast to development, and engineers should be able to view the health of the parts of the system they care about, and also be able to compose dashboards that allow them to view trends over time.
It should allow questions to be answered about each subsequent change: has it made the system faster, or slower?.Are we seeing more or less timeouts?. More details