Written by Matt Lewis, Chief Architect at DVLA
In the first post we looked back at the start of our cloud adoption journey. Although we had proven the value of adopting cloud, our approach was not sustainable in the long term, and we knew we had to make changes as we matured.
Centralised Platform Team
One of the first steps was to establish a centralised cloud platform team. We had started by embedding cloud engineers in multi-disciplinary squads. However, without having defined standards, we saw proliferation of technology and duplication of effort across squads with many trying to solve the same problems at the same time.
The new platform team was responsible for creating a ‘paved road’. This involved providing a certified platform for engineering squads to deploy their workloads onto. If squads utilised this platform and followed agreed standards, they would get key features out of the box such as centralised log shipping, monitoring, metrics, observability and alerting. These capabilities were provided by modern cloud-native tooling that was embedded as part of the platform.
To take advantage, we needed new engineering standards and patterns. We quickly discovered that without a consistent structured log format and standard headers, it was impossible to trace requests across multiple distributed services, so we defined these standards, and created client libraries.
It is one thing defining standards, but they also need enforcing. To deliver at pace, and ensure consistency, we moved away from manual gates where possible, and embraced automation. This meant we could guarantee that correct tags had been applied, audit was enabled and data was encrypted.
Security is critical for us, and we continually challenge ourselves to look at how we can prevent insecure code being deployed into production. We standardised on a CI/CD pipeline, and ensured this was the only way of deploying code into production.
All code resides in our git repositories. In order to deploy new code to production, a software engineer creates a short-lived feature branch and commits a change. On each commit, a series of short tests are run such as linting the code for conformance. To start the process of merging this change into the master branch, a new Pull Request is created. This automatically runs all integration and acceptance tests, and now carries out vulnerability scanning using modern tooling.
If any test fails, or vulnerabilities are found, the Pull Request is marked as failing and cannot be merged. Finally, if successful, the Pull Request must be peer reviewed and approved by a number of approver groups, ensuring many eyes look at the code before it can be merged and promoted through environments.
We found that adopting cloud in an unconstrained manner had led to spiralling costs. When we migrated our original exemplar services onto a different public cloud, we carried out a cost optimisation exercise which resulted in run costs being reduced by over 70%. This involved:
- tagging resources and automatically scheduling them, so that test environments were not left running
- ensuring all resources in auto scaling groups to take advantage of the elasticity of cloud
- right-sizing instances where possible by reducing CPU, memory and/or storage where they had been over-provisioned
Cost optimisation is now a regular exercise. More recent examples involve the progression of our container platform from on demand instances, to reserved instances, and now taking advantage of spot instances.
Growing In-House Capability
None of this would have been possible without our people. For our first steps into cloud we had partnered with external suppliers, but knew we wanted to be in control of our destiny, which meant both upskilling and attracting new talent.
We gave access to top quality online training providers with enterprise subscriptions that meant everyone could access the learning they needed on demand. We also ensured that training was built into our resourcing model, so it became an integral part of people’s work.
We wanted to encourage innovation in a safe environment and allow engineers to learn about new services or build working prototypes. To do this, we provided all of our squads with their own individual developer accounts. These are not linked to any other accounts or environments, and so have a limited blast radius. We encourage everyone to verify their existing capability level by providing vouchers to take certifications.
Finally, to help provide more structure and guidance in this area, we have created career frameworks defining clear descriptions of roles and how to progress in specific careers, what training course are most relevant, and what skills are most desired.
We want to be a learning organisation, and we know that the skills we will need in the future may be different to those we need now. We run regular events to allow our technical resources to have fun, but also to learn new skills we believe are important. For example, in the past 18 months we have organised:
- the Scalextric Challenge — we fitted sensors to Scalextric cars and combined having fun racing with building imaginative and innovative software using the live streaming data, anticipating the future of connected and autonomous vehicles
- a — a hackathon with various public and private sector organisations. This involved finding new and engaging ways to use bot technology within a public service
- a Deep Racer Day — an opportunity to learn the concepts of machine learning whilst building and training reinforcement learning models
We have now reached a stage of maturity where we can rapidly deliver new secure services such as changing address on a vehicle log book in a fraction of the time it took at the start of the journey. But we don’t want to stop here. We conclude our series by looking at what is coming up next, with the launch of our new Cloud Academy, enhanced cloud capabilities and a move to self-service.
Originally published at https://digileaders.com on July 30, 2020.