We love hearing about what DevOps unicorns like Netflix do and how we can learn from their experiences and help our customers transform into DevOps businesses - here are two stories from two of our favourite partners:
Netflix Builds Its Open Sourced Cloud Technologies on the CloudBees Platform
More than 36 million Netflix members worldwide view streamed content and access Netflix features delivered via cloud technology that the company has been developing since 2009. Netflix operates on a cloud platform based on Amazon Web Services (AWS). Over the years, Netflix engineers have developed numerous cloud tools and technologies, which the company has now shared with the development community as open source software.
While many companies may be leery of freely sharing core technology, Netflix sees several compelling advantages to the move. First, the company aims to establish its solutions as standards that are widely used throughout the industry. Second, improvements contributed by the community to improve performance or add support for other cloud services will improve the standard for everyone, advancing the platform robustness and quality. Third, the move will burnish the company’s reputation as a technology leader, which was boosted in 2012 by winning an Emmy for Technical Achievement. Lastly, it will help Netflix attract, engage and retain expert engineers, because it can draw directly from a wide pool of proven contributors.
To help maximize these advantages, Netflix chose the CloudBees Platform as a Service (PaaS) solution to support the public builds of its NetflixOSS open source projects. “Open sourcing our cloud-based architecture projects is a way to reduce risk, improve our service and contribute to the broader cloud-based community,” says Gareth Bowles, senior tools engineer at Netflix. “The CloudBees platform is helping us achieve these objectives by enabling Netflix engineers and external contributors to rapidly build and test changes, instantly see the quality of the changes being made and continue to realize the benefits of continuous integration with Jenkins.”
Challenge: Shorten Release Cycles, Deliver Innovation More Quickly
Netflix needed a way to transition dozens of internal projects to open source projects as efficiently and cost-effectively as possible. “We wanted to make our builds public, in a way that enabled everyone—including the Netflix engineers that would continue to work on the projects and the new external contributors—to see the effect of changes as soon as possible. We wanted the community to be able to assess the quality of each project at any point in time,” says Bowles. “And because we were already very heavy on-premise Jenkins users, we wanted a Jenkins-based cloud service for continuous integration.”
At the same time, the company was looking to keep costs and maintenance requirements down. “We are sharing our code with the community at no cost, and we needed a low cost way to do it that would not require extensive resources and ongoing maintenance,” says Bowles.
"The CloudBees Platform is helping us achieve our goals to open source our code by making it easier for developers to contribute, give rapid feedback on pull requests, provide the current status of projects and support low cost public builds. This spurs an increased pace of innovation that benefits not only Netflix and our customers, but the entire community."
Solution: Enable Shorter Software Delivery Cycles with CI and CD
Netflix transitioned its open source projects to the CloudBees DEV@cloud development platform.
Netflix engineers set up a few prototype builds with Jenkins, the CloudBees Platform and several key Jenkins plugins. They had everything up and running within two to three hours.
“The move was very easy and we had great support from the CloudBees technical team,” says Bowles.
The team then began using the Jenkins Job DSL plugin to create new projects programmatically, updating the location of the GitHub repository that each project used. Once a project is open to the community, developers—both internal and external—can instantly see its current build status, which is displayed using the Jenkins Embeddable Build Status plugin.
Among contributors, the ability to build pull requests before they are merged into the main branch has been a much-appreciated feature. Made possible by the GitHub Pull Request Builder plugin, this capability provides virtually instant feedback when a pull request is made.
“This feature acts as an extra reviewer for the code changes that developers submit,” says Bowles. “Our developers have been really impressed with the ability to get instant feedback on their changes.”
The CloudBees Platform is also playing a central role in the Netflix Cloud Prize contest, in which participants fork one of the Netflix open source repositories on GitHub, implement an improvement or new functionality and submit their changes for judging to win a prize in one of 10 categories. Netflix will set up new builds on the CloudBees Platform for entries that use a compatible build structure.
Ultimately all Netflix open source projects will be built on the CloudBees Platform. Netflix has already migrated most of them, including Asgard (a web interface for application deployments and cloud management on AWS) and SimianArmy, which includes the Chaos Monkey resiliency tool that helps applications tolerate random instance failures.
- Immediate feedback on builds. “With the CloudBees platform, Jenkins triggers a build when changes are submitted, so developers get feedback on their submissions within seconds, instead of having to wait 15 minutes or longer. Plus, everyone can see the current build status immediately,” says Bowles. “That benefits the entire developer community and spurs more contributions, because it reduces friction and makes the entire process smoother.”
- Minimal maintenance overhead. “Once we’ve created a build job, we rarely need to change the configuration so minimal maintenance is required,” says Bowles. “We don’t need a full-time employee to manage all of these projects; we just check in a couple times a day. It’s nice to know that CloudBees is handling the entire infrastructure for us, behind the scenes.”
- Cost-effective pricing. “We are grateful for the CloudBees Platform because it has enabled us to keep costs to a minimum as we open source our code,” says Bowles. “Since we are sharing our technology with the community for no cost, we appreciate that CloudBees has provided an affordable option for open source projects and that we can use the CloudBees infrastructure to publicly build those projects.”
Netflix Takes on the Cloud
AppDynamics was working hand-in-hand with Netflix to help manage the performance and availability of their highly-distributed cloud application. In the passage below are some of the key questions Adrian Cockcroft the chief cloud architect for Netflix) had to address during this move.
Why did Netflix migrate from a physical data center environment to a cloud environment?
The #1 reason Adrian states is “business agility” – the ability to quickly build and release new products (ie iPhone/iPad movie streaming) without having to dramatically ramp up expensive capacity in their physical data center. Some new services are capacity intensive – and their ability to provision hundreds or thousands of cloud nodes has sped their time-to-market with new movies and new products.
Netflix is also experiencing tremendous business growth, with 40% growth Y/Y member growth. Thus, they also have a need for more capacity to serve this higher demand. Adrian said that some of the demand spikes were hard to predict; thus, the need for elastic capacity.
The #2 reason he states is to avoid “undifferentiated heavy lifting.” By using cloud capacity, they no longer have to do the things in the data center that don’t differentiate Netflix from its competitors. They can focus all of their time and passion on innovation and differentiation.
Note – he doesn’t cite cost-savings as the #1 or #2 reason.
What is different about managing applications in a physical data center vs a cloud environment?
Quick answer: Everything. Adrian made a pretty bold statement: “Datacenter oriented tools don’t work” in the cloud environment.
“More things to manage” by a factor of 10: Whereas the physical data center may have had 40-50 megaservers in the past, the cloud nodes are made up of 1000’s of commodity, low-cost servers.
Thus, an individual server means less. Managing application performance and availability by the health of servers (CPU utilization, memory utilization) is no longer a reliable proxy for application health.
Dynamic vs Static: No longer is the same set of megaservers serving traffic each and every day. Cloud servers are easily replaced and 100’s of instances can be added or dropped in a minute. Thus, any concept of management that relied on a static set of servers, connections, agents, etc…is severely outdated. No longer can management solutions expect that their agents will persist on the same machines for months or years. The lifespan of a node may be 5 days or less.
Reinventing the Agile Release Process: When new capabilities are ready to be released, you no longer need to update/patch the existing servers. You now have the option to put the new release binaries on 100’s of new cloud instances – send traffic to them – verify that they are performing well….and then take down the 100’s of nodes with the old release. “Dark Launch” feedback mechanisms just got even better.
Relationships change: Amazon becomes their IT Operations/Infrastructure department and the relationship of App Dev & Architecture for the new cloud apps is with Amazon.
How do APM solutions need to architected to work in the Amazon Cloud?
Suffice it to say that a lot has to change. Adrian deserves the credit for dozens of features that have gone into AppDynamics 2.x and 3.0 releases. I won’t do a full sales pitch in this blog – but let me highlight two pretty obvious situations that must be handled elegantly in this highly distributed and dynamic environment:
1) The APM solution must be able to monitor 1000’s of cloud nodes from a single management server to provide end-to-end transaction performance metrics and tracing. If the APM solution can only scale to 200:1 – you will need multiple consoles and you won’t have a single pane of glass.
2) The APM solution must be able to handle 100’s of nodes being provisioned and de-provisioned. The performance monitoring, metrics, transaction tracing, service dependency modelling, and deep diagnosics all need to work in this extremely dynamic environment. Legacy APM solutions that don’t dynamically adapt to infrastructure changes will become useless quickly.