“Incremental change may be good theory, but in practice you have to have a big enough stick to hit everybody with to make everything move at once”. So shares Adrian Cockcroft, who helped lead Netflix’s migration from datacenter to the cloud — and from monolithic to microservices architecture — when their streaming business (the “stick”!) was exploding.
So how did they — and how can other companies — make such big, bet-the-company kind of moves, without getting mired in fanatical internal debates? Does organizational structure need to change, especially if moving from a more product-, than project-based, approach? What happens to security? And finally, what happens to the role of CIOs; what can/should they do?
Most interestingly: How will the entire industry be affected as companies not only adopt, but essentially offer, microservices or narrow cloud APIs? How do the trends of microservices, containers, devops, cloud, as-a-service/ on-demand, serverless — all moves towards more and more ephemerality — change the future of computing and even work? Cockcroft (who is now a technology fellow at Battery Ventures) joins this episode of the CFI Podcast, in conversation with Frank Chen and Martin Casado (and Sonal Chokshi) to discuss these shifts and more.
Discussion of how Netflix moved to a microservices architecture [1:26]
Security advantages of microservices [8:13], and the general trend toward this architecture in the marketplace [14:21]
How development teams and businesses stand to benefit from this shift [18:34]
Sonal: Hi, everyone. Welcome to the “CFI dcast.” I’m Sonal. Today’s podcast episode is all about microservices. And I’ve been super eager to focus only on this topic on the podcast, since we mention it a lot in passing, and I’m really excited because we finally get to do that. Our special guest for this topic is Adrian Cockcroft, who helped lead Netflix’s migration to a large-scale, highly available public-cloud architecture a few years ago — making Netflix one of the originators and early adopters of microservices. And Adrian is widely credited for helping pioneer microservices at web-scale.
Also joining in the conversation are CFI partners Martin Casado and Frank Chen, who will be moderating the discussion. And in this episode, we cover everything from what [are] microservices, to the evolution of the architecture, to how it changes the shape of organizations, to operations, to changing the role of CIOs. And finally — and this is actually what really excites me the most about this topic — is what new opportunities come up when you have these extremely ephemeral systems that are, you know, just like ghosts in the machine — from containers to servers on-demand, to serverless and what’s happening there, and some really interesting trends on that edge. The conversation begins, however, with the story of how Netflix got into microservices.
Frank: Take us back to the days when Netflix had decided they were gonna move to Amazon and commit to a microservices architecture. Let’s pick up the story there. So, what’s it like inside?
Adrian: We started off basically running away from a monolith. We had over 100 people every two weeks trying to get all the code they’d written in the last two weeks jammed into one codebase, get it through QA, and get that out into production. And that was just getting more and more painful, and we basically decided we had to break it into pieces. You wanted it to be the work of one developer, basically, controlling what they had deployed independently of everybody else. And at the same time, we weren’t looking at moving to cloud.
Frank: Did you make both big moves at once? In other words, monolith to microservices, and then private data center to Amazon?
Adrian: Everything together. And sometimes you find incremental change a good theory, but in practice, you have to have a big enough stick to hit everybody with to make everything move at once. And the big stick was, we didn’t have enough data center capacity to support streaming. We were running the DVD business in the data center, on a system that was growing at a respectable rate. But the streaming business was exploding at a much, much higher rate. And because of that, we knew we would have to either build lots of big data centers, or get onto something else. So, the bet was, “Okay, we need to go on cloud. Then what’s the right architecture for doing that? What’s the right organization for doing that?” The developer group is getting bigger and getting less productive, and we wanted to unlock the innovation. So, we were simultaneously trying to get better developer productivity, better time to value — which is one of the key things we’re trying to optimize for, generally. And then there was a whole bunch of other cloud transitions bundled in.
Frank: As you went from the monolithic application to microservices, what did that entail? What’s that mean? What is a microservices architecture?
Adrian: Well, originally, I called it fine-grained SOA, service-oriented architecture. And there’s a lot — some people get negative reactions to SOA, because they were out there trying to do it 10, 15 years ago.
Frank: That’s right. So it’s all the same ideas over and over again with new dressing.
Adrian: Yeah. It’s a question like, “Why now, and why didn’t it work then?” And if you look at it, what we were doing was — on relatively slow CPUs compared to what we have today, on relatively slow networks, we were processing big fat lumps of XML and parsing it around. And we were really only able to break the application into a few large chunks, because the overhead of all of the message parsing was too high. If you come to today, you know, you can break it into maybe 100th of the size and 100 times as many chunks, because the overhead of the communication is now very low. We’ve got binary protocols. We’re not trying to, sort of, make everything conform to the big SOAP XML messaging schemes. So it became possible to build a fine-grained SOA architecture, and that ended up being called microservices by, I think, Fred George, who was the first to use the word. But it got written up by Martin Fowler, and then everyone said, “Okay, we’ll go with that.”
Frank: Yeah. So, big-bang moves. This was a bet-the-company set of technology decisions. Looking back at it, what are some of the lessons learned?
Adrian: I think one of the ways to approach this is to basically create, kind of, a pathfinder or a pioneer team. There was a lot of controversy inside. So half of the company thought this was stupid, and a few of us thought we could make it work and other people a bit more gung-ho. So, we got the people that thought they could make it work into a room and had a one-day project, where we all built a thing in the cloud to see if it would work — built out of the kind of technologies we’d need to use to build this. That team then, sort of, knocked down a bunch of the straw man arguments that everyone else was holding up against us. You know, a lot of the time, it is just straw man arguments, but you have to actually go and build something to actually find out what are the real arguments. Then you discover things you didn’t even know, which are hard. You run into the real blockers, as opposed to the imaginary ones. So, I think the trick is to get a small team, go very deep, discover what you can, and run a whole bunch of these little projects where you’re trying to learn as much as possible with the smallest possible input.
Frank: You had this cultural aha, which is, “Let’s get the people who are gung-ho about this, and let’s let them go deep, knock down the straw man arguments.” Sort of zoom up to the 30,000-foot view and sort of describe the organization at Netflix before and after. What did it look like before and after, from a skillset point of view, from an organizational design point of view?
Adrian: This is actually one of the big things that makes a difference. Some organizations are set up already to do microservice-based architectures, and others have to go through a reorg. Netflix emerged naturally out of the way we were structured at the time. We were already structured as small cells that own things, a lot of responsibility. Each team had a very clear idea of what it was building and how it related to the other teams. But it was assembled as a monolith, at the end of the day. So, breaking it apart was a fairly natural thing for us to do. What you see with traditional enterprise siloed organizations is they’re actually having to do a reorg, and set up teams that are responsible for services, and it’s somewhat unnatural for the way they’re currently set up. But I’m seeing an increasing number of people go through that transition. And sometimes you can see it as replacing project-based work with product-based work. So, every team becomes basically a product team for their microservice, and you have the product management aspects and the operational aspects within that team.
Frank: And did you find that the people who are used to working on the monolith could be retrained, or did you have to have a new crew come in?
Adrian: The culture at Netflix is interesting. Most of us had been around before. A lot of us had worked on SOA. You know, we’re gray-haired people that had been — there’s a few people that worked at Xerox PARC in the 1980s, and you could go and have arguments within their object-oriented programming. We had some younger people, but it was a lot of very experienced people taking all the stuff they’d learned and synthesizing it together. It was a very collaborative experience. And we came up with things that made sense based on this series of transitions we were going through. The other transition was from a single centralized database. We had this enormous Oracle machine, with a really complicated schema, to a distributed NoSQL database, in the end, based on lots of different Cassandra clusters. And that was the third transition, and that was probably the hardest transition — was getting all of the SQL code and transactional stuff out of the system. It’s actually breaking apart the databases, probably the hardest thing to do — and then splitting chunks of code off is also difficult if you’re trying to pick apart a monolith. And it turns out, if you don’t break apart your database backend, and you just create lots of services that talk to it, you’ve actually created what’s called a distributed monolith, which has all the same fragility of the monolith, and you can’t update things independently, because you’re tied by the database.
Frank: You can’t just take the Oracle database and break it up into little pieces. You have to think about it differently. Now, the same thing is true for the rest of the architecture as you migrate to microservices.
Martin: Yeah. So, I think what excites me about microservices, in general — it moves all of infrastructure up to an application layer. So, if you think about what you normally do in infrastructure, you’ve got these basic abstractions. Like compute and network and storage, which are pretty low level and they’re semantic free, right, you don’t have structured data. One of the huge advantages of going up to a microservice architecture is you can do infrastructure insertion. Things like, for example, security — things for, like, you know, even debugging — basic operations, and management. And you can do it in a way that has the deep context and semantics of the application.
The point here is that not only are you going away from the monolith, which is really important, and I think it’s great, but also, like, you’ve got more semantics than you’ve ever had before. I mean, this is actually meaningful stuff when you’re dealing with not IP headers, for example, not blocks but actual, like, structured data. And I think that we can actually reimagine a lot of these tools in ways that we’ve never thought of them before, because we’ve never had the ability to have this type of semantics in these toolchains. We’re seeing this burgeoning area of microservices where you almost have, like, a function per company coming up, and now, I believe that all of the old stuff that we had in the internet, whether it’s naming, or service discovery, or routing, or whatever, we’ve got an opportunity to bring this up in, kind of, a much deeper, richer level, which is really cool.
Frank: Right. So, we were going to the marketplace or the bazaar away from the cathedral, which is, any individual function can be provided by either an internal or external provider. It could be a cloud service. But then, the challenge is, now it’s up to every organization to coordinate, right. And so what are some lessons that you guys have learned along the way of picking best-of-breed and then making sure they work with each other, getting the version control to work?
Adrian: When you’ve got a monolithic app, everything is in there. If it gets broken into, you have all access. Its connection to the database lets it basically say anything to the database. When you break things into microservices, you’ve got the ability to have some parts of your system be low-security risk and other parts be high-security risk. You can innovate really, really quickly in areas of, sort of, personalization and user experience. And then you maybe have a much more tightly controlled thing for, say, the signup flow and where you’re storing personal information.
Frank: So, the great news is you have a lot more agility. The price that you pay is you’re doing a lot more coordination. With a monolith, it’s easy. You put all your eggs in one basket, and then, from a security point of view, for instance, you basically just pile a bunch of appliances in front of it. Easy, right? Because it was a monolith. You knew exactly where it was. Now that the perimeter is distributed across many machines, you have to be a lot more mindful of where the attack surface has gone and which security service you need to put in front of that part of the microservices architecture.
Adrian: So, you cannot have the privilege escalation of “because there is a little bit of PCI compliance needed in one tiny corner of this monolith, the entire monolith is now subject to PCI compliance and SOC 2 compliance,” and all these things. And by splitting it up into pieces, you can have most of your app be extremely agile and very innovative, and then have the bits that need to be safe be extremely safe. And then if you look at the attack surface, you’re basically keeping a very tight control over what can do what. And if you connect them to the databases, you’ve got very single-purpose connections into the database that are doing one thing, and you can start to control at the access level there as well.
What used to be policy-controlled by the operations people — what they felt was a safe sandbox for the developers — is now really being driven from the other end down. So this idea of developer-driven infrastructure is something that is turning things around. And a lot of what I’m seeing is that big banks, and people like that — they have their existing policy frameworks and rules, and they’re trying to apply it in the new world, and it looks the same, so they’re happy because they’re compliant. But they don’t actually have the real policy separation that they think they have, because it’s all totally reprogrammable, and it’s like you have the illusion that you’re still conforming to the policy.
A lot of these things were Ops-controlled. So the Ops would control the data center, and then the networks in the data center, and now it’s all developer-defined and software constructs which are controlled by your, you know, cloud APIs. If you’re updating it 10 times a day, there isn’t time to have 10 meetings a day with operations to do the handoff. So, what we’ve been seeing is, people just running it themselves. The only person that knows the exact state of the system is the developer that just updated it. That sounds scary until you realize that each of them is controlling a very small piece of the system, and the aggregate behavior of the system turns out to be really robust and reliable — partly because if you put a developer on call, they write really reliable code, and they don’t release code on Friday afternoons, because they want a quiet weekend. You know, they learn a bunch of practices about what it’s like to be on call and how not to break things.
Frank: So we went from an in-person change review board, infrequently, right, to vet the changes to continuous change and, “Hey, let’s coordinate over Slack.”
Adrian: Pretty much. Yeah, you have to tell people what you’re doing, but you don’t have to typically ask for permission and go and have meetings and things like that. This is part of unlocking the innovation. And the people that are most interested in these are large teams of people trying to build complex products, typically enterprises, and they are worried about getting disrupted by the latest Bay Area startup or whatever. There’s an existential threat here, that if you’re doing quarterly releases and your competitor is doing daily releases and continuous delivery, you’re gonna fall so far behind in the user experience that you’re just gonna suffer, right. So, that’s the big driver that is making people say, “Well, how do you get there?” There’s a whole bunch of things tied together. You’re bringing in cloud, DevOps is a whole other area, and microservices as an architecture — all these things tied together — and some cultural change as well in the organization of the company. The companies that are doing well at that are really starting to accelerate off into the distance.
Martin: It’s also worth teasing apart two trends. And one of these trends is, you know, a single company, instead of building a monolithic product, wants to build a microservices product, and gets all the efficiencies of doing that as far as the development process and the OEM process, everything else. But there’s kind of a broader industry trend where companies’ products are basically microservices, right? There’s companies out there that, like, basically, the only way to access the product is through a fairly narrow API. I mean, you know, there’s so many of these now that there are other startups that will just basically stitch them together, and they could build full applications without writing much code. So, I think that, in addition to a single company getting a lot of advantages, I think the entire industry is gonna get a lot of advantages and see a lot of innovation as a result.
Frank: Yeah. If you had said five years ago that there would be multiple independent public companies that all they do was offer an API, you would have been laughed out of the room, right? And now, look at us. Twilio, and Tribe, and on and on.
Martin: I like to do the mental exercise of, kind of, where this is all going, and I still love Chris Dixon’s quote of, you know, “Every Unix command becomes a company.” It’s like grep becomes Google, whatever. Like, I think, you know, we may be having an analog here, which is every function becomes a company, right? It’s, like, even more granular than a command line tool. Every single function, or a logical function, becomes an independent company. And I do think there are implications on things like ownership and dependability, and stuff like that, that we haven’t grappled [with] yet as an industry. It’s a very exciting direction.
Adrian: Yeah. You’re able to build something now that pulls in things from APIs and pulls in some containers, and you just have your little piece of code in the middle that stitches it together and build a completely new service from that. So, it’s just much easier to get things built. It’s more efficient for the big companies, but it has democratized all the way down to pretty much anybody with a laptop can go build something interesting. And if you go back 5 or 10 years, you’re doing things that would be just totally impossible to try and get together at that point. There’s much more room for innovation.
It also makes it harder to compete, in some ways, because now it’s hard to build, you know, a billion-dollar software company on top of these things, because they keep changing underneath you, and they’re cheap to build. So, you’ve got lots of disruption coming, and it’s actually, you know, GitHub, and open source is another big player in here that’s just making it much lower cost to get things done. So, what you’re seeing now is Twitter, and Facebook, and Netflix, and Google, and LinkedIn producing the stuff that you actually want to use, which has already been tested at volume, and then it’s actually much harder to build a proprietary software company because you’re competing with these big end users, and you’ve got this thing you’ve just built, and it’s flaky and don’t quite work right.
Martin: We’ve talked about this, but it seems like closed-source shippable software is on its way out or dead. And there’s a number of reasons for this. One of them is just — the enterprise buyer likes open-source software, but another one is it’s a real burden on the company to ship software, right. I mean, especially if that software is a distributed system, right. I mean, like, you don’t have skilled operators often, every environment is different, right. So, you’ve got these heterogeneous deployment environments. You end up with this, like, the mother of all cache consistency problems, where you’ve got a bunch of versions out there, a lot of products you’ve got to maintain a bunch of versions, etc. It’s hard.
Frank: The QA matrix from hell, right? Oracle’s version multiplied by the flavors of Unix multiplied by whatever Windows versions you’re supporting, right?
Martin: Yeah, that’s right, that’s right.
Frank: Your poor QA manager.
Martin: Yeah, that’s right. And then distributed systems, generally, I mean, a real trick if you’re running your own operation is you have skilled administrators that know how to manage a cluster. And that, like, there are very, very few companies, and I think maybe one that’s actually managed to ship a distributed system that was manageable with a non-skilled operator. It’s a very, very difficult problem. And so a great thing about — if you offer something as a service is, like, okay, you don’t have any of these problems. And so, like, basically, your post-sales operation budget is way lower. It’s much easier to start a company now, but at the same time, there are questions about, “Okay, so what are the sizes these companies are gonna end up being? I mean, how big is the market for a single function?” I think it’s still to be seen, like, how big these companies are gonna become.
Frank: Yeah. Big challenge from an investor’s point of view, which is, if the essential argument is, “there will be no more cathedrals, it’s all bazaars from here on out,” it’s a little harder to make money, right, because the biggest…
Adrian: You’re investing in a food truck, and that’s as big as it’s gonna get.
Frank: So, put yourselves in the shoes of the enterprise CIO. The pace of change is accelerating, right. The ink just dried on her team getting VMware certified. And now, we’re on to containers, and then people are talking about serverless and functions as a service with, sort of, Lambda architecture. So, talk a little bit about what’s coming, and then the ability of an average organization to sort of absorb these changes.
Adrian: Containers came along, really over the last two years, and it’s one of the fastest takeovers of enterprise computing we’ve ever seen. It’s quite remarkable how quickly they were able to colonize the enterprise space. It solved a real problem.
Frank: What role did containers play in moving away from the monoliths to the microservices architecture?
Adrian: What happens with the containers — all that stuff is packaged into a bundle which has all the right versions of everything inside it, and you can download it and run it. It also abstracts you away from the particular version of what you’re running on. There’s now containers for Windows as well. But originally, this was a Linux-based concept. You have the same container format if you want to run in-house, or on a public cloud. It doesn’t really matter. That container can run on VMware, or KVM on OpenStack, or on Amazon, or Google, or Azure, or wherever, right. You’ve just abstracted yourself up one level. It gives you that kind of portability. If you think — about machines used to sit at the same IP address for years. People would know a machine — they would actually know the IP address of by heart if they wanted to do something to it, right?
Martin: Well, I remember that. Yeah.
Adrian: And then you had VMs came along, and now the VMs are more transient, and you know, this thing would come and go, maybe in, you know, order of weeks or something, a biweekly update of your VM. And then, with containers, it’s perfectly reasonable to have a container that runs for less than a minute. You can create an entire test environment, set it up, run your tests, you know, automatically test it, strip the thing down again, and the size of the things have gotten much smaller. If you just take it to its logical conclusion, we’d basically fire up effectively a container to run a single request and have it sit around for about half a second and then have it go away again. And that’s really the underlying technology behind AWS Lambda. It’s a server on-demand that just isn’t there most of the time. And this is the bleeding edge right now. We have to figure out how to extract these, sort of, ghostly flickering images that are sort of coming into existence for short periods of time. How do you track what’s going on? You end up figuring out how to end-to-end tracing as the only way you can monitor things, rather than being a special case like it is now.
So, there’s a bunch of interesting problems here, but what’s really been happening is just this trend to more and more ephemerality. And these extremely ephemeral systems — and then the charging. Used to charge by three years’ worth of machine, and then it became, well, you can rent a VM by the hour. And then containers, you know, that’s lighter weight, and now you’re paying by the hundred milliseconds, right? It’s perfectly reasonable to run for half a second, which means that the setup time to create that half-second worth of machine needs to be radically less than half a second. And the time taken to bill for it needs to be less than half a second. If you remember the story of SMS, the SMS record for, you know, 140 characters — the billing record is much bigger than that. It’s more like a kilobyte. So, if you actually take a telco and rip out all of the billing step for their SMS things, you know, it will cost 1/10 of the amount to run if they didn’t bill for it. So, you got this effect that the overhead of doing the thing is actually vastly more than the thing you’re trying to do. So, actually, it’s a really interesting challenge — is how to create monitoring and billing and scheduling systems that work so quickly that you can afford to bill things in timed increments.
Frank: The portfolio company 21 is, sort of, right in the thick of this, right, which is how do you stand up an ad hoc agreement between an API and an API and, like, have the billing all work. And you know, Bitcoin might play a part in that.
Martin: Also, to your question, going back to the CIO — I mean, it seems to me, in general, with disruptive technologies, it’s like — the disruption happens first and then all the day-2 Ops happen second. I mean, whatever that is. And, I think, in this case, you know, the disruption is around delaminating the app and breaking it apart. I do think that CIOs should not despair and Ops team should not despair, because what happens very quickly in the vacuum being left from kind of, you know, this sprint on these new technologies is whole, you know, ecosystems and whole industries arise around them to provide visibility, to provide security, to provide Ops, and we’re seeing that now. And so, I mean, I think that it’s quite possible to decouple the disruption — which is this velocity around development — and then, you know, the basic operations. And that tooling is definitely going to happen as well. Understanding that ecosystem, understanding the players is very important if you wanna stay on top of this kind of big change.
Frank: Leaning forward into the change, assuming the tooling will meet you halfway, right.
Martin: Exactly right.
Frank: And then you get the benefit — the big benefit from the CIO’s point of view, in my opinion, is that you don’t have this loop where the business user asks for something. It took you 15 months to build it, only to discover that’s not what the business user really wanted, because the requirements were poorly specified. In these days, right, no problem. I’ve got a change for you, we’ll put it live this afternoon. Right? So, the rapid experimentation that happens in startup plan can now migrate into the big organizations, and you don’t have to get your requirements perfectly specified at the beginning of a waterfall process anymore. Let’s run the experiments.
Adrian: It’s actually even better than that. What the CIOs are providing now is a set of APIs for the development team that is part of the business to automatically provision whatever they want, with certain policy constraints around it for what they can and can’t do. But fundamentally, you’re providing APIs. Operations has moved from being a ticket driven organization to be an API. They are now no longer a cost center. That is a very profound move, and I’m seeing a lot of these CIOs buying into that. They want to be part of the product. They want to be — how do you support the business? And you provide APIs so that they can just get business done at a rate that you’re not slowing them down.
Martin: We’re actually seeing the creation of a new buying center in the industry of vertical platform engineering, of vertical DevOps team, whatever. This is, like, budget allocated. It’s actually viewed as a profit center. It’s product aligned, but it’s core infrastructure and operations. And these are very technical buyers, so it’s not the traditional enterprise go-to-market.
Adrian: This is also moving across industries. We’ve seen, obviously, media and entertainment and, to some extent, retail were early movers — mostly because of the threat of Amazon themselves causing retailers to step up to, sort of, reengineering. We’re now seeing FinTech, you know, or Wall Street is really paying attention. Some people are way down the road, some people are just starting. Manufacturing, that whole industry is just starting to think about this. There’s definitely a, sort of, industry-by-industry, sort of, domino effect as people are figuring this out.
Frank: So, we’re a decade on or so into this revolution, right. Many strands. What excites you now?
Martin: For me, what’s really exciting about this, and I’ve said this before — is if we just have the ability to reimagine all of infrastructure, you can now reimagine tooling, and reimagine security, and reimagine operations and management. We get to reimagine it with more semantics and context than we’ve ever had, you know. So, what does it mean to have a firewall in a world where everything is microservices? What does it mean to have operation management, and debugging — things that were traditional boxes, that were stuck on perimeters, now also become functions? And actually managing your infrastructure is almost like looking at a debugger, a context debugger. It’s like you have a symbol table with you. It’s, like, this whole thing is in one large IDE, and you can do that for your operations. I think it’s gonna push the state of the art on how we even think about Ops in entirely new areas. I’m really excited about that change.
Adrian: I think that the whole serverless area is the bleeding edge right now. The monitoring tools industry is, right now, bring disrupted pretty heavily by serverless. There’s only one or two tools that have really come into existence in the last year or two that have — effectively, a way of processing stuff that is this ephemeral and dynamic. So, there’s some interesting products coming out. It’s just a better way of living. If you’ve a developer and you’re working in the waterfall, siloed organization, it’s kind of soul-destroying for a lot of people, right?
Frank: Indeed.
Adrian: And when you get ownership of a product, you know, on distributed teams, you get each distributed team their own product ownership, and they get to define the interface and manage it and run it — yeah, you might be on call, but you’re in much more control of your destiny, and it’s much more rewarding, and it’s more productive. And the ability to get more stuff done as a developer is just rewarding anyway, right. It’s a better way of working for people.
Frank: Well, that’s great. Well, thank you, Adrian. Thank you, Martin. We’ll see a lot more unfold as the architecture shifts.
The CFI Podcast discusses the most important ideas within technology with the people building it. Each episode aims to put listeners ahead of the curve, covering topics like AI, energy, genomics, space, and more.