CFI Podcast

Reining in Complexity — Data Science & Future of AI/ML Businesses

Peter Wang and Martin Casado

Posted August 21, 2020

There is no spoon. Or rather, “There is no such thing as ‘data’, there’s just frozen models”, argues Peter Wang, the co-founder and CEO of Anaconda — who also created the PyData conferences and grew the early data science community there, while on the frontlines of trying to make Python useful for business analytics. He views both models and data as fluid, more like metaphysics than typical data management… Or perhaps it’s that when it comes to data, those with a physics background just better appreciate the mind-bending complexity and challenges of reining in the natural world, and therefore get the unique challenges of AI/ML development, observes CFI general partner Martin Casado — whose first job after college involved computational physics simulation and high-performance computing in Python at Lawrence Livermore National Laboratory. (Wang, meanwhile, graduated in physics.)

But this not just a philosophical question — the answer has real implications for the margins, organizational structures, and building of AI/ML businesses. Especially as we’re in a tricky time of transition, where customers don’t even know what they’re asking for, yet are looking for AI/ML help or know it’s the future. So what does this all mean for the software value chain; for open source collaboration and commodification; and for the future of software businesses? After all, it’s not written in stone that “All information systems must be deconstructed into hardware, and software, and data” and that “software must have these margins”… Will there be a new type of company?

Show Notes

Discussion of various data management tools [1:44] and whether new tools are needed [5:52]

Software vs. hardware [10:00] and a discussion of what data is [13:04]

Managing the inherent complexities in data [14:22] and the backgrounds of the hosts [16:47]

Different company types that are trying to rein in data complexity [22:00], and a vision of a new company built on AI/ML workflows [32:17]

Advice for companies in the AI/ML space [38:37]

Transcript

Hi, everyone. Welcome to the “CFI Podcast.” I’m Sonal. For this week’s episode, we have one of our hallway-style conversations. And this one is literally like eavesdropping in on a debate and discussion that actually started as a Twitter thread debate and discussion — all around the question of whether and how data and AI/ML (machine learning) companies are different than software companies, and what that means for the future of software businesses. Our guest even questions our view of software eating the world — or rather, asks what happens when software is everywhere? What comes next?

Our guest is Peter Wang, the co-founder and CEO of Anaconda, who also leads our Open Source and Community Innovation group — as well as created the PyData community and conferences, and has devoted a lot of time and energy to growing the data science community there. And he’s in conversation with CFI general partner Martin Casado, who’s written a lot about the evolution of software businesses, the new age of data, and especially AI/ML economics. You can find those pieces at CFI .com/mleconomics.

The two dive into a number of themes throughout this conversation, ranging from open source and crowdsource innovation, and the messy ways that innovation really plays out — to what it means when you move from hardware to software to data and AI/ML — abstracting something that is not just complicated, but actually complex. And then, they touch briefly on what it means practically in building a new type of company, as well as the evolving role of data scientists. But the conversation begins with their shared vantage points in coming from physics, which is relevant here since these new kinds of businesses and products involve a process of experimenting, much as with physics.

The best tools to run AI/ML

Martin: Both you and I come from the physics, computational physics background, and we both, kind of, been pushed into this data, AI/ML data science — and I don’t know if that is coincidence, or if we have an affinity for that. Before we get into that, though, there’s  kind of a competing view of the world, which basically says, “SQL can do everything.” And it’s funny, we spent a lot of time actually looking at the data science, or the data landscape, and it feels like there’s two kinds of worlds. There’s, like, the data warehouse maximalists, which like — will stick all data in the data warehouse, and then we’re gonna do SQL. And then we’re gonna have some extensions to SQL, like you see popping up in, like, BigQuery, or whatever, and that can do everything that needs to be done. And oh, by the way — if someone’s using Python and R, all they’re really doing is basic regressions. And so we can just make that a simple extension, and we’re done.

And then there’s the other view of the world, which I like to call the Hadoop refugees, which is like — actually, we do hardcore computation, and we need R and Python because the stuff we do is very sophisticated. I mean, I know you’re squarely on one side of those. But I wonder, like, do you think there’s a convergence that happens? Do these stay two worlds? Does one become irrelevant? Like, what happens there?

Peter: Just because you oppose extremism doesn’t make you an extremist, right? I would say data warehouse maximalists are extremists.

Martin: <laughter> Fair enough. Yeah.

Peter: And I see a heterogeneous world. It’s the old yarn about, I guess, I don’t know — there’s so many variants of this. But Alan Perlis, a great computer scientist, has some really great quotes about — some irreverancies about these kinds of things. But I would say that to the idea that everything can be expressed in SQL, it’s like — which SQL? With how many extensions? Because at the end of the day — and with how many like extensions upon extensions, and Multicorn on your Postgre actually running a Python kernel. Yeah. I guess you’re doing a SQL, but you’re running a Python script, you know, so that’s not really — it doesn’t count.

And frankly, a lot of stuff runs Access and VBA in this world. VBA isn’t SQL. I think if you choose to look at the world through a particular lens, you can choose to count everything else as residuals and rounding errors, but if you take off those lenses, you see a much more diverse landscape. And I think that’s where, for me, I see the space for SQL, and I understand the reasons why — it has evolved into a particular kind of animal. Like the shark is still the best predatory fish in the ocean, but it’s not the biggest predator in the world.

And I think there’s something about that, that if you’re in the ocean, you’re gonna basically [be] shark-like if you’re gonna eat a lot of fish. So if you’re in that business data analytics world, especially because a lot of business data looks like fish — it’s evolved to look like food for the sharks. So that’s kind of the way it is. But what Hadoop opened up back in 2012 — I called it the Hadoop battering ram. I said, “Listen, we’re not gonna win the Hadoop game. We’ll let the Hadoop vendors go and fight against the Teradatas, and the Oracles, and all the classical data warehouse guys. Let them do that thing. Once it battered down the door, we’re gonna come flooding in with all sorts of heterogeneous approaches to data science, data analytics — things that are hard to ask in SQL.”

And moreover, there’s a term I use, which I don’t hear used very often. Now, obviously, you’ve heard the term shadow IT, which is used quite a bit, but there’s a shadow data management — that’s a far, far more insidious and dangerous problem. When I was at a large investment bank, they had a million-dollar Oracle database sitting somewhere, and it was too slow to actually run the analytics they needed. And so what they did is they had an instance of this Oracle database, it costs a million bucks, and what they did is the only query they ran was a bit full table dump into a CSV. And then they took that CSV, and they did everything else with it. And it was Python scripts. It was some random Java crap. It was a bunch of other stuff. And it was sort of like, so if you’re a data manager — if you’re, like, in the data management practice, you say, “Wow, we just have another big old million-dollar instance stood up. Our data management techniques are great.” It’s a, what do you call it, a Potemkin village, I guess, right? But then when you actually go, and you ask the developers, “Hey, where’s the source data for this stuff? Where’s prod data coming from?” Like, “Oh, yeah, this file share backslash-backslash something or the other, or you know, that file.” I’m like, “That file? What about that database?” “Don’t touch the database. It’s too brittle.” Right?

So there’s this kind of stuff going on, and everybody listening to this knows what I’m talking about. That shadow data management is absolutely a pernicious problem, and data science is just eating it alive. Because to ask the question you want to ask, you have to integrate datasets together. Master data management is about siloization, normalization, and all this kind of stuff.

Martin: You’ve hit to the segue,, too. I just think it is so germane to what we’re here to talk about, which is — there’s clearly problem domains, which SQL is totally fine for, right?

Peter: Yep.

Martin: And you can only get the problem domains, which is just not good. I mean, like any sort of hardcore statistics is just not very good for. And the point of us being on this podcast is actually to talk about, okay, like, “Listen, we’re seeing kind of new types of companies and new types of workloads, and they’re around kind of processing data.” And I totally hear you, that this shadow data management is a real issue. And you can make an argument why that exists is not because people are stupid, or they don’t know how to do good workflows. It’s like, literally, we don’t have the tooling to deal with data in the right way.

One macro question that I have that I would love to hash out with you is, are we seeing a fundamental shift in workload that requires a fundamentally new set of tools and a fundamentally new type of company? Or is this just more of a transition where we can kind of put into service the old tools? And I just want to be a little bit more specific, which is — in the past, you had your toolkit of systems approaches, and you have a software system, and you’d kind of pull them out and apply it to the problem, and SQL is one of them. And we kind of understood how those software systems behaved. And we kind of understood how the company is supposed to run and behave.

You know, as an investor looking at a lot of data companies, they just don’t look the same. The types of tools they use, the type of operational practice they use. And the one that you pointed out was a great one, which is — now data becomes so primitive that you want to actually apply, like, software techniques to, in a way, but we don’t have the tools to do that. And then we’ve written posts about [how] margin structures look a lot different, the way you build your company [is] different. And so, just — do you think this mess is because data scientists don’t have formal CS trainings? Or do you think this is an entirely different problem domain, and we should actually look at what the future looks like for that, and development tools, etc.? This is like the heart of what we’re talking about.

Peter: This is absolutely the heart. And I will try to start from the top, which is this concept that every baby or every child is born — and they’re raising it, they think their child is normal, right? They think of, like, your child is like the normal thing. So you have developers coming online in the late 2000s, let’s say, and they think this is the world. Even me as a professional starting in ’99, right, it’s like, “Well, this is just what there is.” The more you start researching history and looking back, you’re like, you know what? We’re just building in this industry — we just layer. It’s frozen accident, on top of frozen accident, on top of frozen accident. Very, very few times do people make principled intentional revolutionary shifts, right?

Martin: Totally. Yeah.

Peter: You basically Band-Aid a substrate. Okay? So starting from the top, what I would say is that, there is no law — there was nothing carved in stone that Moses brought down from the mountain that said, “All information systems must be deconstructed into hardware and software and data.” There’s no such thing. It was information systems, full stop. The fact that we had different cost structures for innovation in hardware, versus software, versus networking, and so forth — that has led to different rates of innovation, different places, things like that. And so when a business steps in and says, “Okay. What’s on the shelf that I can use to accelerate my business processes?” Then it makes sense, because this thing, that thing, the other. Like, when you buy a car, you buy the car, and then you put CDs in the car. You don’t go buy a car with a CD pre-spec, right?

Martin: Is there the exception of technical innovation in certain areas? So for example, like — we now know how to build systems that extract very useful information out of data pretty simply. That didn’t really work in the late ’90s. Like I remember, the whole first neural network, like, genetic programming.

Peter: Oh, yeah. Right. Right. Yeah. Yep.

Martin: The asthma of the late 90s. I did a number of projects on that that didn’t really work. They actually work now. So you could also argue that the technical landscape [has] changed. It’s not just been a macroeconomic issue on the company.

Peter: Yeah. I mean, ornithopters work if you can flap hard enough, right? It doesn’t necessarily mean it’s the right architecture. <laughter> And it depends on the density of air. Ornithopters might work great [on] Mars, but not on Earth. Right? Propellers work better on Earth. Right? Well, with internal combustion engines, and etc., etc. But the point is that, yes, you’re right. I guess my point could be said thusly. There is a multi-dimensional optimization surface we should be thinking about, not just the optimization surface of software, or a data architecture, or data management, and things like that. I mean, yes, someone did software-defined networking, and you know that better than anybody.

Software vs. hardware

Martin: But here’s what’s interesting to me, which is if you build a hardware company, the tools you use, the money that you need to raise, the innovation pace, is defined one way. And if you do a software company, it’s actually defined quite differently. Although you still use, like, a lot of the same practices, it’s still engineering. You can still modularize. It’s not clear to me that as soon as you move to data, you’re in the same domain.

Software, to me, feels like an engineering problem that you can modularize — you can build interfaces, you’re building it from the ground up, you control all the primitives. Data feels like science. It’s like you’re trying to reign [in] the complexity of the physical world. Right? It’s one thing to, like, build a house — building a very complex building is very hard, and we had to do all this design practice and the other, but we got the skyscraper. That’s very different than understanding the cosmos, because the cosmos is so complex, and you don’t understand what it is, and you don’t have a blueprint. And data companies are defining the cosmos more than building the skyscrapers. Does that make sense?

Peter: You hit it on its head. I’ll just back up and comment on one thing relative to the hardware and software. Hardware is frozen software to some extent, but the pace of — oh, how to put it? Because hardware is expensive and slow, and has been, at least historically — the industry has a much more robust view towards standards. Now here’s the thing — because you have standards, now you have a binary, bullshit proof, “Does it work, or does it not work?” kind of thing. Okay. That then reflects and changes, then, kind of what you need to do.

Software — what it does, it makes mistakes in hardware expensive, because there is an inter-subjective reality beyond any particular vendor about what is a mistake. In software, because it moves so fast, it’s too fast running to build specs and hard specs and say, “Did you meet this performance spec you said we’re gonna do?” No one cares about that. Software is just so fast and loose. It’s like jazz. I mean, so — because it moves fast. And there’s not a — you can’t put that thing in. Then the price of making a mistake in software is almost completely subsumed or lost. And so it’s cheap to make mistakes in software because the cost is invisible.

Martin: 100%. However, the actual engineering practices aren’t that different, as far as, like — I mean, you’re absolutely right, like, formal verification is much more important in hardware, but it still feels like engineering to me. You know exactly where you’re going. You have a roadmap. You build an engineering team around that. Data is different.

Peter: Data is different.

Martin: You don’t have a roadmap. Like it is the universe that you’re trying to like, you just infect inside out.

Peter: In fact, this is the exact critique. You’re absolutely right. When you talk about what you do in software and hardware companies, you are trying to manage complexity, for the most part. You get something, but the thing that always screws you — I figure, every kind of engineering is trying to achieve some kind of lift while finding some kind of drag. Right? And in the case of software or hardware engineering, usually, it’s achieving performance or something like that, or some scale of computation, while minimizing complexity — and having manageable errors and things like that. Okay. So that’s those things. But it’s very goal-oriented.

Martin: Yeah. Building to a goal. It’s one thing to say, like, “I’m gonna build this complex system, which you can basically describe, do mock-ups for any destination.” That’s very different than saying, “Extract insight out of this.”

Peter: That’s right. The great John Tukey said, “There’s two kinds of data analysis. There’s confirmatory — kind of, reporting mode — and there’s exploratory.” And the thing you’re talking about, the reason why data smells — and data practices smell like science, is because there is no such thing as data. All data is just frozen models. Right?

Martin: Totally. 100%.

Peter: Every single data set comes from a sensor, even a picture. Everyone thinks, “Oh, well, I took a picture.” Right? That’s just raw data. No, it’s not. There’s a Bayer matrix. There’s a log transform. There’s a gamma correction. And, fundamentally, there’s an exposure time, which is a temporal sampling domain. So there’s all of these things. There is no such thing as data. There’s just frozen models. And where businesses get screwed up is when they treat data management as, sort of, this goal-oriented siloization — it’s a static artifact, and it is artifact management. It’s almost like a — sort of ad hoc library process. And that’s not the same as the kind of data thinking — or the way when you think about data in an ML/AI sort of world. Because in that world, we see that models and data are both fluid. It’s much more from a meta — not to get too metaphysical, but it’s more of a process-oriented metaphysics. It’s much more temporal-oriented than the static views that current data management practice has. And that’s why I think the SQL database extremists are not going to win this particular round.

Martin: So, I’m a systems guy. Right? Like, I did my Ph.D in computer systems. And in systems, we have five tricks. It’s like virtualization, caching, you know — like, we literally have five or six tricks that we throw at every single problem. And you can build amazingly complex systems with these things. Like, you know, we understand distribution, we understand consensus. And so while a piece of software like Google is very complex, it actually can be reduced into subproblems that we know answers to, and then you know, we can — so I would say, like, the relative complexity — the relative entropy of a software system — is finite. It’s not clear to me if you’re trying to use data to run a system that the entropy is as finite.

Peter: Well, yeah.

Martin: Meaning you don’t control nature. I mean, what do we use data for? We use data for pricing. We use data for fraud detection. We use data for calculating wait times. Okay. So what are the inputs from these things? These things — it’s like people’s behavior. Like there’s so much entropy in all of us. It’s like the weather. It’s like this…

Peter: It’s hugely lossy, right?

Martin: Well, it’s these classically chaotic, high entropy systems. And so one of my theses is — and I’ll just have to test this on you, is that building a software system is a relatively low entropy exercise because you’re dealing with primitives that you understand and you’re engineering it. Where actually trying to deal with data, you’re reining in so much entropy, and you’re trying to extract it. That ultimately is why we end up with different companies, because it’s just much, much harder to, like, deal with that much complexity.

Peter: Yeah. Well, that makes a lot of sense. And the Cynefin framework talks about the difference between complex and complicated and chaotic. Right?

Martin: Yes. Yes. Yes. Sure.

Peter: Right. And so complicated. And I think the pithiest way to say this is —  complicated means that you can take it apart, understand the bits, and put it back together again. Complex means that you cannot do that. Right? So a fine Swiss watch is complicated. A cockroach is complex. And so I think when you talk about computer systems — because I’m not a systems guy like you are — but one of the best things that I’ve heard about it is that everyone thinks — what is the quote? Everyone thinks distributed computing is about space, but really it’s about time. What is the time horizon in which we can define a unit of atomicity? What is the time to coherence? Right, etc., etc. And so it’s always a space-time trade-off.

And I’m sorry, I’m making this look so like, into the physics world, but I see it that way because it’s a natural flex for me. In fact, I wanted to major in computer science, but my dad — who was a physicist — he said, “Look, son, if you become a computer programmer, if you go into computer science, you’re gonna become a programmer, and you’re just gonna build tools. If you’re a scientist, though, you’re gonna be the one using those tools to make an impact.”

So I majored in physics. But then, as soon as I got out of physics, it was ’99. And I’m like, “All my friends are getting, like — they’re getting starting bonuses, and they’re getting jobs, and they’re worse programmers than me.” And so I ended up joining a computer graphics startup. And that’s when I started using Python, was in ’99. I realized that I could script a bunch of C++ much better with Python than with broken template support in Visual Studios. It was God awful.

Martin: I came to networking by way of computational physics. Actually, when I was a computational physicist — I was a computer scientist doing computational simulation at Lawrence Livermore National Lab. That was my first job after undergrad. I was a huge Numeric user, because that was the only way to do high-performance computing in Python — and from what I understand, that became Anaconda. I would love it if you would kind of give the history of that project.

History of programming tools

Peter: So in ’99, it was Jim Hugunin — and I think there’s some others that I might be forgetting — can be credited with working on some of the early matrix stuff. And then Jim Hugunin worked on Numeric, and they realized that the operator overloading in Python would allow you to do something that looked a bit like MATLAB. You know, like, it’s okay — it looks like you’re right back to code. And it’s like, “Hey, this hack kind of works.” And also, Python’s C level extensibility meant that they could build a little tight C library that would be fast. So you’re writing the scripting thing, that little syntax looked like MATLAB, but it ran at basically C speed, which is really important.

So then, it turns out, though, that some of the features they built — the Space Telescope Science Institute folks, the ones who run the Hubble telescope — they had some other ideas about what they wanna do with this library, and Numeric wasn’t quite flexible enough, or some other stuff. But they created an alternative matrix library called NumArray. And NumArray had, like, fancy indexing. NumArray had a few other things. And so the ecosystem in the early 2000s — when I first got my first paid job doing Python, [it] was 2004 and I was doing consulting on Python, and SciPy, and all that stuff. And there was still a split between NumArray and Numeric. Or, in fact, most of the libraries that were trying to build on top of this stuff — they built a compatibility layer called Numerix, which would then flexibly import sub-symbols from these different libraries depending on what you’re trying to — it was terrible.

Martin: The wild and wooly days of early Python.

Peter: You know, it’s a mess. Crowdsource innovation is always a mess, but the result is still nice, because what happens is you end up getting somebody like Travis Oliphant — who comes along in 2005 and says, “This is a mess, and this is slowing down innovation because everyone has to do the work twice. We got to make it work with NumArray and with Numeric, and we can’t make forward progress.” So he spent a year of his life into making — just coding and designing, and he made a really nice thing, and he called it NumPy.

And he came out with it in, like, end of 2005, 2006 timeframe. And then the world rejoiced. And I was like, “Oh my God, this is great. This is the unification we needed.” You know, at the SciPy conference in Pasadena the following year, we gave him an award. Anyway, that’s what happened in the mid-2000s. And then, many years later, then in [the] 2010 timeframe, he actually joined the company I was at, Enthought, and then we had many happy days there, doing a lot of scientific computing, consulting. Which is fun for science nerds, but a niche area. Right?

But then we started getting contracts and consulting inquiries from hedge funds, and from banks, and investment banks, and things like that. And by the end of the 2000s, I’m walking to the floor of, like, JPMorgan, Bank of America, and they have thousands of people relying on SciPy and NumPy to run advanced models. You had coders sitting next to traders, like on the energy desk, and you’re like, “This guy is asking me really deep questions about SciPy. He’s really trying to do stuff with this.” So I had this insight that — I think Python is ready to go into the mainstream, like, business analytic space. And that’s not just MATLAB that it could be taking market share from, but maybe SAS. So at the same time big data was starting to crest at that time, or peak — and I realized that people wanna do more than just ask SQL questions of their big data. And in fact, when I went to the first Strata, in 2011, all of the vendors on the show floor were selling many different flavors of Hadoop. SQL integrations, faster Hadoop, etc., etc.

But then, when you go to the tutorials, every single data science tutorial was teaching Python and R, but there’s no Python vendor. And also, Python is kind of janky for some of the stuff. It doesn’t play with Java very well. Python and R were both second class citizens in the Hadoop world. So I said, “You know, I think there’s something here.” And that’s why I started the company. We started as Continuum Analytics in 2012. And it was Python for business data analytics, Python for data science. That’s what led to that. Anyway, that was a long, sort of, exposition. But to your question about the history of all of this — how this came [about] — but I think that when you talk about software systems, it’s actually very interesting. We build software systems thinking they’re merely Lego bricks — that we make relatively homogenous, or homogeneous. Or, well-structured studs are spaced this way, they’re this big and this tall — and then we can stack them together, and boom, now you have a bigger Lego.

But in reality, when you look at any real software language in modern software systems, there’s complexity to it — more than the complication. And that’s where your worst bugs lie. You know, like, you have some npm module that pulls in some other crap, and that interferes with some other crap, and it tries to install this other thing on your system — and now you have complexity beyond the complication. So I think the practice of software is bedeviled by the fact that it actually is playing, at this point, with so much complication that it basically appears complex to our human minds.

Tactics for dealing with data complexity

Martin: Barbara Liskov has my favorite Turing Award acceptance speech ever, and if you haven’t heard it, you have to hear it. And it’s basically about modularity and computer science. And it’s how you can take big problems and make them small problems. Like engineering with modularity — you can rein in complexity. So you have a complicated system, but I think you can actually manage the complexity. I’ll give you an example on the data side where that’s not the case. There are natural systems that are self-similar. By self-similar, it means that they retain the same stochastic properties no matter what zoom level.

So, unlike a software system, if you’ve reduced it down to a method, you’ve got, you know, a fairly simple abstraction. There are some natural systems like, say, coastlines — that it doesn’t matter at what level you look at it, they still are, like, super complex. So one thesis is like, yes, software systems can be complex, but like, they’re more complicated in that you can modularize and focus on things. That’s not necessarily the case with data. Data is as complex as the natural world. Again, like, you don’t have control over the weather, and the weather is self-similar. And no matter what zoom level you look at it, it still maintains the same stochastic problems. It’s not like data. You don’t have the tools necessarily to reduce the complexity to something that is merely complicated like you do with software.

Peter: Right. So the question then in the data practice world, then — let’s just keep it at that level, then, which I think is a great place to be talking about it — to which point do you stop? What is your optimization criterion? Right, because all engineering is a trade-off. So for the amount of effort you want to put in, how well do you need to understand that coastline? If you’re trying to target a guided missile into a window of a building, you don’t need to map the coastline down to a millimeter, right? So on and so forth. So I think that when you get to data, you recognize that, it really, ultimately — if you actually want to get all the value out of it, you’ve got to loop it around into the overall OODA loop of your business — the observe, orient, decide, act loop —and actually take action with it and correct and zoom into the appropriate level.

Real-world implications for businesses

Martin: I think this is kind of what this all boils down to. So now the question is, let’s say that you’re building a company — that instead of the goal of the company is building a modular software system, [it’s] reining the complexity of data, which we’re seeing more and more companies do. What does that mean to deal with that much complexity? So what you just mentioned is, well, okay, maybe you look at, like, the different zoom level — or maybe you’ve got, like, a full feedback system, or whatever. But before we even get to how you do this, I would like to either agree or disagree that the companies trying to rein in that complexity are different.

Peter: I completely agree with that. The companies that actually understand even the problem they need to solve, they have a better chance of solving the problem. Because it’s actually very much like cloud computing. It used to be — how do I build the software on the basis of the computational resource I have access to? Well, once you have ability to access essentially limitless computation, you’ve got to ask about, “Well, what is it I would need to build? What do I really wanna do, right?” So I think with data, it’s a similar thing, where you say, “Well, you can put in for any <inaudible>. You can put in more money and get more texture, more resolution on your predictions.”

Martin: Exactly.

Peter: Where do you stop?

Martin: Exactly. Exactly. Right.

Peter: And stop is, like — I can only convince this CEO to hire three data scientists? So that’s where we stop? Is this what three scientists can do? I think that’s how a lot of people are winging it right now, but the interesting thing with the hedge funds — you look at them is — they understand this. Like some people say, “You know what, we’re not gonna work at the microstructure level. We’re just not gonna do that because there’s a few big players that play the high-frequency stuff. We’re gonna leave that out. We’re gonna do kind of longer-term stuff and do bigger strategies — some, you know, longer-term strategies.” So they self-select into zones where they believe they have the observational capacity and connect that to execution capacity.

Again, it’s about the OODA loop. They believe they can run a coherent loop. Data is important in all of that, but more importantly — is keeping track of the model, because it’s not just processing data anymore. At some point, it’s also going to be modifying the systems that are then producing that data. Right? It’s a loop. And the most effective companies — it has to be that the data processing is part of both the inference and the execution step. Right?

And one that was the most shocking to me, honestly, in the last 10 years I’ve been doing this — so many businesses — big businesses, at the heart of a lot of really important parts of the business — the models are very old. They’re very stale. They iterate very slowly. And it’s a massively human-intensive task with VPs and PowerPoints, and everything else to get revs on models. And then you go to the, like, hedge funds, and it’s like, “No, we hire engineers.” They come in, and they code MATLAB, and they’re trading $100 grand the first week. Right? That’s different. That’s a very different view of the OODA loop.

And, you know, I think in our Twitter exchange, this is where I said — all companies are gonna have to look like hedge funds. Because in a world where you can have essentially unbounded observational capabilities — you can be a logistic startup, and you can basically get data as good as FedEx or anybody else doing logistics. You could be — you can do whatever. There’s a great leveling field with regard to the sensory capabilities. There’s a great leveler with regard to cloud computing capabilities. You don’t need to go hire 100 sysadmins just to go and rack a bunch of servers. You can just turn on some things.

So with that being said, you can now have extremely low footprint, fast-moving companies that are just there to run the OODA loop, and to have extremely explicit intentional sense-making around the modeling. And for them, data, then — it’s sort of like the difference between a fish — the way a fish sees water, versus somebody holding water to ladle. Right? You don’t even think about the data because you’re just swimming in it. Right? Obviously, you understand data.

Martin: Yeah. So this is like the silly VC observation. The silly VC observation is if you look at a software company that doesn’t have to deal with the complexity of data, they tend to have relatively high margins, say 70% to 80%. And the reason is, is because they’re building skyscrapers, and then they sell those skyscrapers, and the team needed to build a skyscraper is relatively fixed — and then you can sell as many of those as you want. That’s kind of the software model.

When we look at companies that are reining in the complexity of data, and that’s how they extract value, the more people you put to rein in that data, the better your results are. And so now your incentive [is] to, like, have more and more people try and work on that data over time. So I think the structure of a hedge fund is — we hire more people to work on the data, we can potentially get more money. Just because they’re actually reining the complexity of that data. But in the software world, all of that complexity is basically going into the margins — yet, depending on who the buyer is, you can’t increase the top line in the same way.

So let’s say I’m gonna sell five copies of my software, right? Now, if I sell five copies of my software, people are buying the software. They’re not buying the results of the data. Like, maybe they’ll like my software better because it’s more accurate or less accurate, but the number of people working on the data doesn’t directly drive the amount of software that gets built. And so now you have this existential margin issue, which is — you want to increase the number of people working on the data. Labeling it, cleaning it — because you can always get some improvement.

Peter: Right. Here’s the question. If we think about — in the software space, you have software vendors and buyers. And the theory of a software vendor, again — going back on our history, there used to just be computer companies. And then Bill Gates was like, “Hey, stop pirating my crap. Pay for my software, because software is a thing. It’s not just your long-haired hippies copying each other’s Unix code. Like software is a thing, right? You need to pay me for it.” Letter to Hobbyists 1970, whatever it was, or something like that. But he did that at the beginning of the PC era. And the PC era basically said, “Well, here’s a set of standards.” Here’s x86. The x86 ISA. Here’s EISA, and BUS, and your peripherals, and networking, and all this other crap. And so you have a set of standards that in the space — oh, actually this recent blog post that I think you — I don’t know if you wrote, but you promoted. The narrow waist of TCP/IP and the…

Martin: Oh, yeah, that was me, me, and Ali. That’s an old networking guys look at crypto.

Peter: The point is, you know, a lot of these things rhyme with each other. When you have standards, what they do is they reduce the cost of innovation, and they increase the innovation surface. The PC era was such a gigantic — it’s such a gigantic leveler, that allowed the era of software to thrive. But again, Moses didn’t have a third tablet that said, “There must be software-hardware divided.” And that software must always have these kinds of margins. We’re now entering into an era where people are considering the entire stack of what an information system is. And so, when you look at that, there’s no reason at all, why — if I’m an end-user, customer, buyer — why should X percentage of my alpha, or my margin, or surplus, if you wanna talk about capital and all that stuff — why should this percentage of my surplus go to all accrue — broadly across all these companies — broadly accrue into just one software vendor? Because if I insource it in-house — the technology — and I have the FTEs, all of the residual value stays within the boundaries of my firm.

And this is what a hedge fund does. In fact, when I go and try to sell to hedge funds, they don’t generally buy software. They use our open source. They like to get consulting services and ask questions. They’re very high-end users of our open source stuff. But they basically say, “Why should I share anything?” Like, they’ll buy a database, they’ll buy some things that they perceive to be truly infrastructure and truly commodity. Anything above that, if there’s a chance of it contributing deeply in a generative way — not a decomposable way, but in a generative way to their alpha — they’re gonna keep it in-house. It’s proprietary.

I was at a dinner with a CTO of a hedge fund. And he’s like, “Tell me why I should care about open-source.” I’m like, because they had [an] internal, like, crappy version of pandas — and I was trying to give him the story of like, “Look, if you just use pandas, you would basically leverage all of the — you basically have cost amortization of innovation for you,” right, “and it’s not differentiating value for you to have your own little tabular data structure.” People think that open-source is winning, or has won. I think the fact that open source is commoditizing all this stuff means that software itself — the value chain is collapsing. And so, right now, open-source is a movement. I think, unfortunately, it’s confused. There’s sort of this Stallmanesque religious aspect to it almost. And then there’s something deeply beautiful about crowdsource innovation, and legit community collaborative innovation, that’s really important. And we’re almost losing that because everyone’s like, “Oh, but open-source has won now.”

I think that’s a mystery of the situation. And it’s a thing I keep tweeting about, because I’m saddened by the loss of that thread of the principle. Why do we do open-source? Why do we do crowdsource innovation? So anyway, it’s that conversation. I think software companies do look different because they have thrived in an era of relative — the substrate they’ve sat on is pretty flat. And now we’re entering a space where performance matters a great deal, where the information systems are integrated again. Software is only one component of a whole integrated information system. And because of that, now it’s no longer, like — I can sell just one piece of software across 1000 companies and just harvest all of this margin.

Companies built on AI/ML workflows

Martin: So here’s my mental model on these things. Let’s imagine that you have two companies, Company A and Company B. So Company A, they’re building a system, and all the properties of that system are gonna be defined as software. And so they’ve got a roadmap, and then they build the software over a period of time. That’s Company A. Let’s say Company B — let’s say, actually, they’re gonna use just all off the shelf, kind of, AI/ML workflow, but they’re not actually really writing software. It’s all about getting the models to be predictive. And so the entire company is around cleaning data, labeling data, training the models. Right? They’re very, very different, because the complexity of the second one is just far, far greater. And I would say, defensibility of the second one is far, far greater just because of the nature of data. And so it feels to me there’s almost like an emergence of a new type of company.

Peter: Absolutely. Yeah.

Martin: Where the organization, the margins, the go to market — everything is being dictated by the fact that they’re processing data, rather than writing software primarily. I think we’re all still trying to understand what that second class of company looks like.

Peter: Yeah. One of my pitches is that by harnessing the power of open-source to commoditize, to do the disruption on a lot of classical data processing systems, we would basically be one of the last great software companies, and be one of the first great AI companies. The margin doesn’t come from how well you do the software bit. And so, I think that’s the big news. I mean, maybe I have a bit of a controversial view on this. But I think that the era of software being the dominant part of the stack — I know, you know, Marc Andreessen likes to say, “Software is eating the world.” It is eating the world. But it’s a ruminant at this point, right? It’s not the most efficient digester of the value.

And so, look, you benefit from chlorophyll, even though you’re not a plant — you just eat a lot of plants. <laughter> But I think in the era of — I mean, if we’re gonna, kind of, to go to the — complex systems thinking, right? In the era of data abundance, the people who can build models, refined models, and execute on them, fastest are the ones that are going to win. They’re the chaos agents in the ecosystem. So, look, we still live in a world of plants. But there’s a beautiful infographic I saw the other day, which is how much biomass is on the earth. Most of it is plants. And then you got, like, this little bit is animals. And there’s a little bit there’s, like — this little bit is mammals, and there’s, like, this little bit is humans. I think that in the world order to come, there’s still gonna be, of course, hardware and software companies, so on and so forth. But I think the margins where you really wanna look for the growth is gonna be those people who are moving like animals, and not just claiming a spot. “I’m gonna go here, grow my leaves.” You can still catch some sunlight, but your optionality — I mean, you know, business is war, your optionality is reduced. And the companies that can move fastest among these different places, those are the animals, and that’s going to be running faster OODA loops.

Martin: I would love to talk about how this impacts the actual business. I’m not sure there’s a huge change on go-to-market, except for the fact that there’s two types of these kinds of AI/ML companies. There’s the infrastructure companies, which basically build the tools to use AI/ML. And that standard — that looks like a standard software infrastructure company. Like, it’d be, like, a data company or something like that depending on your point. And then there’s those that use data science AI/ML to tackle problems in the real world. And in those, it’s kind of interesting, because you end up not building a software company, but more of a farming company or an agricultural company. And so, you’re not selling to core IT right? So they just tend to look very different than typical software problems because they’re selling to a different constituency.

Peter: They’re not software problems. The software is a means to an end, not the end unto itself.

Martin: And this is particularly germane to AI/ML, because it allows us to solve problems that typically software hasn’t been good at solving in the past. Like, it allows us to solve vision problems better than we’ve been able to do it before. Audio processing problems better than we’ve been doing it before. It’s kind of like the best way to interoperate with the physical world. And so now we’re off, like, building these companies that solve these kinds of real-world problems. And you just have different looking companies to do that because, again, you’re selling to the person that inspects the HVAC system. You’re selling to the person that is the farmer. You’re selling to the person that does manage the forest.

I think one thing for the very high level — and, like, anybody creating a company in this space needs to think through is the following, which is — if you’re building just the infrastructure, just the tooling, and the nuts and bolts, you look like a software company, and somebody else deals with the actual AI/ML application. And that’s fine. But let’s say that you yourself are ingesting the data, cleaning the data, labeling the data — there’s a lot of variable costs to do that. Like, every customer may have a new data set. And what happens is this impacts the margins of your business, like, it looks like you have lower margins, because, for every customer, you’ve got all of this work to do. And so I think you’d need to make a decision early on whether — do you want to be the one that’s doing that work, because that’s something you can actually offload to the customer.

So let’s say you go to a new customer and say, “Listen, we’re gonna take all of your data, we’re gonna clean your data, we’re gonna create your models, and we’re gonna solve your problems.” And in that case, you internalize all of that. And as far as your organization, you need to know that this is basically a services arm. Another option is you can say, “Customer, we’re gonna give you all these tools, but you’re gonna have to bring in your own data, you’re gonna have to hire people to label it, you’re gonna have to learn to tune your models. And we’ll help you with all of that, but you’re the one that’s gonna go ahead and sink that cost.” And so you have to think very deeply of how you structure your company relative to the variable headcount — like, the headcount that has to grow per customer, because that seems to be the big difference that we see for these AI/ML companies, and the typical software company.

Peter: Yeah. I think it’s hard to do one of these companies right now because we are in a transitional time. A lot of the customers don’t even know what they’re asking for, and they’re kind of looking for that help. And even now, people recognize it as a growth area, and where the future’s headed, so they wanna spend some money on it. But, absolutely right, the amount of work you have to do per customer starts looking a lot like a services play. And there’s a reason why a lot of companies, when you really look inside the skeleton— like, why I think I called it the skeleton buried in ARR. You see a lot…

Martin: <laughter> Totally.

Advice for companies in the AI/ML space

Peter: Eric von Hippel has a great book around democratizing innovation. And he says, “Even when we have a space in which a product is possible, products usually only cover 60% to 70% of the end-user need. The end-user still has to do.” And he’s not talking about software. He’s talking about people like, you know, welding things onto the side of their tractor. He’s talking about, in general, the customer has this thing they need to do. When it comes to the AI/ML application areas, it’s a lot more than just 30%, and it has to be customized per customer site.

So I think for businesses right now, in this transition, it’s super hard not to end up looking — if you’re doing a good job for your customers, it’s hard not to look like you’re doing a services play. Now, that being said, there are, I think, viable strategies through this. Which is that you can specialize in an area and domain and say, “Look, we’re gonna come in and work on your data set. But we have our own reference model we’ve built.”

Martin: That’s exactly right. That’s exactly right.

Peter: And now we can benchmark you against that. We can bring some of our own magic juice into this. So now the thing that is generalizable across or product-izable across a thing — maybe it’s only for that sector, but the thing that’s generalizable is not just the software, it’s actually more defensible than the software.

Martin: I just wanna very quickly put a fine point on this. There’s two things that you brought up that are very important to realize. The first one is, we are in a transition. So customers don’t even know what it means to, like, label data and clean data. Maybe in five years, you can go to a customer and say, “We’ve got all the tooling for you, but you’re responsible for managing the data,” and therefore, you offload the cost. It’s just today. You just don’t have enough education in the market to do that. They don’t have data scientists, etc., etc. And so I think in order to get the market into that transition, the startups have to do that. Like, you have to build out that basically — services arm. The second point you made is actually, I think, the critical one is — there actually is some commonality in verticals. And so you can reduce that margin by sharing as much as possible, but it does require customers to share data, or at least share models. And that’s sometimes a tough conversation with the customers.

Peter: Well, it’s not just sharing models. I mean, there are deeper and interesting, more leveraged plays to be made. For instance, you go into a sector, and you realize, “Oh, all of these people are doing their own craptacular things. These are their limited budgets, and their data sets are broken this way — but holy crap, there’s this other vendor over here with this data set. I can go and negotiate an exclusivity with that vendor. And now I’m the only one that can bring that kind of model lift into this particular sector. So there’s a lot of that 1800s-style, like, homesteading to be done in this space. So I think it’s more than just the “Let me average <inaudible> Central Limit Theorem everybody in this industry.” There’s some really cool things to be done.

Martin: So the first thing companies need to figure out is what type of a company they are. Many are very confused about that. You need to know are — you a software company and you’re building tooling, or are you a company where the majority of the complexity of the company is around data. And by the way, many companies started as software companies and end up as data companies, and then they’ve structured things incorrectly. So let’s say that you’ve come to the answer to that, and you’ve figured out you’re a data company. Once that happens, you need to understand that often companies that are extracting value from data — there’s a lot of complexity per customer in order to do that. And you need to structure your company the correct way, which is like — just realize it may be hard to scale, just realize you’re gonna have different processes around the actual data. Or come up with a strategy to offload that to the customer.

Now, the reality is, because the market is so immature, it’s unlikely the customer is gonna be able to do a lot of that, but it’s something that you can, over time, train the market to do and do that transition. But I think this is the big sticking point with many <inaudible>. They think they’re software companies. They end up being data companies. They didn’t build the organizations to deal with that intra-complexity. It’s coming down in the margins. Everybody is kind of confused. And so I think just a little bit of self-awareness and a little bit of planning go a really long way in this space.

Peter: But it requires a very different — many West Coast firms have the thesis that to do a really great tech startup, you need at least a tech founder somewhere in there, because they kind of see where things are going. For a really good AI startup, you need to have machine learning people at that leadership level because they know what it means. They know why a single data set can be a billion dollars, or swing a billion-dollar deal. The difference between a software engineer and, like, a data scientist is that — software, you generally know what the inputs are, or the types of inputs, and your goal is to construct a system that, given these inputs, produces these sets of outputs. So you have very nice, clean definitions around correctness, for the most part.

With data science, there’s unfortunately not that. You can have a piece of code, and for some sets of values, it’s correct. Other sets of values, it still produces a result, but those results are wrong. And a function’s correctness is dependent on values. This is the key thing that differentiates all of data science — from machine learning — from classical software engineering. Classical software engineering, it’s like, we’ve got our test data set, we’ve got our prod data set. It works in test, it’s gonna work in prod, right? That’s not how data science machine learning works at all. In data science, machine learning, the correctness of a function is value-dependent, and also performance-dependent — and the performance also value-dependent.

So now you have this intertwined synthesis of a data, and a modeling, and a computation problem that cannot be decomposed into orthogonal vectors, right? That’s the difficulty of this. What I think is that in 5, 10 years time, every company that is actually still in existence and doing well has to, essentially, have synthesized and brought a synthesis in of their data capacity, their data modeling capacity, the model build, and computation — the hardest thing is appropriate computation — and economical fashion to suit their needs.

So the word I like to use for this is cybernetics. I mean, we are right now in between the software era and the cybernetic era, and I think we will get to a cybernetic future. And cybernetic, by the way — you know, it comes from the same word as Kubernetes, right? It means governor. It means a theory of action and control. So businesses have to see computation really moving its way up. Data modeling process has to move all the way up to the very tippy-top of the business. That synthesis will happen, it will have to happen. And that’s what the selection pressure is in the business world. I don’t know exactly the path we’ll take to get there. In the transitional time, businesses who want to basically get in ahead of the curve, they’ve got to have very clear thinking at the leadership level. And they must have a very clear understanding with their investors about what they’re gonna look like as they chase the marlin, because it’s gonna take a little while.

So I think that’s the trick right now, is that you’ve got to find founding teams or leadership teams that have a solid understanding of software — of what software is and isn’t, of where the value is in the software activity. And of where the value is in the data and data modeling activities. In a time of fog, you’ve got to have very, very clear-headed thinking about that sort of thing. But ultimately that synthesis must be what comes.

Martin: Thank you.

Peter: Thank you so much.