Ben Sigelman - The Future Of Observability & Why It's Not Just Telemetry (Observability Series - Part 2)
Ben Sigelman: And then he was very blunt. And he was like," Listen, I'm almost positive you're an engineer. This is a science department. If an engineer comes to a science department, you will be disappointed, you will eventually leave, and that's going to be bad for you and bad for me. So if you want to come here and be a scientist, by all means, join my lab. Otherwise, I think you should just cancel this whole endeavor."
Ben Newton: Welcome to the Masters of Data podcast, the podcast that brings the human to data. And I'm your host, Ben Newton. Welcome, everybody, to another episode of the Masters of Data podcast. We've brought somebody on that I've known for a while now with a new series that we're going to be doing around thinking about observability in that area. One of the first people that came to my mind was Ben Sigelman. He's the co- founder and CEO of LightStep. Welcome, Ben. It's good to have you on.
Ben Sigelman: It's a real pleasure to be here. Thanks for inviting me.
Ben Newton: Absolutely, absolutely. And you and I were talking earlier, like we do on every podcast, I really like to kind of humanize the guests and also bring that element of their background into. And I think in particular with you, I've heard a little bit, and I've talked a little bit with you before about your background, but I'd really love to hear number one, how you got into computer science and you got into this whole area, but in particular, how you got to founding LightStep and why you did that and what problem you were solving. So tell me a little bit, what's your story? How'd you arrive at where you're at?
Ben Sigelman: Yeah, that's a good question. I mean, like most of these things, there's a lot of accidents and serendipity and everything in terms of how anyone gets to where they are. In school, in college, I mean, I was convinced I would not do computer science. I had never took any classes in that in high school. I was one of those kids who was just curious about computers, and so I taught myself a little bit of programming and everything, although in retrospect what I was doing was like horrific, but I didn't have the good sense to realize that at the time. But I did just kind of on a whim take one computer science class my first semester of school. And I just absolutely, totally loved it, loved it. And so the next semester, I'm like okay, I'll take one more. Maybe that was just beginner's luck. The professor in that one was kind of revered, and I thought maybe I just liked the professor. The second class, I actually did not like the professor. He was frankly terrible, and I didn't even go to class, but I loved the work. I just spent all my time on the assignments. And apparently I wanted to be a computer science major. So I went through with that intention. I started school in fall of 1999. And in that year, the people around me who were older in the computer lab were all getting 10 job offers a day or something. [ crosstalk 00:02:51]. And then somewhere along the line during my education, the bottom fell out in 2001, 2002, and it was a very different environment, actually. And I thought it was going to go into academia. There's a long story, which I won't tell, but the short version is I ended up very, very fortunately and with a lot of luck involved, ended up interviewing at Google in my senior year and got an offer there. And I guess the two stories I tell about how I got to where I am now, one was Google is an amazing place, and I admired the people I'm working with very much on a personal and technical level, but I had no idea about businesses whatsoever at that point. But in college, all of my summers I actually spent doing music stuff. And although I had had jobs with W2's and stuff, they were like serving ice cream. I'd never worked in an office my entire life. And I remember the first day I was there, I was talking to my tech leaders, now actually is still at Google. He's a distinguished engineer there now. I literally asked him if I could go to the bathroom. I had no idea what it meant to actually work in an office. So it was a really funny job for me in many ways. And I was actually pretty unhappy about my work. They had an event they did just once, actually, where they put you into basically an 11- dimensional vector space, based on how long you've been out of school, what languages you've programmed in, where you physically sat in the office complex, your reporting structure, that sort of stuff. And they matched you up with the person who was literally the furthest from you in this 11- dimensional vector space. And they just set up a half an hour meeting with no agenda. And the person who I was matched up with, she was like I guess at the time probably in her forties. She was a very distinguished research scientist, and she was incredibly smart. She asked me what I was working on. I was like," It's not that interesting. Lets not talk about that. What are you working on?" And she listed off about five projects that she was kind of dabbling in at the time. And there was like a blob storage thing. There was like a global identity service for Google, a bunch of stuff. And then there's this one thing that was this distributed tracing service that was a prototype that she was working on for a few other people, but she didn't really have time to finish it, and it was going to be hard to deploy. And so that was that. And I was just fascinated by that one, absolutely fascinated. And so I was just sort of like all right, that sounds definitely more interesting and probably a lot more useful than what I'm doing right now. And so I just started working on this Dapper thing, which was this distributed tracing prototype. And it was very skeletal, but it showed promise. And I just loved it. And it turned out to actually be pretty useful. And I was young and dumb enough to just be willing to tolerate a long, probably a year plus, of just toil to get this thing into production, which involved a bunch of things that have nothing to do with tracing. It was a lot of like what do you need to do at Google to get something with root access on every single production machine at all of Google? And it turns out you have to go through a lot of toil for that, which makes sense. That's actually really risky, but I spent most of my time doing that kind of stuff, not distributed tracing. But once we got into production, it was actually incredibly useful, and I was able to build a team around that. And that's when I became interested in this overall space. We didn't call it observability then. We didn't call it microservices then. But it was totally the same stuff. That's when I got interested in the subject. Then the other story I was going to tell about how I ended up where I am, and this is in 2007 at leaving Google and doing what I'd intended to do in college, which is to become an academic. So I was looking into PhD programs, and I was looking at computational neuroscience programs. And I applied to a few of them, and I was admitted to several of them. And then at that point, they fly you out, and they try to put you through a dog and pony show. And the spirit of it is to try and convince you to go to these programs since they've admitted you and all that. So I was used to being kind of courted and coddled at these things. I got into three schools. I went to the first two. It was kind of like that. And I went to the third one, and it was also like that until the very last conversation I had the entire visit, was with the guy who was going to be my advisor. He sat me down for half an hour and gave me the most influential career discussion I've ever had of my life. And I've repeated it to other people because I think it was such great advice. He basically said," There's three types of people like you and me, mathematicians, scientists, and engineers." And he said, "Mathematicians, they are interested in understanding things that are true or false. They can do their work in isolation. It's really difficult. It's the most intellectually challenging work of all. And frankly, four or five people in the world probably actually understand what they're doing. It's that advanced. You, Ben, are not smart enough to be a mathematician. Neither am I. God bless them for their tools." And I was totally, totally in agreement about that. I love math, but I took it pass-fail by the end of college. Then he said, "There are scientists. Scientists are primarily interested in furthering knowledge, and they like answering challenging questions because they're interesting. And they're asking interesting questions, debating them, going to conferences, discussing their ideas. But it's about the advancement of knowledge and asking questions that are interesting because they're interested. And then there are engineers. And engineers like building things that are useful and that solve a problem that's important. And if it doesn't work, they want to understand why it broke, how to make it more resilient or bigger or faster, and that's engineering." And then he was very blunt, and he was like, "Listen, I'm almost positive you're an engineer. This is a science department. If an engineer comes to a science department, you will be disappointed, you will eventually leave, and that's going to be bad for you and bad for me. So if you want to come here and be a scientist, by all means, join my lab. Otherwise, I think you should just cancel this whole endeavor." And I did. I just walked out of there. I'm like I'm done. I'm done with this entire thing. And since then, I really haven't looked back. I'm just interested in building things that are useful, and that's it, period. And that is definitely what motivates me professionally. And that gets to starting LightStep. I just felt like I'd actually come off the heels of trying to build a consumer product that was kind of like social media for introverts. And that's not how I pitched it, but that's basically what it was. And I started a company around that. Total unmitigated failure as a product. I mean, it didn't do any harm, but the people it attracted, they did like the product, but they were all depressed introverts, and they wouldn't talk about it. So it turns out that doesn't work. So I was feeling a lot of pain around building something that wasn't that useful, and I just wanted to do something where I felt like I knew I could build something that was actually valuable. And so I was thinking through my career and things I had done, and I just felt looking at the market that there was a lot of pain people are about to experience based on the architecture that they were pursuing, which was basically microservices. And the best way I knew to address that was to build something that allowed them to gain more confidence and understanding of their own system. And I mean, I guess that's what observability actually is. And that led to the foundation of LightStep. But our vision for the company since we started five years ago really hasn't changed at all in terms of the core mission, which is what I just described to you. And it really comes out of a personal desire of mine to do something that's useful and impactful. And this is the best way I know how to do that.
Ben Newton: Yeah, I know. This is all really interesting. It's funny. I think I realized even more than what you and I were talking before you and I have very similar experiences because I just wished that someone had sent me down and had that talk when I went into grad school because I literally had that same progression, is like oh, I don't want to be a theoretical physicist. I'll be an experimental physicist. Oh wait, I really like to build useful stuff. No one explained that to me. Now I go into computer science. But that was after three years of wasted time. So I really like that way of putting it together. There's no blame. It's just like look, people are built different ways. I think that's a really cool way of talking about it. One thing I'd be really interested in, stepping back a little bit, I mean, why did you need to do it at Google? And why is this a problem that people need to solve? What was the glaring issue that meant that you had to build something and people were willing to invest that amount of time to build it at Google? What were you trying to do?
Ben Sigelman: I mean, Google's a funny place. And I don't want to imply that what was going on at Google is necessarily the same as what's going on at other companies. I think a lot of their problems are unique to the way that they had built their system and their culture, the time it was in the industry in general. A lot of that stuff was before GitHub was even incorporated. There was just nothing to use. So you had to build everything yourself. And the problems that they were having that I think are also endemic in the industry right now, it's a very small percentage of time from an engineering standpoint was being spent on the production of new functionality. I mean, that was basically the problem. And then the other problem is that when things were going awry, the only reason that they ever were able to fix it was because they had a couple... Maybe not a couple, but let's say less than 5%, maybe less than 2% of the population there from an engineering standpoint knew where all the bodies were buried and knew had to read the tea leaves where they could run some arcane tool that had arcane output and see something and say oh, that, that means that this other thing on the other side of the system is having a problem. And that is death for an engineering culture. I mean, the second those people decide to leave the company, you're really in trouble. I mean, I guess if they'd been able to train everyone to know everything that those people had known, the diagnostic things could've been partially addressed, but it's just literally not feasible. I don't know how to do that. I mean, that's the hard problem that we've never been able to solve from a managerial standpoint. And then the velocity stuff was just really problematic. And I think that they invested in a lot of things to solve that. I don't want to claim the observability as literally the only way to do that. It's in the necessary, not sufficient category. They employ more people at Google to work on their source control system than the vendor that they bought it from. I mean, there are more people working on Perforce at Google than Perforce had employees at one point. They had hundreds and hundreds of people that maintain this very elaborate, highly- optimized sort of build CI/CD kind of infrastructure, and so they were investing in luck. I mean, nowadays I think you can see whole companies that have popped up that were teams at Google to do various things in the life cycle. But Dapper was initially developed as a point solution mainly to latency issues, but ended up being quite useful once we fleshed it out for a number of other things. Some of them are sort of obviously in observability, like root cause analysis, MTTR, that kind of stuff. But the core technology was also, it ultimately moved into the storage arm of Google because they found that they could use the context propagation in Dapper to help understand where the workloads were coming from in their large multitenant stored system. So if they wanted to have one regional instance of, let's say Bigtable, which was their key value store, they could have one multitenant system for the 1, 000 plus Google skews that were out there. And they would all be rate limited to very specific amounts of write and read traffic. So not just the storage, but the actual IO that was specific to that storage system. I mean, it basically meant they didn't have to over- provision at all. And that actually has nothing to do with latency, or only tangentially with latency analysis where we started the Dapper project. But in my mind, if we want to sort move the clock forward, I think that a lot of the advantages of building things like distributed tracing into the core of your software is that you can take advantage of global context in many other ways. So it's not just for observability. And certainly my vision for LightStep longterm is to try and do things like that too, but having good hygiene about observability and telemetry, especially, allows you to open doors down the road and resource provisioning security, et cetera. So I think for Google, what we started with was latency, but we ended up with something much broader than that with the distributed tracing technology. Getting to your point about observability, though, it actually drives me a little crazy that people think about observability as distributed tracing or logs or metrics or some combination of those. Those are telemetry only. That's all they are and nothing more. And observability, building or buying distributed tracing doesn't really solve anything on its own. The thing that you should be focused on is almost certainly one of the following, either improving steady state performance, so just making things more reliable over time in a greenfield kind of way, so month over month, getting mean time to resolution down, so basically incident response, and/ or shipping software faster, just improving velocity. Those are the three things that you can get out of observability. And if your thought process doesn't start with those objectives, no matter what you end up building or buying, it's unlikely that you're going to end up with a good result. It needs to be oriented towards those outcomes as a business, or you're just going to end up with a bunch of technology and a bunch of telemetry. And certainly siloing tracing data from other forms of telemetry is a losing strategy. And even if you buy from one vendor, if those are in separate tabs or something like that, you're also siloing your product experience and your benefits. So I think it all has to come back to those use cases. And from a product strategy standpoint, I think that's the right way to build observability and the right way to buy it as well.
Ben Newton: No, no. I think it makes a lot of sense because I'd agree with you. There is a tendency in our industry in general, and I don't think we're unique in this way, is that there's a tendency to just throw new words out and just be like well, if I use this new word, then suddenly I'm doing things differently, whereas not really getting at the core of it. And I know when I first heard the term coming from a science background, I'm like that doesn't sound right because when we would talk about that in physics, we meant a very specific thing. But I think the way you describe it makes a lot of sense because it also seems... There's a couple of other ways I've heard of describing it, is like it's a way of doing things that enables you to achieve these goals. It's not just about getting a bunch of data and throwing it against a wall and hoping something sticks. It's about hey, what are you actually trying to achieve? What's important? And that's something I've always admired, at least what I've understood about the Google culture. And they're not the only ones who are doing this, but this kind of focus on reliability and making that a top line goal, is like look, this is not just about data for data's sake. It's like we're trying to provide a highly reliable service that is going to make the company successful and make our customers successful. And that to me is a core of what observability is. And it sounds like you're kind of in that same camp there. It's not just about the data. It's about achieving your goals. I'm mean, does that make sense to you?
Ben Sigelman: That definitely makes sense in terms of the goals. I would say from a Google standpoint... I hope I'm not... Yeah, I've been gone for long enough I can say this kind of stuff, but I think the way they structured things, it did have a lot of benefits for reliability, but it's actually really problematic in that their SRE organization was completely parallel to the engineering product organization, which is to say from a reporting standpoint was totally parallel. And in as much as Google famously uses OKR... I still have a lot of PTSD about that. It's funny, actually, at LightStep, internally of course we have to manage our progresses well. And we've invented this thing that is basically exactly the same thing as OKRs at this point, but I just can't bear to call it that because I have so much PTSD around it. Anyway, the SRE organization had OKRs, as you would imagine, around reliability, which is totally appropriate, right?
Ben Newton: Right.
Ben Sigelman: It's like that makes sense. The development, product engineering organizations had OKRs around what you would expect, product uptake for ads, revenue, that sort of thing. And these things are fundamentally in tension, which is fine and nothing wrong with that. But it gets difficult... The most reliable thing you can do is never change your software, right?
Ben Newton: Right, right.
Ben Sigelman: That's by far the most reliable software. It doesn't change. And of course that will eventually lead to the demise of your product philosophy. So there is a natural tension there. And because of the way the organizations were structured, it was very difficult to trade those things off without it turning into this kind of massive political battle because you had at some level... I mean, of course it didn't really turn out this way, but theoretically, in order to resolve that tension, if you were to respect the org chart, you would have to basically go to the CEO of Google or something, which is ridiculous. So that didn't work so well in some ways. And I think Google did have a really strong reliability culture, but almost at the expense of movement at times. And I'm a strong believer in SRE and certainly a strong believer in measuring these outcomes. I think that is best done if both the product outcomes and reliability outcomes are shared by the group of people who are building the software and the people operating the software. In some cases, it's the same group. Sometimes it's not. But those should be contracts that everyone agrees to. And I think in the best case, which did happen sometimes, and I saw this myself, which was nice at Google, you had SRE teams and dev teams that actually really did collaborate well. But when things started to go sideways and it would get oppositional, usually the reliability won. And again, I think that was made substantially worse by the fact that, generally speaking, people operating the software and building the software had very low confidence in how it actually worked, which again goes back to I think observability. Google did invent a number of technologies, some of which have been cited as foundational for observability, but our observability was really pretty poor in my mind. I mean, I think Dapper is by no means the design basis for anything we're doing at LightStep. And that's intentional. I mean, I have huge regrets about the way that we built that system and the way we built the technology and the architecture of it. And so most of the time, I think developers of complex stuff at Google really were feeling quite uncertain about how their system was behaving, and that only led to like, it fed the fire of this almost intractable tension between velocity and reliability, which as I understand has actually gotten worse, not better, in that in the last couple of years at Google, although I've been gone since 2012, but I think that they had a summer of outages at one point that that led to a number of even larger bureaucratic hurdles you had to go over to just push changes into the world. And that's going to hamper their ability to execute. So these things can turn into I think company level risks if they're left unattended.
Ben Newton: It's really interesting to hear you describe it because just being in the industry, I've heard a lot of these things. But what really strikes me about that is number one, how much people are intertwined with technology choices and how you actually do things. So you can never pull people out of it. You don't really want to because that's part of what makes it innovative, and how similar it is to the issues that companies were having in the early 2000s that were not as advanced as Google. That whole thing that" dev ops" was supposed to solve Was bringing these two organizations together. And maybe there's this a natural human tendency is when you have those kinds of conflicting goals, is how you resolve those is really crucial because there is a tendency nowadays to kind of look back on this stuff that came out of the early 2000s.com and the way we build applications back then. When I started out and kind of like poo poo it, but the reality was a lot of this stuff developed naturally, and there was a tendency to be okay. And I remember we would say the same thing because I ended up, strangely enough, even being a computer programmer on the operation side, and we would say the most stable time of our application when these projects I worked on was over Christmas because all the developers went home. It was that natural tension, but it's something that kind of naturally happens. And I guess one thing to ask you too, Ben, as part of this whole narrative, so you got a chance to be part of something really interesting at Google. You took that knowledge and learnings and not just repeating some of what you might consider the same mistakes. You went on this journey with LightStep. Where do you kind of see the state of things now? Because I mean, one thing you and I have known each other for a few years now, and I've seen you guys develop and I've seen the industry develop, and it's changed a lot over the last few years. And it feels like some of this stuff is really starting to take hold. But I mean, from your perspective, particularly just being on the inside, where do you see things now? What's the state of the world particularly where you're at in observability and distributed tracing in this whole world?
Ben Sigelman: Yeah, good question. The number of significant, well-known, established enterprises that have actually started building what I've been calling deep systems in production is just totally different than a couple of years ago. And by deep systems, I actually don't love the word microservices because first of all, it just describes a single service. The thing that's interesting is the system. And it's the number of layers in the system between the top of the stack and the bottom of the stack. And once that gets to be more than four or five or so, it's really difficult to understand exactly how the lower layers of the stack and the upper layers of the stack are interrelated. And multitenancy becomes the norm either for your storage systems or for even the communication bus like Kafka. And that means there's a lot of interference effects between different parts of your system. And getting to the bottom of that is just really difficult. And it's also incredibly widespread. It's a problem right now. I mean, I see this everywhere I go. So the pain is really, it's cropping up across the board I think for anyone who's building a system like this. And of course, they're doing it to improve velocity, and I think that part's actually working, but it's not without some peril in terms of just being able to understand how the system's behaving for any of these cases we were talking about earlier, releases, latency, MTTR, et cetera. So I've seen that really come to the fore. A couple of years ago, we were restricted to talking to the kind of internet darlings, like the Twilios and GitHubs of the world that were pushing the envelope on this. But now it's much, much more widespread. So I think from a market standpoint, that's changed. From a solution standpoint, I think that a lot of vendors have realized that there's a need to consolidate to a certain extent and that from a product standpoint, it's necessary to provide a single tool that can handle some of these use cases end to end. I think the way that's being done in many cases is a little cynical, if I'm being honest. I definitely see companies that are basically, either through acquisition or through just rapid development, are creating tabbed experiences in their product where you have several fundamentally different approaches to observability that you can pay for altogether. But they're really pretty distinct from a workflow standpoint. I think that's a non- solution. I mean, I think from a Darwinian standpoint, it will be selected out anyway. So I'm not worried about it, but it creates a lot of noise, and there is a lot of noise in the marketplace. And then I also see a lot of confusion about how to approach observability. And for that, I mean this, it doesn't benefit LightStep in particular, to be clear, but I would certainly urge people to clearly separate the collection of data and the data itself from the solution. And the data's telemetry data, I think the open telemetry project, which I helped to create... So I am biased there... but I think at this point we actually found out yesterday that it is the second most active project of I think 50 something in the CNCF, second only behind Kubernetes. It's an incredibly, incredibly vital project. Actually, Sumo has contributed a lot to it lately. It's a great project. There's so many things vendors involved that you can tell that it's not going to send you in one direction or another. It's a real honest to goodness partnership between a lot of folks that compete, but share a common interest in getting this stuff to be more readily available. So the open telemetry project I think is a really safe bet on the future of data collection basically and making that turn key and easy. And it also comes with a lot of automatic instrumentation these days, so you don't need to go and change your code, at least not manually, which is really advantageous. So I think separating the telemetry gathering from the rest of it is the first thing to do because people are confusing those. And it's a bad idea I think to rely on a vendor because you want telemetry. That should be an open source effort, and I would definitely push people towards open telemetry for that. And then on the vendor side, again, I would just try to come back to spend the time to figure out what problem you're trying to solve. I think we're starting to hear people talk about that, but I have to say there are a lot of conversations we have with folks where we reframe the conversation in terms of, say software deployment and show how observability can apply to that particular problem. And it's like a light goes on in their head. It's like oh, that's how this could look if observability actually knew when my deployments are and actually had features that were designed to explain those. In order to make this easier to people, we've built this thing called LightStep Sandbox, which you don't need to pay for. I think you can just go straight into it, and you don't need to talk to anyone. There's scenarios where we'd say there's been a bad deployment. Figure out what happened. And then you kind of walk through it yourself, but you can also go off the trail and just use the product if you want, but it helps you understand how to approach workflows with observability. And I think that's been really useful for people to educate themselves about how observability can apply to these specific problems. So I think that things like that are really useful to kind of stop just reading blog posts or listen to people like me blabbing on about it, but to actually kind of feel it. I think that's a really useful way to understand how this stuff can fit together. So I think we've seen some lights go on there, but I feel like it's going to be another couple of years probably before people start by talking about deployments and MTTR rather than starting with distributed tracing or how to get metrics and logs to work together or something like, not that that's not part of the solution, but it's just not where it should start in my mind.
Ben Newton: Yeah, yeah. No, it makes sense. And I think there's a natural tendency with any new technological shift. It's a lot easier to talk about individual technologies or talk about simple concepts than talk about the really hard stuff because actually putting it to use and tying it to how you're actually going to change stuff. And one question that comes to mind and maybe is a good way to kind of put a bow on all this, as part of this, like you said, it's changing rapidly. A lot has changed over the last couple of years, and definitely with the open telemetry project kind of driving some consolidation, where do you think we're going to be in this area like five years from now?
I do think that open source will dominate the actual instrumentation and telemetry layer. I'm very confident about that. There's way too much momentum behind that. Even the people who might publicly say they don't want that, privately are trying to make it happen because they're spending a lot of engineering hours maintaining proprietary agents, and it's just not that efficient for them. So I'm very confident about that. That's just a when, not if kind of thing. In terms of the solution space, my very sincere hope is that we see products that are focused on the workflows and not on the telemetry verticals, which I've said a number of times in this conversation, but in addition to that, I would like to see the pricing to be more value based than it has been. I think a lot of vendors are expressing their pricing, not just primarily, but often exclusively in terms of the scale of the actual telemetry data because the telemetry data can expand in such unpredictable ways. My opinion is that the customer should basically have control over some kind of lever on how much telemetry data they actually want to store. And the observability system needs to degrade gracefully within that. So actually much like the internet itself during all this COVID nonsense has performed really beautifully I think. It's not that it hasn't degraded. It has degraded. Netflix has started to degrade into 720p instead of whatever it was before. That's exactly what it should be doing. It degrades gracefully within the constraint. That's what I'd like to see observability do too. So the customer should be able to say," I want to spend X dollars a month on telemetry. Observability, whether it involves pre- aggregation or sampling or both, needs to fit within that data budget, and no one should feel like they're paying a margin on top of that. And then what you should be paying for are the benefits of observability. And what I see right now is people are buying tools that start small because it is based on data volume. The data volume balloons. They've developed a dependency on a particular product or tool, and they end up in a negative ROI place. I've seen that with a number of the largest vendors in this space. You look at their business, it looks great. You talk to their customers, they're irate. I mean, they're to the point of saying," I am offended by this." And that's not a good place for vendors to be in. I think the pricing units, not just the prices themselves with the way it's all modeled is actually pretty bad for customers. It's not just about pricing. Pricing is an important part of how products are developed. I think it comes down to the architecture and the cogs that these vendors need to account for. But I see that needing to change in order to make the observability value proposition a clearly positive one for the many enterprises that need this stuff. So that's another area I see. And the last thing I'll say is that just like I was saying how Dapper, their largest economic value is actually in storage and resource accounting, I see applications being built on top of the kernel observability way outside of the realm that we're talking about now, certainly in security, which is already starting to happen, but also in resource provisioning and hopefully all the way back into the software development life cycle and changing the way that the software actually works, not just the way that we understand it. So I think closing the loop and having feedback loops with the software is some level able to manage itself based on the telemetry and some real-time analysis, that's the future of how this stuff will actually impact the industry, but that's probably 5-10 years out.
Ben Newton: Yeah. Oh, this is, I think it's fascinating. I think it's one of the fun parts about being part of this right now, is that I think everything's changing so fast, and there's a lot of things kind of coming together. And I think you're spot on with what you're saying. Well, Ben, this has been a lot of fun. I think your background is really interesting, and I think what you're doing now is really kind of pushing the conversation forward. And I appreciate you coming on. This was a lot of fun. And thanks, everybody, as always, for listening. And as always, we love you to rate and review us on iTunes so other people can find us. Look for the next episode in your feed. Thanks, everybody.
Speaker 3: Masters of Data is brought to you by Sumo Logic. Sumo Logic is a cloud native machine data analytics platform delivering real- time continuous intelligence as a service to build, run, and secure modern applications. Sumo Logic empowers the people who power modern business. For more information, go to sumologic. com. For more on Masters of Data, go to mastersofdata. com and subscribe. And spread the word by rating us on iTunes or your favorite podcast app.
DESCRIPTION
Welcome back to the Masters of Data podcast! This week’s episode is the second installment of a special three-part series on observability in data. Tune in each week to hear about how the world of observability in transforming from a major player in the data realm.
In today’s episode, we talk to a very special guest who is using observability to build something that useful and impactful. The guest of this episode is Ben Sigelman — CEO and co-founder of Lightstep. Lightstep was born of Sigelman’s personal desire to build something that is useful and impactful. He saw an opportunity to accelerate the industry’s transformation while improving the developer and end-user experience, and he took it. Using observability, he built Lightstep as a system that could help people gain more confidence and understanding of their own system.
As an ex-Googler and co-creator of Dapper, Sigelman witnessed the birth of microservices at Google. He learned a great deal from his experiences, and Lightstep is in many ways a reaction to and a generational improvement beyond those approaches. Sigelman’s fascination lies in deep systems and how they break, but he is also passionate about separating the telemetry from the rest of observability. There is a lot of noise in the marketplace and confusion about how to approach observability but Sigelman is confident that in the next 5-10 years, applications could change the way the software actually works, not just the way we understand it.
Ben Sigelman and Ben Newton sit down to discuss the state of observability and distributed tracing right now and how rapidly this area is changing. Sigelman starts the conversation by giving some context to his background and talking about how he got into the software industry. After graduating from college, Sigelman had two “aha” moments that led him to engineering and eventually co-created Dapper and Lightstep. While working at Google, Ben was matched with a distinguished research scientist during an event that paired employees based on an 11-dimensional vector space. In the meeting, Sigelman asked the research scientist all about her work and became fascinated by a distributed tracing service prototype that she was working on for a few other people. She told Sigelman that she didn't have time to finish it, and it was going to be hard to deploy. At the time, the project sounded more interesting than his other work, so Sigelman picked it up and started working on what we now know as Dapper. It was very skeletal and took over a year of toil to get it into production, but it turned out to be pretty useful. Once it was in production and he was building a team around it, Sigelman became really interested in the overall space. At the time, it wasn’t called microservices or observability, but that’s what it was.
Sigelman also told the story of why he decided to become a software engineer instead of getting his Ph.D. in computational neuroscience. During meeting with his future Ph.D. advisor, he had the most influential career discussion of his life. The advisor told him, “there are three types of people in the data world: mathematicians, scientists, and engineers.” Mathematicians are interested in understanding things that are true or false. Scientists are interested in furthering knowledge and enjoy answering challenging questions. Engineers are interested in building things that are useful, so they can solve a problem that’s important. The advisor ended the conversation by saying, “I'm almost positive you're an engineer. This is a science department. If an engineer comes to a science department, you will be disappointed, you will eventually leave, and that's going to be bad for you and bad for me. So if you want to come here and be a scientist, by all means, join my lab. Otherwise, I think you should just cancel this whole endeavor.” At that moment, Sigelman walked away from that idea and never looked back. He realized that he’s just interested in building things that are useful, and that is what motivates him professionally.
After talking about his past, Sigelman then delves into why and how he started LightStep. At the time, he was coming off the heels of trying to build a consumer product that was basically social media for introverts. It was a total unmitigated failure, and he was feeling a lot of pain around building something that wasn't that useful. Sigelman simply wanted to build something that was actually valuable and saw an opportunity in the market. There was a lot of pain people are about to experience based on the architecture that they were pursuing — microservices — and he addressed this by building something that allowed them to gain more confidence and understanding of their own system — observability. And therefore, Lightstep was born.
But aside from Sigelman’s pathway into Lightstep, Sigelman and Newton also discuss the glaring issue that needed to be solved and what Sigelman was trying to do in his work at Google. Sigelman talks about how nowadays there are whole companies that have popped up that were just teams at Google. According to Sigelman, Dapper was initially developed as a point solution mainly to latency issues but ended up being really useful once it was fleshed out for a number of other things. The core technology also ultimately moved into the storage arm of Google because they found that they could use the context propagation to help understand where the workloads were coming from in their large multitenant stored system. Therefore, Google started with latency but ended up with something much broader than that with the distributed tracing technology. However, Sigelman believed that to move the clock forward, distributed tracing must be built into the core of your software. That way the developer can take advantage of the global context in many other ways besides just observability. Sigelman discusses why he thinks it’s important to have good hygiene about observability and telemetry, so that doors can be opened down the road.
Sigelman even touches on the differences between telemetry and observability, and why it drives him crazy that people think about observability as distributed tracing, logs, or metrics. All those elements are telemetry only, and nothing else. Right now, there is a lot of noise in the marketplace and a great deal of confusion about how to approach observability. Sigelman talks about how to separate the telemetry gathering from the rest of observability. He explains why observability, building, or buying distributed tracing doesn't really solve anything on its own. The focus of observability should be one of these three things: improving steady-state performance to make things more reliable over time, getting mean time to resolution down — incident response — or shipping software faster and improving velocity. It needs to be oriented towards those outcomes as a business, or there won’t be a good result.
Sigelman also talks about how the mistakes and regrets he has with Dapper helped him in creating Lightstep. In his opinion, Google had a really strong reliability culture, but sometimes at the expense of movement. Sigelman is a strong believer in SRE and in measuring these outcomes. It’s best if both the product outcomes and reliability outcomes are shared by the group of people who are building the software and the people operating the software. At Google, the SRE teams and dev teams collaborated really well together, but when things started to go sideways and it would get oppositional, the reliability won. This is where Ben starts talking about observability because everything got worse when the people operating the software and building the software had very low confidence in how it actually worked. While Despite Google inventing technologies that laid the foundation for observability, Sigelman talks about why Google’s own observability was pretty poor. He says that Dapper is not the design basis for anything they do at LightStep, and that's intentional. Sigelman has huge regrets about the way the system, the technology, and the architecture of Dapper was built. According to Sigelman, the developers’ uncertainty of their own system “fed the fire of this almost intractable tension between velocity and reliability.”
Something often left out of the conversation surrounding observability, are the issues with pricing it. Sigelman spends some time discussing why he thinks the pricing should be more value-based. He thinks many vendors often exclusively price in terms of the scale of the actual telemetry data which isn’t great because the telemetry data can expand in such unpredictable ways. In his opinion, the customer should basically have control over how much telemetry data they actually want to store. No one should feel like they're paying a margin on top of the data budget, and they should be paying for the benefits of observability. Sigelman talks about how the pricing units, not just the prices themselves, models are actually pretty bad for customers. Pricing is an important part of how products are developed. Sigelman discusses why the pricing needs to change in order to make the observability value proposition a clearly positive one for the many enterprises that need it.
After delving into the complexities and state of observability right now, Sigelman finishes off the conversation by talking about the future of observability. Sigelman discusses how the number of significant, well-known, established enterprises that have started building deep systems in production is completely different than a couple of years ago. He talks about how the pain and problems associated with microservices are very widespread. In his opinion, it has been really useful for people to educate themselves about how observability can apply to these specific problems. Despite everything is rapidly changing in the industry, Sigelman does talk about his confidence in open source dominating the actual instrumentation and telemetry layer. He also touches on the applications being built on top of observability that is way outside of the realm that is being talked about in this podcast. Sigelman thinks that closing the loop and having feedback loops with the software to manage itself — based on the telemetry and some real-time analysis — is the future of this industry. Even if it’s in 5-10 years, he hopes that observability can change the way that the software actually works, not just the way that we understand it.
To learn more about Ben Sigelman or Lightstep, check out the resources down below. And to hear more about observability, tune in the last part of this three-part series next week.