Charity Majors - Revelations in Observability (Observability Series - Part 1)

Media Thumbnail
00:00
00:00
1x
  • 0.5
  • 1
  • 1.25
  • 1.5
  • 1.75
  • 2
This is a podcast episode titled, Charity Majors - Revelations in Observability (Observability Series - Part 1). The summary for this episode is: Observability doesn’t have three pillars, and it is not a monitoring tool. While the concept of observability is often misunderstood, what it can do for ops is revolutionary and transformative. Observability is achieved when a system is understandable — which is difficult with many things failing at once in a complex system. Many large technology companies use systems to debug code and understand how it runs in production, but those tools are not available to engineers outside of those companies. That’s why Charity Majors — CTO and co-founder, started Honeycomb, to tackle the issues with logging, monitoring, and metrics. Despite the confusion surrounding observability, it can help us ask the right questions of our systems in a way that is predictable, fast, and scalable over time. Utilizing observability to fully understand complicated production systems makes the systems more resilient to errors. Listen to Charity and Ben discuss the innovations and revelations of observability, and learn more about this transformational tool in data. Say goodbye to spending all your time on debugging, and get ahead of those issues now. This week’s episode is the first of a special three-part series on observability in data. Tune in each week to hear about how the world of observability in transforming from a major player in the data realm.
The point of this transformation is not for everyone to suffer, but instead to make it better
00:12 MIN
Getting visibility into the log is a problem
00:16 MIN
We could follow the trail of breadcrumbs
00:11 MIN
It may seem simple, but it is revolutionary
00:52 MIN
Run books and predicting the problem
00:29 MIN
The blending of two disciplines
00:26 MIN
People are disconnected from the impact they are making with their work
00:14 MIN
Leaders don't see that Ops can be a grounding force for a team
00:11 MIN
Stop erecting glass castles, and build playgrounds
00:12 MIN
How well can you understand this inside by observing from the outside?
00:12 MIN
Be resilient to as many errors as possible
00:22 MIN
You don't want variability, you want muscle memory when you're writing code
00:38 MIN
A human gate adds variability
00:29 MIN
Weird is irreducible; normal should feel visceral
00:30 MIN
Make it as flexible as possible, so they can ask any question
00:16 MIN
Undercutting your own revenue
00:14 MIN
The "three pillars of observability" myth
00:10 MIN
If you're standing still in computers, you're losing ground
00:21 MIN
Mission is to creating systems that allow everyone to be in the top 20%
00:34 MIN
These are socio-technical systems
00:27 MIN

The goal of this transformation is not to make everyone a masochist. Ops has a well-deserved reputation for masochism and we need to get rid of that, right? The point is not everyone should suffer, the point is this is how we make it better.

Ben Newton: Welcome to The Masters of Data Podcast, the podcast that brings the human to data, and I'm your host, Ben Newton. Welcome, everybody, to another episode of The Masters of Data Podcast, and I'm excited to have a guest on here that I've been wanting to have on the podcast for a very long time, and I think many of you will recognize her name. I think we're in for a fun ride and a fun discussion with Charity Majors. She's the CTO and co- founder of Honeycomb. Has a long history of just making funny commentary on complicated subjects, particularly observability, which we're going to talk about today, and I'm really excited to have her here. Welcome to the podcast.

Charity Majors: Thanks so much for having me.

Ben Newton: Absolutely. Charity, particularly with somebody... One of the things I like about following you, I think you bring levity and fun to a complicated area, and you're just fun to follow.

Charity Majors: Well, thanks.

Ben Newton: I would actually personally love to know your story, and I think the listeners would love that too. We always start with that. Tell us a little bit more. How did you end up where you are? What led you to the technology area to be a software engineer, and end up at Honeycomb?

Charity Majors: Well, I come from rural... I come from the backwoods of Idaho. I was homeschooled, and I ran away and went to college when I was 15.

Ben Newton: Oh, wow.

Charity Majors: I had a piano performance scholarship. I got there and I started looking around and realized everyone was poor, and I'd been poor all my life. I did not want to continue being poor, so I switched my keyboards, and I went for computers instead. I have always just been a tinkerer, more than someone's who's formally trained because all the training that we need is out there on the internet, and we can just find it now. I've made a career out of being the first Ops person who joins a start up. Who, I met a bunch of software engineers, and they've got a thing, and they think it's ready to be a real thing, and they've got users, and the worlds are colliding between beautiful theory and messy reality, and that's really where I like to sit. I was doing that for Parse, mobile backend as a service a few years ago. We were on a roller coaster, we got acquired by Facebook, and around the time we got acquired by Facebook, I was coming to the horrifying conclusion that we had built a system that was basically undebuggable by some of the best engineers in the world, doing all of the" right things." Figuring that out is what led me to start Honeycomb afterwards. Because when I was leaving Facebook, I just inaudible start and went, " Oh, shit, I don't know how to engineer anymore without this stuff that we've built here." It's become so core not to just how I fix it when the site is down, but how I see, it's like my five senses for production. The idea of going back to just building in the dark was just unthinkable.

Ben Newton: I like the way you describe it. One thing I want to go back to, I actually hadn't made that connection there, I was going back and reading a little more of your bio that you started out in music. It's funny how many people I've interviewed on here that come from music background. I can definitely resonate because I was a music minor and I still love to play and play with friends and just do it for fun. But I remember one of my friends asked me, was like, " Why didn't you become a professional musician?" I ran into that in college, and I was like, "Because I like to feed my family."

Charity Majors: Yeah. No, there's there's something not very romantic about earning$ 12,000 a year as a 30 year old.

Ben Newton: Exactly, exactly. But it is interesting how the-

Charity Majors: Well, the patterns, they're so mathematical. When you're dealing with these highly abstract systems, I think it tickles the same places in your brain.

Ben Newton: Yeah, no, absolutely. Some of the smartest people in math I ever met were actually musicians. I don't think half the time they even realized it. And too, I can definitely resonate. My first job out of school was with a company called LoudCloud, Ben Horowitz and Marc Andreessen's company. I remember in the early days of what was then Ops then and going in... I did a lot of consulting in people that were willing to either live in the dark or just be surrounded by a bunch of red buttons that nobody could explain like, red alerts.

Charity Majors: You're like cave creatures.

Ben Newton: Yeah, exactly. I was like, " How can you live like that?" That was the first time that I really ran into monitoring was like, I can't live like this. How do you guys deal with it?" You lived the real life. You saw that problem and then you started honeycomb, I guess as a way to formalize and build a product that did what you wanted to do that you wish you had had? Is that how you thought about it?

Charity Majors: Yeah. In the dog days, of course, when it was down, it created this really interesting set of problems where we were doing a lot of... We were doing microservices before there were microservices. We were doing a lot of things before there was the formal word for it. We were striking out on our own in a lot of ways. One of these ways was, we're using a shared pool of workers for the API, and we're using shared database backends. Around the time we got acquired, we had 60,000 mobile apps all sharing the same pools of hardware. Well, we didn't pick a threaded language, so that was a problem. But the API could go down so fast before anyone could be alarmed, before anyone could be alerted, because it would all fill up with requests that were in flight waiting to be served by one of these shared things of hardware. The database would have a slow query running on it, and everyone would get backed up, the API would go down. Aa few times like a week I'd be here, and Disney would be like, " Parse is down." I'd be like, " Parse is not down. Behold my wall full of dashboards, they are all green." Because the other category of problems was that a user would be having a terrible experience, but it would never show up in our top level graphs, because they're all aggregates, right? They're all aggregates, and mobile traffic isn't huge. Maybe Disney's app is doing four requests per second, and I'm doing 100,000 requests per second on the backend, never even going to show up, even if they're 100% down. If they're down because of hardware starvation, as we got more and more databases behind this API, there were more and more and more points of failure. If anyone thought any of these backends was slow, everyone went down. It was literally impossible to figure out who, because with logging tools... Logs are great, but you can basically only find what you're looking for, what you remembered to log in the first place and what you know to search for. With monitoring and metrics, you can basically only find something's that's going to show up your top 10 list. If it's below that, you're screwed. Below that, you're literally just spraying and praying and going and poring over code and log lines. Because the thing is, your log might fill up with requests to one user or one query or one whatever, and that might not be the problem. It's just all the ones are backed up behind the problem. Getting visibility into what is actually happening at any moment, it was a really hard problem. I tried every tool on the market, and none of them was helping. There was this one tool at Facebook called Scuba, which is this, butt ugly, aggressively hostile to users tool that did one thing really well, which was it let you slice and dice in near real time on dimensions of high cardinality. By that, I mean, imagine you have a collection of 100 million users, and the highest possible cardinality because it's just unique members of a set, the highest possible cardinality will be social security number, or request ID, anything that's unique. Last name and first name are very high cardinality, gender is low cardinality. Species equals human would be the lowest of all, right? All of the tools out there, all of the monitoring tools that use metrics and tags can only support low cardinality dimensions. You get over 100 members of the set, and suddenly they're like, " Whoa, whoa, whoa, you're blowing out a key space. You're going to have to back out of that." Being able to just break down by the time we left Parse, we had a million apps. By being able to break down by one in a million apps, and then at any combination of query, backend, API key, whatever, was transformational. Suddenly, instead of having to sit here and use our human brains to hypothesize about what was happening, which was impossible, suddenly, we're able to just put one foot in front of the other and follow the trail of breadcrumbs. I could break down by this is app ID, and go, " Yep, they're all timing out. Well, is it all endpoints?" So, I could breakdown by endpoint? Are they all timing out? Breakdown by request ID, or breakdown by the return code. Yeah, it is. Okay, is it all the endpoints? Are they all slow? Are they all timing out? Oh, no, it's just some of them. Which ones are so? Oh, it's just the read endpoints. Are they all so? No, it's just the ones talk to MongoDB? Is it all of them? No, it's just the ones in this AZ or this replica set? Is it all of the writes? Oh, no, it's just this one query that's slow. Now, I know the answer. I don't have to know anything about what's going to be waiting for me at the other end of this tunnel just to ask a question, look at the answer, and formulate another question. Which by the way, sounds simple, but is revolutionary because in operations we have been in this mode forever where we carry the world around in our heads, and we have all this scar tissue based on all the past outages that we've experienced. We get really good at flipping through that Rolodex in our head and going, " Oh, yeah, I know what this pattern looks like. This means it's Redis." Or, " This pattern, I remember this a year and a half ago when we ran out of file descriptors." Just jumping from possible answer to possible answer, which worked really well in a world of monoliths and single databases, and works really, really, really poorly in a world where every time you get paged is something brand new.

Ben Newton: I remember when I was going back and rereading some of the stuff you had written, I really like the way you had described them. I've run into this in the conversation I had in my own experience and in talking to people that do this every day. I like the way you describe that, and I think that a lot of people don't appreciate is this whole investigative process. If you can go back, what was that famous-

Charity Majors: Like in the science, and computer science.

Ben Newton: Yeah. Also, I know you would recognize the tweet, I forget, but there was a famous tweet, it's like we converted to microservices so that every outage it was like a murder mystery.

Charity Majors: Yes. So true.

Ben Newton: But there's something about that whole iterative hypothesis, testing hypothesis, testing scenario that I think a lot of people that haven't lived this, it's hard to understand that it's not this, oh, yeah, this is probably it. Well, if I already know the answer, then I should have fixed it already. But a lot of these microservices, you don't know the answer. You don't know what you're looking at.

Charity Majors: Exactly. The whole point of being a fairly mature system is that you've automated those out of existence. You're not getting paged in the night going, " Oh, this again, let me check my run book." There are no run books in this world because, if you can predict the problem, you've fixed it. You have to, or you will be drowning, because there's this other long tail of problems that are only ever going to happen once, they never happen again. Yet, you need to be able to diagnose them very quickly. It's interesting because the more scientific model of debugging, a lot of people do it in their code, when they're writing code. They've learned all these tricks and techniques for doing that, for bisecting the problem, for adding instrumentation, the right place to point out the problem. It's interesting now because I think what we're seeing is because we've blown up the monolith, well, we could no longer trace things the way we used to be able to trace like strace or whatever. Now, it hops the network, so a whole category of our tools broke. What we're seeing now is the blending of the two disciplines because now, you have to have more operational tools to trace your code because it's going to hop across components.

Ben Newton: That's a really interesting to describe that, because I guess that it also would be tied with the fact that through this whole transition from monoliths to microservices, you also have this cultural change going on from waterfall to agile, to DevOps. Engineers have to take responsibility for the stuff they write.

Charity Majors: Very, very much. You don't have a pair of debugging in a short amount of time, if you didn't write it and maintain it. There's just no way you could hand that to an Ops team and go, " Figure it out." Similarly, you weren't going to do a good job of writing these systems unless you're pretty heavily embedded in the operation of them, because that operational issues are so central to microservices. If you're making that someone else's problem, that's half your job. You need that feedback loop, intensely. I don't look at this as a new thing. I look at this as like a return to our origins. In the beginning, there weren't these specialized niche, where I just write the code, I just run the code. There was just owning code. People would yell at you if your shit didn't work, and so you would fix it. It was this very tight, virtuous feedback loop where you got the feedback you needed in order to do the right thing. Specialization and scale and blah, blah, blah, we've broken it up into all of these specializations, but by doing that, we also lost a lot of what motivates us. I look at so many people who are so disconnected from their jobs, they're just they clock in, they write code, they clock out, they don't really give a shit. And I can't help but feel like that's connected to the fact that they're so disconnected from their users and the impact of what they're building. I feel like there this whole virtuous cycle going on. There are people who will try to abuse it, there are people who will just be like, " Haha, I've gotten so many DMS of people who are panicked. My manager says it, Charity Majors says that I need to put you on call, and I don't want to because I get woken up five times a night and then I don't want to be miserable." I'm like, the goal of this transformation is not to make everyone a masochist. Ops has a well deserved reputation for masochism, and we need to get rid of that. The point is, not everyone should suffer, the point is, this is how we make it better.

Ben Newton: Yeah, no, I totally understand what you're saying, and I think that it was really interesting that transition, particularly in the early 2000s, when I was coming up is that you started just getting these people, the people that only knew how to run systems, and the people that only need to write code. I tell that story sometimes, but I remember one that always struck me is that we had... I worked on this particular... Back then I was doing government consulting and we worked on this particular government contract where we were building a file sharing system, and it actually turns out generals like to use PowerPoints. That's how they share their battle plans. Basically this team, this development team wrote a new version of it, and they use a database that had never been used for the purpose and all this new stuff, which is fine to experiment with it, but they put it out in production. I was sitting in what would have been called now a DevOps seat, and it went down, I shit you not, for 30 days.

Charity Majors: Oh, God.

Ben Newton: They were camped out in the conference room and the military brass coming down their throats. It was just an awful experience. Ever since then, I'm like, they could have avoided that by being more involved and feeling the pain earlier, but because they disconnected themselves from the production pain, it came all at once.

Charity Majors: Yeah. There's no such thing as putting it off forever. You could have it early and in small controllable doses, or you can have it all at once and have your real nightmare. Yeah, because Ops has always been, you sit close to our users. Our motivations are very closely aligned with theirs. That's why I have always found that Ops teams have the tightest unit cohesion. They're the teams that have each other's back, there's a bit of a foxhole mentality, and it's very bonding. I've had so many software engineers over the years wistfully ask if they could join my team, just because they really envied that dynamic. There's that gallows humor and there's just like, grizzled, I've seen it all sort of thing. I feel like too often leaders see this as entirely a negative like Ops is just a cost center. They don't see it for the grounding force that it really could be for a team.

Ben Newton: I never heard anybody say it that way. I think you're really onto something there, because sometimes that unit cohesion could get in the way because there was the Windows guys and the Linux guys, and the database people-

Charity Majors: It absolutely can. Tribalism is a powerful force to be managed very consciously. But tribalism is just an extreme version of what bonds us as human beings.

Ben Newton: Yeah, exactly. Having that cohesion is somewhat really makes it work.

Charity Majors: We should love to work for each other, fundamentally, we do. The more that you can reinforce that, the happier people tend to be. That can be taken too far too. I've seen so many people who really should have left their jobs years ago. They're too good for that place. But between our fear of interviews and our need to be there for each other, people who just stay stagnant and you only get one career, and you're in charge of your career. I think that a dose of that is good. I think that it's really good... The lecture that I can give to Ops people now is, it's time to... We need to stop erecting these glass castles where we keep them safe by making sure that nobody can touch them, and we need to look at them like it's a playground that we need to build. You accept the fact that your kid might get a bloody nose in the playground once in a while, and it's fine. We build guardrails, we make sure that the slides aren't too high and everything, but you don't want them to die. We really do need to be inviting developers in and having a spirit of enablement in the department.

Ben Newton: Yeah, no, I think you're absolutely right. This is one of the reasons when I was a product manager I used to invite the engineers writing the code to be interviews with users because it sometimes was painful. But the thing is, then they understood. Yeah, they understood.

Charity Majors: For sure. At Parse, we had a rotation up till the end where developers would do a day of support, where they would just deal with support tickets. That is the other side of on call and operations work, it's the user interface stuff. Yes, I think that all developers should be very grounded in inaudible

Ben Newton: Yeah, no, absolutely. Well, making a transition to talking about this subject that your name, possibly more than anybody else is connected with observability. Out of this whole, I don't know, what we've been talking about here. Why there's this pain. Tell me a little bit more about... You've had a front seat to observability and the kind of birth of the term and the misuse of the term. Tell me a little bit more about why it came about?

Charity Majors: That was actually my fault, kind of. It might have been inevitable, but Honeycomb in 2016, we began in January and I spent the first six months of that year just wrestling with not how to build it. We knew how to build it, which is not trivial, because we had to build our own storage engine and everything, but how to talk about it. Because everybody was telling us the market is saturated. Datadog's about to IPO. There's no space, there's nothing left to be done. It's mature. We had had this experience, it was just so... I just didn't believe that, but I didn't know how to talk about... Every term in data is so overloaded. We knew that it wasn't a monitoring tool, because monitoring has this very... It's mostly about curating thresholds. You define some semi- arbitrary thresholds. You're like, between this and this, it's fine. Let's just check over and over and over, make sure it's still fine. We knew that that's not what we were building, but every tool, every demo looks the same, and it was just infuriating. Then there was a day in July or at night, or June or whatever, when I Googled, and you can find my tweets. I actually went back and found my tweets from that day. When I looked up what observability means, because the only heritage to that point was Twitter's team had called themselves the observability team, and they were basically a monitoring team, but they used the term, so it was out there. I Googled it, and I realized it had this really rich dropout. I didn't actually know, all the math majors were like, " Well, duh." It has this really rich heritage in mechanical engineering of observability is a mathematical tool of controllability.

Ben Newton: Yeah, I was a physics major, and when I saw it, I was like, " Oh, yeah, of course. I've never heard it used like this."

Charity Majors: Yeah, I had never... When I read the definition, where it's about how well can you understand what's going on inside the system just by observing it from the outside? I just had light bulbs going off. I was like, oh my god, this is exactly what we're trying to do. The tooling that I had used at Facebook that I was trying to build was to let you understand what the working state of the system was, even if it was in a completely new state that you'd never seen before, so you had no context on, you couldn't pattern match anything had happened before. Just so you could understand to persist that state all the way through the execution of however many services, et cetera. It was just like, oh my God, this is what we're trying to build. I started talking about it a lot. Then unfortunately, after a year and a half of my carpet bombing the world talks about observability all the other vendors, and the adjacent logging spaces, monitoring spaces, metric spaces, and APM spaces were like, yo, we do observability too. I'm like, " No. Well, you don't, you don't." Because if you accept the definition of it's about the unknown unknowns, and all this other stuff, where they have happily lifted all of the marketing language. They're like, yeah, we do unknown unknowns. If you accept that definition, then there are a lot of things that proceed from that, technically speaking. You have to be able to handle high cardinality, high dimensionality. Your source of truth has to be these arbitrary wide structured data blobs, because if they're not arbitrarily wide, then that means you were predefined an advance which this schema was like, which just means, you're telling your future self what they can expect to see, which again, is not the point. I believe, and I hope that under the hood, all these other companies are trying very quickly to build the technical stuff to catch up with our marketing language, but it's really, really maddening for me. In the meantime, I would say the Lightstep and Honeycomb are the only observability providers out there.

Ben Newton: I think that connection you make with the cardinality, because I think it's another term that is a math and physics major, is like yeah, cardinality. But most people don't use that term.

Charity Majors: When I was in charge of marketing for Honeycomb, I was like, cardinality is going to be a winner for us. Let's market based on this, and we got five great customers from that campaign.

Ben Newton: Well, yeah, it was... The other people that know what the word means. I think it's one of those words that when you connect it to reality, and-

Charity Majors: You understand, it clicks.

Ben Newton: Yeah.

Charity Majors: You're like, " I want to see why this matters." Because all of the interesting debugging information is going to be high cardinality.

Ben Newton: Yeah. I think the way you described the cardinality in terms of when we had monoliths and where I started out too, you had the three tier architectures. I remember being super proud of the day that I took this architecture I was working on, and I put it in a Visio diagram, and we had this giant printer. I don't know who made it, but it was ridiculous. It was like six feet across or something. We printed it out and put it up on the wall like, yes. It actually was reasonably accurate for months. Whereas, you get to these micros numbers, and they're organic. They're like living things.

Charity Majors: It's flipping in and out of existence. It's dynamic, it's response to what's happening on the ground, and you just have to... There's this whole shift that you have to make from trying to prevent mistakes, trying to prevent errors, to just embracing them and going, this is constant. This is the fact of our reality, and we're going to...you know, what, errors are amazing because we learn something from every time something fails. Our goal is not to prevent them, it's to be resilient to as many errors as possible.

Ben Newton: Yeah, I think one guy I worked with us to call that the limiting the blast radius. That it makes a lot of sense. I think that when you go back to how you talked about the investigation process, you didn't tell me if this makes sense to you, but that whole idea, I think, I actually described to somebody the other day, I was like, the tip of an iceberg, but you have a couple of signals to say something might be wrong. You're going to the emergency room, and they're looking at the size of your pupils, and they're looking at your blood pressure, and they're like, " Well, you're bleeding all over yourself." These are pretty simple indicators that even a non- professional can see. But they can interpret it. But once you delve in and the surgeon comes in, or the physician, they're opening up. Their area of possible investigation and things that they could touch or feel or measure just explodes in complexity. Like you said, the systems in the early 2000s were all about that initial question. They never expanded beyond that.

Right. We have the benefit of instrumentation. The surgeon can never actually tell your foot, " Let me know if you get too cold." Which is something that I deeply wish that I had for my own body. I don't know about you, but we need to do that. This is the final step, I feel like. When you're writing code, you need to be thinking about your future self and all the poor saps who are going to be debugging this shit long after you left. How are you going to know if it's working or not? What instrumentation should you add? What might someday be the missing link? Just getting in this training ourselves to instrument for the future while we're writing code so that, after you merge your code to prod, it should go live within minutes automatically. There should not be any human involved. When you add a human gate, then you add variability. What you want is for this to become muscle memory. You merge your code, you go and you look at it, you look at it in production, through the lens of the instrumentation that you just wrote, while it's all fresh in your head. You know exactly what you were trying to do, you know exactly what the entities are, you know exactly what the outliers are and what to look for, and it's going to decay quickly. You're going to move on to something else. So, right now is when you need to look at it and ask yourself: is it doing what I expected it to do, and does anything else look weird? It's irreducible from that. Weird is irreducible, because, if you're in your systems every day, you get this very visceral feel for what normal is. If you're only looking at it when you know something's wrong, you're not going to have that feeling, so you have to be looking at it every day, and building that muscle of just like, does it look weird? Okay, I'm going to go investigate. If you do that, I swear to God, like 80%, 90% of all problems will never even get to the point where users can see them, because you'll see it right then and there. If you don't see it right then and there, you move on to something else, the world moves on, the bugs that you should become part of the new normal. So, everyone absorbs them as normal. Then it'll take months for users to tease them out. That's just really inefficient. That's how we get in a state we're like, we're shipping code that we don't understand, the systems we've never understood, and just crossing our figures and curating our monitoring thresholds going, " Well, it seems okay." That's just that. It does not let you move quickly with confidence.

Ben Newton: Well, one thing I definitely would be really interested in hearing your thoughts on is that... Because I remember, and I've definitely seen you write about some of this, is that I think early on when I started, typically, the kind of stuff you monitor is either stuff that somebody and a vendor somewhere said, " Oh, you should monitor this." This came out of the system, and there's nothing you could do about it, or it was very hand curated. Then it feels like there was a period of time, I don't know if it was like in 2010, 2012 timeframe, where it was like, monitor everything. It's just throwing as much stuff as possible in the system. Nobody knew what to do with it. Then you have this reverse thing where it's like, and I know you have opinions on this, on the three pillars or trying to collapse it all into these. But it seemed to me, and I know you've mentioned this before, that it miss the whole point is, you need to know what to do with the data. What questions are you trying to answer? A lot of times there's just this tendency of, let's just check a box and get a whole bunch of data out there and it'll be fine. But it really won't know.

Charity Majors: No. The more you emit, can make it the harder to find the actual root causes. You don't want to be in a situation where you're trying to predict what questions you're going to need to ask because you can't. This is where I feel like this is a really subtle thing, but the data model matters. If you're incentivizing engineers, just like anytime you see something that might someday be useful, toss it in. If your source of truth is arbitrarily wide structured data blob, adding more dimensions to that data blob is almost free. It's just a few more bytes of memory. Versus if your model is the metrics bottle, it literally linearly goes up, the cost of storage, everything. You cannot ask new questions of metrics, you can only ask the questions where you store the data to answer that question in the beginning. I feel like the metrics model, which is what Datadog, Prometheus are, has just reached its end of the road. At the time of write, you are throwing away all of the context that links all of the metrics together, and you can't get back to them. You can take an arbitrarily wide structured data blob, and from it you can infer, or you can derive the metrics, and the unstructured log and the traces but you can't go backwards. You can't actually go backwards. That's what's the truth. I feel like what we have right now is a bunch of companies were built on the wrong data model, and they're trying as hard as they can to get to the right one, I assume, under the hood, because it's not about... You want to incentivize people to capture all the data, and then you want to make it as flexible as possible so they can answer and ask any question of that, any combination, any permutation, you want to make it very, very flexible, and agreeable so that they don't have to predict what the questions are going to ask.

Ben Newton: Yeah. No, I think you described that really well. I think that's... What's interesting, and having been in the industry for a while, I think what you described is changing the model. That's the perennial problem of software products because it happened, you move from mainframe-

Charity Majors: Migrations are a bitch.

Ben Newton: Yeah, exactly. You moved from mainframes to mini- computers and so on and so forth. Always, every time there's a change, the previous vendors, the vendors that latch to that model, for whatever reason, rightly or wrongly, it's very, very hard to change your model after the fact because you've got revenue depending on it.

Yeah, you really do. You've got customers, you've got people... It's even harder in computing. Generations of technologies have a way of being cheaper than the ones that came before. If you have a bunch of revenue that's locked up in the old way of doing things, you'll be undercutting your own business if you bring out a cheaper way of doing things, which can be very, very hard to swallow. Tracing is a really interesting thing here too. You mentioned three pillars, and I've got to shit on them for a second. The only reason there are three pillars are because these companies have three products to sell you — monitoring, logs, and traces — and that is the only reason. There is no logical reason. In fact, it is more expensive to have three different products, but it's also bad for the user, because if you've got three products or four products or whatever, you've got a person sitting in the middle, copy pasting things around. You look at your monitoring dashboards, you see a spike, you're like, "What is that?" Well, now you jump over into your logging thing and you try and correlate the timestamps in those services or whatever. Then you find an example of the problem, you copy paste that idea over to your tracing solution. You store this data three times for no good reason and you're in the middle trying to make sense of them. Well, that's not better. Tracing should just be a visualization mode, that you could view them by time, or you can view them by some other dimension, but that doesn't really matter. It should be two sides of a single coin, just flipping back and forth. Because what that allows you to do is it allows you to go from very high level, like, we might lose or violate it, all the way down to the raw events. Differing what happened in this version ID versus that version of you versus that version ID for these dimensions. Then you can just like, " Oh, I'm going to trace this. Oh, they're in my trace, I see the problem." Now, I want to zoom back out and see who else is impacted by this?" Just that toggling dynamically back and forth becomes impossible if you have to jump between product inaudible

Ben Newton: Yeah. I think you're really right there. There is a tendency to try to take a new concept and boil it down to previous concepts. That's one of the reasons why marketing often gets such a bad rap because there's this, oh, but we've got to map this to the old model, or we can take what we already have on the track and-

Charity Majors: It has to be done with understanding or have credibility. Unless you pay enough, if you have enough dollars behind it, then it will. That's what I feel has happened to the three pillars. It's just like, well, all of the big vendors have endorsed this, therefore, it is true, which is sort of unfortunate.

Ben Newton: Yeah, well, therein lies the history of the software industry.

Charity Majors: Right?

Ben Newton: Particularly on that, we've got in to this transition where monitoring has gone to... We heard the monitoring is dead talks at inaudible and the whole thing and now we're in this observability phase. Been misused and whatever. Where do you feel like this is going? You sit in a perch where I feel like you have a really good visual of what's really going on. Where do you see this going?

Charity Majors: It's being driven by pain, the pain of the engineers who are trying to build these ever... The services that are just exploding in complexity. The only way... If you look at the DOR Report, the DevOps Research Report year over year, it's really fascinating because you can see year over year from 2018 to 2019, the bottom 50% of teams, in terms of their efficiency, they actually lost ground, and the top 20% is achieving escape velocity. This speaks to the fact that: If you're standing still in computers, you're losing ground because entropy is always coming for you. There's always more complexity coming, there's more users, things are degrading, and it's just, you can't stand still. The teams that have embraced the new way of doing things, there's pain involved, but it's short term pain that leads you to a better place. It's not just observability, but it's like observability and feature flags and progressive deployments, and all of this tooling that reinforces each other, which brings developers into the day to day operations of their code, and which makes Ops people enablers. We're systems enablers, to help people own... Ops is never going away. There is an expertise there that is deep and real, but we have to stop being blockers to progress and we have to start pushing and enabling the progress. I feel like this is... Often, people look at that, and they look at that top 20% and they go, " Well, I'm not one of those engineers. This isn't for me. This is for the Googles and the Facebooks of the world." I would strongly push back against that. Your ability to ship code fast and safely, with confidence comes 10% to 20% from the knowledge of algorithms and data structures in your head. It's not about you, it's about the system around you, it's about the team around you, it's about the system around you. If someone joins... A high performing engineer joins a low performing team, within three to six months you will meet their level of being able to ship. It's all about the system. Those of us who have gotten over the hump of being a senior engineer in our careers, those of us who have some authority and credibility. Whether you have a technical role or a people role, our mandate, and our mission has to be creating the systems that will let everyone be in the top 20%. Creating the systems that will ship your code safely, fast, quickly, cleanly, that gives you the tools to inspect at a very low granularity when it's in production, and say, "Is it doing what I expected? Is it doing anything else that looks weird?" And then driving home those cultural principles that will create the expectation that everyone looks at their code and owns at production.

Ben Newton: Yeah. I really like the way you explain that, Charity? Because I think in particular, if there is... Lilly was just talking to somebody about this, I feel like there's one lesson I've learned in the last 20 years in this industry is that everybody always thinks it's about technology. But nine times out of 10 is mostly about people, and about culture.

It's about both. It's about the intersection. These are socio-technical systems, and you're not going to solve them by just looking at the people or just looking at the technology or just looking at the tools. It's got to be cohesive... It's got to be all three because there is no recipe book, there is no roadmap, your system is unique, and it is complex, and it has its own needs. Sometimes you will need to break the rules or the consensus, but only you know where.

Ben Newton: Yeah, exactly. If you don't have the right combination of both, you can't succeed.

Charity Majors: Yep.

Ben Newton: Well as always, Charity, I'm honored to have you on the podcast and you're-

Charity Majors: Thanks so much for letting me rant.

Ben Newton: Absolutely, you're even more engaging in person. I love the perspectives you put out there and I'm glad we were able to bring you on. Thanks for coming on.

Charity Majors: Thanks.

Ben Newton: Thanks again everybody for listening and we'll definitely put some links in the show notes about some of the stuff that Charity was talking about and the Honeycomb and some of the things that she works on. As always, thank you for listening. Take care.

Outro: Masters of data is brought to you by Sumo Logic. Sumo Logic is a cloud native machine data analytics platform delivering real time continuous intelligence as a service to build, run and secure modern applications. Sumo Logic empowers the people who power modern business. For more information go to sumologic. come. For more on Masters of Data, go to mastersofdata. com and subscribe and spread the word by rating us on iTunes or your favorite podcast app.

DESCRIPTION

Welcome back to the Masters of Data podcast! This week’s episode is the first of a special three-part series on observability in data. Tune in each week to hear about how the world of observability in transforming from a major player in the data realm.


In today’s episode, we talk to a very special guest who is trying to change the way we view and use observability in data. The guest of this episode is Charity Majors — CTO and co-founder of Honeycomb. Honeycomb is a company that allows individual engineers, teams, and organizations to understand their production systems better and get ahead of issues before users see them.


Charity and her team utilize observability to find issues before they are seen by users. By understanding the system, they can save time and money on debugging bad code. Charity and Ben sit down to discuss just how complicated the term, “observability” really is and how observability can make everything better for operations.


Charity starts the conversation by giving some context to her background and how she found herself in the world of technology as a software engineer. She delves into her rural upbringing and musical background, explaining that she switched her piano keyboards, and went for computers instead. Charity talks about finding training on the internet, and says she, “made a career out of being the first ops person who joins a start-up.”  Charity was working at Parse around the time it got acquired from Facebook, and she realized that the system they built was undebuggable even by the best engineers in the world. This discovery then lead her to start Honeycomb because, “the idea of going back to building in the dark was unthinkable.”


But aside from Charity’s pathway into Honeycomb, Ben and Charity also discuss the issues with logs, metrics, and monitoring. Charity talks about the difficulty of getting visibility into what is actually happening when the API goes down. In operations, people are in a mode where they “carry around the world” in their heads, and Charity explains that they all have “scar tissue” based on all the past outages they’ve experienced. By jumping from possible answer to possible answer, ops worked really well in a world of monoliths and single databases. However, she argues that this mindset and this process works poorly, "in a world where every time you get paged is something brand new.” She talks even talks about working on microservices before there was a term for that, and the issue of not knowing what you are looking at, “there are no run books in this world because, if you can predict the problem, you've fixed it.” She says that even though there's this other long tail of problems that are only ever going to happen once, you need to be able to diagnose those problems quickly. Charity even touches on the blending of the two disciplines because nowadays, operational tools are needed to trace code.


Something often left out of the conversation surrounding operations and data, are the humans behind the work. Charity spends some time discussing why she thinks people are disconnected from their jobs and how she stays motivated. Many people simply clock in, write code, and clock out, and don’t realize the impact of what they’re building. They are disconnected from the users, and they don’t want to be miserable in their jobs. In fact, she says, “ops has a well-deserved reputation for masochism, and we need to get rid of that. The point is, not everyone should suffer, the point is, this is how we make it better.” She even discusses the tight knit cohesion of an ops team, and how those teams are often undervalued and seen as “just a cost center". The motivations of ops align very closely with those of the users,  and Charity states that operations can be a grounding force for a team.


After delving into the complexities and nature of operations, Charity begins to discuss the confusion around observability. She had a front-seat to the birth and misuse of the term, and she talks about the role she played in observability’s beginnings. Everyone was telling her the market was saturated and every term in data is so overloaded, so she spent the first six months thinking about how to talk about it. Charity knew how to build it, knew it wasn’t a monitoring tool, but didn’t realize it was observability until she googled the term, "when I read the definition, where it's about how well can you understand what's going on inside the system just by observing it from the outside? I just had light bulbs going off. I was like, oh my god, this is exactly what we're trying to do.” That’s when Charity started talking about it a lot, and then all the other vendors decided they did observability too which created mass confusion surrounding the term. Observability is all about the unknown unknowns, and Charity is passionate about debunking the myths surrounding the term. One of those myths being that observability has three pillars: logging, metrics, and tracing. Charity talks about how the three pillars only exist because these companies have three products to sell you, and that having three products is more expensive and bad for the user.


Simply put, observability is achieved when a system is understandable, however that is very difficult to do when many things are failing at once in a complex system. Charity talks about why there should not be any human involved in the process, “when you add a human gate, then you add variability, and what you want is for this to become muscle memory.” She talks about knowing exactly what you were trying to do, knowing exactly what the entities are, knowing exactly what the outliers are, and knowing exactly what to look for because, "if you're in your systems every day, you get this very visceral feel for what normal is."


Despite the issues surrounding observability, Charity is determined to make it a revolutionary tool for operations teams everywhere and the data world at large. It is important in technology to constantly be moving forward and developing new ideas and systems. Charity talks about how embracing the new ways of doing things is painful, but ultimately leads you to a better place. She discusses the idea that there are always new users, things degrading, and new complexities coming, so “if you’re standing still in computers, you are losing ground.”

Charity finishes off the conversation by talking about the intersection of technology, people, and culture. At their core, all these systems being discussed in today’s episode are socio-technical systems, and Charity argues that you can't solve them by just looking at the people or just looking at the technology or just looking at the tools. It has to be cohesive, and “sometimes you have to break the rules."


To learn more about Charity Majors or Honeycomb, check out the resources down below. And to hear more about observability, tune in the second part of this three-part series next week.