Graph Data and Why It Matters (Guests: Denise Gosnell and Matthias Broecheler)
Ben Newton: To be honest Matthias, I was really trusting everything you said until you got Structural Relational Learning wrong. Then you lost me.
Matthias: Yeah. My PhD Advisor would be so upset with me right now. That's why I had to correct that real quick. Otherwise, woo!
Ben Newton: Yeah. We'll make sure to warn people before they listen. Welcome to the Masters of Data Podcast. The podcast that brings the human to data. And I'm your host Ben Newton. Welcome everybody to another episode of the Masters of Data podcast. I'm excited to bring in on a couple new special guests today. That are going to talk to us about something that you might know, not know as much about it. About graph data and graph databases. But they're also some pretty fun people. So I think we're going to have a good time. So first, we have Denise Gosnell. She is the Chief Data Officer, otherwise known as the Goddess of Data at DataStax. It's good to have you here, Denise.
Denise: Thanks Ben. Glad to be here. Appreciate that colorful intro.
Ben Newton: [ inaudible 00:01:13], we will start it off on a good note. And we've also got a Matthias Broecheler here. I hopefully I didn't butcher that too badly. I got the little guttural thing going on there. He's the Chief Technologist over at DataStax, the Creative Officer, the man of many talents. Welcome onto the podcast.
Matthias: Thank you Ben. And your six years in Europe, definitely show. Nice work on the last name there and apologize for that.
Ben Newton: No, no, no, no, not at all. But we're bringing you guys here today where we're... It's kind of an interesting time. You guys are sheltering at home just like me. So, it's good to connect on any topic. Much less a cool one like graph data and graph databases. So, maybe we just want to start off. And, even and I being in this space, I still feel like somewhat ignorant on these topics. Maybe give us a little primer. I mean, why would you want a graph database ? What is graph data and why should I care ?
Denise: Yeah. I mean other than we're all isolated right now, which is a fun graph term anyways. That's so cool. For kind of understanding what graph data is and why we care, the best way to think about it is that, graph data prioritizes relationships between data over the actual things or the entities themselves. And that's also just kind of how you think Ben. So I think when we were even getting ready for this, we were kind of walking through how we knew each other. I had a colleague, who had a brother, who connected us together. And so we're able to kind of map together, the relationships that brought us together on this podcast. And it's the way humans think. And the fact that humans think in relationships, is exactly why businesses and other people are wanting to use graph data to represent the data within their own companies. And set up graph databases to be able to better model it more efficiently.
Ben Newton: Yeah. And so, what are the kind of use cases that you guys are... Well, I mean even back to me, maybe taking a step back so, what is graph data ? And kind of like, you talked a little bit about relationships. What does that actually really mean ? What are you actually putting in there ?
Matthias: Yeah, good question. I mean we have one massive use case happening in front of our eyes right now, with a pandemic. Social networks are a very prominent use cases of graph data. And if you want to for instance, model the spread of a disease, let's say... I mean, I'm just pulling this out of my head. Let's say coronavirus is spreading in a social network, hypothetically speaking.
Ben Newton: Good example.
Matthias: Right ? I'm just so proud of myself for coming up with that example right now. And in that case, you don't randomly contract a virus. It's not like it just happens upon you. So if you look at an individual, and you look at why they get sick, it's usually because they contracted it from somebody that they were in close contact with, right? And that's what the CDC is telling us. Please stop doing that for the very reason that that transmits the virus. Now if you wanted to model that and understand how it spreads in a society, you need to understand how that society is connected. Like who knows whom, who hangs out with whom, who goes to work with whom, who is in the same room with somebody else. And that is fundamentally a graph, right ? You have the vertices or nodes in the graph are people, and the edges are relationships connecting them, are encounters that they've had. And then you can superimpose that graph onto the social network of who knows whom, or who works with whom. So you have very different types of edges. We call that edge labels. So you have a label between Denise and myself. We're coworkers, right ? So that connects us. And then, between me and my partner, we have cohabitating edge. So, we know that... We live together. And those edges carry different connotation, different semantics and how we're related. And in the case of a pandemic model for instance, you would then model the spread differently according to what the edge labels are. So social networks, one very prominent example. One of the earliest use cases of graph data and graph data modeling... Sociologists have been using this for many years to think about social dynamics, social networks, institutions in that way. And obviously epidemiologists are using Graph Theory in that regard as well to understand how viruses spread, and why they spread so much quicker in densely populated areas, in strong social networks than sparsely populated rural areas. Is basically a direct outcome of modeling a virus spread on social networks.
Ben Newton: Now. And when you actually talk about like saying an epidemiologist... Oh that's a big word. When someone's actually using this to understand virus transmission, do you find that... Are those people actually thinking like that ? I mean, how are they actually interacting with this data ? I mean, is MD- PhD actually interacting with graph data and doing this, or is this something that's like a couple layers beyond them ?
Denise: No. Especially in regards to this pandemic, I think this is the first question that anyone is being asked when they end up being tested positive for the coronavirus. They're immediately walking through their history of the past 14- 21 days. Who they interacted with, what places they visited. And they're asking people to record very interesting scenarios of the potential spread that they could have ignited because they ended up being positive. There's some very detailed reports that you can get coming out of South Korea, where they followed the first 31 patients... I think that's the right number. But they had drawn a specific relationship graph from Patient Zero, who they infected, et cetera. And then I think it was Patient 31 if I'm remembering, that did not comply with self isolation or following their rules. And so, they ended up creating this massive community of additionally infected individuals. Community, because they went to church, they went to another hospital, et cetera. And so yes. The graph of this infection is very much at the forefront of researcher's minds. And it's the data that they collect immediately, when they get a new positive test.
Ben Newton: And I think that I've always found this to be an interesting part about, talking about this. I mean we literally all live graph data every day.
Ben Newton: Because,
Ben Newton: at least as far as math goes, it's a very realistic depiction of our world. I guess is what I'm trying to get at.
Ben Newton: But
Ben Newton: in some sense is that... When you look at those researchers, are they actually implementing it the way ? Are there actually tools for those researchers to use that are mapping that all out ? Or is this just something they're kind of thinking during their writing ? Putting in an Excel spreadsheet. I'm assuming that. I hope not [ crosstalk 00:08:00].
Matthias: I think there's a bit of both happening. I think there's a research area called Network Signs. People that are trained to think in terms of graphs and networks. And those people are very familiar with the tools and technologies. But I think you're also right that there aren't very many tools out there that would allow people without a very strong technical background to implement this kind of tracing in the real world, right ? We still have to build mostly custom systems. And you'd be surprised, but yes, Microsoft Excel is often used for this kind of stuff, right ? People have built extensions to Microsoft Excel in order to make it map network data. Which is, obviously putting something on top of something that shouldn't really belong together. But because you have that need, and people are familiar with Microsoft Excel, they're... People have created Network Science Tools that work with Microsoft Excel. And one of the things that we're trying to get the world to use, are tools that are purposefully built for graphs, right ? That allow you to represent data in a graph way. So that you can more easily answer queries like, " For particular individual, who have they been in contact with." And then map that out to hops. And hops is a word of saying like, extending the neighborhood of people. And doing that with pen and paper is very, very hard. Right ? Like if I gave you a list of, who contacted whom. And then I'll tell you, " For an individual, give me everybody that they contact, and then everybody that those people contacted, you'd be doing a lot of work on paper to figure out who those people are. Because you're going back and forth tracing these individuals. And unfortunately, yes. Some people are still doing that. And I think we're now sort of at the border where the technology is becoming available, that we no longer have to do this pen and paper style or spreadsheet style. And we can use dedicated tools. But they're still very young. They're still very much in their infancy. And there's still a lot of education that needs to happen before people feel comfortable, thinking in graph terms but also, implementing it in graph terms. Right. Like you said, a lot of mathematicians have been thinking in graph terms for many years. But, there's a disconnect between that, and the people that build systems that are engineers, that are doctors, that are scientists. Those people still don't necessarily know the tools. Are not familiar with them or not educated on them and how to use them.
Ben Newton: Are you guys basically finding that the interest in Graph Theory and in graph databases is actually risen right now ? Or is it... Are you seeing increased interaction with what's going on ?
Denise: Yeah, absolutely. And the era of using tools and graph tools to map out your graph data is here. It's been trending upwards recently. And actually Matthias you shared with me some very interesting statistics this week about the popularity of searching for graph databases. Yes ?
Matthias: Yeah. And when you look at... I mean, when you do some keyword analysis, Google Trends Analysis, you can see that they're actually coming up and one of the most popular... We are looking specifically at graph database technology. And there's been quite a search in the trends over the last couple of months, but also the last couple of years, where people are understanding more and more how deeply interconnected everything is. And I think a pandemic is a very, very vivid way of demonstrating that to the world, right ? The fact that a disease can spread globally, I think pretty much every country is affected at this point, right. Is a very vivid demonstration of how interconnected we are as a human species. The same thing applies to our supply chain networks. And yes, we have also found that out, kind of the hard way, right ? With this pandemic. Seeing how closing a factory in China, immediately affects what you can order on Amazon, a couple of weeks later. And understanding those relationships... There's multiple relationships in between obviously, you getting your package on your front door, and something being produced in a factory in China. But those are complex supply chain networks, that we rely on as a human species to get the goods that we need to get. And that's just a couple of examples. If you look at financial transaction networks, if you look at just the social networks that we have come to appreciate with, Twitter, Facebook, et cetera. Networks are all around us. And I think, this time right now, is a very vivid demonstration of how important they are. But I think people have been understanding this for the last couple of years, and are looking for tools to make it easier for them to work with them, analyze them, and use them to their advantage.
Ben Newton: One question I would have on that... When I've had discussions about, Machine Learning and Artificial Intelligence and, kind of the things in that area, there's kind of a transition going on in that space now where the technology to some sense has been out there for a while. But it's transitioning from something that researchers do, to something that's actually getting integrated into real business processes. And actually, I even interviewed a couple of guys a couple of weeks ago that, were talking about exactly that. In terms of graph theory and graph database implementations, where are they in terms of that journey ? Are we kind of moving from that research phase into something to where it's going to be more widely deployed or is that already happening ? And this is just a matter of it growing out in different verticals and different use cases ?
Denise: Yeah. That's a really great question. And Matthias and I have been working either at DataStax or together within the broader graph community for about a decade now. And the way that we see that, is that we kind of have two main forks in where graph use is being more widely adopted by companies and people around the world. On one side, we have the common templates of patterns and how people are using graph data, and then on the other side, we see an uptake in how people are wanting to use graphs to more deeply research new problems. And for the first side... Matthias and I spent the past year and a half, two years, working together to extract the common templates and the common patterns of what people are doing with graph data in production applications. And that's what we wrote about. You kind of have already heard Matthias talking about one of them. That the most popular way, that people are trying to understand and use graph data in a production application, is to go through that neighborhood approach. So for this person, tell me everything related to them within my data. And then I want to know two layers out or two neighborhoods out. Tell me everything related to that. And that idea of doing neighborhood expansion, is the most popular way that companies over the past 10 years have emerged as the common template from where people are getting started. And there's other ways that we have found companies that are wanting to use graph data. A really common one, is what we did at the start. When we were discussing how we know each other in this podcast. What's the path between me to you ? We talk with each other all time in that sense. Like help me get connected with how I know you. You use LinkedIn that way. Anytime you search for someone on LinkedIn, you see that badge that says you're their second or third connection. And then you kind of drill in and you want to figure out who you know in common. It's the path, it's understanding your connected relationships that get you from one person to another. And then the other really most popular way, that we all use probably every day. And even more so now is, there's movie recommendations that you get on your favorite streaming platform. Probably Netflix. But the idea of kind of looking at the analysis of what movies your friends watched, and using that to generate a recommendation back to you, is just at the center of how we use our digital content. Every one of those recommendation panes, are using different algorithms called Collaborative Filtering. That serve up new recommendations. So, to your question then, we see common templates that companies have emerged through creating over the past 10 years. That's one way the graph technology has been more widely adopted in starting to solidify within industry uses. And then the other way, is just more in ways that people are finding creative solutions to different complex problems.
Ben Newton: Well, now you're talking about Netflix recommendations. I'm more engaged. So, it's... No. But in all seriousness, one thing you remind me of Denise is that... I don't know if you guys remember. But in the early days of Netflix... I don't know why I did this. This is probably an unfortunate mission. But I would spend significant amounts of time, let's say that, going through and rating movies. Because you could use to be able to do that. And they would give you this giant page or something. And I think I rated hundreds of movies. And then they made this transition at Netflix where they went from ratings to basically a thumbs up in whatever you really watch. And I remember when I read some things about it, it was because they made it... And I'm sure I'm oversimplifying it, but they made a transition from, what you say you want, versus what you actually do. Because I remember when early in the days when I was illegally downloading music to be clear. And I would rate-
Matthias: Did you just admit to that ?
Ben Newton: No, it's illegal. But I would rate the music and I would rate all these music really high. And I was like, well because it's supposed to be good. But I'm like, " I never listen to it." So, there was this kind of thing about analyzing what you actually do, versus what you say you want to do. Right. So, with that long introduction, it sounds like... Is that partly what Graph Theory and graph data can actually help ? Because it sounds like that's how you would actually map out what I'm talking about. About what you actually do and what your friends do. That's how you would map that out with graph data as opposed to these kind of traditional rating diagrams and stuff like that, or if that's the right way to say it.
Matthias: Yeah. I think one of the benefits of graph, is that you can incorporate very many different types of data into one graph. We see a lot of companies, a lot of individuals apply graph in those cases where you have very different sources. Like the ratings that you were talking about. Like you would rate a movie four out of five stars, let's say. But then, there's also an edge between you and the movie, and you watched the movie, right ? And those edges are timestamped. So you might have multiple edges between you and watching the movie. And that is another data point that is being considered by the algorithm, that says, " Well he only rated this movie two out of five, because he felt like his friends were judging him if he liked The Princess Bride or something. Right ? But you're watching it-
Ben Newton: That's an amazing movie to be clear. Just to be clear. And you can go [crosstalk 00:18:41].
Matthias: crosstalk you're watching it every Friday. So obviously there's a disconnect between your rating. And how often you watch it. And we can incorporate all that data into the graph. So we get a fuller picture. But then we can also look at how often are you stopping the movie ? When are you watching it often. Are you recommending it ? What other modes of interaction do you have with that movie, that can all form one big connected graph of data, that can then be used in order to inform the recommendation system. And that's one of the areas that graph is really powerful in. Is that it loves you to evolve your data model, the kind of data that you're capturing, very quickly to accommodate all these various signals that you're getting. And then to experiment with those signals and see which ones work, which ones don't work. And then build the model to be the best that it can be.
Ben Newton: Well that's really fascinating. I hadn't thought about it that way. Back to what I was saying with AI and things like that, is there kind of an intersection there ? Because it seems like what you're describing is a more... It's something more like what you would expect when an Artificial Intelligence algorithm is actually making connections between things and understanding the relationships between different ideas. Is there kind of starting to be an overlap there between those two disciplines.
Denise: I definitely see a massive overlap between those two disciplines there. There's a long list of really fascinating and creative ways to use the connectedness in your data to generate new recommendations, or I guess new clusters, if you want to kind of go into a more of a Machine Learning, A.I. perspective. Because when you look at the traditional approaches that have made it into production applications with Machine Learning, we're looking at very flat sets of features. Where we've got 10 things that are really important and you run them through. Let's be honest, y= mx + b. I mean, at the end of the day, almost everything in production, that's Machine Learning, it's just linear prediction. But that's fine. But what's really interesting is, when you're able to add one more dimension to that, and that would be the connections between your features, the connections between your pieces of data, you can do more creative things. And actually Ben, you probably use one of these almost every day. When you go to Google, and you do a basic search. Behind the scenes, one of the ways that all of those search results are returned according to the relevancy, is something called Page Rank. Where the links that exist from one page to another, help determine, what's the most popular content that is both authentic, real, not spammy, and things like that. That's in the background, just one way that Google is determining your search results. It's all those links. This page links to that page, which links to this page, et cetera. And it's an unsupervised approach. Page Rank is kind of the ML and AI part of what we're talking about. But there are dozens of other really fascinating, creative algorithms that you can dive into in that space to start to see how connected data provides more unique interesting ways to look at ML algorithms, than if you were looking at the flat feature tables.
Ben Newton: Yeah. That makes a lot of sense. Because, kind of what I'm hearing you guys saying is that, I think that's always one of the big problems in data in general. Is like the format of the data itself. It's like you can have really amazing data, but if it's not put in a format that's consumable and it's not put in a... And you don't have a way of representing those things, you add all this extra friction into finding out what you need to find out. And so, in some sense that's how those things always really go together. So it makes a lot of sense that, really using graph data would actually unlock a lot of possibilities for you [ crosstalk 00:22:30].
Matthias: It's that old joke, right ? Where it's like, when somebody tells you they want AI and Machine Learning, what they really mean, is they want clean data and a regression model. I think that's what they actually mean, in terms of translating that. And I think there's a lot of truth to that. We see that oftentimes you can go a long ways, if your data is clean, is solid, is well structured, and you put a nice regression model on it. You can get very, very highly predictive models. That said, there's obviously been a huge rise in unstructured Machine Learning and AI. For TensorFlow and such, we have been able to make phenomenal leaps in terms of being able to do, object recognition in pictures, to do object classification, to do text analysis. All those unstructured sources of data are now much more analyzable by machines. I think we're going to see a similar push, and this is looking into the future. We're now seeing how these deep learning models can be applied to structured sources of data. And I think, it's going to be really, really interesting to see the two converge. Where you have these deeply structured sources of data, and then you have these very, very, very deep Machine Learning models that you can superimpose one on the other. And there's actually an area within Machine Learning called Relational Structural Learning, that explores exactly that intersection. And it's actually Structural Relational Learning. I got it the wrong way. But that's been an area that people have been doing research in for decades now. And it's finally coming to the point where we're seeing the huge benefits that we can deliver, if we look at structured data, in combination with Machine Learning. Whereas, I think in the past we've kind of thought about them as sort of separate things, right ? You use these deep learning models on mostly unstructured data, and then you use more standard like vector based models on structured data. And I think the combination of the two, can be really, really powerful. And it's something that we'll hopefully see in the next couple of years.
Ben Newton: [ inaudible 00:24:29], that sounds fascinating. To be honest, Matthias, I was really trusting everything you said until you got Structural Relational Learning wrong. Then you lost me.
Matthias: Yeah. My PhD Advisor would be so upset with me right now. That's why I had to correct that real quick. Otherwise, woo!
Ben Newton: Yeah. We'll make sure to warn people before they listen.
Ben Newton: But no, that actually is really, really, really fascinating. I think this is obviously an area where you could just... There's layers upon layers of things going on here. So I guess you, take it up one level, as we kind of put a bow on this. You talked a little bit about this Matthias, but from the both you, I mean, what do you see is the future here ? Where are things going ? You've written this great book. So, now I'm guessing there's nothing else to do or is there ? I mean, what's next ?
Matthias: Yeah. Let's go home people. We're done here.
Denise: I mean, Oh gosh, that's a hard one to zoom back out, and really look at. And I know Matthias is really, really great for providing a little bit longer, forward thinking. I'll stay a little bit more medium term thinking with my response, Ben, to that. So, personally... Again, this is a lot of what we wrote about and I mentioned it already, but I think that, in the next few years, more short term, what we're really going to see come together, is a wider adoption of the common ways that people want to use graph data. And I say that because there's a very established way that you're going to use SQL Technology, or any type of relational database. There's really common templatable patterns on how people deploy that technology and use it, to structure and organize their data, and then to query it and show a user what to do with that shape of data. And when I say that shape of data, I'm thinking more like rows and columns. Like people know what to do with that stuff. I think in the next 2- 3 years, we're going to see a much larger adoption of what people learn to do with relationships in data. And we have some great examples of how we interact with that already. It's what we do already with LinkedIn. It's what we do all day on Netflix. We've got those more innovative applications already using it. So in the next few years, I'm seeing that more apps are going to be using paths, using connections in data, as a way to contextualize and personalize how you use your app. But Matthias, I'd be really curious what you see beyond that.
Matthias: I think you're hitting the nail on the head here by saying, one of the first things we need to do in this area, is make it more consumable by developers, by engineers. Right. Now, Ben, to your earlier question, right now, it is very much a sort of a scientific thing. Lots of scientists are using it. In particular Data Scientists are familiar with graph technology, and are applying it. But it hasn't really reached a mainstream, much like AI now has, right ? But it's not because everybody knows how to build seven layer deep neural networks and all that sort of stuff. But because it was abstracted to the point where I can now say, " Hey, I have this corpus of data. I'm just going to shove it at this API, and it's going to do some future learning and giving me back a model that I can deploy, and I don't have to know all the intricate details of the system." And we're not quite there yet with graph. And you can see this in the book that we wrote. There's still a lot of things you need to know in order to not shoot yourself in the foot. A lot of things you need to know in order to be successful and we need to reduce that barrier. We need to make it easier for people to say, " Okay, I get it. I understand LinkedIn. I understand how viruses are spread, and I understand how that kind of spreading behavior can be... " You can model how ideas propagate, exactly the same way as how virus propagates. So if you want to bring a new product to market, you should understand how that works. Great. But we don't have the simple tools yet, where an engineer can say, " Okay, I'm going to just going to grab this thing off the shelf, and I'm going to model how our new product propagates in a social network." That does not exist yet. They would have to embark on a journey to learn this technology. And that's something we need to do as the first step. It's to lower the barrier. And then I think we're going to see phenomenal applications of this technology. And it's going to enable us to do things that are beyond our wildest dreams. To the point where at some point we can be able to really understand how supply chain networks work. Really understand how financial systems work. Which we currently don't really. I mean look at the 2008 financial crisis. Nobody knew what was happening as our financial system crumbled around us. And we were scrambling to understand what the dependency were and which bank had what on his balance sheet. We need to get to a situation... If we want to understand and model human behavior with a certain level of resilience, we need to get to that level of understanding. And the only way we can do that, is enable a lot of people to use these kinds of tools successfully.
Ben Newton: Yeah. Well that sounds good. Good job. You came back.
Matthias: Let's do it.
Ben Newton: Let's do it. Okay. Well, I trust in you again. Thank you Matthias. But this is... As I said, it would be... It's been fun to interview you guys. You make Graph Theory fun.
Matthias: crosstalk thank you.
Ben Newton: But I think this is a really interesting area. And like I said, it's one of those fascinating, Mathematical Computer Science areas that we live every day, but we don't necessarily talk about it that way. So it's always been fascinating to me. So I appreciate you guys taking the time to come on.
Denise: Yeah. And thank you for having us.
Matthias: Yeah. Thanks so much for having us.
Ben Newton: Absolutely. Absolutely. And stay safe. And we'll stay in touch. And thanks everybody for listening to another episode of Masters of Data. Take care.
Speaker 4: Masters of Data, is brought to you by Sumo Logic. Sumo Logic is a cloud native machine data analytics platform, delivering real time continuous intelligence as a service, to build, run and secure modern applications. Sumo Logic empowers the people who power modern business. For more information, go to sumologic. com. For more on Masters of Data, go to mastersofdata. com and subscribe. And spread the word, by rating us on iTunes or your favorite podcast app.