Digital Twins and Real-time Analytics (Guest: Simon Crosby)

Media Thumbnail
  • 0.5
  • 1
  • 1.25
  • 1.5
  • 1.75
  • 2
This is a podcast episode titled, Digital Twins and Real-time Analytics (Guest: Simon Crosby). The summary for this episode is: The rate of data growth and complexity is only accelerating. How do you decide what data is important? What data do you store for later? Simon Crosby, CTO at, discusses how real-time streaming analytics provides a better way to think about massive volumes of data from sources like traffic infrastructure and manufacturing. He discusses how you can use this data to build a model of your world - digital twins - and use that model to make better decisions.
How do digital twins work?
01:04 MIN
How uses creates and uses real-time analytics
01:08 MIN
Computing data on the fly
02:32 MIN
Real-world application of digital twins
01:31 MIN
The evolution of digital twins to data models
01:50 MIN
Growing trend of streaming data processing
01:26 MIN

Simon: So these models are constructive. Data built a model, and the model trains itself. It trains itself on its own data and that's totally awesome.

Ben: Welcome to the Masters of Data Podcast, the podcast that brings the human to data. And I'm your host Ben Newton. Welcome everybody to another episode of the Masters of Data podcast. We are still recording these in our shelter at home, so I think considering we're doing this stuff over the internet, maybe it really doesn't matter, but it's fun to be able to keep doing this and bringing these episodes to you no matter what's happening outside our door. As part of that, I'm really excited to have our guest today is Simon Crosby. He's the CTO of Swim. AI. Welcome Simon. It's nice to have you on.

Simon: Hey, thanks for having me. It's good to be with you.

Ben: Absolutely. I think we're going to be talking about some, just like we've done in the last few episodes, we're going to be talking about some really interesting things about data, but in particular we're going to be talking about in the context of what's going on right now. But before we get into that at all, Simon, tell us a little bit about yourself. What's your background? So how'd you end up at Swim and what's your story?

Simon: Sure. So I started as an academic teaching computer science at Cambridge University. Then I started a company called Zen Source, which was I hypervisor company and help with the cloud. Then I was CTO at a company called Bromium, which took virtualization even further. And then Swim. AI. So I'm a computer scientist and I like doing things which are at the bleeding edge of tech.

Ben: Yeah, it sounds like it from some of those companies. Tell us a little bit more about Swim. AI where you are at right now. I mean, why did the company come about? What kind of problems are you guys trying to solve for your customers?

Simon: Nope. Everyone already thinks that it's all about data and in fact it isn't. When we compute, we compute on state, we don't compute on data. And so there's this huge disconnect in the industry right now, which is that as we move towards wanting to process in real time on data, a set committed from large numbers of things, mobile devices, users, whatever happens to be, we tend to stick all this data on the hard disc someplace and then compute later. Based on this rather silly assumption that that's a finite dataset that we're done, right? What's on the disc is enough. And that's just patent nonsense really. You know, what we have to do is get out of this notion of store then analyze into something much more akin to continually process data, continue to analyze and react. Then if you really care about certain original data, but let's get real. Discs are a million times slower than CPUs. So if you want to deal with vast amounts of data, you have to get rid of this notion of storing and analyzing that. So Swim is fundamentally about that.

Ben: And you know, what's kind of the fundamental problem that led you to that? Was it because, because you know, in particular I've been really interested in the past about what happened with big data and why that maybe wasn't successful and where there were a lot of unsuccessful implementations. Was it [ crosstalk 00: 03: 55 ]

Simon: What's your view on that by the way? Why was it unsuccessful?

Ben: Well that's why I find really interesting to see what you're saying because I remember when I first started talking to companies about this, they would talk about it and they would say, Hey, we've, we've got this big data implementation. And I would ask them okay, so how are you processing? How do you get results? And it's like, Oh well we get results in a couple of days. I'm like, how is that useful? I find that really interesting how you phrased that because probably what I'm poking at is that there's a sense that in a lot of what the companies are having to do today and the kind of problems they're trying to solve, that kind of timescale of making decisions and getting results just isn't good enough anymore. Yeah.

Simon: You're absolutely right. But also there's this very interesting account. We're very early in this stage, but think about this way. No learning and prediction, is it better to assemble a vast amount of data and then try and have a model that you prebuilt? Or do you have a view of the world which is very similar to you as a human, which is that you learn and predict based on what you do every day. So if I gave you a blueberry muffin, you would know whether you like blueberry muffins based on all the experiments you yourself have run in the past. Okay? So digital twin of you ought to know that you like blueberry muffins or don't. But you don't phone your mom and say, Hey mom, do I like blueberry muffins. Right? So the key thing here is that if we move to a model which is continually informed by events from the edge, then digital twins, which are representative edge things, stupid things or just things that can just tell us their current status, digital twins of those, ought to be able to observe and learn on the fly. Which is totally trippy. The cool thing is that if you get that right, then you get out of this horrible problem of having to build models beforehand and train them and then push them to the edge. Okay? And they're massive challenges with that. So you get into this notion that things learn based on what they can see and their own data and the things around them and they form theories that you get pretty good in many cases.

Ben: Is that at all connected, because the way you described that sounds to me like a lot of the things that discussions I've had around artificial intelligence and machine learning, because they try to kind of add it to the model. So is that partly driven by that kind of thinking that you want to build these models which you then can use to make predictions as you go forward?

Simon: Absolutely. And I'll give you an example of that. I can start now, but we can go into detail. We're deployed in probably 20 or so U.S. cities where we do real time traffic prediction. Now we're dealing with vast amounts of data. So for example, Las Vegas, 64 terabytes a day, you couldn't afford the hard disc, right? And each of 3000 odd digital twins of intersections in our world are learning from their own data and their surroundings and predicting continuously their future state for two minutes ahead. And those predictions get sold by an API, streaming API, to vendors like Uber, Lyft and FedEx and whatever else. Okay. And so the ability to continually process data and continually predict, given what you're seeing right now is absolutely vital to this next generation of applications in which you always have to have an answer. Okay? Right now you have to have an answer, and the answer from the last batch run just won't cut it.

Ben: So let me ask you a question in that regard then, because it seems like partly what you're balancing there is to need to have near instantaneous interaction with these digital twins or these models. But in some sense by doing that streaming, you're also deciding to give up the ability to go back and ask different questions of the data. So for whatever reason, the way you process that data turned out to be... you wish you had done it in a different way, you're going to give that up and potentially say, okay, we're going to maybe adjust it as we go along because it's more important to have quick answers than it is to have perfect answers. Does that sound right?

Simon: So I'll give you a funny story. I mean there was an engineer I dealt with at a large manufacturing company and had 40 large compressors, right? For every degree of rotation of every shaft, they get 78 data points.

Ben: Wow.

Simon: Okay. These things go 365 days a year at 2000 RPM. Okay. And these guys thought they had to keep all their data just in case they produced the wrong analytic. Okay. Good luck with that. So from among a DB project and the last time I talked to them, they were still buying hard discs. So in many cases when we deal with the real world, knowing the past state isn't that relevant. There are some cases when it's really important, in which case, yes, you want to store the relevant things. Often what I see is that the relevant things are not raw data. Let me just be clear. Let's go back to traffic. The stuff we get is ghastly. Voltage changes between relays on ancient bits of traffic infrastructure. What you really want to know the thing was red or yellow or green, right? Which is tiny. So they're often much more efficient ways of storing state and time or even insights into state and time. Like on average, people waited this long in the month of January in Palo Alto. Which is fine. That's a source of information that is of durable value to you. That's what you store. The key thing is that often we don't need to store all the raw data because that can just be huge amount of useless stuff. People don't know when to throw it away and it's extremely expensive. It's either expensive in terms of effort to sort locally and back it up and everything else. Or you've got to get it over a wire into a cloud and then once it's there, you're there forever. That's by the way, part of the cloud guys goal. Just give me the data and then I have you forever.

Ben: Right. The way you described that, too Simon, that it's interesting because I think that kind of balance that you're talking about versus storing it just in case or I don't know, it's almost like a FOMO thing, fear of missing out. It's like I want to give the data just in case, but as you're saying there's in particularly some of these real world applications, the data just gets so massive that to do that with everything is just not practical. So you're balancing those things off. So I guess the question when you're thinking about this basically are you informing what you keep and how you build these models based on what you want to accomplish. Right? So when you described that traffic example, which is a great example, is that, okay, well if, what I really want to know is how long people wait at red lights and some traffic measurement of that. And if I need to change the sequence at certain times of the day to update that or I want to update navigational maps or places like Uber that's going to inform what you keep. So in some sense you're choosing what to keep based on what you're trying to accomplish as opposed to this what's potentially possible. Does that, crosstalk

Simon: Yes. Or there's a much more efficient way to store digested version of the data. So for example, if you just look at the state of the infrastructure in Las Vegas, you go from 60 terabytes to a gigabyte a day, which is fine.

Ben: Yeah.

Simon: Okay. And that's just the current state of every light. And as it changes right? In every car going over every loop. That's all very compact and can be efficiently represented. Not that original raw data, which is just ghastly.

Ben: No, that makes a lot of sense. Well, I guess getting to potentially other examples because this is really fascinating and I'm sure there's lots of different ways to approach this. So you've talked about traffic, talked about [ inaudible 00:13: 22 ] manufacturing processes. What other places do you see where this kind of approach is really bearing fruit that you're seeing that.

Simon: Yeah. So I want to give you a couple of big examples which are not analytics based. Okay. So we're deployed in cities where these digital twins of intersections are predicting for themselves what their futures look like two minutes ahead. And then publishing, streaming those predictions. As the predictions which are value in their low rate and they can be pushed directly into Uber's inaudible or whatever. And what you can see there is a digestion of the state, or the evolving state of fixed assets into predictions. But here's another cool one. We do a bunch of smart inaudible stuff in Dubai and then they give an example. If a truck with bad cornering or breaking behavior enters into a geo- fence of a hundred yards around an inspector, tell the inspector so you can pull the driver over. Now, then what you have is a notion of mobile things. And so you can't necessarily be on the network where the data is coming from, right? So how does the data from the GPS tracker in the truck arrive? It arrives over a thing called the internet. So you can't assume that you're on the network and you get to reduce the data on flight, right? So you're getting raw data somewhere over the internet. Where are you going to run this thing? In the cloud, right? Because otherwise where? You don't know where. And so there is an aspect for mobile things where you can't necessarily be on the same network. But there is another interesting aspect here which is that these applications which involve streaming data unnecessarily, extraordinarily granular and local. They always end up local, people want local insights. What is going to happen right here, right now? Okay. It's not like computing the averages or distributions and stuff. The things that the analytic stacks of your data. Okay. But you can do easily in batch mode. It's like what's about to happen around me? So what we're finding is that the applications are extremely granular and what we end up doing is building graphs on the fly. So you've heard of graph databases and you probably dug into that extensively. What we ended up doing is building graphs of digital twins on the fly. So this truck is near. So it falls into geo- fence, and geo- fence is on a digital twin. It falls in a geo- fence of this inspector, link the truck and the inspector so the inspector can see the details of truck. Its number plate and so on. Its license plate Let the application then drive the inspector to go inspect the truck. So the graphs that we build are in memory, in real time and they're built from data. Okay. So data is continually happening. These digital twins are receiving their own data in parallel. They're all concurrent, effectively objects, and they link to other objects in their neighborhood based on constraints that you specify. So this contextual linking is a bit like pub/ sub, but what it allows is that two things linked in this graph can see to the state in real time and memory and compute on that. Okay. And they're all computing in parallel. Okay. So every digital twin of everything is concurrently pursuing its own data and former links and computing on all of those links and coming up with observations and theories and whatever else. Okay. So what we're doing is building a big concurrent digital twin set from real world data and then digital twins go off and behave effectively, form links and computer inferences as a result of that. And then tell you.

Ben: When you're saying with the digital twins here, make sure I understand too is that... so the digital twins, are they... I can think about the right way to ask this. Is it only the observed or also the observer? So you're talking about two potential types of actors here. You've got the inspector and you've got the people dealing with the trucks. So the digital twins are the trucks, but do you also mapping out the people that are observing or-

Simon: Yeah, so for everything in the environment, so we have some application use case, right? So the truck guys say, Hey do this for us. Actually, a partner of ours does this. So there is digital twin of the inspector, which includes where they are, the GPS location from their mobile device and so on. So we know where they are. We probably know a bunch of things about that person and so on. Then there's digital twin for every truck. Then there's a map and we know where they are. Then there's digital twin for the geo- fence and geo- fence is continually computing and linking to everything that falls within is geo- fence. So for the truck we know its GPS coordinates as well as its license plate, a bunch of other stuff. And so the key thing I want to get across here is that there could be hundreds of these geo- fences operating. They're all computing in parallel in absolute concurrency, right? And so whenever a truck enters into any one of them or whatever, every one of those is reacting on the fly, right? And so they're continually computing based on data that they observe. So a truck continually publishes its location and the inspector too, and the geo- fence then says, aha, you're within me and you're within me. As the inspector and the truck, and notifies the inspector. Literally to go and talk to the truck. The key point here is that all of this is concurrent. So this is a model of the real world. These digital twins are effectively modeling the way we want the real world to be, the way we want to predict about the real world and analyze about the real world. Now there was a real challenge here, which is how do you create these models. Let me give you two worlds. If believe Google, Microsoft, and Amazon, you have enough money to hire a data scientist who's going to go off and build you a big model of the world and try it using data in the cloud and then maybe push it to the edge. Okay. Let me tell you why that doesn't work. First, you don't have enough money. You don't have enough money to hire that person. Second, you're going to get the data to the cloud. Third, they've got to build a model and train it, and there are all sorts of problems about whether or not the data is effective. Under flow, over flow, and so on. Then you've got to push it to the edge and manage it through its life cycle. So let me simplify that and say it's complete bullshit. It's just never going to happen. Okay? It's just absolutely impossible. But what's totally possible is to let a model build itself. So here's what we do. Let me go back to my traffic example. Happily go anywhere else. Data shows up for a thing. For every thing, that is, every source that sent data, if there is not a digital twin of it already in memory create one. Let it run concurrently and then from then it will acquire its own raw data and statefully evolve. So it's a stateful concurrent object in memory. Okay? And so all these digital twins acquired their own raw data and statefully evolved concurrently. So at all points in time, these digital twins are a mirror of the real world and they aim to be absolutely in real time. But that's not enough. What would you really want is graph, which is they're related to this. Okay? So intersections own about 80 sensors maybe. Okay. And intersections are near other intersections. Okay. And cars going over loops are sending events into an intersection saying, I'm here. And so this graph, which is the interrelatedness of these digital things in the real world is reflected in real time in memory. And goodness me, it just grows. It builds itself. So the same code that runs traffic of Swim. AI, which is a real time view of downtown Palo Alto, will also run and build the model on the fly of Houston and Las Vegas and Jackson Hole and wherever else. Okay? So these models are constructive. Data built a model and the model trains itself. It trains itself on its own data and that's totally awesome.

Ben: Yeah, it's really cool. Now, if I think about the situation that we're in, I can't help but think that this sounds awfully similar to, I actually had a discussion with another couple of guests talking about graph models and we ended up talking about basically infection models and even social distancing types of models. It sounds like this would be applicable to that. Is that right?

Simon: Yes. In fact, the thing that Google and Apple came up with is we had come up with the exactly same idea, but from a monitoring perspective, okay? It turns out that effectively Google and Apple do it on your phone. They build a digital twin of you and this digital twin remembers other things that comes near to you. It does privately and so on. And privacy is key. And making sure that it's not available to ad tech or to law enforcement unless there's a warrant or whatever it else. Okay. That's key, and I think Apple, Google did a good job there, but we came up with essentially the same idea. Of course the carriers, the big mobile operators, could have done it and it didn't.

Ben: In saying, is this something that you guys have actually.. you came up with the ideas is actually something that's being done right now, is that this idea of building [ crosstalk 00:25: 16 ]?

Simon: Not that I'm aware of it. And by the way, I think the APIs that Google and Apple have made available would be fun. We're going to go from play against those APIs. Their goal is to publish appropriately privatized information for each device to ensure that people will go from build apps. That'd be fun. So we're going to go to it.

Ben: No, it does sound really cool. Now you mentioned this briefly about the privacy thing and I wonder how much you have spent time thinking about this and having to deal with this because in some sense what I'm hearing you say is okay the phone companies, Apple, Google, whoever is basically building a digital twin of me and now I'm thinking that really kind of puts a point on the whole idea of owning my own information. Do I own my digital twin? How does that work out?

Simon: Right. Actually I've thought extensively about privacy of this. There is the extreme case, which is how do you ensure there is no backdoor into a device. Okay. That nobody can ever get into the device unless there is a legal warrant, in which case you don't have a choice. So I worked on that one. I even published some stuff on it using mechanisms to effectively shod a key. So it would be something like this for every user. You back up the state of the device continually. It's encrypted in the cloud so no bad guys can get hold of it. And the key that could decrypt it is agile but is shodded amongst multiple disparate entities with different interests. So you can say that the key to decrypt it would be split between Apple and Google and Microsoft and Verizon and god knows who else. Okay? There is a process which would allow them to not combine the key ever, but nonetheless to compute on the encrypted data and that process is an evolving branch of mathematics. You know, if you split the decryption key across the distrustful parties, that's a good thing. And then you let them compute on the data if a warrant is sent digitally to each one of them without actually combined the key into a single entity. So privacy is crucial. And I think Google and Apple have actually taken a good step here.

Ben: Oh that's good to hear. Well, yeah, I mean, and this is truly fascinating because like everything you're saying, it's literally easy to devisualize things that are literally happening at this very moment and how different countries are dealing with this and how seriously they take that. I can think of what's happening in certain countries where people are being tracked once they're infected and being fined and things like that. And there's some advocacy for that. But the privacy implications are pretty enormous. So, I guess one, to kind of put a bow on all this, I mean I think this is a super fascinating area obviously. And what you guys are doing are very, very interesting. Where do you think this area is going? Where are some of the things that you're seeing?

Simon: The big trend is towards streaming data processing. That is compute first, analyze and react. And then if you have to store the data that's fine, but compute and analyze and then store if you want to. If you store, you probably store something which is durable value. Not necessarily just raw data. So there's a big trend towards that. It's started out as IOT, which is a non- market. Then it became edge computing, which is another non- market. There are chips and boxes and stuff. Right? But it's really about applications and state and the ability to compute on large amounts of data. And why is it really important? It's important because 20 billion new devices show up on the internet every year. And so with that number of things arriving, it's actually this idea that database would know the current state of the world is actually silly. I mean, why would the database know? It's easier to know by virtue of talking to the digital twin, which is continually evolving in lockstep with the thing. So streaming data processing becomes vital. There's no way you could ever store all those bits. So an ability to know and to compute on the fly is absolutely essential.

Ben: When you put it to those numbers with the new devices that really does put it in contrast and it's kind of at the core of the whole idea of what you do with all this growing data. So I think that's fascinating. I think what you guys are doing Simon is really interesting and I'm personally interested to see how all this works out and how the area grows because I think there's a lot of interesting applications.

Simon: Let me just mention that Swim is open source. So, if you go to and play will inaudible user mode extensions to the Java virtual machine or else native compile using Gras. So it's simple stuff and you can get going fast to develop really high value applications.

Ben: That's fascinating. We'll put a link to that in the show notes as well. But Simon, I appreciate your time. I think this is a fascinating area and I think some of the stuff we talked about very, very relevant. And I wish you guys luck and thank you for coming on.

Simon: Thanks Ben, that was fun.

Ben: Thanks everybody again for listening to another episode of Masters of Data and as always rate and review us on your favorite podcast location so that other people can find us, and thank you for listening.

Speaker 3: Masters of Data is brought to you by Sumo Logic. Sumo Logic is a cloud native machine data analytics platform, delivering real time continuous intelligence as a service to build, run, and secure modern applications. Sumo Logic empowers the people who power modern business. For more information, go to Sumalogic. com. For more on Masters of Data, go to mastersofdata. com and subscribe and spread the word by rating us on iTunes or your favorite podcast app.


In this epsiode of Masters of Data, we sat down with Simon Crosby, CTO at, and discussed the benefits of using digital twins and real-time streaming analytics. The use of digital twins provides a more efficient and money-saving way to create data models that are continually evolving and learning from themselves. With this, the way that data is handled changes. It eliminates the process of creating a model beforehand, teaching it, and then starting to plug that stored data into the model. Instead having a system that continually processes, analyzes, and reacts to the data. In today's world where answers are expectrd right then and there, the implications of real-time analytics not only provides a more efficient method of data computing, but also fast updates and results.

To learn more about using Swim's applications, visit

And to learn more from the team, visit the blog at