Big Data has transformed the world big time. It led many companies to strongly focus on data analytics, trying to collect and control gigantic amounts of data. After years in the rat race, several of them are slowly realizing that the continuous striving for having more data than others is maybe not the most meaningful business objective for everyone.
In fact, the data collection craze is steadily killing innovation.Post-Big Data thinking in which data is controlled again by people, outlining the goals and ambitions of the Solid project.
Ruben is a professor of Semantic Web technology at IDLab, Ghent University, and a research affiliate at the Decentralized Information Group of CSAIL at MIT. He acts as a technology advocate for Inrupt & the Solid ecosystem of apps, and has co-authored 2 books and numerous publications. He aims to build a more intelligent generation of clients for a decentralized Web at the intersection of Linked Data and hypermedia-driven Web APIs.
Check Ruben’s presentation here:
You can listen to the Podcast below:
— Transcript —
Connected Data London – Welcome to the Connected Data London podcast. Brought to you by the connected data London team. Connected data London is the leading conference for those who use the relationships, meaning and context and data to achieve great things. We have been connecting data people in India since twenty sixteen. We focus on knowledge, graphs, linked data and semantic technology graphs, databases and machine learning, technology use cases and educational resources. Big data has transformed the world big time. It led many companies to strongly focus on data analytics, trying to collect and control gigantic amounts of data.
After years in the rat race, several of them are slowly realizing that the continuous striving for having more data than others is maybe not the most meaningful business objective for everyone. In fact, the data collection craze is steadily killing innovation. In this talk, Ruben Verborgh discusses post big data thinking in which data is controlled again by people outlining the goals and ambitions of the Solid project.
George Anadiotis – OK, so last part of the day and happy to see you, you all made it in one piece and well, this is going to be super interesting.
I’m really happy that Ruben Verborgh is here today with us. He’s well, one of those people who wear many hats. He’s a researcher at the university and also affiliated researcher at M.I.T. and also a technology advocate for enwrapped. Then I guess this last bit may be the most relevant today in a way, because he’s going to be talking about decentralization, solid link data and how all that stuff somehow comes together. So on to you…
Ruben Verborgh – Big Data, we all know it and I generally dislike it because I think it’s killing innovation.
The good news is that connected data can help us bring back that innovation that we’re missing today. I’ll talk about Solidand Solidas a way of redefining the relationship between people, companies, services and data. What we’re going to do is move data closer to people because that’s good for many reasons, not just for privacy and control, but also for innovation, for doing meaningful things with data. So let’s have a look at how this works. I’ll be talking about three things today.
I’ll start with this redefining of that relationship between apps, data and people. And then I’ll go to the solid ecosystem explaining what the parts are, how it works, how you can use it. And then finally, I’ll zoom in on the technical side, like how it is that this connected data exactly helps realizing this vision. So first, let’s talk a bit about this relation. Now, all of the examples that I’ll be giving will be in the context of social media.
However, I’m not sending you a social media platform. It is just one example that we all know because we all use it or we all know someone who uses it. But as I give the examples about social media, try to transform them into a completely different kind of domains, because this is a data story and this is a picture that you might be familiar with. They are the walled gardens that we have in social media. But if we’re honest, we have those walled gardens everywhere where there is a vendor lock in, which is, as you know, in quite a lot of places.
What is the essence of the walled gardens? Well, it means that there’s barriers in between. The idea behind the open web is that we can just interact with anyone we want with any piece of information we want. However, the walled gardens are the result of a wave of innovation that goes against the spirit within the walled gardens. You cannot easily share information among different systems. Simply said you cannot share your Facebook photos with your LinkedIn colleagues. I mean, you can, but you have to either move the data or to move to people.
So you have lots of copies of your data or you have people that are multiple people on multiple networks and so on. So it gets really complicated. And if you really think about it, this doesn’t make sense. Just to give you one example, have you ever tried changing your profile picture? If you do that, you have to change it in every single place that you’re you want to use. And sometimes years later, you still find your old profile picture.
This is a result of data being close to the application and not being close to the person. So this is what we want to change now so that it’s not a platform solid. It’s not saying like we should all do this differently. I mean, it’s kind of this, but it is foremost about giving choice and telling people that they need more control or more privacy, more innovation, and telling that if you want, you should have the choice to do so.
The sort of ecosystem is about enabling people to use the apps that they want to use while still storing the data wherever they want to store it. And this is not true today because, yes, you can make your applications that you want, but they can only interact with those people who are using the same applications. If you’re not, well, then you’re out of luck. So we’re going to decouple this. We’re going to put data separately from applications so people can freely choose the apps they want.
And we want that people can control the data and then share it with the apps and people they choose. Sounds very simple and it is conceptually, but it is a concept, but really strong consequences from the main concept behind. So it is something we call a data park. It has different names like data involved and so on. The idea being is that every single one of us, you get your own space, your own place where you can store your data.
That’s it. So not storing it with LinkedIn or Facebook or any other if you’re storing it in your place. Conceptually, this is where it starts from. Very simple I.T. with huge consequences. For instance, this is a view that you’re very familiar with the data and if you like, this one consists of thousands of pieces of data. And to date, all of those pieces, they will come from a single place, the social media, they will come from Facebook.
Let’s say. However, if every one of us gets their own data plant, then this view certainly needs to get data from different places. For instance, if this is my social media post, well, then this profile picture is going to be stored in my data. But the text is. Going to be sort of my data part, if you comment on my post, a comment is yours is going to be stored in your data pot and even a piece of data as small as a like if you like my post, if you like it, Corman’s this like is yours is going to be stored in your data pot and then it gets really crazy because that view can be built up of piece of data that are coming from hundreds or thousands of different data points.
So we’re no longer talking about big data here. We’re talking about a big a huge number of very small data sets and they are connected. So totally different way of thinking. What kind of consequences does this have? Well, on the left hand side, you have today’s landscape. And today data and apps are tightly coupled with each other. They’re inseparable. So on Facebook, I have my contact list. I have my pictures of my agenda. LinkedIn has contacts and Doodle has my agenda as well.
There’s no synchronization. So if I accepted the event on Facebook, then do not know about it. If I have a new colleague on Facebook, wouldn’t know about it. Maybe that’s a good thing. But the thing is, even if you want, we still can do that. And this means that we have this constant duplication as well. Like we always have to keep things in sync because the data resides with the applications and this is what creates those data silos.
And again, it’s not just social media. It’s basically every app that we use will try to hold onto our data because that’s just how they work. On the right hand side, you see the consequence of what happens when indeed we all store our own data in our data files and those rectangles that you see into the same apps, except the data has been pushed out. So the data is now and my personal data field. And this means that I can start with one application work on data.
I don’t I can continue with the next. And there’s no need for synchronization because my data is only in one place in my place. And this makes a lot of sense because why would we want to maintain all this? I mean, this, if you think about it, is how it should have been all along. We’ve been ready. Technically, it also means that we can make different choices. The left hand side, if we want to interact, we need to use the same apps, the same systems.
On the right hand side. I can use one app, you can use another app. And it still works because the integration happens not through the application, but it goes through the data points. And this is where it gets really interesting because the left hand side is a world in which there is only one app. The app is built for the average user, like the Facebook app, for instance. It’s built for the average user, which doesn’t exist on the right hand side because you have a choice of apps.
You can have more specific apps like everyone can use their own thing, basically, and it is to collaborate together. So this is a very fundamental change. This, of course, is great for people because people get choice and you don’t have the choice today. However, it is also great for companies and for innovation. Let me explain that a bit more right now. Again, the left hand side is today. This is what we have. We have a single market for centralized apps.
So you get this competition that’s not based on quality of service, but data ownership into competition. There is who has the most data, which is great. If you’re one of those having more data, however, it kills innovation. Let me give you two examples. On the one hand, think about Facebook. Facebook is a major company. How do they innovate? Well, let me ask you, who can name me one innovation to your personal news feeds in the past ten years, which is an eternity in technology, but Facebook has done for you.
You might not notice, but they have done one major renovation instead of just doing like, you know, five different emotions, but that’s about it. Facebook doesn’t innovate. They don’t have to because they’re in the data collection business. That’s a kind of game that they’re playing. Fair enough. However, kills innovation even worse. Second example, suppose that someone has a revolutionary idea. They have a new idea on how to present information for social feats, build a new concept to bet they cannot enter the market because they don’t have the data.
So not only are the big companies not innovate on that front, those who want to innovate can because they depend on data for doing so. On the right hand side, we get a new world that we would have if people keep data closer to themselves, because then you get a competition on two levels, on the one hand to competition between those offering data as a service places. It gives you data voluntarily, can store your own data was up on your hands.
You get a separate competition between apps. And this is where it gets interesting because again, the left hand side is where you have the one for the average user. This means that grandmothers and grandchildren are literally using the same app because it gives the best experience. No, because the data is on there. The right hand side is a market where grandmother and grandchild have their own apps, but considered direct because data has been this couple from from the left hand side is competition.
But there’s one winner. The right hand side is a competition as to competitions, but many more winners because everyone can pick the deals that are best for them. And the deals picked for data and for apps are different. So this is where the innovation will happen. This was a concept on how things work, so the separation between apps and data. Now let’s have a look on how this concretely translates into technology. We’ll talk about the sort of ecosystem.
So Solidis not a company, not a platform. It is foremost a set of standards that make this work together. If you compare it to anything, compare it to the Web. The web is also an open platform with standards. And you can choose your browser. You can choose whatever software you use to make websites. And they all work together. You can view a website on mobile, on desktop, on laptops and so on, so forth.
And I still work. So this is what we’re talking about and this is not a coincidence. So it is started by Tim Berners Lee, who invented the World Wide Web. And basically what he wants to do is bring back the original principles, bring back this open, has this permission less innovation where we don’t depend on companies harvesting as much data because it kills innovation so that it’s more than just software. It’s an ecosystem. It’s a movement. It’s a community.
It’s an ecosystem in the sense that it’s based on standards. So this is the key thing, the contracts that anyone can implement. It’s a movement in the sense that we want to build apps in different ways. We’ll talk more about that later, because there’s a really different approach to building apps and also community in the sense that there’s different people with different backgrounds, different companies, different organizations who want to contribute to solve it. And the important thing is that anyone can build or hosted software for Solid because it’s not a platform.
If you implement the standards, well, then you can participate. For instance, you can have your own server at home or in your workplace. You don’t have to connect to anything else. You can run server space or you can use one of the three servers basically having a sort of data port. It’s like having a website you can self host, someone else can host. You can also add a small or large company. It is your choice.
So I’m not telling people you should have your own server. I’m telling people you need choice. If you want your own server, you can. If you don’t want, that’s fine. I don’t care what is a solid server. The server access the data part that will store and got your data. It is by all means a regular web server. So we’re using the existing web stack. We are just a couple of layers on top. There’s access control and there’s linked data.
The server itself is application agnostic, which is really important. It means that you can build any application and that all application specific logic resides in clients. It is like a website again, just like a website can be opened with any browser. A solid part can be viewed with any app. What’s in the data part? Well, like a website, you name its profile, pictures, comments, lag’s, whatever, whatever you want to put in it, you can put on it.
It is a website. Your imagination is your only limit. Basically what our clients want clients are browser apps or native apps that read from or the data to your part. You give apps permission and you can be very specific about what they get to access. You can say there’s documents, you can see there’s documents you cannot see. You can get friends, permission, friends can give you permission and so on, the things you already know. But this time you control whatever happens and it’s up to apps to deliver a unified experience.
So the views that we saw in the beginning, I mean, if you use an app, you should notice a difference whether it’s Solid or not, except that when you see an app and the reality data will come from many different places, instead of coming from one single background, what kind of apps can you build? Again, this is seem like the web. If you can envision it, you can build it. It’s not a totally different platform.
You just building on top of the web. The web is a platform. However, the difference is that if you start building, you don’t need data, because right now, if you’re a startup, the first goal you have is how are we going to get people’s data? Because without data, you can build apps here. You don’t have to worry. People comment or data. They will give you permission to see specific parts and then you can start showing it and interacting with them and so on.
So it’s all based on features. It’s not based on data harvesting. Now, this is nice in theory, but does it work in practice? Yes, it does. However, it’s not made for end users yet. So this is in the prototyping stage and development stage. We’re trying our technology to see how it works and so you can use it if you’re brave. That’s very rough around the edges. That said, there are several implementations of solid servers.
You can store your data already in a data part, my data stored in one. There’s some free storage available or you can deploy your own as you wish to applies. There’s a couple of them. There’s data browsers, that’s context and photos, apps and so on. And if you don’t find what you want, you can build it yourself because that’s a good thing. Again, you don’t need data to build something. There’s also several libraries that help you build your own sort apps so you can try it out if you want.
But warning. Very rough around the edges, though. Now, how come? Well, it started as a research project at MIT and there we have the science, the initial protocols and standards. And so one of the prototypes as well. And those are what we have today. However, in computer science research, the problem is once you’ve solved the technical challenges, you have to move on to the next challenge. But that way you don’t get it to the end user.
So Timorously is currently behind a startup called Inrupt. And what do you want to do, really, is to show the world that you can have a healthy company based on data without doing data harvesting. So he’s trying to lead by example shown as possible to do nice things on the web without doing data harvesting and Inrupts is contributing to open source ecosystem with implementations, helping with the standards and so on. And they also have a couple of products down the line.
However, it’s very important that Solid and Inrupt are very different things and is trying to be one company in a bigger ecosystem or let’s say differently. If Incorrupt is the only company, then they have lost because it is all about competition. So what we’re trying to do is open up the market to different players because the more people start doing this, the more beneficial it is for everyone, really. So think of one of as one of the players that is trying to open up the ecosystem for everyone is talking about this new way of thinking about it.
And people I’ve shown you what ecosystem looks like. Now, let’s have a closer look at a deeper place like how does it work on a lower level? Because there must be some magic technologies behind this. Of course, there’s never magic, but there’s a couple of interesting things to be done there, because the main difference is that decentralized apps, they will have many backhands and those backhands need to work with many apps. On the left hand side, you have to the situation again where one backend, one app, you know how it works.
And this is really easy because you don’t exactly what the contract is. And if you don’t have a contract, you can change it because often the back into the front end teams, they work closely together. So this is about as easy as it gets from the right hand side. We have the world that we’re preparing for, which is WANGBERG data pulse interact with different apps. Apps interact with different data pods, and one app has to reach data that has been written by another app and you have to have this handoff.
So what we get is a really complex way of of interaction, basically, because you cannot predict anymore what the data pump will look like. And this means that we will have completely have to completely change the way that we think about building apps, because this is what it looks like right now. You have application specific logic that goes to specific server interface and you heart gote interactions. So every client has to be http interaction code to go to a specific server, which means that if you have servers, that’s kind of the same thing, but not quite.
They don’t work. They are incompatible. But this does not work for for two reasons. First of all, you cannot make assumptions about what the server looks like because everyone can do their own thing. And second, you don’t even know to which server you need to go because data can be distributed across one ten, one hundred a thousand different servers. So you cannot hardgoods because you simply don’t know where data will be monetized. So we have to evolve towards a different way of building apps.
And the way that I think about it is a query based contract. So instead of in your application specific logic, hard coding, what you want you express in a query like this is the shape of the data that I’m looking for. That’s all you do. And the interesting thing is that such a query captures what you want without being specific to a certain backgrounds or a certain network. This also means that you need a query with universal semantics where your query will need to have a meaning that is the same across all data fields in the entire world.
So things like craftsy won’t work because one graphical query has different meanings, depending on who you’re asking. So we’re thinking about this global space where a query has meaning and the same meaning for everyone. So this is how you build apps only using queries so you don’t use HCB requests because those will be different every day. And then it’s up to a query engine on the client site to translate those requests and those queries into concrete requests to interfaces on different data paths.
Now, there’s one problem with this picture and this. It depends on a really strong query engine that will do all the magic, of course. But the good news is there’s engine is reusable because different applications, because this is all about making sure that the application specific Goetze does not have to change the contract, be built against our queries. And if interfaces change, that’s all fine because it’s only the translation step that has to change. The application itself does not need to adapt.
Now, of course, this comes with huge interoperability. And we solve them with linked data, and our first challenge is if your store or on data, how do we connect it to other people’s data and how can apps share data without having to first agree on everything? And finally, how do we integrate data from multiple data points? Well, the answer to all of these things for me is already if the first question, how do how do we connect it?
Well, using links, for instance, this is me liking connected data. London, 2019. And guess what? There was no lag between the website, but I liked it anyway. And this is what I can do if I’m in control of my data. I just made a link to connect the data London and I liked it. That’s this is my like. So this is how it works. Connecting to things about how the center operates. This is a complex topic that needs more than just two minutes to explain.
But the bottom line is this. We’re basing it on ships. So instead of trading applications to know exactly what to expect, we train them on specific ships so applications know how to recognize alike, but they can also ignore all the things that they don’t understand. So it’s based on shape matching the query engine thing. And finally, how do we integrate data from many different sources? Well, this is easy because it’s already you can just merge the grass.
And this is the reason why I’m insisting on universal semantics, because we’ll need to combine data from random places in the world. And the need to work together just means that you need a consistent semantics and consistent meaning across all those possible data points. This also means that we need to do something for developers because this completely changes the way that they’re used to developing things. So I think this is the most crucial success factor that we’ll need from the developers are the ones that built up people see and forget about you, Xs and OS, they’re the ones being sold to people.
We should not make the mistake as solid or as Inrupt to say, look, we’re going to bring it all the way to the end user. What we should be doing is thinking in terms of enabling others. So we want to enable developers to build things on top of it. So rather than doing it ourselves, we think about how do we make it better for them? And the interesting thing is that if we have developers build better apps, we also help ourselves doing it.
And this is something I noticed and I played a bit later. And this to me is a crucial point because you’ll have lots of things about RDF being complex. And a very convincing argument, in fact, was made this morning. However, no, Libby Miller had a beautiful quote saying that people think art is a pain because it is complicated, but the truth is even worse. It is basically simplistic, but allows you to work with real world data and problems that are horribly complicated.
And this is a horribly complicated problem because data spreads all over the world. How can you fix this? By giving people tools?
For instance, I myself have been working with simple react based tools that give developer access to this whole world. So this is just a component’s way of building and doesn’t look too special. Doesn’t this looks quite regular. However, let’s have a close look at these things. They’re microexpression their queries. And those are queries, not just one data source, but their queries for the full rep.
I call this the flex language for link data and flexible. And it works like this. So you can do things like data that user name data, the user, the trends and so on and so forth. But this one is interesting because the first name of those friends are their profile picture is not coming from the user’s data space. It’s coming from their data spaces. So this is a forward federated query going out to do negative things like their first names and even just have a random Euroland there.
So that’s level of abstraction that we’re talking about. So that if I’m to you about a new relation between apps and people, I’ve explained how the ecosystem can work and also what underlying technologies are that connect all of this together. However, as a researcher, I have an important statement to make, and that is that in my semantic up community, we often think that problems are solved, like we focus on the heart of such problems, like Federated querying, and that with all the rest will be solved by the engineers.
But the state of the art medical community for Federated queering is let’s take services to basically. Connect them over Ethernet and then we do federated query, however, if you think that’s to the public web with 100 sources on global connections, but different data models or nice theories don’t change. So to 80 percent that we think of we’re solving, the problem is actually quite small. Bring all of those theories into practice, requires a lot of research and a lot of engineering to make it work at the scale of the web.
Thank you. Thank you them, so I’m sure there’s tons of questions on. So I would be wondering what sort of applications do you think shouldn’t be written in this way? All applications and also on the census, like for most US applications, working at personal data because with personal data, we have major problems. It is expensive right now. You have GDPR, you have social pressure to do it and also business factors. But there’s a couple of application scenarios that are very interesting for this in particular.
And governments and governments are really asking for things like that, especially when the Europeans get so complex. E-health is also a very prominent example because different hospitals have records and so on. So if you can just put them of patients, that makes sense. So expect those issues to be very prominent examples, which would be good examples. What sort of application would you not do this? I think examples where there’s no real personal data involved, I would I would say so more point-to-point things and so might be less interesting.
But we need to think of concrete examples. How difficult do you think it is to achieve universal agreement on semantics, and to what extent is that going to inhibit innovation that people just can’t agree on what something means? Extremely difficult to the extent of the impossible. However, the thing is, we don’t need full agreement. We can use smaller, decentralized agreements. There’s a very small set of consensus about all we need to agree on. But the more issue gets the smaller set of people who actually need to agree.
And the good thing is, because we have the RDF model is also possible if different initiative starts to random after the fact. So the whole complexity of the stack is justified because you don’t want this central agreement, because this is so slow, so easy to rely on decentralized agreement. Hey, this is more of a political question, but how far do you think I mean, do you have a vision of bringing this technology to make the world of the Google and the Facebook change or we just started with it?
This is impossible. And we have to create a parallel track because there’s no way that these people are going to get it. We have to do it all at the same time. However, be aware that it might seem like we’re the enemy of the big companies that know that even big companies are struggling to take Facebook data biggest, but even Facebook is desperate somehow to acquire more data. So I’m quite sure that they know that their model today is finance and they’re desperately looking for alternatives.
They don’t like sample Google. They have lost the social media market now would like to be a provider of data faults in this ecosystem. And then maybe it’s a monopoly of Facebook so easily for big companies. That makes sense to think about these things. I that’s that’s very important. So someone is very attractive from the point of view of users controlling your data, but often there are good performance-related reasons to centralize data, how do you how do you speak to those concerns?
Yeah, so first of all, very good points, because if I explain this to people, some of them say, oh, this is nice. Now you have new research to work on, but actually setting the way, setting the rules in this way is so meaningful that we believe it justifies looking for new solutions. So we’ll have to look at those things just because the rules are meaningful. However, decentralization does not mean that there can be no centralization or performance optimization.
So, of course, to make a scale caching, you need indexing and so on. The difference being that the source of truth is the individual data falls into caches build up. You can still find the answer. It’s just to be painfully slow, which is very different from today because it’s essential systems go up, then everything is gone. So yes, we will need massive caching. And so there are only optimizations, which is a good thing.
He’s going to be the last one, but don’t go away, because there’s a direct continuation of what would have you done, any thinking about what the sentence might be for you to get into this is a bit more limiting the kind of revenue generation models they might have innovation. I’ll give you a very short example. A couple of months ago, I had a company knocking on my door. They built eight to type people, two jobs, which is really great because the only way to make it algirdas better, which is more data.
And if they need more data, they’re basically competing directly with LinkedIn. So they knocked on my door saying, look, we don’t want this. We want to harvest data because the moment we do it, we’ve lost already. We want such an ecosystem in which we don’t depend on data harvesting. So instead of having 50 percent of our people working this week, 500 percent of people working on innovation and the data we don’t want, that’s please keep it.
So innovation to me is a driver for for these things. OK, I know we could go on for a while, but we’re going to have to wrap up because there’s a panel coming up. So she’s saying I’m starting to say.
Connected Data London – We hope you enjoyed the podcast. To get more of our own material and to keep up with the latest industry and research news from our domain, we invite you to connect with us. Connected data, London has an omni channel presence.
Besides all major podcast platforms, YouTube and SlideShare, you can find us on Twitter, LinkedIn, Facebook and Instagram. You can join our meet ups or you can keep up with our news and special offers by joining our mailing list.