* Talk from Connected Data London 2017 Conference *
Connected Data encompasses data acquisition and data management requirements from a range of areas including the Semantic Web, Linked Data, Knowledge Management, Knowledge Representation and many others. Yet for the true value of many of these visions to be realised both within the public domain and within organisations requires the assembly of often huge datasets. Thus far this has proven problematic for humans to achieve within acceptable timeframes, budgets and quality levels.
This panel discussion by Paul Groth, Spyros Kotoulas, Tara Rafaat, Freddy Lecue & moderator: Szymon Klarman tackles these issues
You can listen to the podcast below:
(Transcripts below)
If you are more of a visual type, you can also watch the presentation
—Transcript —
So, yes. It's my great pleasure to share this first panel of today entitled Does connected Data need AI or AI need Connected Data. Admittedly, the AI bandwagon is writing fast and loud these days, and it remains in a spotlight of many communities, computer science, tech, industry, enterprise, and whether we like it or not, also popular media. So it's very tempting for any new aspiring technology to claim strong links to artificial intelligence. And Connected Data is certainly one of those concepts that is commonly associated with AI. It's kind of grown out of AI research, and it claims they also can solve many problems in the space.
So in this panel, we'll try to discuss and reflect on the nature and the extent of this relationship. And I think we, well, you've seen most of the talks already in the morning. So, you know, the panel is. But let me just quickly go through the names again. We have Spiros Cotullas from IBM Research Island. We have Tara Rafat, chief ontologist and associate vice President at Emphasis. We have Freddy Luke, a principal scientist and research manager in artificial intelligence systems at Accentual APS and Paul Growth Disruptive technology director at CVS.
Let's start. I wanted to introduce you a bit deeper, but then I thought, okay, maybe you are the best qualified, actually, to tell a few sentences. How do you work with Connected data? And what does it mean for you in your space? You just saw it that's easy.
Freddy.
Yes. Okay. As I mentioned, accenture is a service company. So we don't have any products. We build solutions for clients, and we try to bring different AI solutions that are depending on their needs. And they heard about girls words on the eyes. So when they think what they are, they think about deep learning, that's basically the shortcut they're using. And they try to apply to hype and new technologies. So before we say, Well, that's right. Then you need to apply. We try to figure out what the problem they have.
And sometimes it's very busy and simple models for doing predictions. Whatever the problem they have, M might just work. And sometimes connected data is not the approach that you need to bring. So it's already depending on the problem they have and the data issue they have. And those companies, they just heard about those news from from Google, from IBM and large companies about the big data problem that they have. They do have big data problems. They are larger motor date of data, Amazon example. But those companies they don't have, they don't have everything digitalized.
They don't have large amounts of data in general. So they have a big, small data problems the way we name it. So sometimes you need to enrich the data they have with more data to contextualize information. And in a large number of cases, we try to understand we will start from the data, what's the data they have, and then how we can achieve the goal they want. And in some cases, we bring disconnected data on a ledge or external data that push much more context to the information that they have.
And sometimes the information that they have internally is not enough for doing the goal of the work they want to do. So that's where we bring connected data. So again, it's not each and every problem that will need it. But sometimes the work I've presented, you need external data. You need to bring much more contextual information and Pol and spies. They mention it to some extent in the work they're doing the external information to get us. And that's the way we're doing. We have the same philosophy.
We try to bring in knowledge, graph, or semantics, if that's needed. We don't want to overkill the problem.
Thanks, Freddy. That's fine.
Okay. Hi, everyone. So I work for a company called Emphasis, which is also company. So we build solutions. But we also have products as well, which is built on semantic technologies. Their foundation models are ontologies. And I actually want to mention a different aspect of when talking to customers. So a lot of them. Yes. As Fred mentioned, they have heard the word AI, and they come over and say, oh, build us an AI solution. I want machine learning. I want deep learning, but a lot of them have actually a different approach.
We talk to them and say, hey, look, we've got connected data. We've got semantics that makes you data intelligent. We've got some AI working on it to give you recommendations, and they're like, oh, well, wait a second. Ai. I mean, we're not there. We're not yet there. This is scary. Ellen Moss said it's gonna kill us. So it's just really the other way around. So their approaches like, no, we don't want AI right now. And for us, for me, I've had a lot of actually webinars talking to people who are not in this field because we're dealing with businesses like finance, insurance.
So they're really not in the AI field itself. But they're hearing about this technology and it's freaking them out. So what I try to do is actually really break down what AI means because this word is really being overused. And I'm sure all of you have different definitions of it, but it's really nice to kind of break it down into what are the fields of AI? What are the subdivisions and then go back to Pets. It is which one of those can actually be useful in a sense, for to solve your problem and also to make people kind of understand that they're actually dealing with AI every single day, every minute.
All of you have your iphones. I hope you have Google Pixel, though, but you are dealing with, Siri, you're talking to your Amazon Echo. You're using Google Home, so everyone is just using it. Do you use Google now to plan your trip? That's a lot of AI that's happening. A lot of natural language processing going through your emails, extracting where you're going, like planning your trip for you saying, I take this bus to go to the airport to get there on time. So all of this that you're dealing with every day is actually a part of artificial intelligence, getting people to know that and customers they're like, oh, okay.
And then we can actually extract exactly the same technologies and apply to fields that is relevant to them, like their business, to a finance or insurance or manufacturing or anything that they're interested in. So I just wanted to bring that aspect of it out as well, to see the other side of the spectrum.
IBM Research Research Group, what we're mostly dealing with is health care and social care solutions for large organizations. They can be government or other types of large organizations. So this is not the environment where you're going to change the database. These databases have been there for a long time, or sometimes they're not even it's not even a state of the databases. There would be legacy systems, but any kind of change can take years. So the data integration part that comes together with link data, I don't see it happening yet.
Just to be honest, the second part where I do think link data having an immediate impact is when we're combining when we're combining it with something else. I can give you a specific example where we're using link data. So as I mentioned, most of the cases that we have to deal with these people with significant and complicated needs and to understand the significant and complicated needs many times you need to train your I recommend model, a machine learning model, and so on. The problem that you have with this kind of approaches is that sometimes the results that you get can be very wrong.
And if something is very wrong, for example, if somebody is treating female patient and you say, yeah, check them for plastic cancer, it is not just the case. It's a stupid recommendation. Don't do it. It is also that people lose faith in the system knowledge graph. Actually, we are using this impact. It can help you clean up your unreasonable recommendations or your unreasonable predictions. And of course, to do this, you need to be able to connected in healthcare, for example, it is data. Would it inherently connected because you have so many different terminologies that are being used so much existing knowledge?
I think Paul, during his talk, also mentioned that there are so many approaches to try to extract knowledge out of a factor document that is really an area where connection is there, but at the same time you want it at the same time.
Prediction is there as well things that I mentioned a moment ago that the lot a lot of effort in your work. You have to put into clarifying some overloaded concepts. Right. And I was thinking since we are all kind of oncologists, maybe we could dare to come up with a concise definition for for each of you in your work, how do you define for yourself what connected data really is in a kind of compact way? What is it? Is it this or that data format? Is it whole philosophy of connecting data points?
What is it really? What is this conference about three.
well, I think the first thing that comes to mind whenever you say connected data is being able to extract information from different sources and bring it together. So this is literally like what Paul was talking about of showing all these different variable sources that you have that have different formats of data. I think I lost this. That's working. So I have different formats of data. And you really want to have as much information as possible when you're going through the process of decision making. So link data in that sense.
And the very easy sense means that I can bring all of this information from this different sources to integrate them together. And the problem that usually exists is the interoperation problem between these different sources because they all follow different schemas. People use different vocabularies. So that's another layer of thing for me to the link data concept in general, because, yes, you bring the data in from these sources. But what if people have used different terminologies? How do I actually connect them together to get that whole view of the data?
And that's where that abstract level of the knowledge graph model comes in. That's where you're defining what are the different concepts, how they're related together. So how can I fit this data that I'm bringing in together into this knowledge model? So it actually makes sense for me. And in my work that's really like building that abstract level and trying to bring information from all these different sources to connect them together in a format that's understandable by the computer. And therefore, I can do a lot of more intelligent stuff with it, automate a lot of things that are happening in the industry today.
So that's kind of what really link data means to me in my work.
So if I can follow up with a question to the rest of the panelists, how is this different from data integration? So.
I think this is the real ... I'm on me. I have my own mic. I think this is really a fundamental things. Like, I think the part of it is that you have the data integration, but you as an organization, especially from an organizational perspective. You need that same conceptual view of the world across or at least parts of your organization. So it's very much if you want to be really old school about it. It's like creating an enterprise data model that's like a really old school way of thinking about that.
But with the flexibility of being able to adapt that enterprise data model over time to source the concepts coming from the enterprise data model from different places, combined with the importance of data in that process of that data integration. So that's I think it's just two old school technologies, right? We can call it enterprise data models plus data integration on steroids. How about that?
Compare venues of ender base data warehouses with link data technology.
Okay, let's link for a moment back to artificial intelligence is that we're trying to understand this relationship. There are two sides of this question, of course. And let's start with one of them. How in what way AI systems and in what sense AI systems need connected data. Where does it come in? Really useful. So you mentioned now building this one big conceptual view over the infrastructure involving AI systems. Yeah. That's going to be one point here. In what other sense that's connected data contribute to a.
For me, it's the reduction of cost. If I'm using link data, and that goes back to the previous discussion, but not internationally. If I'm using link data, I get some of the integration for free. If I'm using a link data format, for example, to describe a patient, right. I can just pull in free base or week date or whatever, and they just get additional information practically for free. And for me, this is the essence of the value of link data reducing cost. Of course, I could make an enterprise data warehouse and integrate this data would be not the problem doing that.
It would just take more time.
I.
Actually want to touch upon the integration point as well that you mentioned. And then that is it's actually a layer of intelligence that you're adding, as opposed to just old school data integration, because one very interesting thing that we hear from our customers, usually when we work with them and say, okay, you want to work with this amount of data, we're going to pull it from different sources. The question is like, how are you different from exactly data integration systems? And my answer is that I'm not going to overload you with data because it is really important to pull out the relevant information for you and just bring that to the screen for you to observe, because if I'm going to overload you with data and just say, okay, I've integrated everything for you, then there's a problem.
I'm not really helping you. I'm just making you more confused. So that top level of abstraction of conceptual models that go on the top are really to help with this intelligent extraction and integration of data. That's what it's helping. And coming back to your question in terms of how is that helping? And that was a great point in terms of the cost, obviously, because what we hear from a lot of people is that the cost of making good data to feed into the machine learning algorithms is really high.
So if you have cleaned open link data to feed into it, that's great. But there's also another aspect. And that is when you could use knowledge models to kind of guide the right data, which is very similar to what you mentioned of coming up with recommendations which are completely irrelevant. So your results from a machine learning algorithm is as good as your data is. So if your data has problems, your results will have problems. And unfortunately, currently a lot of data sets that exist their facts, but not necessarily complete and comprehensive facts.
And can I give you an example just a little bit? So let's say you want to hire people. So you have a machine algorithm that's trying to figure out who you want to hire. Right. So you're feeling a lot of resumes and data into it. And you're feeling a lot of information about the very successful CEOs. So you want to hire a CEO for your company. So obviously the first thing that the algorithm would look and learn is that out of this 5000 that you gave me, 4000 were men, $1,000 were women.
So the first point is that the one that I'm going to select is going to be a man. Right. And then those man is like the men is like, okay, 80% of the successful ones were white, 20% were minority of color. So I'm going to select a white man. So then they go through the CVS of people. And these are the first two filters that the algorithm will do based on this data. So how do we avoid things like that is that you really need to have something that's actually kind of semi supervised learning process where you could use a knowledge base in the back to give you information about what is the discrimination going on in the workplace right now?
What is the rate of people in terms of education ten years ago as opposed to now? So you get those information from other sources. So this link data here could actually provide you with all the extra information that you could feed into your system to help the learning process to be a little bit more accurate.
So the process becomes more accurate. So, for instance, can we improve machine learning? Can we find something more and better if we use link data, can we maybe explain the results you've been talking freely about? Explainable. Can we explain the results better communicate the results better of machine learning using connected data methods. There were also the value lies.
I think this is one of the value is how you can use this link data. We use also the term context. So you contextualize the domain and enterprise have internal data about the enterprise. They know what the policy Neal gets internally, it externally. There are a lot of information they don't have access to and how you bring it back. And you can help contextualize the decision you're taking and explain system being self explainable somehow. So we really try to expand the data with more context to give much more evidence to people on what the data that you have my mean in Bruce Board context.
So I want to push back here a little bit. I think I actually think link data or this kind of connected data is actually useful for people. Right. Machines thinking large ginormous matrices. Right. What do you feed it to? Machine you feed feature vectors. Right. And I don't think for most development of new models, link data really gives you much. And in fact, it's something I've been struggling with is to communicate this connection that we've been talking about a lot. What I can say is that the results of models, our recommendations or predictive models, those have to be tied to something.
And that's really where we've seen the benefit of having connected data is that we can really feed. We can tie recommendation directly for a person, and we know that that's a person and it's one person we can tie a chemical directly to. We know it's this thing, right. And it's helpful for people to consume. I'm playing Devil's advocate here. I think there's obviously some benefit to having more data.
Right.
Why don't we just learn from our data Lake, right. Because we got one of those two. Right. Why don't we just learn from the data there? Why do we need to go to this effort of data integration all for building our AI models? That's what we've been talking about a lot internally is we understand the value of connected data for our developers, our end users, for people. It's less clear to me that it's actually valuable for straight up building kind of machine learning models.
Yeah. I got your point. If you take existing with simple machine learning systems and you're training on some of the data that you have, the first say you start with decision trees. The first rule, the is confidence on the clear ate are things that everybody knows. It's kind of common sense knowledge, and it's still on the top because they are very eye confidence I support, but the people doesn't care about those particular ones. So that's where you can use link data, a common sense knowledge that you can capture from somewhere else and prune those particular roles that doesn't really make much more sense.
That makes sense, but that you don't win it. And to go back to the peers point when you say, well, you do predictions and you want to also prune the one that actually are inconsistent meaning. And they said that sometimes they learn wrong things and then how you discard the systems to learn those wrong things. So I think this is where the context external information might bring you some ID on some techniques to prone to search base.
But if you're doing an end to end deep learning model, you have an end task in mind, you have some input data. Isn't that the fundamental revolution, right? Is that you have feature detectors that are being built automatically. You don't have to organize the data for the machine, it organizes itself.
Right.
So why do we need to organize it beforehand?
Well, it was very nice for some applications. Video processing, that language processing, but other ligers fail because first, we don't have the large amount of data that the deep, no running and unfortunate. So deep learning might work in some cases ontact on some applications, but you need something else.
Okay. Yeah. Let's keep this open for maybe some ideas from the other side.
So just to add today, I agree with Paul that it is not doing it is not a fundamental revolution within its algorithm. Again, from a cost proposition. The way that I see this working, I can give you an example in the healthcare domain. You have your let's say some huge hierarchy of them, I see is pretty large, right? My data just doesn't cover all of those. They just don't pay. What do I do?
Okay.
I can generalize. And that with connected data is very easy to do. Right. If you're thinking of a hierarchy, I'm not saying that you should go and do that. You can move everything up one level in the hierarchy, and suddenly you have one order of magnitude more dense data.
So you're saying it's fundamentally the grouping capability of the models that we typically developed. The taxon models is kind of fundamental for aggregating enough data so that you can do the training.
I just want to use case. Yeah.
Okay. So I would like to slowly start opening the floor for questions from audience, but before that, okay. Let's quickly jump on the other side of this issue and think for a moment in what sense connected data needs AI. So we've been talking about this knowledge graph, and of course, I guess we cannot expect everything will be created manually and build manually. So definitely link data needs AI also to grow. Can you elaborate on this? Do you have any views on this problem?
Paul really touched upon this in his talk, really clearly of how it could be used in. I help building the knowledge graph. And just from an oncologist perspective, I'll tell you that it's a really hard job to build these knowledge models and ontologies. It's really a way of thinking and a way of like trying to get access to the knowledge that exists in people's heads or in books. So there's just so much information and you want to bring all that together and capture as much as you can to create a model that is really reusable and it could solve your problems and you can apply algorithms to it and aims to it.
So it's a really complicated process. And that's why it takes years to build that medical ontology model. Or it's not an overnight thing so anything that could help to even semi automate this process it's needed. And I think the whole NLP machine learning data extraction is defined for human intervention to look at the model, to make sure everything is correct, but a perfect first step and start.
I.
Think one thing I didn't mention and I think it's really important here is one of the reasons is that we cannot find. We have limited number, we have domain experts, and we need to optimize their time. And we also people who do connected data. Right. So finding oncologist is not it's a hard problem, right. It's a higher good ones. It's really difficult. So we want to optimize their level of effort so that they're not just kind of sitting there curating data. We want them designing the upper level models.
We want them to make sure that the systems are designed appropriately. So I think we fundamentally have to have intelligence in order to just optimize our workforce correctly.
Okay.
Do we have any questions from the audience, please?
First of all, thanks. Panel. A lot of healthy debate there. Unfortunately, I'm old enough to actually remember AI before the last AI winter, which was roughly mid 90s through the early noughties, I attended a few conferences like that, maybe late 80s, early 90s. And we talked about connected data. We talked about semantic networks, we talked about machine learning, so I would like to throw up into the panel what has fundamentally changed since then? No.
Thank you.
I can give a quick answer. From my perspective, I think two things have changed. One is the availability of data, and the second thing is the computational power. So these two combined together is what's making all these visions more possible today. So AI at some point was if you can do have an expert system that was AI for people at that time. Right now, expert system is just very basic, linear first order logic. And AI for people right now is to think about autonomous cars or flying machines.
So the idea and the definition has completely changed and what has empowered that is really the amount of data that's available and the computational power that has increased so much. Yeah.
I would just say examples of effectiveness and real world applications, and that makes it so much easier for organizations to really believe. Right. And once you have belief, you can do a lot more. But we just see it day to day. Right. So it's really hard to get around AI, right. My phone is going to get delivered tomorrow with facial recognition built in. Right.
If it works, let's see here. Right.
Maybe to add to this regarding connected data and AI, maybe I missed it. But there has been no connected data evolution. There has been a connected data evolution. And this goes back to what I had mentioned. We have more data. We have more computational power.
So.
Well, in the next five years, and the five years after that and the five years after that, these are things that have evolved. In my opinion, nothing has fundamentally changed.
Also impact of some success stories.
One nothing to thanks.
Any other questions?
A question?
Hi.
So this is not so much a other question, but about, like food for sold to try to understand what are the gaps really have as a company to work on with building these connections with AI and other big data technologies, etc. For example, and those connectors. I think it's not only from a conceptual perspective, how those technologies work together, but also from a practicality of things, how we basically create tools to enable other communities like the AI community to come and work together with us. So I just want it to.
Okay. I did want to mention that. So I think for us in terms of that is we've seen that the move around the big data space has really also helped us begin our story around having a connected data space, and it's helped us drive development around AI and particular practical application of this is that we have a data Lake. It's up in the cloud.
Right.
We have an environment we use that's based on Spark. It's called Data Bricks that all of our data scientists can go and they can start seeing all the data available for them to play with.
That's. Great.
That's cool. They start training algorithms, they start saying, what the heck does this column mean? And then we have these big long discussions on our data science mailing. Is somebody trying to figure out a particular park file, right. And then what we're saying? Well, this is a benefit to us internally to start having common definitions of key attributes on our data. Where should that be? We need and we should have an ontology for that. We should have vocabularies around that. So in some sense, that trend has opened the eyes of at least that community within our organization to the need for better documentation and real definitions around important attributes and entities within within the company.
So that's really something where I think you end up coming to connect to data, whether you like it or not.
Maybe.
So the question, why would you not use a database to maintain all these closer definitions and so on? What do, for example, why would you use link data technologies as opposed to adding an additional table to your data warehouse and saying, okay, this means that that means that and so on could do.
And I still recall that connected data, right. I think the point is starting having those definitions is a key part of what you would advocate if you're an oncologist. Right.
So connected data is what you're saying is more about. Again, philosophy of connecting the pieces of information rather than this one or that data structure format, right? Yeah.
I mean, I have this rent that I go on all the time is that we start developers in our organization on teaching them about, like, link data or the notions of knowledge graphs. And then what ends up happening is that we want to developers want to perform another triple store evaluation, or they go off the deep end reading too much about Owl and that's all. It's great that they're interested in this stuff. But the point is that it's more the conceptual framework and choosing the right tool for the right job.
And somehow that's kind of the stuff that kind of communication material, that kind of education material is really lacking, especially when you look at more. I guess the semantic web style stuff that's out there, at least that I've seen. And so if anybody wants to correct me, I'm happy to be corrected.
You mentioning, too, at the end of the day, if it's practicality and you want to implement, you have to choose a tool so ontologies and concepts and relationships at a high level and abstract level. And what we mean is, yes, it's super interesting, and we talk about it and it's like from the philosophy perspective. But really, when you want to use that notion and they're coming to link data and you want to implement it, it boils down to the tools. What is, of course.
One you want, right?
Yeah. You can pick and choose, but you really have to understand what each tool is capable of in terms of. And then there's a lot of discussion in terms of why database as opposed to why ontologies as opposed to databases. And one of the major answers is the maintenance of an ontology. When your data grows. When new concepts come in, the redesigning of a database is way more more complicated than just adding some concepts and relationships in a graph model that you have. I'm sorry. I'm an oncologist.
I have to kind of defend that, too.
What I would say is we use for ontology maintenance. We're all in on Scotts and Owl. Great for some of our stuff. Graph databases are great for a lot of our stuff. I want to link data API using JSON. Ld. Right. And I don't want to mess with dot. Net, right. And they have like talking about JSON.
Ld, which is another thing grown out of the semantic concert. Annotations the messages.
Keep connecting. Okay.
Some of this afternoon's debates and panel sessions on the different standards and iterative approaches and so on, I realized we're eating into people's lunch to feel excuse the pun big thanks to Shimon of semantic integration. Facilitating in the panelists.