Connecting Data, People and Ideas since 2016.
22 October 2018

Semantics, meet Data Science: GraphDB adds support for data wrangling and similarity search

With the knowledge graph hype in the air, more people than ever are looking to find out how and why the Googles, Airbnbs and Ubers of the world are using knowledge graphs, and how such practices can be adopted elsewhere. Usually with hype comes also confusion, and knowledge graphs are no exception. On a superficial level, using a graph database to power a knowledge graph sounds like a good idea.


Digging a little deeper, however, it turns out that not everyone means the same thing when referring to knowledge graphs. If you are interested in knowledge graphs as originally conceived, namely as a semantics-rich graph structure, governed by controlled vocabularies, then RDF graph databases, aka triple stores, is probably what you should be looking at.


One of the more prominent solutions in this space is GraphDB by Ontotext. GraphDB has been around since 2004 and is used in a variety of projects and organizations. Keeping with the times is a must if you want to be around for as long as GraphDB has, and the latest updates in GraphDB’s repertoire are data wrangling and similarity search.


Data wrangling and semantic reconciliation


Data wrangling is the least sexy part of what is supposed to be the sexiest job these days: data science. The truth is a big part of what you do as a data scientist is data wrangling, which is another term for trying to come to terms with messy data. That does not sound like fun, and in all honesty, it’s not. Plus, less time spent on data wrangling means more time spent on solving business problems.


This is where tools for helping with data wrangling come in handy. One of the most popular tools in this category is OpenRefine. OpenRefine was originally released by Google, and consequently open sourced. As such, OpenRefine has been integrated with GraphDB. OpenRefine offers many capabilities enabling users to clean up data, apply filters, facets and transformations.


All of these are tasks data scientists typically perform by writing code. Using OpenRefine, however, makes things a bit simpler and more streamlined for typical data wrangling tasks. This lowers the barrier to data wrangling, meaning less senior or technically inclined team members can engage in it. In addition, a valuable feature of OpenRefine is the ability to export transformations in a standard format, and replay them.


Perhaps the most interesting thing OpenRefine brings to the table for GraphDB, however, is data reconciliation. When ingesting RDF documents, the ability to parse documents and identify entities (named entity recognition) and link information related to, or characterizing, those entities adds value by enriching connections and references.



The latter part, linking entities with information about them contained in controlled vocabularies and big public knowledge graphs such as Wikidata, is called semantic reconciliation. It adds value by enriching available information on those entities, and facilitating master data management, navigation and exploration.


OpenRefine comes with preconfigured access to some vocabularies, but it can also be additionally configured to access any vocabulary. By linking entities to a controlled vocabulary, they not only become unambiguously defined by means of URI reference, they also become associated with and gain access to properties of those entities, as well as their connections. This enables further exploration and association.


Atanas Kiryakov, Ontotext CEO and Founder, highlights a not-so-obvious aspect of this integration:


“One interesting thing about the way we integrated OpenRefine is that GraphDB makes the tabular data visible through a virtual SPARQL endpoint. One can explore it with SELECT queries and experiment with graph representations with CONSTRUCT queries. At the end, the data gets ingested in a concrete RDF repository INSERT query. Of course, such query is automatically generated, for those who don’t want to be bothered.


This approach gives a very flexible, but also very standard, mechanism to define the transformation of the table in a graph. This is the same approach taken in an open source tool called TARQL. But when packed with OpenRefine in-front, it becomes as easy to use and as visual as importing CSV in Excel”.


Similarity search via semantic vectors and embedding


With data ingested in the database, the ability to search efficiently in a potentially vast collection is crucial. Fast key and property search is a sine qua non, and with the help of additional indexes and tools such as Lucene, Solr and Elasticsearch, SPARQL in GraphDB also supports full-text search.


Now GraphDB is taking things up a notch, by adding support for semantic similarity search by integrating the Semantic Vectors library. What Semantic Vectors does is that it enables search to retrieve results that go beyond exact matching, or even full-text search via natural language processing techniques.


With Semantic Vectors, search can be expanded to results that are completely unrelated in terms of word similarity, but related on the semantic level. To use an example from GraphDB’s documentation: would you say the terms Pyonyang and Seoul are related terms when looking for results on Korea? Would you like your results to include them?


The answer is probably yes, but based on even the more advanced full-text search techniques, retrieving them would not be possible. This is why Semantic Vectors relies on a different technique, known as word embedding.



Without getting too much in the details, it suffices to say that the key idea behind these techniques is representing words as multi-dimensional vectors. Semantic Vectors uses Random Projection algorithm to achieve the same that word2vec achieves with neural networks – to represent words in a low-dimension vector space. By doing so, the theory goes, we can use the distance between vectors as a signal for the semantic relation between words. So by choosing a sound representation of words to vectors, the vectors for Korea, Pyonyang and Seoul should be close in distance, therefore the corresponding terms should be related.


Embedding techniques are widely used in NLP and machine learning, as they can help leverage semantic similarity without explicitly encoding it as such. There are certain assumptions and pitfalls that apply when using embedding, as with all techniques. But the ability to close in on semantic search from two different directions is very promising.


In GraphDB using embedding similarity scores for search can be done both via a GUI and in SPARQL queries, and can apply to whole documents or to specific terms.


Going forward


Both data wrangling and embedding are key additions for GraphDB today, but they may actually lay the groundwork for more features to be built on top of those in the future.


Data wrangling is a chronic pain point for everyone who works with data. For semantically rich data in particular, the ability to suggest vocabulary groundings can save a lot of time in data ingestion, and at the same time help data quality.


Embedding is a key technique for NLP and machine learning, and it also helps entity matching / reconciliation. This kind of techniques will be used to offer more advanced and efficient reconciliation, on top of what OpenRefine gives you already.


Having the infrastructure to work with this opens the door to integrating more such features in the future. Combining semantics and machine learning is what top organizations are doing anyway.


Vassil Momtchev, GraphDB product owner, notes that GraphDB stayed strictly in symbolic semantics for a long time and releasing reconciliation and semantic vector is a serious step towards statistic semantics:


“Unlike competitors who simply integrate machine learning we leverage our model which can differentiate strings (literals) from things (resources). Semantic vectors can find similar resources based on their text description – an algorithm quickly learning the significant words and their semantics from the current knowledge base.


Reconciliation uses the information schema to suggest similar entities based on their semantic contexts well beyond the simple string matching.


The challenge we try to solve is to give tools, well integrated with GraphDB, to maintain knowledge graphs at scale so we can lower the price and effort required to operate such systems”.



Kiryakov, on his part, notes:


“Why is reconciliation important? In our experience the biggest value of knowledge graphs in big enterprises is to use them as a data integration or warehousing platform. So, it is key to offer reconciliation that is efficient enough to make data integration via knowledge graphs work better than record linking and similar offerings of the mainstream vendors of master data management technology.


Reconciliation is a very important direction for us and we will be announcing more and more features to support this and make more complex scenarios possible. For example, reconciliation against datasets already loaded in the same or a different instance of GraphDB.


This is useful when you want to integrate multiple datasets into a single knowledge graph. Then you can load the first one, setup OpenRefine reconciliation service on top of it and use it for entity matching when ingesting the next source. This way reconciliation is no longer a matter of linking entities to controlled vocabularies, but rather a new flavor of data integration.


Why is semantic search important? Because when you have piled together lots of diverse data in a knowledge graph, consuming the data only using a structured query language as SPARQL becomes cumbersome. Figuring out which classes, predicates and graph patterns to use to get to specific information becomes too hard. So, the bigger and more diverse a knowledge graph is, the smaller fraction of its value can be used via SPARQL.


This is why we will be extending our offerings related to embeddings and semantic search. There is a lot more that can be done to be able to extract deeper insights, relationships, patterns, etc”.


People who work with knowledge graphs at scale need all the help they can get, and such features should be welcome by GraphDB users.


To experience GraphDB first hand, join us in Connected Data London. Kiryakov will be discussing how Analytics on Big Knowledge Graphs deliver entity awareness and help data linking, and the GraphDB crew will be there as well.


This is what Connected Data London 2018 brought to the fore. Connected Data London 2019 is on! Secure your chance to learn from experts and innovators, get your ticket early! Limited number early bird tickets available.

Connected Data World 2021  All Rights Reserved.

Connected Data is a trading name of Neural Alpha LTD.

Edinburgh House - 170 Kennington Lane
Lambeth, London - SE11 5DP