Slides available here
In realm of data-driven businesses, formalized knowledge is a valuable resource for AI projects, created at great expense.
IATE, with almost one million concepts storing multilingual terms and metadata, holds a large part of the textual knowledge of the EU. However, it can only be accessed lexically, and the database concepts stand alone.
Taxonomization is linking a flat set of concepts into a hierarchical knowledge graph. So if IATE were converted in a full-fledged ontology, its data could not only be consumed by linguists, but would also become accessible for machines through e.g. a SPARQL endpoint.
In this talk, we will present our approach to a semi-automatic generation of taxonomised concept maps, elevating a sub-domain of IATE terminology into a multilingual knowledge graph. We taxonomized a flat list of concepts within the COVID sub-domain, benchmarking two approaches to tackle this task: automatic concept map creation using an enhanced ML-powered language model and manual creation of the graph by a linguist expert.
We will dwell on performance and resource-saving advantages of our collaborative method, made easy by Coreon user-friendly UI, and show how the achieved productivity rate can make the taxonomization of even larger terminology databases economically viable.
To demonstrate empirically the effectiveness of the semi-automatic approach in a typical industry use case scenario, the resulting IATE/Covid graph was used to initialize a CNN for a multilingual document classification task. Leveraging the created taxonomy, we got a classification granularity that is not reachable by state-of-the-art models, such as non-initialised CNNs and zero-shot classifiers.