1. Transforming Access to Culture & History with Connected Data The case of Europeana Netherlands, Public Domain 1660 - 1625, Rijksmuseum Anonymous Arrival of a Portuguese ship
2. Netherlands, Public Domain 1615, Rijksmuseum Anonymous Elegant Party on a Terrace of a Venetian- inspired Setting Who we are Europeana: Transforming the World with Culture
3. Europeana: Cultural Heritage Metadata and Content from across Europe • We aggregate data from Cultural Heritage organisations across Europe • predominantly, but not only, EU member states • In most cases we only harvest metadata • increasingly, however, we are also hosting content as well as metadata • Make it available through our portal site: https://www.europeana.eu/portal/en • ultimately linking back to the originating institution Transforming Access to Culture & History with Connected DataCC BY-SA
4. Europeana in numbers • 53 million+ items • 30+ languages • 4500+ GLAM (Galleries, Libraries, Archives, and Museums) institutions CC BY-SA Transforming Access to Culture & History with Connected Data
5. Europeana as ‘Big Data’ • Volume: relatively low, by Big Data standards (< 2TB metadata) • Velocity: continuous updating, flushed to datastore every 15 minutes • Veracity: significant issues of data quality • Variety: immense • multiple languages • multiple formats • different institutions • etc … extremely heterogeneous CC BY-SA Transforming Access to Culture & History with Connected Data Analysed as the four ‘V’s …
6. Norway, CC BY-SA 1921, Oslo Museum Ernest Rude Ernest Marini - dancer in a costume Who they are Users: what they want and what they do
7. Who are they? • "Culture Vultures" • Academic researchers • Teachers and students • Visual artists • Graphic designers • Amateurs (in the original sense of the word) • "Culture snackers” • casual browsers looking for entertainment CC BY-SA Transforming Access to Culture & History with Connected Data
8. What are they looking for? • Query pattern is extremely flat • analysis of logs shows no search term shared by > 6 users • further analysis needed here • “serendipity search” is important: users are trying to surprise themselves CC BY-SA Transforming Access to Culture & History with Connected Data It seems literally impossible to say ….
9. What are they like? • Culture vultures • engagement is extremely high • mean rank of clicked items: 82 (!) • session length once an item is clicked in the SERP can stretch into hours • Culture snackers • bounce rate difficult to estimate, but high (> 85%) CC BY-SA Transforming Access to Culture & History with Connected Data User engagement
10. What are they doing? • school reports • university essays • presentations • exhibitions • research papers • new artworks CC BY-SA Transforming Access to Culture & History with Connected Data Making new stuff!
11. United Kingdom, CC BY The Wellcome Library Luigi Garzi The birth of Adonis and the transformation of Myrrha Where we’ve been, where we’re going Visions for cultural heritage and connected data, past and present
12. Original Vision: as Linked Open Data provider CC BY-SA Transforming Access to Culture & History with Connected Data Linking Open Data cloud diagram 2011, by Andrejs Abele, John P. McCrae, Paul Buitelaar, Anja Jentzsch and Richard Cyganiak. http://lod-cloud.net/
13. The original vision, today • Ontological modelling • Europeana Data Model (EDM) • expressed in RDF for data-model mediation • internationally shared (DPLA, BBC, etc.) • Served on our SPARQL endpoint • … but more frequently as JSON-LD over our APIs • plug plug: received API World award this year for best Data API CC BY-SA Transforming Access to Culture & History with Connected Data Continued contributions
14. LOD: New Directions • “Entity-fication” • 70%-80% of our searches are for named entities • People • Places • Concepts (subject headings) • Information on these can be harvested from: • DBPedia • Wikidata • Geonames • … CC BY-SA Transforming Access to Culture & History with Connected Data Structuring content through LOD harvesting (i)
15. LOD: New Directions • “Workification” (FRBR data model) • creating abstract artistic or intellectual entities from numerous instantiations • for example, the novel “Oliver Twist” from its many printed editions and translations • Harvested (or at least seeded) from OCLC and VIAF CC BY-SA Transforming Access to Culture & History with Connected Data Structuring content through LOD harvesting (ii)
16. LOD: New Directions • Knowledge Graphs linking … • authors to works • artists to their paintings, and other artists • concepts to concepts • … • Obvious applications • educational • research • “serendipity” • improved “snacker” engagement CC BY-SA Transforming Access to Culture & History with Connected Data Structuring content through LOD harvesting (iii)
17. Case Study CC BY-SA Transforming Access to Culture & History with Connected Data Linking Rembrandt to Jahangir • https://www.thetimes.co.uk/article/from-rhinos-to-rembrandt-how-india- inspired-the-world-hdsr8kls5 “Self-portrait” (Rembrandt van Rijn), “The Great-Mughal Jahangir” (Rembrandt van Rijn), and “Prince Salim, the future Jahangir, Enthroned” (Anonymous), all in the public domain.
18. How we do it Technical stack France, Public Domain 1914, National Library of France Agence de presse Meurisse Concours de cycles nautiques sur le lac d’Enghien : Berregent piloté par Austerling
19. The webapp stack • Data ingestion: Java + XSLT behemoth • Data enrichment: Java • Source-of-truth datastore: MongoDB • Information retrieval: Solr + Neo4J • API: Swagger with Java • UI: JS, variety of libraries • SPARQL endpoint: Virtuoso CC BY-SA Transforming Access to Culture & History with Connected Data
20. France, Public Domain 1588, Bibliothèque municipale de Lyon Hendrik Goltzius Le dragon dévorant les compagnons de Cadmus Reality check Where we are and how fast we can go
21. Dirty Data (i) • getting from things to strings is a non-trivial process • Named Entity Recognition technology relatively unhelpful in this domain • exact-string matching only: precision good, but recall poor • multilinguality strong • Limited number of tools to help with cleaning, enhancing, validating this data • OpenRefine potentially helpful • ShEx, SHACL not yet fully mature CC BY-SA Transforming Access to Culture & History with Connected Data Source data
22. Dirty Data (ii) • Irregular data models • Large number of Wikidata, DBpedia properties applied irregularly • “defensive querying” • Incorrect data • more often questions of structure than inaccurate field values • e.g. Geonames hierarchies • Uncurated or aggregated data • e.g., many variants provided by VIAF CC BY-SA Transforming Access to Culture & History with Connected Data Linked Data resources
23. Directions forward • Manual or at least heavily-supervised curation a requirement for the foreseeable future • Tools to aid NER and entity-matching are the focus of two US efforts: • Institute of Museum and Library Services (IMLS) Local Authority Files project • Linked Data for Libraries Reconciliation Service Group • Work division • devolution to partners • crowdsourcing CC BY-SA Transforming Access to Culture & History with Connected Data Dealing with dirty data
25. PANEL: LINKED OPEN DATA - IS IT FAILING OR JUST GETTING OUT OF THE BLOCKS? Tweet your questions via Direct Message to @Connected_Data or #ConnectedData MODERATOR James Phare Connected Data London PANELIST Chris Taggart CEO OpenCorporates @CountCulture PANELIST Chris Gutteridge Linked Open Data Architect University of Southampton PANELIST Leigh Dodds Data Infrastructure Programme Lead Open Data Institute @ldodds PANELIST Sebastian Hellmann Executive Director and Board Member DBpedia
Connected Data World 2021 All Rights Reserved.
Connected Data is a trading name of Neural Alpha LTD.
Edinburgh House - 170 Kennington Lane Lambeth, London - SE11 5DP