Things should be as simple as possible, but not simpler. As JSON and the applications that use it proliferated, some of the ways people use JSON are questioned. Does having a schema hinder application development? And what happens when applications grow beyond their initial scope, team members come and go, and data has to be used in ways not originally anticipated?
Our panel of experts, moderated by George Anadiotis, addressed these questions and explored a whole range of related topics. The conversation featuring Kurt Cagle, Managing Editor of Data Science Central, Brian Platz, co-founder and CEO of Fluree, and Benjamin Young, Principal Architect at John Wiley and Sons and co-chair of the W3C JSON-LD Working Group, will be released soon. Here are the takeaways.
The traits JSON brought to application development are also related to the NoSQL wave of databases. Until the early 2010s, relational databases were practically the only game in town. Relational databases need a schema for the tables used to store application data. There are some intermediate layers involved in "translating" data from the database to a format applications can work with, and schema changes have to propagate through those.
This makes the process cumbersome. Would it not be easier and faster if data could be stored directly using the format applications work with? Indeed, it would. Since that format increasingly was JSON, solutions for storing JSON (aka "document stores") started popping up. Today document stores are among the most popular database systems.
But what about schema? Using JSON and document stores, a schema is not strictly necessary. Document stores are able to store JSON as-is. JSON can be created and parsed without conforming to a schema, and not having one can give the impression of speeding things up. But is there really such a thing as a schemaless application?
"Do you or don't you need a schema? I think the answer is you always have a schema. There is no application that doesn't have a schema. The question is just whether the schema exists in your proprietary application code, or do you reside that schema with the data itself? Your application needs to know what it's dealing with at the end of the day", said Platz.
While this is a valid point, sometimes application developers unwittingly cut corners. This may be because they are under pressure to deliver, or because they don't see the long term implications. If they did, they would probably realize that a little extra effort upfront can save a lot of effort later on.
"When you code something up for yourself and you're like, oh, this is promising. I should give it away. I'll put it on the web and people can see it. The first ten people that come can't make any sense of it, and you spend all your time explaining. This is what I meant. I'll write a better readme. I'll explain how to install it.
"These are the steps that essentially are missing from most applications in terms of semantics. We've created the data and we've talked to ourselves. But we haven't necessarily done the steps of how do we take that data into the wider world so people can understand it and remix it", said Young.
A little schema goes a long way towards serving each and every application, that much should be clear. But what about having to deal with a multitude of applications? This is what Young hinted at when he talked about "remixing" data.
JSON does actually have a way to implement schema, aptly named JSON Schema. Using it is going to enforce parameters and capabilities of the shape of JSON documents. For example - what values each key can have, how many of which value? Is it a list? Can it have more than five things in the list? Can it have different terms? These are all things that can be specified using JSON schema.
A JSON schema serves as a contract for the data each application manages. In certain scenarios, however, data will need to go beyond the confines of the application that generated it. Those scenarios are typically encountered at the enterprise level.
"Developers are used to working on the application, where they control all levels of the interaction with the database. We're now entering the stage where that's no longer true; where in fact, the definitions that you have with enterprise data are very different from what your typical developer is going to come up with. Part of the difficulty in building out enterprise scale data systems comes down to the fact that ontologically they're a mess.
There is the difficulty in being able to say, okay, we have SAP here, we have custom built over there, we've got Azure content coming from there in SharePoint, and each of them basically describes their own data sets. Each of them have been working on the underlying assumption that we don't need to share data is now essentially colliding with the fact that when you start talking about language of any sort, you need a way of describing commonality", said Cagle.
This is where JSON-LD comes into play, as a way to integrate multiple schemas. A good way to explain JSON-LD would be to point to its most prominent use - schema.org. This is an initiative undertaken by major search engines, with the aim of semantically enhancing web content. A vocabulary has been developed by schema.org, and by using JSON-LD to annotate web content, people can point to schema.org to specify the meaning of terms.
Let's take a simple example - identifying people's names and surnames. JSON-LD uses so-called contexts to map terms such as "surname" to a global namespace. In the case of schema.org, "surname" can be mapped to schema.org/givenname. When integrating data from all over the web, which is what search engines do, some may people use the term "surname" and some may use the term "last name".
Not everyone is going to use the same terms, and we can't make them do that. But what we can do using JSON-LD is provide context documents that map terms to something which helps disambiguate meaning. So if "surname" and "last name" are both mapped to schema.org/givenname, search engines know what they mean. Users apply JSON-LD annotations because there's something in it for them too: better ranking in search engine results.
The same principle can be applied at the enterprise level too. The difference is that instead of annotating terms on web pages and letting search engines scrape them, JSON-LD can be used to disambiguate terms used in JSON documents. Organizations add new applications as business needs dictate, but with each new application the complexity of the organization’s data landscape increases.
Young extended the previous example: let's imagine we need to integrate contact data from every contact system on the planet. They're not all going to follow the same schema, and we can't make them do that. But what we can do is provide context documents that map each application's schema terms to a vocabulary that helps disambiguate meaning. Then data can be integrated in a central store, or even queried in a distributed way.
As Platz pointed out, it's all about interoperability. Chief data officers and data scientists need to integrate the data their organizations use, internal and eventually external too. Interoperability is needed to build new applications and enable insights. Analytics and machine learning need integrated, quality data to work with.
Traditionally, the solution to this has been data warehouses. But data warehouses are expensive and they don't solve the data integration problem, because they are like copy-pasting on a mass scale. A better solution would be composable data that can be aggregated on the fly, and using JSON-LD for data integration is a step in that direction.
Cagle described schema.org as the pidgin of the web. It may not be a perfect vocabulary, but if you can express something in schema.org, chances are that someone else can use it. Schema.org may not necessarily be a good match for every enterpise's needs. In all likelihood, each organization will have to develop its own vocabulary that better captures its own needs.
But the mechanism of using enterprise vocabularies with JSON-LD for data integration can work in the same way search engines use JSON-LD to integrate web content with schema.org. In that sense, JSON-LD may be called the pidgin of enterprise data integration, in the same way schema.org has been called the gateway drug of linked data.
This is why Fluree has officially added support for JSON-LD. Fluree is a[n immutable] ledger-backed graph database, enabling Git-like control over data. Most databases sit behind an application and are protected by firewalls, but they're also very vertically integrated into the application. Platz thinks this is why we're having such a hard time aggregating and collaborating around data.
Fluree has always supported JSON for transacting, but not formally JSON-LD. Fluree's journey started back in 2016, and at that time the importance of JSON-LD wasn't obvious for Platz. It is now, however, especially with its underpinning concepts around decentralized identifiers and verifiable credentials.
“What JSON-LD does is it allows developers to have that ease that you would get from a document store, but it also brings a layer of relationships or references into it. You get to the point where you have a fully connected graph and all the power of a relational database into the simplicity of a document store.
“And then you also have the potential of the future of interoperability. We talked about schema.org as an example. It's really hard to model data and to come up with a schema from scratch. There are great vocabularies out there that can save you a lot of time, and JSON-LD is the gateway to using them.”, said Platz.
JSON is the de facto data format for developers today because it’s easy to use, but it’s not without its issues. JSON-LD builds on top of JSON, facilitating enterprise data integration.
Pidgin (noun): A simplified speech used for communication between people with different languages.