Semantic Web and Linked Data. Corrections and additions

I want to present to the attention of the public a fragment of this recently published book:

Ontological modeling of the enterprise: methods and technologies [Text]: monograph / [S. V. Gorshkov, S. S. Kralin, O. I. Mushtak, and others; executive editor S. V. Gorshkov]. - Yekaterinburg: Publishing House of the Ural University, 2019. - 234 p.: ill., tab.; 20 cm - Auth. listed on the back of tit. With. — Bibliographer. at the end of ch. - ISBN 978-5-7996-2580-1: 200 copies.

The purpose of laying out this fragment on Habré is fourfold:

  • It is unlikely that someone will be able to hold this book in their hands if he is not a client of a respected SergeIndex; It's definitely not for sale.
  • Corrections have been made to the text (they are not highlighted below) and additions have been made that are not very compatible with the format of the printed monograph: topical notes (under spoilers) and hyperlinks.
  • I want to collect questions and commentsto take them into account when this text is included in a revised form in any other editions.
  • Many Semantic Web and Linked Data adherents still feel that their circle is so narrow, mainly because the general public has not yet been properly explained how great it is to be an adherent of the Semantic Web and Linked Data. The author of the fragment, although he belongs to this circle, does not adhere to such an opinion, but, nevertheless, considers himself obliged to make another attempt.

So,

Semantic Web

The evolution of the Internet can be represented as follows (or talk about its segments formed in the following order):

  1. Documents on the Internet. Key technologies - Gopher, FTP, etc.
    The Internet is a global network for the exchange of local resources.
  2. Internet Documents. Key technologies are HTML and HTTP.
    The nature of the exposed resources takes into account the characteristics of the medium for their transmission.
  3. Internet data. Key technologies are REST and SOAP API, XHR, etc.
    The era of Internet applications, not only people become consumers of resources.
  4. internet data. Key technologies are Linked Data technologies.
    This fourth stage, predicted by Berners-Lee, creator of key technologies of the second and director of the W3C, is called the Semantic Web; Linked Data technologies are designed to make data on the web not only machine-readable, but also "machine-understandable".

From what follows, it will become clear to the reader that the key concepts of the second and fourth stages correspond:

  • analogues of URL are URIs,
  • HTML is analogous to RDF,
  • HTML hyperlinks are similar to URI entries in RDF documents.

Semantic Web is more of a systematic vision of the future of the Internet than a specific spontaneous or lobbied trend, although it is able to take into account these latter as well. For example, an important feature of what is called Web 2.0 is considered to be "user-generated content". It is called upon to take it into account, in particular, the W3C recommendation "Web Annotation Ontology"and such an undertaking as Solid.

Is the Semantic Web dead?

If you refuse unrealistic expectations, the situation with the semantic web is about the same as with communism in the days of developed socialism (and let everyone decide for himself whether loyalty to the conditional precepts of Ilyich is observed). Search engines quite successful force websites to use RDFa and JSON-LD and themselves use technologies related to those described below (Google Knowledge Graph, Bing Knowledge Graph).

In general terms, the author cannot say what prevents greater dissemination, but he can speak on the basis of personal experience. There are tasks that would be solved “out of the box” in the conditions of the SW offensive, although not very massive. As a consequence, those who have these tasks do not have the means of coercion against those who are able to provide a solution, and the latter themselves providing a solution by the latter is contrary to their business models. So we continue to parse HTML and glue various APIs, one shittier after another.

However, Linked Data technologies have spread beyond the mass web; The book is, in fact, devoted to their applications. Currently, the Linked Data community expects these technologies to become even more widespread with Gartner fixing (or proclaiming, whichever you like) trends such as Knowledge Graphs и data fabric. I would like to believe that not “bicycle” implementations of these concepts will be successful, but those related to the W3C standards discussed below.

Linked Data

Berners-Lee defined Linked Data as the Semantic Web done right: a set of approaches and technologies to achieve its ultimate goals. Basic Principles of Linked Data Berners-Lee singled out the following.

Principle 1. Using URIs to name entities.

URIs are global entity identifiers as opposed to local string identifiers of entries. Subsequently, this principle found its best expression in the Google Knowledge Graph slogan “things, not strings».

Principle 2. Using URIs in the HTTP scheme so that they can be dereferencing.

By referring to a URI, it should be possible to get the signifier behind that signifier (the analogy with the name of the operator “*» in C); more precisely, to get some representation of this signified - depending on the value of the HTTP header Accept:. Perhaps, with the advent of the AR / VR era, it will be possible to get the resource itself, but for now, most likely, it will be an RDF document that is the result of a SPARQL query DESCRIBE.

Principle 3. Use of W3C standards - primarily RDF(S) and SPARQL - in particular when dereferencing URIs.

These individual "layers" of the Linked Data technology stack, also known as Semantic Web Layer Cake, will be described below.

Principle 4. Using references to other URIs when describing entities.

RDF allows you to limit yourself to a verbal description of a resource in natural language, and the fourth principle calls for not doing this. With universal observance of the first principle, it becomes possible to refer to others, including "alien" ones, when describing a resource, which is why the data are called linked. In fact, it is almost inevitable to use URIs named in the RDFS dictionary.

RDF

RDF (Resource Description Framework) - a formalism for describing interrelated entities.

About entities and their relationships, statements of the form "subject-predicate-object" are made, called triplets. In the simplest case, the subject, predicate, and object are both URIs. The same URI can be in different triplets in different positions: be a subject, a predicate, and an object; the triplets thus form a kind of graph called an RDF graph.

Subjects and objects can be not only URIs, but also so-called empty nodes, and objects can also be literals. Literals are instances of primitive types, consisting of a string representation and a type specification.

Examples of writing literals (in Turtle syntax, more on that below): "5.0"^^xsd:float и "five"^^xsd:string. Literals with type rdf:langString can also be provided with a language tag, in Turtle it is written like this: "five"@en и "пять"@ru.

Empty nodes are "anonymous" resources without global identifiers, which, however, can be asserted; sort of existential variables.

So (this, in fact, is the whole essence of RDF):

  • the subject is a URI or an empty node,
  • the predicate is a URI,
  • object is a URI, an empty node, or a literal.

Why can't predicates be empty nodes?

The probable reason is the desire to informally understand and translate triplet into the language of first-order predicate logic. s p o like something like Semantic Web and Linked Data. Corrections and additionsWhere Semantic Web and Linked Data. Corrections and additions - predicate, Semantic Web and Linked Data. Corrections and additions и Semantic Web and Linked Data. Corrections and additions - constants. There are traces of such an understanding in the document “LBase: Semantics for Languages ​​of the Semantic Web”, which has the status of a W3C working group note. With this understanding, the triplet s p []Where [] - an empty node, will be translated as Semantic Web and Linked Data. Corrections and additionsWhere Semantic Web and Linked Data. Corrections and additions - variable, but how then to translate s [] o? The W3C recommendation document "RDF 1.1 Semantics” suggests another way of translating, but still does not consider the possibility of predicates being empty nodes.

However, Manu Sporny allowed.

RDF is an abstract model. RDF can be written (serialized) in various syntaxes: RDF/XML, Turtle (most human readable) JSON-LD, HDT (binary).

The same RDF can be serialized into RDF/XML in different ways, so it makes no sense, for example, to validate the resulting XML with XSD or try to extract data with XPath. Similarly, JSON-LD is unlikely to satisfy the desire of the average Javascript developer to work with RDF using Javascript dot and square bracket notation (although JSON-LD is moving in that direction by offering a mechanism framing).

Most syntaxes offer ways to shorten long URIs. For example, ad @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> in Turtle will then allow you to write instead <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> just rdf:type.

RDFS

RDFS (RDF Schema) - basic modeling vocabulary, introduces the concepts of property and class, and properties such as rdf:type, rdfs:subClassOf, rdfs:domain и rdfs:range. Using the RDFS dictionary, for example, the following valid expressions can be written:

rdf:type         rdf:type         rdf:Property .
rdf:Property     rdf:type         rdfs:Class .
rdfs:Class       rdfs:subClassOf  rdfs:Resource .
rdfs:subClassOf  rdfs:domain      rdfs:Class .
rdfs:domain      rdfs:domain      rdf:Property .
rdfs:domain      rdfs:range       rdfs:Class .
rdfs:label       rdfs:range       rdfs:Literal .

RDFS is a description and modeling vocabulary, but is not a constraint language (although the official specification and leaves the possibility of such use). The word "Schema" should not be understood in the same sense as in the expression "XML Schema". For example, :author rdfs:range foaf:Person means that rdf:type all property values :authorfoaf:Person, but does not mean that this should be said in advance.

SPARQL

SPARQL (SPARQL Protocol and RDF Query Language) is a query language for RDF data. In a simple case, a SPARQL query is a set of samples against which the triplets of the queried graph are matched. Variables can be placed in the positions of subjects, predicates, and objects in patterns.

The query will return such variable values ​​that, when substituted into the samples, can result in a subgraph of the RDF graph being queried (a subset of its triplets). Variables of the same name in different samples of triplets must have the same values.

For example, on the above set of seven RDFS axioms, the following query would return rdfs:domain и rdfs:range as values ?s и ?p respectively:

SELECT * WHERE {
 ?s ?p rdfs:Class .
 ?p ?p rdf:Property .
}

It is worth noting that SPARQL is declarative and is not a graph traversal language (however, some RDF repositories offer ways to adjust the query execution plan). Therefore, some standard graph problems, such as finding the shortest path, cannot be solved in SPARQL, including using the mechanism property paths (but, again, individual RDF repositories offer special extensions for these tasks).

SPARQL does not share the presumption of the openness of the world and follows the "negation as failure" approach, in which are possible structures such as FILTER NOT EXISTS {…}. Data distribution is taken into account using the mechanism federated queries.

The SPARQL access point, an RDF store capable of processing SPARQL queries, has no direct analogues from the second stage (see the beginning of this paragraph). It can be likened to a database, based on the content of which HTML pages were generated, but accessible to the outside. The SPARQL access point is more like an API access point from the third stage, but with two main differences. Firstly, it is possible to combine several “atomic” queries into one (which is considered a key characteristic of GraphQL), and secondly, such an API is completely self-documented (which HATEOAS tried to achieve).

Polemic remark

RDF is a way of publishing data on the web, so RDF repositories should be considered document DBMSs. True, since RDF is a graph, not a tree, they turned out to be graph at the same time. It's amazing that it worked out at all. Who would have thought that there would be smart people who implement blank nodes. Here is Codd didn't work out.

There are also less full-featured ways to organize access to RDF data, for example, Linked Data Fragments (LDF) and Linked Data Platform (LDP).

OWL

OWL (Web Ontology Language) - a formalism of knowledge representation, a syntactic version of descriptive logic Semantic Web and Linked Data. Corrections and additions (everywhere below it is more correct to say OWL 2, the first version of OWL was based on Semantic Web and Linked Data. Corrections and additions).

The concepts of description logics in OWL correspond to classes, roles to properties, individuals retain their former name. Axioms are also called axioms.

For example, in the so-called Manchester syntax for the OWL notation, the axiom we already know Semantic Web and Linked Data. Corrections and additions will be written like this:

Class: Human
Class: Parent
   EquivalentClass: Human and (inverse hasParent) some Human
ObjectProperty: hasParent

There are other syntaxes for writing OWL, such as functional syntax, used in the official specification, and OWL/XML. Also, OWL can be serialized into abstract RDF syntax and in the future - in any of the specific syntaxes.

OWL is twofold in relation to RDF. On the one hand, it can be viewed as a kind of dictionary that extends RDFS. On the other hand, it is a more powerful formalism for which RDF is just a serialization format. Not all elementary OWL constructs can be written with a single RDF triplet.

Depending on which subset of OWL constructs are allowed to be used, one speaks of so-called OWL profiles. The standardized and best known are OWL EL, OWL RL and OWL QL. The choice of profile affects the computational complexity of typical problems. A complete set of OWL designs to match Semantic Web and Linked Data. Corrections and additions, is called OWL DL. Sometimes one also speaks of OWL Full, in which OWL constructs are allowed to be used with the full freedom inherent in RDF, without semantic and computational restrictions. Semantic Web and Linked Data. Corrections and additions. For example, something can be both a class and a property. OWL Full is unresolvable.

The key principles of attaching consequences in OWL are the acceptance of the open world assumption (open world assumption, OWA) and the rejection of the unique name assumption, A). Below we will see what these principles can lead to and introduce some of the constructs of OWL.

Let the ontology contain the following fragment (in Manchester syntax):

Class: manyChildren
   EquivalentTo: Human that hasChild min 3
Individual: John
   Types: Human
   Facts: hasChild Alice, hasChild Bob, hasChild Carol

Will it follow from what has been said that John has many children? Rejecting UNA would force the inference engine to answer this question in the negative, since Alice and Bob could very well be the same person. For the following to take place, we need to add the following axiom:

DifferentIndividuals: Alice, Bob, Carol, John

Now let the ontology fragment have the following form (John is declared to have many children, but he has only two children):

Class: manyChildren
   EquivalentTo: Human that hasChild min 3
Individual: John
   Types: Human, manyChildren
   Facts: hasChild Alice, hasChild Bob
DifferentIndividuals: Alice, Bob, Carol, John

Will this ontology be inconsistent (which can be interpreted as evidence of invalid data)? Accepting OWA will cause the inference engine to respond in the negative: "somewhere" else (in a different ontology) it could well be said that Carol is also John's child.

To eliminate this possibility, let's add a new fact about John:

Individual: John
   Facts: hasChild Alice, hasChild Bob, not hasChild Carol

To exclude the appearance of other children, let's say that all the values ​​of the property "have a child" are people, of which we have only four:

ObjectProperty: hasChild
   Domain: Human
   Сharacteristics: Irreflexive
Class: Human
EquivalentTo: { Alice, Bill, Carol, John }

Now the ontology will become inconsistent, which the inference engine will not fail to report. With the last of the axioms, we kind of "closed" the world, and notice how the possibility that John is his own child is ruled out.

Linking Enterprise Data

A set of approaches and technologies Linked Data was originally intended for publishing data on the web. Using them in an intracorporate environment faces a number of difficulties.

For example, in a closed corporate environment, the deductive power of OWL based on the adoption of OWA and the rejection of UNA, solutions driven by the open and distributed nature of the web, is too weak. And here the following outputs are possible.

  • Endowing OWL with semantics, implying the rejection of OWA and the adoption of UNA, the implementation of the corresponding inference engine. - along this path is Stardog RDF repository.
  • Abandoning the deductive power of OWL in favor of rule engines. - Stardog supports SWRL; Jena and GraphDB offer own languages rules.
  • Rejection of the deductive capabilities of OWL, the use of one or another subset close to RDFS for modeling. - See more about this below.

Another problem is the more significant attention that the corporate world can devote to data quality issues and the lack of data validation tools in the Linked Data stack. The outputs are as follows.

  • Again, using OWL constructs with closed-world semantics and uniqueness of names to validate if there is an appropriate inference engine.
  • Using SHACL, standardized after the list of Semantic Web Layer Cake layers has been fixed (however, it can also be used as a rules engine), or ShEx.
  • Realizing that everything is ultimately done by SPARQL queries, creating your own simple data validation mechanism using them.

However, even a complete rejection of deductive capabilities and validation tools leaves the Linked Data stack out of competition in tasks that are landscape similar to the open and distributed web - in data integration tasks.

How about a regular corporate information system?

This is possible, but one should, of course, be aware of exactly what problems the appropriate technologies will have to solve. I will describe here a typical reaction of development participants to show what this technology stack looks like from the point of view of conventional IT. Reminds me a bit of the parable of the elephant:

  • Business analyst: RDF is something like a directly stored logical model.
  • Systems Analyst: RDF is like EAV, only with a bunch of indexes and a convenient query language.
  • developer: well, it's all in the spirit of rich model and low code concepts, read about it recently.
  • Project Manager: yes it is collapsing the stack!

Practice shows that the stack is most often used in tasks related to the distribution and heterogeneity of data, for example, when building systems of the MDM (Master Data Management) or DWH (Data Warehouse) class. Such problems exist in any industry.

As for industry-specific applications, Linked Data technologies are currently most popular in the following industries.

  • biomedical technologies (where their popularity seems to be related to the complexity of the subject area);

relevant

In the "Boiling Point" the other day, a conference organized by the association "National Medical Knowledge Base" was held "Unification of ontologies. From theory to practical application».

  • manufacturing and operation of complex products (large engineering, oil and gas production; most often it is a standard ISO 15926.);

relevant

Here, too, the reason is the complexity of the subject area, when, for example, at the upstream stage, if we talk about the oil and gas industry, a simple accounting needs to have some CAD functions.

In 2008, Chevron hosted a representative installation conference.

ISO 15926 eventually seemed a little heavy to the oil and gas industry (and found almost more use in mechanical engineering). Only Statoil (Equinor) got hooked on him thoroughly, in Norway a whole ecosystem. Others are trying to do their own thing. For example, according to rumors, the domestic Ministry of Energy intends to create a "conceptual ontological model of the fuel and energy complex", similar, apparently, to created for the electric power industry.

  • financial institutions (even XBRL can be seen as a hybrid of SDMX and RDF Data Cube ontology);

relevant

LinkedIn at the beginning of the year actively spammed the author with vacancies from almost all the giants of the financial industry, whom he knows from the TV series Suits: Goldman Sachs, JPMorgan Chase and/or Morgan Stanley, Wells Fargo, SWIFT/Visa/Mastercard, Bank of America, Citigroup, the Fed, Deutsche Bank… Everyone was probably looking for someone to send to Knowledge Graph Conference. Quite a few managed to find: financial institutions occupied everything first day morning.

On HeadHunter, something interesting came across only from Sberbank, it was about "EAV storage with an RDF-like data model."

Probably, the difference in the degree of love for the corresponding technologies of domestic and Western financial institutions is due to the transnational nature of the latter's activities. Apparently, integration across state borders requires qualitatively different organizational and technical solutions.

  • question-answer systems that have commercial applications (IBM Watson, Apple Siri, Google Knowledge Graph);

relevant

By the way, the creator of Siri, Thomas Gruber, is the author of the very definition of ontology (in the IT sense) as a “conceptualization specification”. In my opinion, the rearrangement of words in this definition does not change its meaning, which, perhaps, indicates that it is not there.

  • publication of structured data (with good reason this can already be attributed to Linked Open Data).

relevant

Big fans of Linked Data are the so-called GLAM: Galleries, Libraries, Archives, and Museums. Suffice it to say here that to replace MARC21, the Library of Congress is promoting BIBFRAMEWhich provides a foundation for the future of bibliographic description and of course based on RDF.

Wikidata is often cited as an example of a successful project in the field of Linked Open Data - a kind of machine-readable version of Wikipedia, the content of which, in contrast to DBPedia, is not generated by importing articles from infoboxes, but is created more or less manually (and subsequently becomes a source of information for the same infoboxes).

Also recommended for review list users of the Stardog RDF repository on the Stardog website in the "Customers" section.

Be that as it may, in the Gartner "Hype Cycle for Emerging Technologies" 2016 "Enterprise Taxonomy and Ontology Management" is placed in the middle of a descent into the valley of disappointment with the prospect of reaching a "productivity plateau" no sooner than in 10 years.

Connecting Enterprise Data

Predictions, predictions, predictions…

Out of historical interest, I have summarized Gartner's forecasts of various years for the technologies of interest to us in the table below.

Year Technology Photos Position Years to plateau
2001 Semantic Web Emerging Technologies Innovation Triggers 5-10
2006 Corporate Semantic Web Emerging Technologies Peak of Inflated Expectations 5-10
2012 Semantic Web Big Data Peak of Inflated Expectations > 10
2015 Linked Data Advanced Analytics and Data Science Trough of Disillusionment 5-10
2016 Enterprise Ontology Management Emerging Technologies Trough of Disillusionment > 10
2018 Knowledge Graphs Emerging Technologies Innovation Triggers 5-10

However, already in "Hype Cycle..." 2018 another uptrend appeared - Knowledge Graphs. A certain reincarnation took place: graph DBMS, to which the attention of users and the forces of developers turned out to be switched, under the influence of the requests of the former and the habits of the latter, began to acquire the contours and positioning of their competitor predecessors.

Almost every graph DBMS now claims to be a suitable platform for building a corporate “knowledge graph” (“linked data” is sometimes replaced by “connected data”), but how justified are such claims?

Graph databases are still asemantic, the data in a graph DBMS is still the same data silo. String identifiers instead of URIs make the task of integrating two graph DBMSs still the same integration task, while integrating two RDF repositories is often just a matter of merging two RDF graphs. Another aspect of asemanticity is the non-reflexivity of the LPG graph model, which makes it difficult to manage metadata using the same platform.

Finally, graph DBMSs do not have inference engines or rule engines. The results of such engines can be reproduced by complicating queries, but this is possible even in SQL.

However, the leading RDF repositories have no problem supporting the LPG model. The most solid is the approach proposed at one time in Blazegraph: the RDF* model, which combines RDF and LPG.

Details

You can read more about the support of the LPG model by RDF storages in the previous article on Habré: "What's going on with RDF repositories now". About Knowledge Graphs and Data Fabric, I hope one day a separate article will be written. The final section, as is easy to understand, was written in a hurry, however, even six months later, these concepts are not much clearer.

Literature

  1. Halpin, H., Monnin, A. (eds.) (2014). Philosophical Engineering: Toward a Philosophy of the Web
  2. Allemang, D., Hendler, J. (2011) Semantic Web for the Working Ontologist (2nd ed.)
  3. Staab, S., Studer, R. (eds.) (2009) Handbook on Ontologies (2nd ed.)
  4. Wood, D. (ed.). (2011) Linking Enterprise Data
  5. Keet, M. (2018) An Introduction to Ontology Engineering

Source: habr.com

Add a comment