[caldeveloper-l] Fwd: WebDataCommons releases 38.7 billion quads Microdata, Embedded JSON-LD, RDFa and Microformat data originating from 7.4 million pay-level-domains

Rutger Geelen rutgergeelen at schedjoules.com
Thu Jan 11 12:03:48 PST 2018


Colleagues,

Interesting report on microdata and json-Id including hcal. Would be
interested to have this on the agenda for a next conference to discuss this.

Best,

Rutger

---------- Forwarded message ---------
From: Anna Primpeli <anna at informatik.uni-mannheim.de>
Date: Thu, Jan 11, 2018, 10:37
Subject: WebDataCommons releases 38.7 billion quads Microdata, Embedded
JSON-LD, RDFa and Microformat data originating from 7.4 million
pay-level-domains
To: <semantic-web at w3.org>, <public-schemaorg at w3.org>, <public-vocabs at w3.org>


Hi All,

we are happy to announce the new release of the WebDataCommons Microdata,
JSON-LD, RDFa and Microformat data corpus.

The data has been extracted from the November 2017 version of the Common
Crawl covering 3.2 billion HTML pages which originate from 26 million
websites (pay-level domains).

In summary, we found structured data within 1.2 billion HTML pages out of
the 3.2 billion pages contained in the crawl (38.9%). These pages originate
from 7.4 million different pay-level domains out of the 26 million
pay-level-domains covered by the crawl (28.4%).

Approximately 3.7 million of these websites use Microdata, 2.6 million
websites use JSON-LD, and 1.2 million websites make use of RDFa.
Microformats are used by more than 3.3 million websites within the crawl.



*Background:*

More and more websites annotate data describing for instance products,
people, organizations, places, events, reviews, and cooking  recipes within
their HTML pages using markup formats such as Microdata, embedded JSON-LD,
RDFa and Microformat.

The WebDataCommons project extracts all Microdata, JSON-LD, RDFa, and
Microformat data from the Common Crawl web corpus, the largest web corpus
that is available to the public, and provides the extracted data for
download. In addition, we publish statistics about the adoption of the
different markup formats as well as the vocabularies that are used together
with each format. We run yearly extractions since 2012 and we provide the
dataset series as well as the related statistics at:

http://webdatacommons.org/structureddata/



*Statistics about the November 2017 Release:*

Basic statistics about the November 2017 Microdata, JSON-LD, RDFa, and
Microformat data sets as well as the vocabularies that are used together
with each markup format are found at:

http://webdatacommons.org/structureddata/2017-12/stats/stats.html



*Markup Format Adoption*

The page below provides an overview of the increase in the adoption of the
different markup formats as well as widely used schema.org classes from
2012 to 2017:

http://webdatacommons.org/structureddata/#toc10

Comparing the statistics from the new 2017 release to the statistics about
the October 2016 release of the data sets

http://webdatacommons.org/structureddata/2016-10/stats/stats.html

we see that the adoption of structured data keeps on increasing while
Microdata remains the most dominant markup syntax. The different nature of
the crawling strategy that was used makes it hard to compare absolute as
well as certain relative numbers between the two releases. More concretely,
we observe that the November 2017 Common Crawl corpus is much deeper for
certain domains like blogspot.com and wordpress.com while other domains are
covered in a shallower way, with fewer URLs crawled in comparison to the
October 2016 Common Crawl corpus. Nevertheless, it is clear that the growth
rate of Microdata and Microformats is much higher than the one of RDFa and
embedded JSON-LD.  Although, the latter format is widely spread, it is
mainly used to annotate metadata for search actions (80% of the domains
using JSON-LD) while only a few domains use it for annotating content
information such as Organizations (25% of the domains using JSON-LD),
Persons (4% of the domains using JSON-LD) or Offers (0.1% of the domains
using JSON-LD).



*Vocabulary Adoption*

Concerning the vocabulary adoption, schema.org, the vocabulary recommended
by Google, Microsoft, Yahoo!, and Yandex continues to be the most dominant
in the context of Microdata with 78% of the webmasters using it in
comparison to its predecessor, the data-vocabulary, which is only used by
14% of the websites containing Microdata. In the context of RDFa, the Open
Graph Protocol recommended by Facebook remains the most widely used
vocabulary.



*Parallel Usage of Multiple Formats*

Analyzing topic-specific subsets, we discover some interesting trends. As
observed in the previous extractions, content related information is mostly
described either with the Microdata format or less frequently with the
JSON-LD format, in both cases using the schema.org vocabulary. However, we
find out that 30% of the websites that use JSON-LD annotations to describe
product related information, make use of Microdata as well as JSON-LD to
cover the same topic. This is not the case for other topics, such as Hotels
or Job Postings, for which webmasters use only one format to annotate their
content.



*Richer Descriptions of Job Postings*

Following the release of the “Google for Jobs” search vertical and the more
detailed guidance by Google on how to annotate job postings (
https://developers.google.com/search/docs/data-types/job-posting), we see
an increase in the number of websites annotating job postings (2017: 7,023,
2016: 6,352). In addition, the job posting annotations tend to become
richer in comparison to the previous years as the number of Job Posting
related properties adopted by at least 30% of the websites containing job
offers has increased from 4 (2016) to 7 (2017). The newly adopted
properties are JobPosting/url, JobPosting/datePosted, and
JobPosting/employmentType.

You can find a more extended analysis concerning specific topics, like Job
Posting and Product, here

http://webdatacommons.org/structureddata/2017-12/stats/schema_org_subsets.html#extendedanalysis



*Download *

The overall size of the November 2017 RDFa, Microdata, Embedded JSON-LD and
Microformat data sets is 38.7 billion RDF quads. For download, we split the
data into 8,433 files with a total size of 858 GB.

http://webdatacommons.org/structureddata/2017-12/stats/how_to_get_the_data.html

In addition, we have created for over 40 different schema.org classes
separate files, including all quads extracted from pages, using a specific
schema.org class.

http://webdatacommons.org/structureddata/2017-12/stats/schema_org_subsets.html



*Lots of thanks to:*

+ the Common Crawl project for providing their great web crawl and
thus enabling the WebDataCommons project.
+ the Any23 project for providing their great library of structured
data parsers.
+ Amazon Web Services in Education Grant for supporting WebDataCommons.
+ the Ministry of Economy, Research and Arts of Baden – Württemberg which
supported through the ViCE project the extraction and analysis of the
November 2017 corpus.

*General Information about the WebDataCommons Project*

The WebDataCommons project extracts structured data from the Common Crawl,
the largest web corpus available to the public, and provides the extracted
data for public download in order to support researchers and companies in
exploiting the wealth of information that is available on the Web. Beside
of the yearly extractions of semantic annotations from webpages, the
WebDataCommons project also provides large hyperlink graphs, the largest
public corpus of WebTables, a corpus of product data, as well as a
collection of hypernyms extracted from billions of web pages for public
download. General information about the WebDataCommons project is found at

http://webdatacommons.org/


Have fun with the new data set.

Cheers,
Anna Primpeli, Robert Meusel and Chris Bizer
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.calconnect.org/pipermail/caldeveloper-l-calconnect.org/attachments/20180111/0b1d647b/attachment.html>


More information about the caldeveloper-l mailing list