SWC:Proposal

Overview
The regular content creators ( such as bloggers or even more experienced web masters) do not really use appropriately the semantic annotations either they use RDFa or Microdata or Microformats as annotation languages. The most of the training documents with respect of semantic annotations are quite complex and people learn from examples showing up on various blogs. However, as these examples are not always properly designed sometimes people get a wrong understanding on how to add semantic annotations  to their pages.

Recently (March 2012) a common initiative of Free University Berlin and Karlsruhe Institute of Technology, http://webdatacommons.org/ offers a public set of semantic data extracted from the Web. " The Web Data Commons project extracts all Microformats, Microdata and RDFa data from the Common Crawl web corpus, the largest and most up-to-data web corpus that is currently available to the public, and provide the extracted data for download in the form of RDF-quads and also in the form of CSV-tables for common entity types (e.g. product, organization, location, ...)." On June 2012 they also did an analysis of the vocabularies that are used by web masters when performing semantic annotations (see http://webdatacommons.org/vocabulary-usage-analysis/index.html ) showing that there is a very significant growth on the annotated data in 2012 versus 2010.

Indeed after [http://schema.org Google, Bing and Yahoo! initiative] to encourage semantic annotations the number of annotated pages is booming.

What's wrong with the Semantic Annotations?
For example (and there are many more):
 * A professional looking closer to the datasets offered by http://webdatacommons.org will find quite easy that, apart of crawling professional web sites such as http://freebase.com or http://yahoo.com or others, there are a lot of  websites (Web is so huge...) which offer triples that are semantically wrong.
 * While a lot of clean data is coming from metadata tags and from work licensing (such as Creative Commons) semantic data extracted, for example, from blog posts is not much present.

Indeed big players such as Google already started to help its customers by providing a partial annotation of the blogs on the http://blogger.com platform. However, as Google noticed, you cannot really automatically annotate everything because you miss the most important part: the post content annotations.

Why content creators does not perform Semantic Annotations?
In the past it was a little motivation to content creators to annotate their content. Nowadays with the advent of Schema.org vocabulary and direct involvement of the major search engines into the Semantic Web business everyone is interested more than ever to produce annotated content. But the main problem is that annotating is difficult. It is difficult because of at least the following reasons: Lets consider the following example:
 * Web masters must learn to use an annotation language such as RDFa or Microdata or Microformats
 * There is a lack of tools allowing for easy to do semantic annotations: when I publish my Wordpress post I use the WP editor. The same is valid for other blogging platforms too. In the CMS world the status is the same.
 * While learning RDFa or Microdata is not so difficult understanding various vocabularies to be used is far more complex.

It is common understanding for most of the people that   is an address and if a blogger will find such a markup on another web then is quite likely that it will borrow this style of annotation.

While the humans may consider the above markup as being fair, from the machine perspective is incorrect.

Why?
Simply because the definition of the property address in Schema.org vocabulary say that the value of this property should be an instance of. And PostalAddress is NOT a Text. It is a class allowing a number of properties including,   and so on (What? Class?, Property?, Inheritance? .... very difficult).

Therefore the semantically consistent markup must be:

which is far more complex that the first one and REQUIRES knowledge about Schema.org vocabulary, including knowledge about types, inheritance, and so on. But to say that there are many other vocabularies that can be used...

This problem was already noticed by the Schema.org initiative as they DO NOT REQUIRE A COMPLETE SEMANTICAL MARKUP. Basically they accepted that PEOPLE DO MISTAKES when they markup the content.

What we should do?
When Google accepts that content creator may markup a PostalAddress like a Text they also understand that such annotations will not be able to be correctly used in  triplestores using Schema.org as underlining vocabulary (when you perform a SPARQL query on   you'll never get the right information...)
 * Creating  Tools allowing EASY semantic annotations
 * Cleaning existent crawled semantic data towards on clean semantic data ready to be used in reasoning tasks.

We believe that cleaning can be done by:
 * 1) Using  crowdsourcing solutions.
 * 2) By data transformation  allowing the mapping of the data from its given wrong format into the format expected by the appropriate application. This includes using heuristics to create semantic value conversion such as extracting from strings on which the machine "knows" a priori what they should be (e.g., find intelligent functions such as  ).

Related Discussions

 * From Words to Concepts and Back: Dictionaries for Linking Text, Entities and Ideas, http://googleresearch.blogspot.de/2012/05/from-words-to-concepts-and-back.html -- Admin66 15:29, 12 October 2012 (CEST)
 * Data Science Toolkit, http://www.datasciencetoolkit.org/developerdocs --Admin66 11:23, 20 January 2013 (CET)