Many computer scientists see the answer for the problem of object search in publishing objects using formal vocabularies. Initiatives such as Linked Open Data and the Semantic Web create such vocabularies for publishers to annotate their published objects. Though this provides more choice and freedom to publishers, it is freedom for technology experts only—considering that today publishers overwhelmingly fail to produce even syntactically correct Web pages.
DIADEM takes a bolder view: We believe that the hard and repetitive tasks necessary for object search can be automated given a number of significant, but realistic breakthroughs in automated Web data extraction. DIADEM allows publishers to focus on the object descriptions for humans and transforms these into objects with searchable attributes.
DIADEM’s web extraction is based on the observation that object descriptions for humans occur in a limited set of patterns—at least within a given domain: book descriptions contain title and author, the title usually larger or in bold font. Such patterns of occurrence form the phenomenology of objects in that domain.
We assemble domain and phenomenological knowledge about a domain of interest. With this knowledge we can automatically analyze arbitrary object descriptions from that domain and identify which objects occur using which patterns. The result of the analysis is an extraction program. It can be used to extract automatically all similarly published objects and their attributes.