Web Extraction Languages and Foundations


This research area investigates languages and foundations of web data extraction. DIADEM needs languages for navigating web pages and interact with them. In order to extract the data behind a web interface, the static DOM tree of the page is not enough; a live DOM must be used instead.

In addition, the extracting data from a web site is certainly a time-consuming task. For this reason efficient and parallel techniques are investigated.

Due to the intrinsically noisy nature of web data, DIADEM also investigates how probabilistic techniques can be used to improve the quality of the extracted data or of the extraction process.

The extraction language for DIADEM is OXPath.

The wrapper definition language for DIADEM is GLUE.