Articles

AMBER - Adaptable Model-based Extraction of Result and Details Pages

Web extraction is the task of turning unstructured HTML into structured data. Previous approaches rely exclusively on detecting repeated structures in result pages. These approaches trade intensive user interaction for precision.

AMBER replaces the human interaction with a domain ontology applicable to all sites of a domain. It models domain knowledge about (1) records and attributes of the domain, (2) low-level (textual) representations of these concepts, and (3) constraints linking representations to records and attributes. Parametrized with these constraints, otherwise domain-independent heuristics exploit the repeated structure of result pages to derive attributes and records. Amber is implemented in logical rules to allow an explicit formulation of the heuristics and easy adaptation to different domains.

We apply Amber to the UK real estate domain where we achieve near perfect accuracy on a representative sample of 50 agency websites.


Presentation


Evaluating AMBER

  • Result Pages
    • dataset - UK real estate
    • gold standard - Uk real estate
  • Details Pages

References

  • Tim Furche, Giovanni Grasso, Giorgio Orsi, Christian Schallhart, and Cheng Wang. 2011. Little Knowledge Rules The Web: Domain-Centric Result Page Extraction. In Proc. of RR. 2011.
  • Tim Furche, Georg Gottlob, Giovanni Grasso, Giorgio Orsi, Christian Schallhart, and Cheng Wang. AMBER: Automatic Supervision for Multi-Attribute Extraction. CoRR abs/1210.5984 2012.

Contacts

(name dot surname at cs dot ox dot ac dot uk)
  • Christian Schallhart
  • Giorgio Orsi
  • Cheng Wang