Articles

OXPath

The evolution of the web has outpaced itself: The growing wealth of information and the increasing sophistication of interfaces necessitate automated processing. Web automation and extraction technologies have been overwhelmed by this very growth. To address this trend, we identify four key requirements of web extraction: (1) Interact with sophisticated web application interfaces, (2) Precisely capture the relevant data for most web extraction tasks, (3) Scale with the number of visited pages, and (4) Readily embed into existing web technologies.

OXPath is an extension of XPath for interacting with web applications and for extracting information thus revealed. It addresses all the above requirements. OXPath’s page-at-a-time evaluation guarantees memory use independent of the number of visited pages, yet remains polynomial in time. We validate experimentally the theoretical complexity and demonstrate that its evaluation is dominated by the page rendering of the underlying browser.

Our experiments show that OXPath outperforms existing commercial and academic data extraction tools by a wide margin. OXPath is available under an open source license.


Presentation


Demo

Try the OXPath EDBT 2011 Demo

Try the OXPath WWW 2011 Demo


Code

OXPath v1.0 is available both as Java API and command line interface. Read this document to getting started with OXPath in Maven, or download the binary packages (service offered by Google Code Project Hosting). Note: An out-of-sync version of OXPath source code and binary is hosted on GitHub. We plan to synchronise it with the following releases.



References


Contacts

(name dot surname at cs dot ox dot ac dot uk)
  • Giovanni Grasso
  • Andrew Jon Sellers