GitHub - seagatesoft/sde: Structured Data Extractor. An application to extract structured data from web pages. It uses Data Extraction Based on Partial Tree Alignment (DEPTA) method. (UPDATE: I implemented a newer algorithm: https://github.com/seagatesoft/webdext)

Structured Data Extractor (SDE) is an implementation of DEPTA (Data Extraction based on Partial Tree Alignment), a method to extract data from web pages (HTML documents). DEPTA was invented by Yanhong Zhai and Bing Liu from University of Illinois at Chicago and was published in their paper: "Structured Data Extraction from the Web based on Partial Tree Alignment" (IEEE Transactions on Knowledge and Data Engineering, 2006). Given a web page, SDE will detect data records contained in the web page and extract them into table structure (rows and columns). You can download the application from this link: Download Structured Data Extractor.

Usage

Extract sde.zip.
Make sure that Java Runtime Environment (version 5 or higher) already installed on your computer.
Open command prompt (Windows) or shell (UNIX).
Go to the directory where you extract sde.zip.
Run this command: java -jar sde-runnable.jar URI_input path_to_output_file
You can pass URI_input parameter refering to a local file or remote file, as long as it is a valid URI. URI refering to a local file must be preceded by "file:///". For example in Windows environment: "file:///D:/Development/Proyek/structured_data_extractor/bin/input/input.html" or in UNIX environment: "file:///home/seagate/input/input.html".
The path to output file parameter is formatted as a valid path in the host operating system like "D:\Data\output.html" (Windows) or "/home/seagate/output/output.html" (UNIX).
Extracted data can be viewed in the output file. The output file is a HTML document and the extracted data is presented in HTML tables.

Source Code

SDE source code is available at GitHub.

Dependencies

SDE was developed using these libraries:

Neko HTML Parser by Andy Clark and Marc Guillemot. Licensed under Apache License Version 2.0.
Xerces by The Apache Software Foundation. Licensed under Apache License Version 2.0.

License

SDE is licensed under the MIT license.

Author

Sigit Dewanto, sigitdewanto11[at]yahoo[dot]co[dot]uk, 2009.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
SDE		SDE
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Usage

Source Code

Dependencies

License

Author

About

Releases

Packages

Languages

seagatesoft/sde

Folders and files

Latest commit

History

Repository files navigation

Usage

Source Code

Dependencies

License

Author

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages