xpyth

A module for querying the DOM tree and writing XPath expressions using native Python syntax.

Example usage

>>> from xpyth import xpath, DOM, X

>>> xpath(X for X in DOM if X.name == 'main')
"//*[@name='main']"

>>> xpath(span for div in DOM for span in div if div.id == 'main')
"//div[@id='main']//span"

>>> xpath(a for a in DOM if '.com' not in a.href)
"//a[not(contains(@href, '.com'))]"

>>> xpath(a.href for a in DOM if any(p for p in a.ancestors if p.id))
"//a[./ancestor::p[@id]]/@href"

>>> xpath(X.data-bind for X in DOM if X.data-bind == '1')
"//*[@data-bind='1']/@data-bind"

>>> xpath(
...     form.action 
...     for form in DOM 
...     if all(
...         input 
...         for input in form.children 
...         if input.value == 'a'
...     )
... )
"//form[not(./input[not(@value='a')])]/@action"

>>> allowed_ids = list('abc')
>>> xpath(X for X in DOM if X.id in allowed_ids)
"//*[@id='a' or @id='b' or @id='c']"

Motivation

XPath is the de facto standard in querying XML and HTML documents. In Python (and most other languages), XPath expressions are represented as strings; this not only constitutes a potential security threat, but also means that developers are denied standard text-editor and IDE features such as syntax highlighting and autocomplete when writing XPaths. Furthermore, having to become familiar with XPath (or CSS selectors) presents a barrier to entry for developers who want to interact with the web.

Great inroads have been made in various programming languages in allowing the use of native list-comprehension-like syntax to generate SQL queries. xpyth piggybacks off one such effort, Pony, to extend this functionality to XPath. Now anyone familiar with Python comprehension syntax can query XML/HTML documents quickly and easily. Moreover, xpyth integrates with the popular lxml library to enable developers to go beyond the querying capabilities of XPath (when necessary).

Installation

pip install xpyth

Use with lxml

xpyth supports querying lxml ElementTrees using the query function. For example, given a document

<html>
    <div id='main' class='main'>
        <a href='http://www.google.com'>Google</a>
        <a href='http://www.chasestevens.com'>Not Google</a>
        <p>Lorem ipsum</p>
        <p id='123'>no numbers here</p>
        <p id='numbers_only'>123</p>
    </div>
    <div id='123' class='secondary'>
        <a href='http://www.google.org'>Google Charity</a>
        <a href='http://www.chasestevens.org'>Broken link!</a>
    </div>
</html>

accessible as the ElementTree tree, the following can be executed:

>>> len(query(a for a in tree))
4
>>> query(a for a in tree if 'Not Google' not in a.text)[0].attrib.get('href')
"http://www.google.com"
>>> next(
...     node 
...     for node in 
...     query(
...         p 
...         for p in 
...         tree 
...         if p.id
...     ) 
...     if re.match(r'\D+', node.attrib.get('id'))
... ).text
"123"

Known Issues

HTML tag names that contain special characters (dashes) cannot be selected, as they violate Python's generator comprehension syntax. HTML attributes containing dashes, e.g. data-bind, work normally.

The use of all is quite buggy, e.g. the following return incorrect expressions:

>>> xpath(X for X in DOM if all(p.id in ('a', 'b') for p in X))
"//*[not(.//p/@id='a' or //p/@id='b')]"  # expected "//*[not(.//p[./@id!='a' and ./@id!='b'])]"
>>> xpath(X for X in DOM if all('x' in p.id for p in X))
"//*[not(.contains(@id, //p))]"  # expected "//*[not(.//p[not(contains(@id, 'x'))])]"

Contacts

Name: H. Chase Stevens
Twitter: @hchasestevens

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
tests/unit		tests/unit
xpyth		xpyth
.gitattributes		.gitattributes
.gitignore		.gitignore
.travis.yml		.travis.yml
COPYING.txt		COPYING.txt
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests/unit

tests/unit

xpyth

xpyth

.gitattributes

.gitattributes

.gitignore

.gitignore

.travis.yml

.travis.yml

COPYING.txt

COPYING.txt

MANIFEST.in

MANIFEST.in

README.md

README.md

requirements.txt

requirements.txt

setup.py

setup.py

tox.ini

tox.ini

Repository files navigation

xpyth

Example usage

Motivation

Installation

Use with lxml

Known Issues

Contacts

About

Releases

Packages

Languages

License

hchasestevens/xpyth

Folders and files

Latest commit

History

Repository files navigation

xpyth

Example usage

Motivation

Installation

Use with lxml

Known Issues

Contacts

About

Topics

Resources

License

Stars

Watchers

Forks

Languages