Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC: Using web_poet.Unset #21

Draft
wants to merge 17 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,12 @@
Changelog
=========

TBD
===

* Use ``web_poet.Unset`` sentinel value which represents fields which hasn't been
assigned with any value. This is to differentiate values which are ``None``.

0.2.0 (2022-09-22)
==================

Expand Down
10 changes: 5 additions & 5 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,12 @@ zyte-common-items

.. description starts

``zyte-common-items`` is a Python 3.7+ library of item classes used by Zyte_ to
normalize different types of data extracted from websites.

It can be used in custom data extraction code for normalization purposes,
maximizing opportunities for code reuse.
``zyte-common-items`` is a Python 3.7+ library of item_ and `page object`_
classes for web data extraction that we use at Zyte_ to maximize opportunities
for code reuse.

.. _item: https://docs.scrapy.org/en/latest/topics/items.html
.. _page object: https://web-poet.readthedocs.io/en/stable/
.. _Zyte: https://www.zyte.com/

.. description ends
Expand Down
2 changes: 2 additions & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,8 @@ def get_version_and_release():

autodoc_member_order = "groupwise"

intersphinx_disabled_reftypes = []
intersphinx_mapping = {
"python": ("https://docs.python.org/3", None),
"web-poet": ("https://web-poet.readthedocs.io/en/stable", None),
}
21 changes: 19 additions & 2 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,27 @@ zyte-common-items |version| documentation
:end-before: .. description ends

.. toctree::
:hidden:
:caption: Getting started
:maxdepth: 1

setup
usage

.. toctree::
:caption: Usage
:maxdepth: 1

usage/items
usage/pages

.. toctree::
:caption: Reference
:maxdepth: 1

reference/index
changelog

.. toctree::
:caption: Contributing
:maxdepth: 1

contributing
3 changes: 1 addition & 2 deletions docs/reference/adapter.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,4 @@
Adapter
=======

.. class:: zyte_common_items.ZyteItemAdapter
.. autoclass:: zyte_common_items.adapter.ZyteItemAdapter
.. autoclass:: zyte_common_items.ZyteItemAdapter
24 changes: 8 additions & 16 deletions docs/reference/components.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,34 +7,26 @@ Components
These classes are used to map data within :ref:`items <items>`, and are not
tied to any specific item type.

.. class:: zyte_common_items.AdditionalProperty(**kwargs)
.. autoclass:: zyte_common_items.components.AdditionalProperty(**kwargs)
.. autoclass:: zyte_common_items.AdditionalProperty(**kwargs)
:members:

.. class:: zyte_common_items.AggregateRating(**kwargs)
.. autoclass:: zyte_common_items.components.AggregateRating(**kwargs)
.. autoclass:: zyte_common_items.AggregateRating(**kwargs)
:members:

.. class:: zyte_common_items.Brand(**kwargs)
.. autoclass:: zyte_common_items.components.Brand(**kwargs)
.. autoclass:: zyte_common_items.Brand(**kwargs)
:members:

.. class:: zyte_common_items.Breadcrumb(**kwargs)
.. autoclass:: zyte_common_items.components.Breadcrumb(**kwargs)
.. autoclass:: zyte_common_items.Breadcrumb(**kwargs)
:members:

.. class:: zyte_common_items.Gtin(**kwargs)
.. autoclass:: zyte_common_items.components.Gtin(**kwargs)
.. autoclass:: zyte_common_items.Gtin(**kwargs)
:members:

.. class:: zyte_common_items.Image(**kwargs)
.. autoclass:: zyte_common_items.components.Image(**kwargs)
.. autoclass:: zyte_common_items.Image(**kwargs)
:members:

.. class:: zyte_common_items.Link(**kwargs)
.. autoclass:: zyte_common_items.components.Link(**kwargs)
.. autoclass:: zyte_common_items.Link(**kwargs)
:members:

.. class:: zyte_common_items.Metadata(**kwargs)
.. autoclass:: zyte_common_items.components.Metadata(**kwargs)
.. autoclass:: zyte_common_items.Metadata(**kwargs)
:members:
1 change: 1 addition & 0 deletions docs/reference/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,5 +5,6 @@ Reference
.. toctree::

items
pages
components
adapter
25 changes: 10 additions & 15 deletions docs/reference/items.rst
Original file line number Diff line number Diff line change
@@ -1,43 +1,38 @@
.. _items:
.. _item-api:

=====
Items
=====
========
Item API
========

Product
=======

.. class:: zyte_common_items.Product(**kwargs)
.. autoclass:: zyte_common_items.items.Product(**kwargs)
.. autoclass:: zyte_common_items.Product(**kwargs)
:members:
:inherited-members:

.. class:: zyte_common_items.ProductVariant(**kwargs)
.. autoclass:: zyte_common_items.items.ProductVariant(**kwargs)
.. autoclass:: zyte_common_items.ProductVariant(**kwargs)
:members:
:inherited-members:

Product List
============

.. class:: zyte_common_items.ProductList(**kwargs)
.. autoclass:: zyte_common_items.items.ProductList(**kwargs)
.. autoclass:: zyte_common_items.ProductList(**kwargs)
:members:
:inherited-members:

.. class:: zyte_common_items.ProductFromList(**kwargs)
.. autoclass:: zyte_common_items.items.ProductFromList(**kwargs)
.. autoclass:: zyte_common_items.ProductFromList(**kwargs)
:members:
:inherited-members:


Custom items
============

Subclass :class:`~zyte_common_items.base.Item` to create your own item classes.
Subclass :class:`~zyte_common_items.Item` to create your own item classes.

.. class:: zyte_common_items.Item(**kwargs)
.. autoclass:: zyte_common_items.base.Item(**kwargs)
.. autoclass:: zyte_common_items.Item(**kwargs)
:members:

.. attribute:: _unknown_fields_dict
Expand Down
76 changes: 76 additions & 0 deletions docs/reference/pages.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
.. _page-object-api:

===============
Page object API
===============

Product
=======

.. autoclass:: zyte_common_items.BaseProductPage(**kwargs)
:show-inheritance:

.. autoclass:: zyte_common_items.ProductPage(**kwargs)
:show-inheritance:


Product List
============

.. autoclass:: zyte_common_items.BaseProductListPage(**kwargs)
:show-inheritance:

.. autoclass:: zyte_common_items.ProductListPage(**kwargs)
:show-inheritance:


Custom page objects
===================

Subclass :class:`~zyte_common_items.Page` to create your own page object
classes that rely on :class:`~zyte_common_items.HttpResponse`.

If you do not want :class:`~zyte_common_items.HttpResponse` as input, you can
inherit from :class:`~zyte_common_items.BasePage` instead.

.. autoclass:: zyte_common_items.BasePage(**kwargs)
:show-inheritance:

Base class for page object classes that has
:class:`~zyte_common_items.ResponseUrl` as a dependency.

.. data:: metadata
:type: zyte_common_items.Metadata

Data extraction process metadata.

:attr:`~zyte_common_items.Metadata.dateDownloaded` is set to the current
UTC date and time.

:attr:`~zyte_common_items.Metadata.probability` is set to ``1.0``.

.. data:: url
:type: web_poet.page_inputs.http.ResponseUrl

Main URL from which the data has been extracted.

.. autoclass:: zyte_common_items.Page(**kwargs)
:show-inheritance:

Base class for page object classes that has
:class:`~zyte_common_items.HttpResponse` as a dependency.

.. data:: metadata
:type: zyte_common_items.Metadata

Data extraction process metadata.

:attr:`~zyte_common_items.Metadata.dateDownloaded` is set to the current
UTC date and time.

:attr:`~zyte_common_items.Metadata.probability` is set to ``1.0``.

.. data:: url
:type: web_poet.page_inputs.http.ResponseUrl

Main URL from which the data has been extracted.
8 changes: 7 additions & 1 deletion docs/usage.rst → docs/usage/items.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,12 @@
.. _items:

=====
Usage
Items
=====

The :ref:`provided item classes <item-api>` can be used to map data extracted
from web pages, e.g. using :ref:`page objects <page-objects>`.

Creating items from dictionaries
================================

Expand Down Expand Up @@ -31,6 +36,7 @@ nested data, such as :class:`~zyte_common_items.components.Image` and
>>> product.mainImage
Image(url='https://example.com/image.png')
>>> product.canonicalUrl
Unset
>>> product.gtin
[Gtin(type='gtin13', value='9504000059446')]

Expand Down
37 changes: 37 additions & 0 deletions docs/usage/pages.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
.. _page-objects:

============
Page objects
============

The :ref:`provided page object classes <page-object-api>` are good base classes
for custom page object classes that implement website-specific :doc:`page
objects <web-poet:index>`.

They provide the following base line:

- They declare the :ref:`item class <items>` that they return, allowing for
their ``to_item`` method to automatically build an instance of it from
``@field``-decorated methods. See :ref:`web-poet-fields`.

- They provide a default implementation for their
:attr:`~zyte_common_items.Page.metadata` and
:attr:`~zyte_common_items.Page.url` fields.

The following code shows a :class:`~zyte_common_items.ProductPage` subclass
whose ``to_item`` method returns an instance of
:class:`~zyte_common_items.Product` with
:attr:`~zyte_common_items.Product.metadata`, a
:attr:`~zyte_common_items.Product.name`, and a
:attr:`~zyte_common_items.Product.url`:

.. code-block:: python

import attrs
from zyte_common_items import ProductPage

class CustomProductPage(ProductPage):

@field
def name(self):
return self.css("h1::text").get()
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@
install_requires=[
"attrs>=21.3.0",
"itemadapter>=0.2.0",
"web-poet>=0.5.0",
"web-poet @ git+https://[email protected]/scrapinghub/web-poet@feat-unset#egg=web-poet",
],
classifiers=[
"Development Status :: 3 - Alpha",
Expand Down
3 changes: 2 additions & 1 deletion tests/test_adapter.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
import attrs
import pytest
from itemadapter import ItemAdapter
from web_poet import Unset

from zyte_common_items import Item, Product, ZyteItemAdapter

Expand Down Expand Up @@ -202,7 +203,7 @@ def test_known_field_get_missing():
product = Product(url=url)
with configured_adapter():
adapter = ItemAdapter(product)
assert adapter["canonicalUrl"] is None
assert adapter["canonicalUrl"] is Unset


def test_known_field_set():
Expand Down
6 changes: 4 additions & 2 deletions tests/test_components.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
from web_poet import Unset

from zyte_common_items import AggregateRating, Breadcrumb, Link, Metadata


Expand All @@ -19,5 +21,5 @@ def test_link_optional_fields():

def test_metadata_default_values():
metadata = Metadata()
assert metadata.dateDownloaded is None
assert metadata.probability is None
assert metadata.dateDownloaded is Unset
assert metadata.probability is Unset
Loading