Skip to content

Commit

Permalink
add docs
Browse files Browse the repository at this point in the history
  • Loading branch information
jamesturk committed Mar 20, 2023
1 parent c5b190a commit 32a6c75
Show file tree
Hide file tree
Showing 16 changed files with 1,095 additions and 50 deletions.
10 changes: 6 additions & 4 deletions LICENSE.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
# Hippocratic License

**HIPPOCRATIC LICENSE**

**Version 3.0, October 2021**
Expand Down Expand Up @@ -62,10 +64,10 @@ The rights granted to the Licensee by this License are expressly made subject to
* [3.1.17.](#3.1.17) Harm the environment in a manner inconsistent with local, state, national, or international law.
* [3.2.](#3.2) The Licensee SHALL:
* [3.2.1.](#3.2.1) Provide equal pay for equal work where the performance of such work requires equal skill, effort, and responsibility, and which are performed under similar working conditions, except where such payment is made pursuant to:
*[3.2.1.1.](#3.2.1.1) A seniority system;
* [3.2.1.2.](#3.2.1.2) A merit system;
*[3.2.1.3.](#3.2.1.3) A system which measures earnings by quantity or quality of production; or
* [3.2.1.4.](#3.2.1.4) A differential based on any other factor other than sex, gender, sexual orientation, race, ethnicity, nationality, religion, caste, age, medical disability or impairment, and/or any other like circumstances (See 29 U.S.C.A. § 206(d)(1); Article 23, _United Nations Universal Declaration of Human Rights_; Article 7, _International Covenant on Economic, Social and Cultural Rights_; Article 26, _International Covenant on Civil and Political Rights_); and
* [3.2.1.1.](#3.2.1.1) A seniority system;
* [3.2.1.2.](#3.2.1.2) A merit system;
* [3.2.1.3.](#3.2.1.3) A system which measures earnings by quantity or quality of production; or
* [3.2.1.4.](#3.2.1.4) A differential based on any other factor other than sex, gender, sexual orientation, race, ethnicity, nationality, religion, caste, age, medical disability or impairment, and/or any other like circumstances (See 29 U.S.C.A. § 206(d)(1); Article 23, _United Nations Universal Declaration of Human Rights_; Article 7, _International Covenant on Economic, Social and Cultural Rights_; Article 26, _International Covenant on Civil and Political Rights_); and
* [3.2.2.](#3.2.2) Allow for reasonable limitation of working hours and periodic holidays with pay (See Article 24, _United Nations Universal Declaration of Human Rights_; Article 7, _International Covenant on Economic, Social and Cultural Rights_).

**[4.](#4) SUPPLY CHAIN IMPACTED PARTIES:**
Expand Down
7 changes: 7 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,13 @@ See the [examples directory](https://github.com/jamesturk/scrapeghost/tree/main/

## Changelog

### 0.3.0 - WIP

* use `tiktoken` for tokenization instead of guessing
* compute cost of each call, add `total_cost` to scrapers
* add tests and complete examples
* list mode prompt improvmeents

### 0.2.0 - 2021-03-18

* Add list mode, auto-splitting, and pagination support.
Expand Down
123 changes: 123 additions & 0 deletions docs/LICENSE.md

Large diffs are not rendered by default.

8 changes: 8 additions & 0 deletions docs/_cost.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
**Cost per 1,000 tokens (March 19th, 2023)**

| Model | Input Tokens | Output Tokens |
| --- | --- | --- |
| GPT-3-Turbo | 0.002 | 0.002 |
| GPT-4 (8k) | 0.03 | 0.06 |
| GPT-4 (32k) | 0.06 | 0.12 |

71 changes: 71 additions & 0 deletions docs/api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# API Reference

## `SchemaScraper`

### Selectors

The main limitation you'll run into is the token limit. Depending on the model you're using you're limited to 4096 or 8192 tokens per call. Billing is also based on tokens sent and received.

One strategy to deal with this is to provide a CSS or XPath selector to the scraper. This will pre-filter the HTML that is sent to the server, keeping you under the limit and saving you money.

Pass the `css` or `xpath` arguments to the scraper to use a selector:

```python
>>> scrape_legislators("https://www.ilga.gov/house/rep.asp?MemberID=3071", xpath="//table[1]")
```

### SchemaScraper Options

* `model` - The GPT model to use, defaults to `gpt-4`, can also be `gpt-3.5-turbo`.
* `list_mode` - If `True` the scraper will return a list of objects instead of a single object. (Alters the prompts and some behavior.)
* `split_length` - If set, the scraper will split the page into multiple calls, each of this length. (Only works with list_mode, requires passing a `css` or `xpath` selector when scraping.)
* `model_params` - A dictionary of parameters to pass to the underlying GPT model.
* `extra_instructions` - Additional instructions to pass to the GPT model.

### Auto-splitting

It's worth mentioning how `split_length` works because it allows for some interesting possibilities but can also become quite expensive.

If you pass `split_length` to the scraper, it assumes the page is made of multiple similar sections and will try to split the page into multiple calls.

When you call the scrape function of an auto-splitting enabled scraper, you are required to pass a `css` or `xpath` selector to the function. The resulting nodes will be combined into chunks no bigger than `split_length` tokens, sent to the API, and then stitched back together.

This seems to work well for long lists of similar items, though whether it is worth the many calls is questionable.

Look at `examples/cbb.py` for an example of a 800+ item page that is split into many calls.

### Examples

See the [examples directory](https://github.com/jamesturk/scrapeghost/tree/main/examples) for current usage.

### Configuration

### `scrape`

## Exceptions

A scrape can raise the following exceptions:

### `TooManyTokens`

Raised when the number of tokens being sent exceeds the maximum allowed.

This indicates that the HTML is too large to be processed by the API.

:::{.callout-tip}
Consider using the `css` or `xpath` selectors to reduce the number of tokens being sent, or use the `split_length` parameter to split the request into multiple requests if necessary.
:::

### BadStop

Indicates that OpenAI ran out of space before the stop token was reached.

:::{.callout-tip}
OpenAI considers both the input and the response tokens when determining if the token limit has been exceeded.

If you are using `split_length`, consider decreasing the value to leave more space for responses.
:::

### InvalidJSON

Indicates that the JSON returned by the API is invalid.
22 changes: 22 additions & 0 deletions docs/changelog.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Changelog

## 0.3.0 - WIP

* use `tiktoken` for tokenization instead of guessing
* compute cost of each call, add `total_cost` to scrapers
* add tests and complete examples
* list mode prompt improvments

## 0.2.0 - 2021-03-18

* Add list mode, auto-splitting, and pagination support.
* Improve `xpath` and `css` handling.
* Improve prompt for GPT 3.5.
* Make it possible to alter parameters when calling scrape.
* Logging & error handling.
* Command line interface.
* See blog post for details: <https://jamesturk.net/posts/scraping-with-gpt-part-2/>

## 0.1.0 - 2021-03-17

* Initial experiment, see blog post for more: <https://jamesturk.net/posts/scraping-with-gpt-4/>
15 changes: 15 additions & 0 deletions docs/cli.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Command Line Interface

scrapeghost offers a command line interface which is particularly useful for experimentation.

It is also possible to use as a step in a data pipeline.

## Configuration

In order to use the CLI, the `OPENAI_API_KEY` environment variable must be set.

## Usage

```{bash}
scrapeghost --help
```
134 changes: 134 additions & 0 deletions docs/code_of_conduct.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
# Code of Conduct

## Our Pledge

We as members, contributors, and leaders pledge to make participation in our
community a harassment-free experience for everyone, regardless of age, body
size, visible or invisible disability, ethnicity, sex characteristics, gender
identity and expression, level of experience, education, socio-economic status,
nationality, personal appearance, race, caste, color, religion, or sexual identity
and orientation.

We pledge to act and interact in ways that contribute to an open, welcoming,
diverse, inclusive, and healthy community.

## Our Standards

Examples of behavior that contributes to a positive environment for our
community include:

* Demonstrating empathy and kindness toward other people
* Being respectful of differing opinions, viewpoints, and experiences
* Giving and gracefully accepting constructive feedback
* Accepting responsibility and apologizing to those affected by our mistakes,
and learning from the experience
* Focusing on what is best not just for us as individuals, but for the
overall community

Examples of unacceptable behavior include:

* The use of sexualized language or imagery, and sexual attention or
advances of any kind
* Trolling, insulting or derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or email
address, without their explicit permission
* Other conduct which could reasonably be considered inappropriate in a
professional setting

## Enforcement Responsibilities

Community leaders are responsible for clarifying and enforcing our standards of
acceptable behavior and will take appropriate and fair corrective action in
response to any behavior that they deem inappropriate, threatening, offensive,
or harmful.

Community leaders have the right and responsibility to remove, edit, or reject
comments, commits, code, wiki edits, issues, and other contributions that are
not aligned to this Code of Conduct, and will communicate reasons for moderation
decisions when appropriate.

## Scope

This Code of Conduct applies within all community spaces, and also applies when
an individual is officially representing the community in public spaces.
Examples of representing our community include using an official e-mail address,
posting via an official social media account, or acting as an appointed
representative at an online or offline event.

## Enforcement

Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported to the community leaders responsible for enforcement:

* James Turk: <[email protected]>

All complaints will be reviewed and investigated promptly and fairly.

All community leaders are obligated to respect the privacy and security of the
reporter of any incident.

## Enforcement Guidelines

Community leaders will follow these Community Impact Guidelines in determining
the consequences for any action they deem in violation of this Code of Conduct:

### 1. Correction

**Community Impact**: Use of inappropriate language or other behavior deemed
unprofessional or unwelcome in the community.

**Consequence**: A private, written warning from community leaders, providing
clarity around the nature of the violation and an explanation of why the
behavior was inappropriate. A public apology may be requested.

### 2. Warning

**Community Impact**: A violation through a single incident or series
of actions.

**Consequence**: A warning with consequences for continued behavior. No
interaction with the people involved, including unsolicited interaction with
those enforcing the Code of Conduct, for a specified period of time. This
includes avoiding interactions in community spaces as well as external channels
like social media. Violating these terms may lead to a temporary or
permanent ban.

### 3. Temporary Ban

**Community Impact**: A serious violation of community standards, including
sustained inappropriate behavior.

**Consequence**: A temporary ban from any sort of interaction or public
communication with the community for a specified period of time. No public or
private interaction with the people involved, including unsolicited interaction
with those enforcing the Code of Conduct, is allowed during this period.
Violating these terms may lead to a permanent ban.

### 4. Permanent Ban

**Community Impact**: Demonstrating a pattern of violation of community
standards, including sustained inappropriate behavior, harassment of an
individual, or aggression toward or disparagement of classes of individuals.

**Consequence**: A permanent ban from any sort of public interaction within
the community.

## Attribution

This Code of Conduct is adapted from the [Contributor Covenant][homepage],
version 2.0, available at
[https://www.contributor-covenant.org/version/2/0/code_of_conduct.html][v2.0].

Community Impact Guidelines were inspired by
[Mozilla's code of conduct enforcement ladder][Mozilla CoC].

For answers to common questions about this code of conduct, see the FAQ at
[https://www.contributor-covenant.org/faq][FAQ]. Translations are available
at [https://www.contributor-covenant.org/translations][translations].

[homepage]: https://www.contributor-covenant.org
[v2.0]: https://www.contributor-covenant.org/version/2/0/code_of_conduct.html
[Mozilla CoC]: https://github.com/mozilla/diversity
[FAQ]: https://www.contributor-covenant.org/faq
[translations]: https://www.contributor-covenant.org/translations
57 changes: 57 additions & 0 deletions docs/faq.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# FAQ

## Is this practical?

When I started this project, I really didn't think it would be. I was aiming at a fun proof of concept, but I've been surprised by the results.

Time will tell if this is a practical tool, but I'm somewhat hopeful now.

## Why not use a different model?

This was a toy project, not an attempt to build a production system. I'm open to trying other models if you have suggestions.

## What about pages where the data is loaded dynamically?

This won't work for those out of the box. It should be possible to use something like [selenium](https://selenium-python.readthedocs.io/) to load the page and then pass the rendered HTML to `scrapeghost`.

## What if a page is too big?

Try the following:

1. Provide a CSS or XPath selector to limit the scope of the page.

2. Pre-process the HTML. Trim tags or entire sections you don't need.

3. Finally, you can use the `split_length` parameter to split the page into smaller chunks. This only works for list-type pages, and requires a good choice of selector to split the page up.

## Why not ask the scraper to write CSS / XPath selectors?

While it'd seem like this would perform better, there are a few practical challenges standing in the way right now.

* Writing a robust CSS/XPath selector that'd run against a whole set of pages would require passing a lot of context to the model. The token limit is already the major limitation.
* The current solution does not require any changes when a page changes. A selector-based model would require retraining every time a page changes as well as a means to detect such changes.
* For some data, selectors alone are not enough. The current model can easily extract all of the addresses from a page and break them into city/state/etc. A selector-based model would not be able to do this.

I do think there is room for hybrid approaches, and I plan to continue to explore them.

## Does the model "hallucinate" data?

It is possible, but in practice hasn't been observed as a major problem yet.

Because the *temperature* is zero, the output is fully deterministic and seems less likely to hallucinate data.

It is definitely possible however, and future versions of this tool will allow for automated error checking (and possibly correction).

## How much did you spend developing this?

So far, about $25 on API calls, switching to GPT-3.5 as the default made a big difference.

My most expensive call was a paginated GPT-4 call that cost $2.20. I decided to add the cost-limiting features after that.

## What's with the license?

I'm still working on figuring this out.

For now, if you're working in a commercial setting and the license scares you away, that's fine.

If you really want to, you can contact me and we can work something out.
Loading

0 comments on commit 32a6c75

Please sign in to comment.