Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WARCs with 3600 new records fail to fully index #828

Open
machawk1 opened this issue May 7, 2024 · 0 comments
Open

WARCs with 3600 new records fail to fully index #828

machawk1 opened this issue May 7, 2024 · 0 comments

Comments

@machawk1
Copy link
Member

machawk1 commented May 7, 2024

I am attempting to index a WARC from Archive-It using ipwb from the current master branch [abaa35a](https://github.com/oduwsdl/ipwb/commit/abaa35ab13902dd06948a3e8d7b466ee89bf780f) and am receiving a timeout error. The WARC is about 131 MB and contains 3600 records, as reported by the index indexer.

At record 3375, the indexing output stalls and eventually says:

IPFS failed to add, retrying attempt 1/59-ANNUAL-KBAWJW-20110217001046-00000-crawling113.us.archive.org-6682.warc: 3375/3600
(<class 'requests.exceptions.ConnectionError'>, ConnectionError(ReadTimeoutError("HTTPConnectionPool(host='localhost', port=5001): Read timed out.")), <traceback object at 0x1059252c0>)
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/ipwb/indexer.py", line 66, in push_to_ipfs
    http_header_ipfs_hash = push_bytes_to_ipfs(hstr)
                            ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/ipwb/indexer.py", line 367, in push_bytes_to_ipfs
    res = ipfs_client().add_bytes(bytes_in)  # bytes)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/ipfshttpclient/utils.py", line 195, in wrapper
    res = cmd(*args, **kwargs)  # type: ty.Dict[str, T]
          ^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/ipfshttpclient/client/base.py", line 229, in wrapper2
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/ipfshttpclient/client/__init__.py", line 264, in add_bytes
    return self._client.request('/add', decoder='json',
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/ipfshttpclient/http_common.py", line 594, in request
    return stream_decode_full(closables, res,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/ipfshttpclient/http_common.py", line 189, in stream_decode_full
    result = list(response_iter)  # type: ty.List[T_co]
             ^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/ipfshttpclient/http_common.py", line 131, in __next__
    data = next(self._response_iter)  # type: bytes
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/requests/models.py", line 822, in generate
    raise ConnectionError(e)

Kubo/ipfs 0.28.0

CTRL-C'ing the IPFS daemon keeps the process open. Killing the daemon, restarting the daemon, and restarting the indexing allows for the process to succeed, i.e., the indexing completes. I am wondering if this is an issue with indexing a certain quantity of records and overloading the daemon before it has responded. This would allow the second run of the daemon to not need to re-add records from the previous run and perhaps complete.

Note that this effect does not seem to occur on WARCs that are slightly smaller (like this one (125 MB) and this one (108 MB) from the same crawl)).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant