-
-
Notifications
You must be signed in to change notification settings - Fork 431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finished crawling with no results #175
Comments
Strange, also that there's no error in the log! When not using the CLI mode but the library mode (see readme.md) does the extraction work for you? |
Acutally not. The problem seems to be that one hast to accept the advertisement popup first. The output was: |
Did I understand you correctly that:
|
I had this problem myself, I am pretty sure I had a configuration issue that was failing silently. I remade my configuration file basing it off of the examples and things seemed to start working. My assumption was some weird python tabs or spaces problem. |
To 1. exactly! I just asked for maintext and title. edit:
|
@tobiasstrauss I agree with you, the issue was the website pop up at https://www.zeit.de/zustimmung?url=https%3A%2F%2Fwww.zeit.de%2Findex |
Hey there @tobiasstrauss |
Hey @woxxel, |
@SamuelHelspr or @woxxel have either of you (or anyone reading) figured out how to send a cookie? I've been using the |
If no one has figured this out in a week ping me and I'll write a quick
patch for you. I was doing some decently large scale crawls with this and
to get scale that was something I had to do.
…On Tue, Aug 8, 2023 at 9:34 AM Chris Loughnane ***@***.***> wrote:
@SamuelHelspr <https://github.com/SamuelHelspr> or @woxxel
<https://github.com/woxxel> have either of you (or anyone reading)
figured out how to send a cookie? I've been using the from_url function
and it seems there's no option to pass it.
—
Reply to this email directly, view it on GitHub
<#175 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABM63N6EDNYBSEAUAIGYSDDXUI56ZANCNFSM4QLZBU7A>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Hey @JermellB i'd gladly take you up on that patch. |
I just made the same experience. Interestingly some sites (guardian, FAZ) are working fine even though there are ads in between. But for Spiegel the @JermellB any updates from you? Do you need help how you starting this patch? |
cf. #282 |
Mandatory
Related issues:
Describe your question
The the given CLI example returns no pages from zeit.de. I have the same problems with other web pages. No error is thrown, it just returns and claims to be finished. So the question is if there is a way to approach the problem. I attached the log file.
log.txt
Versions (please complete the following information):
Intent (optional; we'll use this info to prioritize upcoming tasks to work on)
I train language models for finetuning them on other tasks like ner or text classification
The text was updated successfully, but these errors were encountered: