-
Notifications
You must be signed in to change notification settings - Fork 512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug? splash 3.0+ instances locking up on certain SSL requests. Does not happen on 2.3.3 #1164
Comments
Same problem |
Since you say the issue started happening recently, without Splash itself changing, and assuming it is not something that has changed on the target websites, it means something other than Splash itself changed on your end. I assume some newer version of a dependency is at fault here. My best guess would be Twisted, as Splash 2.3.3 caps it at 16.3.0, while 3.0+ do not cap it, and there have been recent releases. It would be great if someone could try if freezing Twisted at 16.3.0 works. If it does, we could then find the specific version where the issue starts happening, and that would help identify the issue. I would not discard that the problem is not Twisted itself, but some indirect dependency that Splash gets through its dependency on Twisted. |
@Gallaecio i'll give it a try today and report back edit: day got away from me, shooting for monday |
@Gallaecio forcing twisted to 16.3.0 in a splash 3.5 docker container did not resolve the issue. the symptoms are the same. for clarity in case i did something wrong, i did
and once connected, ran
afterword i ran then i ran my scraper that is known to cause the issue and observed the same symptoms |
Did running |
@Gallaecio one more piece of context, for these tests on my dev environment i'm running one splash 3.5 instance on twisted 16.3.0 and two on default (twisted 19 something) although i did get the compatibility warning, the instance using twisted 16.3.0 works fine with sites that don't cause this issue, and exhibits the exact same failure behavior with the site that does cause the issue. edit: i noticed my (working) splash 2.3.3 on prod is actually running twisted 16.1.1 - so i tried that version with splash 3.5 and observed the same issue. so i do not think the twisted version is the problem |
Which packages was it about? It is possible the issue is not Twisted, but an indirect dependency. If the issue is neither Twisted nor an indirect dependency, and it is actually an upstream change that is incompatible with newer Splash (i.e. with the WebKit version upgrade Splash 3.0 got), fixing the issue may be rather hard, and unlikely to be done any time soon, if ever. |
@Gallaecio the only warning was about splash incompatibility |
Then I don’t think Twisted is the issue :( |
@Gallaecio are there any more verbose logs i can produce for splash somehow, or from some directory? there is a splash verbosity setting that defaults to 1 during aquarium setup. I will try messing with that along with anything else you suggest |
I am not familiar enough with Splash to help much further.
I might have been wrong here, given dependencies are not an issue. Maybe those websites somehow stopped working with the version of WebKit that Splash 3.x uses. |
this might be true, but splash silently locking up and dying is not good behavior in this case |
bump. any ideas, anyone? |
Recaptcha introduced code that breaks Splash 3.X in October, confirmed with 3.2 and 3.5. For simply reading a site, adding an on_request() hook at the beginning of your script that blocks any attempts to access a URL that contains "recaptcha/releases" will prevent it from locking up. I'm not aware of any workarounds or any root-cause information as to what that Javascript is doing that is breaking Splash. |
@gtsupport-com thank you for the answer - and my apologies, i'm using the built in splash render.html - are you talking about the lua script? I never did learn lua, could you spell this out for me? thanks |
@minispeck All of my experience has been via /execute and lua scripts thus I'm not familiar with the options for the built in renderers. My first guess would be to place your own proxy in front of your splash instance and block it via that proxy. I don't see an option in the splash documentation to auto-blacklist certain urls; if you're dependent on render.html I don't have an easy answer for you. |
@gtsupport-com oh sorry i meant, i'm happy to move to execute endpoint, just 0 lua knowledge, so assuming i start with a copy of the default script, could you toss me some sample code for on_request to kill those requests? |
This will grab that page - delete the "args.url= ..." line if you are passing the URL in externally. There are a large number of examples on the Splash documentation site, it would be worth your while to dig into the tutorial so you can troubleshoot/tweak if necessary.
|
@minispeck You should set the Another issue arises from the fact that the Please take into consideration: @kmike | @immerrr | @Gallaecio |
@minispeck If you insist on using the WebKit engine (it's lightweight and fast, but QtWebKit is awaiting updates - here I want to thank @annulen for his great efforts: большое Вам спасибо), you'll need to utilize the |
FYI, you can get updated version of QtWebKit maintained by @mnutt at https://github.com/movableink/webkit/ — it's very close to WebKit's bleeding edge and should have much better compatibility with modern web content (though it's not polished at the moment and can have quite a few rough edges). |
My issue happens on splash 3.0 and 3.5 but NOT on 2.3.3. i am currently running prod on 2.3.3 as a workaround and would like a permanent solution to run 3.x
i have been running splash + HAProxy set up by aquarium for years before experiencing this issue, including successfully rendering the sites in question without issue prior to the day before yesterday
here is a url that consistently produces the issue, even simply using render.html from [host]:8050
https://www.schooljobs.com/careers/kirkwoodcc/jobs/3776251/adjunct-dental-hygiene
happens with aquarium default configuration
this happens in both dev (mac OS 15+) and prod (ubuntu) environments, and i did try wiping all my containers and starting over with aquarium. splash works fine for other urls but the above and some others kills it. every time, it locks up the entire docker container (immediately) and the HAPROXY stats shows a level 7 timeout (splash 3.5) or Level 4 timeout (3.0).
i cannot attach to a splash docker instance that hangs in this way - if i try, my terminal hangs.
thanks to docker-compose with aquarium i can watch splash output live. on 3.5 i often don't even get to see output of the request starting. sometimes i just see the request and then no more output as the instance hangs
on 3.0 only i get the following info
i have googled the network issue and found a bunch of issues right here in this repo with no clear answers about what is going on.
happy to be very responsive. please let me know if more info is needed. I want to get back to splash 3.x
The text was updated successfully, but these errors were encountered: