Skip to content
This repository has been archived by the owner on Mar 9, 2021. It is now read-only.

TumblThree sometimes stops crawling, but the application doesn't hang, without error. #390

Open
bland328 opened this issue Jan 24, 2019 · 4 comments

Comments

@bland328
Copy link

bland328 commented Jan 24, 2019

TumblThree is wonderful! I'm having good luck with it, except when it simply stops crawling.

I've simplified my configuration all the way down to crawling only one blog at a time, and it is often successful, even when downloading tens of thousands of files.

Occasionally, however, it just stops, and the message in the queue is sometimes that it was Downloading and sometimes that it was Skipping a file.

My settings are Concurrent Connections = 90, Concurrent Video = 10, Concurrent blogs = 1, Timeout = 60, Scan connections = 4, Limiting API to 55 connections per 60 seconds.

In the Settings file, MaxNumberOfRetries = 1.

I'm downloading everything, saving metadata as JSON, 50 posts per page, Downloading specified size (1280x1080) even if it isn't offered, and Dumping crawler data.

My internet access is speedy and reliable, I have RAM to spare, and I'm saving to an SMB share, for what it's worth.

When TumblThree stops crawling, the application hasn't frozen; if I click Stop, it stops and cleans up duplicates.

Any thoughts on how I can troubleshoot this? Or a settings change that might help?

Edit 1: Also, it may be the case that this only happens near or at the completion of a blog; the last three times I ran into this, the Downloaded Files value was at least 99% of the of Number of Downloads.

Edit 2: I just saw it happen at the 80% mark, so that goes against my previous edit. Also, I'll mention that clicking Pause and then Resume has no effect.

@bland328 bland328 changed the title TumblThree sometimes stops, but doesn't hang, without error. TumblThree sometimes stops crawling, but the application doesn't hang, without error. Jan 25, 2019
@apoapostolov
Copy link

I confirm this issue happens for me as well. It happens to the same thumblogs, not randomly.

@elipriaulx
Copy link
Contributor

I can also confirm this is an issue for me at times. I haven't put much thought into troubleshooting it yet but maybe we could start working towards a troubleshooting method?

My immediate thoughts are to:

  1. Compile a list of blogs to download with varying types of content
  2. Install remote debugging tools
  3. Import list with TumblThree and crawl
  4. If stuff stops happening, break and investigate the state of the threads.

It would be great if we could narrow this down. Often when it happens to me, I can either restart TumblThree and continue downloading, or I am forced to delete all downloaded data entries for the problematic blog and start again. I don't know if these are two different issues.

I guess we need an understanding of what should be happening when stuff is downloading state-wise; this isn't something I have investigated much (yet). It might also help to add more logging first - maybe a straight forward exception is being thrown (yeah right 🙄) that could guide us towards a fix.

I probably won't make time to look at this myself until I have caught up on some logistics issues in moving to the new repo, but it's high on my priority list for after - it is bloody irritating. Maybe later in February.

@bland328
Copy link
Author

bland328 commented Jan 31, 2019

A quick question for @johanneszab regarding my attempts to find the root cause of this problem (with apologies for not having studied the code to answer this on my own):

Am I correct that, when rescanning, TumblThree assumes any file listed in a Index\blogname_files.tumblr file to have been successfully downloaded, and that it does not check to see if the file actually exists locally?

Or do I misunderstand the purpose of the file? Thanks!

@bland328
Copy link
Author

bland328 commented Jan 31, 2019

@apoapostolov and @gpriaulx, do either of you have "Dump Crawler Data" turned on, so that TumblThree writes out .json files accompanying downloaded files?

I do have it turned on, and I ask because I just did a full rescan of just one blog with about 200,000 downloads, and very near the end I saw TumblThree stop on a "skipping" message, as I've seen before. It sat there for several minutes, as I've also seen before.

This time, though, I looked in Task Manager, and saw that TumblThree.exe was still actively accessing my target disk.

So, I quickly downloaded and fired up the Microsoft Process Monitor, spied on the TumblThree.exe file system activity, and saw it was re-writing what appeared to be all 200,000 .json files.

I waited, and about 30 minutes or so later, the blog properly completed and disappeared from the queue.

So, even though during the rescan only three new files were actually downloaded, TumblThree rewrote all the .json files. Unsurprisingly, writing 200,000 .json files takes a while.

Now I'm looking back and wondering how many times I stopped or killed the application in such a situation, thinking nothing was happening.

It would certainly be valuable if TumblThree would display a status message in this case.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants