Scheduled Direct duplicate program requests #474

enternoescape · 2022-05-17T16:06:23Z

It has come to my attention that a SageTV client may get blocked temporarily for making multiple requests for the same program from Schedules Direct within a 24 hour period. While we do compare MD5 hashes to make sure we are not re-downloading program data we already have, we do re-query programs directly from Schedules Direct for some data we don't store in the Wizard database to fill in data such as series info that is not consistently available from Schedules Direct. We also do this when we are trying to generate recommendations.

I have written and am currently testing code that will add a folder to the SageTV folder named epgcache that will export the JSON from every program into files with the extension .program. When one of these files already exists and the MD5 hash is current, the file will be used instead of querying Schedules Direct. The files in this folder older than 14 days will also be regularly cleaned out so disk space usage doesn't get out of hand.

So far it looks like for a full cable lineup, you can expect 50,000-80,000 .program files consuming about 200-300MB of actual disk space (due to filesystem cluster sizes). I am open to suggestions if anyone finds this kind of disk usage outrageous or if anyone would like to point me to something already in SageTV that would consolidate this data better. I think that the data we're re-querying wouldn't be appropriately stored by expanding the Wizard database as we're only using parts of it to attempt to fill in some blanks and generate recommendations.

JustFred51 · 2022-05-17T17:32:54Z

What's the expectation for how often these files would be created/updated? Will this feature only be enabled if specifically selected? I'm wondering about the impact to the life of an SSD if there are a significant number of writes.

Narflex · 2022-05-17T17:35:33Z

How did you come across this? What's the current negative impact somebody can experience from this?

enternoescape · 2022-05-17T17:36:02Z

The files are relatively static. There are a lot of them, but they are very small and the churn should be very low. If that's a bigger concern over disk space used, we could make cleanup only happen when files are older than 30 days.

enternoescape · 2022-05-17T17:38:57Z

@Narflex Bob from Schedules Direct is re-writing the backend in Python and they have a mechanism that attempts to thwart DDOSing. This is one of the things it may temporarily block traffic from a specific IP over. It came up while I was testing several weeks ago and I've been trying to find time to work on a solution.

Narflex · 2022-05-17T17:52:25Z

OK, I see how that would be a problem then. So what are the specific cases we have right now where we are re-downloading the same data? This may be better solved by adding another field to some DB objects to track this instead.

enternoescape · 2022-05-17T19:51:26Z

I need to do a little more to find the worst offending lines of code, but the issues are around recommendations and getting series descriptions and images. I'll try to hone in on what specifically we're gathering (it's been a while since I've looked at some of this code), but I still think the extra data would only slow down startup and eat up memory for content that doesn't need to be available for any time other than when we update the EPG.

enternoescape · 2022-05-23T15:46:44Z

I've boiled it down to two options because I don't want to leave this issue unchecked and I really do not have much personal time to write anything significant. Either we can use the caching I proposed or I will just remove all the code that references programs outside of what is needed to fill the EPG. That may mean we get less series data or no series data and that could result in less images to display in the UI.

Narflex · 2022-05-23T16:45:02Z

What is this you are referring to? "I will just remove all the code that references programs outside of what is needed to fill the EPG."

enternoescape · 2022-05-23T16:51:53Z

I would remove API calls to the Schedule Direct programs API that are only an attempt to fill in missing series data. I know you're just trying to better understand my intent, I can assure you the end result would be useable, just less images. The alternative is that I fix everything over the span of maybe a year and in the mean time accounts get blocked repeatedly.

Narflex · 2022-05-23T17:24:15Z

I'm fine with you doing it the way that gets done sooner; just wanting to understand the implications more. Like the celebrity images, it seems like only trying to generate series info for ones that don't exist would greatly reduce the number of requests:

sagetv/java/sage/epg/sd/SDRipper.java

Line 1764 in aff35f9

// Add/update the series info any time an episode of a series is updated. This

That seems to indicate we try to update it for every show every time...so if you reduce that to only get ones that don't exist in the DB that should be fine; and maybe that is what you are actually suggesting. :)

Narflex · 2022-06-09T20:49:18Z

@enternoescape If you just want to go ahead with the caching solution you originally proposed I'm fine with that too. :)

enternoescape · 2022-06-10T00:35:07Z

@Narflex Sorry, I've been getting pulled a few directions over the past few weeks. I will try to get something submitted before July.

Narflex · 2022-06-10T17:39:32Z

No problem at all...I really appreciate the help here so whenever you get it done is fine. I just wanted you to know I'm not trying to push you towards any specific solution. :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduled Direct duplicate program requests #474

Scheduled Direct duplicate program requests #474

enternoescape commented May 17, 2022

JustFred51 commented May 17, 2022

Narflex commented May 17, 2022

enternoescape commented May 17, 2022

enternoescape commented May 17, 2022

Narflex commented May 17, 2022

enternoescape commented May 17, 2022 •

edited

Loading

enternoescape commented May 23, 2022 •

edited

Loading

Narflex commented May 23, 2022

enternoescape commented May 23, 2022

Narflex commented May 23, 2022

Narflex commented Jun 9, 2022

enternoescape commented Jun 10, 2022

Narflex commented Jun 10, 2022

Scheduled Direct duplicate program requests #474

Scheduled Direct duplicate program requests #474

Comments

enternoescape commented May 17, 2022

JustFred51 commented May 17, 2022

Narflex commented May 17, 2022

enternoescape commented May 17, 2022

enternoescape commented May 17, 2022

Narflex commented May 17, 2022

enternoescape commented May 17, 2022 • edited Loading

enternoescape commented May 23, 2022 • edited Loading

Narflex commented May 23, 2022

enternoescape commented May 23, 2022

Narflex commented May 23, 2022

Narflex commented Jun 9, 2022

enternoescape commented Jun 10, 2022

Narflex commented Jun 10, 2022

enternoescape commented May 17, 2022 •

edited

Loading

enternoescape commented May 23, 2022 •

edited

Loading