Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storing data in a data in database #98

Open
ExpDev07 opened this issue Mar 19, 2020 · 15 comments
Open

Storing data in a data in database #98

ExpDev07 opened this issue Mar 19, 2020 · 15 comments
Labels
enhancement New feature or request performance Issue related to performance and optimizations

Comments

@ExpDev07
Copy link
Owner

Right now the data is just stored in cache. Is it perhaps better to sync the data to an actual MySQL database? It would allow for fast querying.

@ExpDev07 ExpDev07 added the enhancement New feature or request label Mar 19, 2020
@Kilo59
Copy link
Collaborator

Kilo59 commented Mar 21, 2020

From what I have seen you guys talking about on other issues it seems like the format of the data is not totally stable.
If that's the case wouldn't it be easier to use a NoSQL database over a RDBMS?

@ExpDev07
Copy link
Owner Author

Yes, that would probably be better. MongoDB or alike.

@Kilo59
Copy link
Collaborator

Kilo59 commented Mar 22, 2020

Happy to help with this as well.

@focus1691
Copy link

This would be good too because the API is now broken for me and others

@ExpDev07
Copy link
Owner Author

@traderjosh can you explain in more detail how it’s broken for you? JHU (our data provider) made some pretty drastic changes lately which has caused the API’s outputs to change (of note the ID indexing and provinces no longer being present for USA).

@focus1691
Copy link

@ExpDev07 it works now - it was that cors header field missing from the API. The recovered field is missing now it seems. Is that gone forever? And some countries don't have an id, you're right.

I think a database would be good for this in case of website changes. MongoDB would be good because the data can change unexpectedly.

@ExpDev07
Copy link
Owner Author

ExpDev07 commented Mar 25, 2020

For JHU, yes, the recovery stats is gone forever unless they decide to bring it back. I’m gonna see if I can find some other reputable sources that offer it and add it to the API.

I believe their reasoning was that no reputable sources were providing accurate recovery numbers, so they just decided to remove it.

@focus1691
Copy link

Ok that's not an issue. Do you want me to help setup MongoDB? I can write the boilerplate and you can integrate an account for it.

@ExpDev07
Copy link
Owner Author

It would be awesome if you can start drafting a PR for it. It needs to be compatible with our service provider system (see “app.services.locations” module). But I think MongoDB will be perfect for it. I’m thinking we periodically sync the DB with data retrieved from the data sources.

@focus1691
Copy link

And it should be in Python? Not my speciality but I could do some research.

@ExpDev07
Copy link
Owner Author

Yeah, feel like that would be best.

@Kilo59
Copy link
Collaborator

Kilo59 commented Mar 25, 2020

And it should be in Python? Not my speciality but I could do some research.

https://api.mongodb.com/python/current/

@Kilo59 Kilo59 added this to the Performance Optimizations milestone Mar 26, 2020
@Kilo59 Kilo59 added the performance Issue related to performance and optimizations label Mar 29, 2020
@Kilo59
Copy link
Collaborator

Kilo59 commented Mar 29, 2020

Perhaps we should use Mongo to store and update the normalized data?
We can keep the data in a format that is easy to translate into our various responses.

  1. Collections for each source
  2. Documents for the countries/locations
  3. Background tasks to refresh the sources according to how frequently they are each updated.
  4. Continue to use caching to minimize database reads.

@cyenyxe
Copy link

cyenyxe commented Mar 31, 2020

From what I have seen in the code so far, both the JHU and CSBS location models derive from the Location class, with CSBS having some additional fields. This kind of inheritance relationship should be easy to represent (and query) in an RDBMS.

Splitting the data in multiple collections in Mongo wouldn't really add much value unless you want to support historical records in multiple formats, which is horrible to query anyway if they aren't backwards compatible.

@Kilo59
Copy link
Collaborator

Kilo59 commented Apr 25, 2020

We are deployed with gunicorn that runs multiple worker processes (4) each with their own caches.
@cyenyx once we are storing the data in any database (RDBMS, Mongo, Redis, etc) the workers can use it like a shared cache. Then they don't have to independently rebuild their own separate caches every hour.
It also adds resiliency when one of the dependent services timeouts, or encounters some kind of error, which is often the problem that causes this API to go down.

Related reading

https://realpython.com/python-memcache-efficient-caching/
https://redis.io/topics/lru-cache

Created a new issue for using a shared cache #304

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance Issue related to performance and optimizations
Projects
None yet
Development

No branches or pull requests

4 participants