Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add async worker for populating DB #45

Open
rashadg1030 opened this issue Jun 16, 2019 · 5 comments · May be fixed by #132
Open

Add async worker for populating DB #45

rashadg1030 opened this issue Jun 16, 2019 · 5 comments · May be fixed by #132
Assignees
Labels
github Synchronization with GitHub, parsing content from GitHub
Milestone

Comments

@rashadg1030
Copy link
Collaborator

I was wondering if it is a good time to work on the async worker that will populate our DB? Once we get a better idea of the data we need I guess. I've implemented an endpoint for our server that touches the DB and everything, so I think this would be a good next step. And one thing about file structure. We have a file that holds the GitHub query functions on the path src/IW/Server/Search.hs. Is it fine if it goes into the Server folder like it is now, or should it go to another folder called Async or Worker. For example, it could be src/IW/Worker/Search.hs. I'm not sure because technically Search.hs does function on the server, but I think files in src/IW/Server should be related to the issue-wanted API.

@rashadg1030 rashadg1030 self-assigned this Jun 16, 2019
@chshersh
Copy link
Contributor

@rashadg1030 Let's have IW/Sync directory and corresponding IW.Sync.* modules for the forwker. Async and Worker seem to generic. Sync is for synchronisation which is that it should do.

Regarding implementing the worker, I have some suggestions:

Implement a separate function that fetches data for a single repo or for a single issue. In this case we will be able to test such functions from GHCi and check the data in SQL tables. And it's easy to call such functions for all fetched repos.

rashadg1030 added a commit that referenced this issue Jun 18, 2019
rashadg1030 added a commit that referenced this issue Jun 18, 2019
rashadg1030 added a commit that referenced this issue Jun 18, 2019
rashadg1030 added a commit that referenced this issue Jun 19, 2019
rashadg1030 added a commit that referenced this issue Jun 21, 2019
rashadg1030 added a commit that referenced this issue Jun 22, 2019
rashadg1030 added a commit that referenced this issue Jun 23, 2019
rashadg1030 added a commit that referenced this issue Jun 24, 2019
@chshersh chshersh added the github Synchronization with GitHub, parsing content from GitHub label Jul 7, 2019
@chshersh
Copy link
Contributor

chshersh commented Jul 8, 2019

@rashadg1030 I'm going to describe a high-level overview of the sync algorithm.

Main function should look like this:

syncCache = forever $ syncWithGitHub `catch` \...

So we are trying to sync cache in the infinite loop, handling all errors.

syncWithGitHub function looks like this:

syncWithGitHub = do
    repos <- fetchAllRepos
    upsertRepos
    populateCategoriesAsync <- async $ populateCategories repos
    issues <- fetchAllIssues
    upsertIssues issues
    wait populateCategoriesAsync

The idea is the following:

  1. First, we fetch all the repositories.
  2. Second, we populate categories for repositories in a separate thread.
  3. While we are populating categories, we can update all issues.
  4. Wait while we update all categories.

Does this make sense for now?

@rashadg1030
Copy link
Collaborator Author

@chshersh This is perfect. Thank you!

@chshersh chshersh added this to the Sync milestone Jul 10, 2019
@rashadg1030
Copy link
Collaborator Author

@chshersh Now that everything is in place, I'm gonna implement the sync algorithm for issues and then clear this issue. There will be a single function that syncs both repos and issues.

@rashadg1030
Copy link
Collaborator Author

@chshersh
I'm gonna use the algorithm you mentioned above now:

syncWithGitHub = do
    repos <- fetchAllRepos
    upsertRepos
    populateCategoriesAsync <- async $ populateCategories repos
    issues <- fetchAllIssues
    upsertIssues issues
    wait populateCategoriesAsync

rashadg1030 added a commit that referenced this issue Aug 8, 2019
@rashadg1030 rashadg1030 linked a pull request Aug 8, 2019 that will close this issue
rashadg1030 added a commit that referenced this issue Aug 8, 2019
rashadg1030 added a commit that referenced this issue Aug 17, 2019
rashadg1030 added a commit that referenced this issue Aug 17, 2019
rashadg1030 added a commit that referenced this issue Aug 17, 2019
Resolves #45

Fix after review

[#45] Add syncWithGithub function

Resolves #45

Fix after review

[#45] Add syncCache function

Resolves #45

Fix after review
rashadg1030 added a commit that referenced this issue Aug 18, 2019
Resolves #45

Fix after review

[#45] Add syncWithGithub function

Resolves #45

Fix after review

[#45] Add syncCache function

Resolves #45

Fix after review
rashadg1030 added a commit that referenced this issue Aug 24, 2019
Resolves #45

Fix after review

[#45] Add syncWithGithub function

Resolves #45

Fix after review

[#45] Add syncCache function

Resolves #45

Fix after review
rashadg1030 added a commit that referenced this issue Aug 24, 2019
Resolves #45
rashadg1030 added a commit that referenced this issue Aug 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
github Synchronization with GitHub, parsing content from GitHub
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants