Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-download files from the staging directory to output #500

Draft
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

andrii-i
Copy link
Collaborator

@andrii-i andrii-i commented Mar 14, 2024

Adds auto-download functionality.

For packaging input folder, this PR was superseded by #510.

@andrii-i andrii-i added the enhancement New feature or request label Mar 14, 2024
@andrii-i andrii-i force-pushed the package-input-folder branch from c75a717 to eb8e18f Compare March 14, 2024 21:34
@bhadrip
Copy link

bhadrip commented Mar 20, 2024

Can we support the following use case:

jupyter-root-dir
  dep-dir
     init.py
  project1-dir
     analysis.ipynb (depends on dep-dir/init.py)
  project2-dir
     report-generation.ipynb (depends on dep-dir/init.py)

2 separate notebook jobs could depend on the same python file:
Job1 - analysis.ipynb
Job2 - report-generation.ipynb

With the current approach the customer has to copy the dep-dir to every project folder.

@andrii-i andrii-i force-pushed the package-input-folder branch from a28c881 to 383a37e Compare March 27, 2024 21:16
@andrii-i andrii-i force-pushed the package-input-folder branch 4 times, most recently from cfc4886 to 7bbed76 Compare April 10, 2024 18:03
@andrii-i andrii-i force-pushed the package-input-folder branch 4 times, most recently from bfb68dd to 92b9088 Compare April 19, 2024 00:11
@JasonWeill
Copy link
Collaborator

Upon first view, the term "Package input folder" doesn't mean anything to me. I don't know why, as a user, I would want to do that, or whether I would need to. This change also doesn't modify the documentation at all.

I would recommend:

  1. Modifying the documentation to describe what "package input folder" does
  2. Adding some kind of "info" ℹ️ button or expander near the new checkbox to explain what the option does, briefly

@andrii-i andrii-i force-pushed the package-input-folder branch from e868446 to ee0fdb8 Compare April 23, 2024 05:27
@andrii-i
Copy link
Collaborator Author

Kicking CI

@andrii-i andrii-i closed this Apr 23, 2024
@andrii-i andrii-i reopened this Apr 23, 2024
@andrii-i andrii-i closed this Apr 23, 2024
@andrii-i andrii-i reopened this Apr 23, 2024
Copy link
Collaborator

@dlqqq dlqqq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Andrii! I left a few comments regarding the auto-download capability being introduced in this branch, but didn't complete a full code review.

My main concern with this PR is that 1) the new default auto-download capability being proposed here is a potential breaking change, and 2) the changes to the download logic are not required to implement the 'include parent dir' capability, which was the original intent of this PR.

Let's explore the possibility opening another PR with just the 'include parent dir' capability such that it can be reviewed, merged, and released separately.

session.commit()


class DownloadManager:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we merge this class with DownloadRecordManager above? I don't see the benefit of splitting the logic here into two separate classes if they are only used together anyways.

Comment on lines +73 to +81
# Forces new processes to not be forked on Linux.
# This is necessary because `asyncio.get_event_loop()` is bugged in
# forked processes in Python versions below 3.12. This method is
# called by `jupyter_core` by `nbconvert` in the default executor.

# See: https://github.com/python/cpython/issues/66285
# See also: https://github.com/jupyter/jupyter_core/pull/362
multiprocessing.set_start_method("spawn", force=True)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with this is that this line affects the multiprocessing behavior globally for everything running on this main thread, i.e. the server and all server extensions running on a JupyterLab instance. We should really avoid doing this just to pass our GitHub workflows. Consider what happens if:

  • Another extension is also calling multiprocessing.set_start_method(..., force=True), or
  • Some part of the server / server extension breaks when the start method is changed during its lifetime by our extension's initialize_settings() method.

I don't have a solution for how this bug can be fixed. However, the error message is pretty specific about why an exception is being raised, so my intuition is that this bug can be fixed. I'm leaving some references here for us to review in the future.

Comment on lines +44 to +50
async def process_download_queue(self):
while not self.download_manager.queue.empty():
download = self.download_manager.queue.get()
download_record = self.download_manager.record_manager.get(download.download_id)
if not download_record:
continue
await self.job_files_manager.copy_from_staging(download.job_id, download.redownload)
self.download_manager.delete_download(download.download_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can avoid using multiprocessing.Queue if we are already writing pending downloads to a DB. Can we read directly from self.download_manager.record_manager instead?

I believe that this may fix the process bug previously raised on the E2E tests in this branch. This is the corresponding error message:

RuntimeError: A SemLock created in a fork context is being shared with a process in a spawn context. This is not supported. Please use the same context to create multiprocessing objects and Process.

If we remove the need for multiprocessing objects, we may be able to fix this bug without relying on multiprocessing.set_start_method(). Can you give this a try?

@sravyasdh
Copy link

Can we support the following use case:

jupyter-root-dir
  dep-dir
     init.py
  project1-dir
     analysis.ipynb (depends on dep-dir/init.py)
  project2-dir
     report-generation.ipynb (depends on dep-dir/init.py)

2 separate notebook jobs could depend on the same python file: Job1 - analysis.ipynb Job2 - report-generation.ipynb

With the current approach the customer has to copy the dep-dir to every project folder.

@andrii-i can you confirm if this usecase can be supported? Can we support changing the input folder path on the UI? Rest of this lgtm

@andrii-i
Copy link
Collaborator Author

@sravyasdh thank you for looking into this PR. In the initial version only packaging the input file's folder via checkbox control will be supported so use case in question will not be supported. Reason is that it's not clear if users would prefer more complex UI/UX that would allow packaging of the folder one level above or packaging of the arbitrary folder or ability to input folder path on the UI. So it makes sense to start with simpler implementation, let users try it and get feedback from them, and update the implementation based on feedback if needed.

@andrii-i andrii-i mentioned this pull request Apr 24, 2024
@andrii-i andrii-i marked this pull request as draft April 24, 2024 19:01
@andrii-i andrii-i changed the title Add an option to package input folder Add an option to package input folder, auto-download files from the staging directory to output Apr 24, 2024
@andrii-i andrii-i changed the title Add an option to package input folder, auto-download files from the staging directory to output Add an option to package input folder; auto-download files from the staging directory to output Apr 24, 2024
@andrii-i
Copy link
Collaborator Author

For package input folder, superseded by #510.
For auto-download, either this PR will be reopened or a new PR would be created based on relevant commits.

@andrii-i andrii-i closed this Apr 25, 2024
@andrii-i andrii-i reopened this May 8, 2024
@andrii-i andrii-i force-pushed the package-input-folder branch from 166ed05 to 9d41103 Compare May 9, 2024 22:04
@andrii-i andrii-i changed the title Add an option to package input folder; auto-download files from the staging directory to output Auto-download files from the staging directory to output May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants