Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

digits Slurm #1435

Open
wants to merge 106 commits into
base: master
Choose a base branch
from
Open

digits Slurm #1435

wants to merge 106 commits into from

Conversation

z1mmz
Copy link

@z1mmz z1mmz commented Feb 5, 2017

Updated nvidia digits to work with slurm workload manager should be easy to extend to other workload managers if someone has a system to develop with.

@z1mmz z1mmz changed the title Dev digits Slurm Feb 5, 2017
@lukeyeager
Copy link
Member

(Related to #829, #1353)

@lukeyeager
Copy link
Member

Wow, that's a lot of code. Do you have any documentation for how to use this? Any suggestions for how to review 20K lines of code in 100 commits?

@z1mmz
Copy link
Author

z1mmz commented Feb 13, 2017

Working on getting the documentation ready.

Most of the changes revolve around the task.py file and the sub classes of task.
I have made some new files to handle transforming the task arguments to slurm(set up to be so that it should be able to be extended). These files are within digits/extensions/cluster_management.

The workload managers are detected by the cluster factory - slurm is detected by checking if slurm home is set. The workload manager can then be selected via the gui under the info tab. When this is selected the selected workload manager is stored in the cluster_factory.

In the task class the run method has been edited to check if task should be run via a workload manager or just normally. If a workload manager is selected the cluster_manager will be called and then the arguments will be edited by the selected cluster_manager. Also there are some changes to make sure no tasks have missing variables for slurm.

There are other changes that are to capture the variables for slurm and changes to the gui to take the commands.

Unfortunately I no longer have access to a slurm cluster to continue work on this so making any changes to my current fork would not be able to be tested by me. As of my final commit all tests you have on travis where passing using slurm.

Feel free ask any questions.

@lukeyeager
Copy link
Member

As of my final commit all tests you have on travis where passing using slurm.

That's pretty impressive! We'll try to make some time to review this.

@inJeans
Copy link

inJeans commented Mar 3, 2017

Great to see that you guys are going to take the time to review this. Can't wait to see it integrated into the main branch 👍🏽

@z1mmz
Copy link
Author

z1mmz commented Apr 26, 2017

I have found the cause of the huge number of lines changed, I have fixed this and synced my fork. The number of lines changed is now 2,000 this is still inflated due to some line ending issues which made git think I rewrote a few files. I will run our tests again to make sure it is still passing after the sync

@z1mmz
Copy link
Author

z1mmz commented May 3, 2017

After the sync all tests are still passing in our slurm environment, travis is only failing dist and lint (from some files in examples) all other tests seem to be passing fine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants