Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Submit the SoS job submitter to a compute node #5

Open
gaow opened this issue Oct 2, 2020 · 2 comments
Open

Submit the SoS job submitter to a compute node #5

gaow opened this issue Oct 2, 2020 · 2 comments

Comments

@gaow
Copy link

gaow commented Oct 2, 2020

Currently, DSC runs in two modes:

  1. local mode: dsc ...
  2. Cluster mode: dsc ... --host where --host option loads a host configuration file such that an SoS job submitter keeps running on the background, of a cluster's login node, for example, and computation jobs are submitted to each cluster node.

The problem with 2 obviously is that running the SoS job submitter on the background can be a bit resource intensive and not welcomed on the cluster login node. So to run a DSC job on the cluster, one has to do something like this:

#!/bin/bash
#SBATCH --time=36:00:00
#SBATCH --partition=broadwl
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=2000
#SBATCH --job-name=dsc-submitter
#SBATCH --output=dsc-log-%J.out
#SBATCH --error=dsc-log-%J.err

# Run the main DSC
#./fixed_mix.dsc --host dsc_mnm.yml -o mnm_20200510 -s existing -e ignore &> mnm_20200510.log

This is a bit tedious. We'd like an interface and mechanism to be able to submit such a job to a compute node, which then submits jobs to the cluster. Perhaps it should be done on the SoS end? I'm going to submit a ticket at SoS repo.

@gaow
Copy link
Author

gaow commented Oct 2, 2020

A possible interface is to change the default section of the host submitter file,

https://github.com/cumc/dsc/blob/master/vignettes/one_sample_location/midway.yml

adding to this section

default:
  queue: midway2
  instances_per_job: 40
  nodes_per_job: 1
  instances_per_node: 4
  cpus_per_instance: 1
  mem_per_instance: 2G
  time_per_instance: 3m

this extra lines:

  submitter_mem: 6G
  submitter_walltime: 36h

then with dsc ... -o output_name --host config.yml -c 8 where config.yml has that extra line, it will create a job script based on the corresponding template, in this case because queue: midway2 it will use midway2 template, and submit a job like:

#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=6G
#SBATCH --time=36:00:00
#SBATCH --output={cur_dir}/{output_name}.out
#SBATCH --error={cur_dir}/{output_name}.err
sos run ... -j 8

to execute the DSC by submitting SoS jobs on a compute node. As you can see, the two 8 comes from DSC's -c 8. The 6G comes from DSC's new submitter_mem: 6G line. The walltime for this submitter is 36h. output_name will be the DSC benchmark output folder name but here we use it for the standard error and output filenames.

This solution can be implemented in DSC code and not requesting a new feature from SoS.

And we can allow something like

  submitter_mem: None
  submitter_walltime: None

to say that we want to submit from where we execute the command, not submitting it to a node which will then submit all jobs. -- this is the current behavior anyways.

@BoPeng
Copy link
Collaborator

BoPeng commented Dec 24, 2020

It is never a good idea to have long lasting processes running on headnode, even just for job submission with controlled ram and cpu usage. vatlab/sos#1407 now works (check the last few posts for sample configuration) and let me know if it works for your cluster.

Note that -r cluster -q cluster would be needed even if you are on cluster because -r cluster, if configured to be a pbs queue, would be needed to submit the entire workflow to the cluster (therefore not running on head node). Then -q cluster will be used for submitting jobs from computing node.

sos status

etc can be used to check status of workflows (with IDs starting with w, tasks are all have iDs starting with t or t#t for master tasks).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants