Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

将任务到队列调度系统Slurm中 #28

Open
1 of 2 tasks
perillaroc opened this issue May 16, 2022 · 2 comments
Open
1 of 2 tasks

将任务到队列调度系统Slurm中 #28

perillaroc opened this issue May 16, 2022 · 2 comments
Assignees

Comments

@perillaroc
Copy link
Owner

perillaroc commented May 16, 2022

提交Slurm作业,编写 TAKLER_JOB_CMD,两种方式:

  • 直接运行 sbatch 命令
  • 编写提交脚本,在脚本中运行 sbatch 命令
@perillaroc
Copy link
Owner Author

方案一:直接运行

sbatchscancel 写入到 TAKLER_SHELL_JOB_CMDTAKLER_SHELL_KILL_CMD 变量中

job_cmd = "sbatch {{ TAKLER_JOB }}"
kill_cmd = "scancel {{ TAKLER_RID }}"


def slurm_serial_job(node: Node, partition: str = "serial"):
    node.add_parameter("TAKLER_SHELL_JOB_CMD", job_cmd)
    node.add_parameter("TAKLER_SHELL_KILL_CMD", kill_cmd)
    node.add_parameter("PARTITION", partition)


def slurm_para_job(node: Node, nodes: int, tasks_per_node: int = 32, partition: str = "serial"):
    node.add_parameter("TAKLER_SHELL_JOB_CMD", job_cmd)
    node.add_parameter("TAKLER_SHELL_KILL_CMD", kill_cmd)
    node.add_parameter("PARTITION", partition)
    node.add_parameter("NODES", nodes)
    node.add_parameter("TASKS_PER_NODE", tasks_per_node)

@perillaroc
Copy link
Owner Author

方案二:编写工具脚本

仿照业务系统中 slsubmit6slcancel4 编写提交工具脚本,便于在提交失败时将任务节点设为 aborted,并将提交失败记录写到作业日志中。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

No branches or pull requests

1 participant