Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to define and use Global Constants #532

Open
markjschreiber opened this issue Jan 27, 2023 · 16 comments
Open

Ability to define and use Global Constants #532

markjschreiber opened this issue Jan 27, 2023 · 16 comments

Comments

@markjschreiber
Copy link
Contributor

A mechanism to provide a global constant that can be used in a workflow block, task block or task command block even if the constant isn’t made part of the task input.

Example use case is:

A debug bash command snippet that I want to add into my tasks doing something at the top level like String cmd = "foo baa -v" and then in each task have something like:

command: {
   ~{cmd}
   # rest of bash goes here
}

I can do it if I define String cmd in each task or have String cmd as an input of each task but that makes removing the debug command later more cumbersome.

Pat Magee has suggested the following pattern for declaring them:

version 1.x

constants {
  String cmd = "foo"
  File foo = "s3://some/file
}

There are use cases for global variables (that are not constant) although that could be added later.

@rhpvorderman
Copy link
Contributor

I am not very fond of global variables/contant. It leads to a lot of "Where the **** does that variable come from?!?!" shouting to computer screens and endless grep hunting to actually find out.

In your actual use case it seems much better do this at the execution engine level. Just create a program myshell.sh

#!/usr/bin/env bash
set -eu 
foo baa -v
bash $@

And configure cromwell/miniwdl to use myshell.sh as default shell.

@markjschreiber
Copy link
Contributor Author

Adjusting the launch shell is possible but not in the case of a managed workflow engine such as one behind a GA4GH endpoint or even in a world where your system administrator doesn't want you messing with the system that's used by many others.

To help with the where does it come from question then I think using imports and namespaces should be required if the constant is declared in another file (also removes conflicts).

@jdidion
Copy link
Collaborator

jdidion commented Jan 30, 2023

I agree with @rhpvorderman. A primary goal of WDL is readability. Readability is sometimes sacrificed when it is outweighed by the potential benefit, but I don't see that being the case with globals.

@jdidion
Copy link
Collaborator

jdidion commented Jan 30, 2023

One thing I might consider is some form of inheritance. For example:

version development

task Super {
  input {
    String cmd
  }

  command <<<
  ~{cmd}
  >>>
}

task Sub extends Super {
  input {
    File foo
  }

  command <<<
  ~{cmd}
  cp ~{foo} bar
  >>>
}

Inheritance would work via simple merge - the declaration with a given name at the lowest level replaces any with the same name at a higher level (and overriding a variable cannot change its type).

With inheritance it is at least clear where a task is inheriting from - another task that is in the same file or is explicitly imported another file.

I could see this replacing a lot of boilerplate in my own WDL tasks. But it also comes with some risk of unintended behavior. So I'm not sure whether it would be worth the tradeoff.

@markjschreiber
Copy link
Contributor Author

Wouldn't you still need to call Sub and provide cmd. So if all your workflow tasks extend Super you have to provide cmd to all calls to them? With a constant available to the whole workflow you don't have to provide cmd to Sub but you can use it in Sub

@jdidion
Copy link
Collaborator

jdidion commented Jan 30, 2023

Yes, you'd still need to explicitly call each task with cmd, but you wouldn't have to define cmd for every task. In the trivial example above it doesn't save much effort, but there are complex workflows where the same set of inputs is passed to every task.

Another place where things could be simplified is in the call input section. We recently added the ability to elide call input values when the name of the input matches the name of the variable, i.e.

workflow foo {
  input {
    String bar
  }
  # these are the same
  call mytask { input: bar = bar }
  call mytask { input: bar }
}

We could further introduce syntax (e.g. ...) to insert all missing inputs that match the name of a variable in the current scope:

task SuperTask {
  input {
    String cmd
  }
  command <<<
  ~{cmd}
  >>>
}

task SubTask extends SuperTask {
  input {
    Int baz
  }
}

workflow SuperWf {
  input {
    String cmd
  }
}

workflow SubWf extends SuperWf {
  # the '...' means that the `cmd` input to this workflow will be passed to the
  # `cmd` input of mytask
  call mytask {
    input: baz = 1, ...
  }
}

Now you only have to pass cmd to the top-level workflow and it will be propagated. Of course this opens more cans of worms - extending workflows is more complicated and potentially problematic than extending tasks; SuperWf and SubWf would have to be defined in separate files unless we begin allowing more than one workflow definition in a file (or create the notion of an "abstract" task/workflow), ...

I'm guessing this would get shot down for introducing more complexity than it's worth.

@markjschreiber
Copy link
Contributor Author

Another approach that would work for my case (and maybe yours) would be to add a before_all command style section and after_all command section which would be special tasks that run before all workflow tasks and after all workflow tasks. It would depend on the tasks container being able to run it.

@jdidion
Copy link
Collaborator

jdidion commented Jan 30, 2023

Where would those sections go? Would you put them in every task? Or would you put them in the workflow and expect every task to inherit them?

@jdidion
Copy link
Collaborator

jdidion commented Jan 30, 2023

I'll propose yet another option that may be more palatable to everyone.

In development we have introduce the concept of hints (and this may get backported to WDL 1.2 assuming it comes to pass), which is like runtime except that the engine is allowed to ignore hints. We also define some reserved hints with specific meanings. We could introduce another reserved hint called e.g. global. The global hint would have special semantics - if it is defined in a workflow then every workflow or task called by that workflow inherits global from its parent unless it defines global itself.

workflow A {
  call B
  call C

  hints {
    global: {
      cmd: "echo hi"
    }
  }
}

task B {
  # echoes "hi"
  command <<<
  ~{wdl.hints.global.cmd}
  >>>
}

task C {
  # echoes "bye"
  command <<<
  ~{wdl.hints.global.cmd}
  >>>
  
  hints {
    global: {
      cmd: "echo bye"
    }
  }
}

Keep in mind that in WDL 1.1 we introduced the ability to specify runtime attributes at runtime, and this is extended to hints in development. So you wouldn't have to explicitly define cmd in your top-level workflow - you could pass it in as an input.

The only pre-requisite would be to make the runtime and hints attributes accessible within the command section (proposed here).

@markjschreiber
Copy link
Contributor Author

Hints are optional and can be ignored if the engine chooses (I think). I don't know if a global constant should be optional.

@jdidion
Copy link
Collaborator

jdidion commented Jan 30, 2023

For hints, "optional" means the engine doesn't have to do anything with them. If we add runtime access to runtime and hints then we would have to require that the engine at least makes their attribute values available at runtime.

@rhpvorderman
Copy link
Contributor

I am struggling to see the problem that is solved here. Is this a big deal that warrants extra language constructs if a simple configuration in the execution engine can solve it?

Adjusting the launch shell is possible but not in the case of a managed workflow engine such as one behind a GA4GH endpoint or even in a world where your system administrator doesn't want you messing with the system that's used by many others.

Yes, but would you debugging your workflow on a system that is used by many others? Wouldn't you do that on your own system?

@markjschreiber
Copy link
Contributor Author

Lots of scenarios -

  1. I want a solution that is independent of the engine. The command shell isn't necessarily a feature of any given engine for WDL.
  2. I might want to have the constant available at the end of the command, or in the middle, or in a trap or interpolated into a string.
  3. There are many uses for constants beyond debugging. The URI of a docker container uses in multiple tasks is another key one. Right now many workflows have the container URI as an input of the task which would be redundant in many GATK-best practices workflows if the URI was a global constant.

Would I debug (or profile or whatever) on my own system? Unlikely. Most workflows run in distributed cloud environments replicating this on my own system is only possible at very small scale and doesn't reveal what happens at larger scale when failures can happen due to the vagaries of large distributed systems under stress. Many workflow developers will not be in a position to deploy and configure their own test environments. Many workflow developers that I interact with are not given sufficient admin privileges to deploy these kinds of environments and really shouldn't need them.

@rhpvorderman
Copy link
Contributor

So you want to inject code X in any task at a postion that should be designated in the task Y template. You want to define it in a global way so you don't have to pass X around your workflow and using task inputs.

Is that a correct problem definition?

In that case I see as the least invasive option adding two runtime options:

  • run_before
  • run_after

Where these can be any random command that will be inserted before or after the task. By default these are empty strings. You can set runtime options globally by using defaults. It is not as flexible as you propose, but it is easy to implement, easy to understand and does not require tremendous changes to the language.

For more flexibility you always still have the option to have a debug input for each task and wire your workflow so that all these inputs are used. That is a hassle, sure, but if run_before and run_after catch 90% of the use cases it is not that bad.

@markjschreiber
Copy link
Contributor Author

I think these would be valuable additions. Another valuable option would be run_on_error which could be run if a tasks command exits with a non 0 code.

We would need to have a few rules about what to do if any of these fail and what the eventual return code should be to the engine.

  1. What happens if run_before fails (non-zero exit)? I think this should just be logged and the tasks command should still run.
  2. The return code sent to the engine should only be based on the command and not any value generated by run_before, run_after or run_on_error
  3. If command fails then run_after will not run and run_on_error will run. If command succeeds run_after will run and run_on_error will not run.
  4. Only one each of run_before, run_after and run_on_error are allowed in any workflow.

There may be (probably are) more rules that are needed that I haven't thought of yet?

@rhpvorderman
Copy link
Contributor

We would need to have a few rules about what to do if any of these fail and what the eventual return code should be to the engine.

Oh dear, this run_before thing is getting too complex for my tastes already.

I can do it if I define String cmd in each task or have String cmd as an input of each task but that makes removing the debug command later more cumbersome.

That seems to be the more preferable thing if the alternative is adding complex debugging code to the language that is going to be rarely used. Or worse: regularly abused. If these things are added to the language people will start relying on them and a whole lot of WDL's simplicity is going to be lost.

The problem can already be solved right now using the existing WDL. I think the use case is rare: only when the bug can not be reproduced on one's own system without access to the production system. This is not a thing that should occur a lot when using containers: a lot of the environment is fixed in that case. I do not have numbers on this, only my own experience on the cluster. And even on the HPC-cluster, the cluster gives me back enough information that I do not need this sort of functionality to solve the problem.

Adding new language functionality seems to me quite a heavy-handed solution to such a rare problem. Not that I want to diminish any problems you have. How many failing tasks that need to debugged are we talking about exactly?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants