Skip to content
This repository has been archived by the owner on Oct 3, 2021. It is now read-only.

Combined tasks #1104

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from
Draft

Conversation

lembergerth
Copy link
Contributor

Description of tasks

New benchmark tasks that are a combination of existing benchmark tasks.

Each benchmark task in this directory is created from two other tasks
of other categories, according to the following pattern:

/* definitions of Task 1 ... */
int main1() { /* main-method of Task 1 ... */ }
/* definitions of Task 2 ... */
int main2() { /* main-method of Task 2 ... */ }

int main() {
  if (__VERIFIER_nondet_int()) {
    main1();
  } else {
    main2();
  }

Definitions are renamed to avoid conflicts, if necessary.

This construction leads to programs with independent program parts.

The name of each benchmark task reveals its origin:
All task names are created by the pattern ${TASK_1}+${TASK_2}.

Open Issue: Licensing

I'm not sure about how to license these tasks.
Some of the tasks that we used are part of sv-benchmarks, but do not have a separate license in their corresponding directory.
For now, I have written a LICENSE.txt that coarsely references the original licenses (if present),
but I don't think this is good enough.

I'd be happy about any constructive input on this matter.

Checklist

  • programs added to new and appropriately named directory

  • license present and acceptable (either in separate file or as comment at beginning of program)

  • contributed-by present (either in README file or as comment at beginning of program)

  • programs added to a .set file of an existing category, or new sub-category established (if justified)

  • intended property matches the corresponding .prp file

  • programs and expected answer added to a .yml file according to task definitions

  • architecture (32 bit vs. 64 bit) matches the corresponding .cfg file
  • original sources present
  • preprocessed files present
  • preprocessed files generated with correct architecture
  • Makefile added with correct content and without overly broad suppression of warnings

@PhilippWendler
Copy link
Member

Yes, the licensing should be properly handled. The best and easiest way to do this would be to decide on #1099, add machine-readable (REUSE-style) license headers to all relevant source files, and adjust your tool that constructs the combined programs such that it puts the union of all such headers from the input files in to the output files.

@PhilippWendler PhilippWendler added C Task in language C new benchmarks labels Jul 23, 2020
@PhilippWendler
Copy link
Member

Now that #1099 is closed to getting merged: For which directories would you need REUSE-style license declarations in order to be able to update this PR?

After we have this list we can maybe split the workload over a few people. For some directories the task of adding license headers might be a lot of work, but then we could leave programs from these directories out for a first round of combined tasks.

@lembergerth
Copy link
Contributor Author

Thanks for following up on this.

We have the following pairs:

  • systemc/pc_sfifo*, systemc/token_ring*
  • systemc/token_ring*, eca-rers2012/Problem05_label4*
  • eca-rers2012/Problem12_label0*, seq-mthreaded/pals_lcr.*
  • bitvector/gcd_*, floats-cdfpl/newton_*
  • floats-cdfpl/square*, bitvector/soft_float_*

So directory-wise, we need systemc/, eca-rers2012/, seq-mthreaded/, bitvector/, and floats-cdfpl/

@PhilippWendler
Copy link
Member

Oh, that's a problem. For both systemc and bitvector we still do not know the license (cf. #165), so if we want to do this correctly and add new these new tasks only if there license is clear in order to avoid more work in the future, we would have to ping the original submitters of these tasks first and ask for a license.

@lembergerth
Copy link
Contributor Author

What about eca-rers2012/Problem12_label0*, seq-mthreaded/pals_lcr.*?
Is that pair easy?

We could try to get these merged for SV-COMP'21 and postpone the others.

@PhilippWendler
Copy link
Member

Right, I overlooked this pair. This seems doable.

All ECA programs should be easy, because I think there were never any cases where other people added tasks to their directories or so, so we can assume they are all from the same source.

For seq-mthreaded, one should double check the history of the directory maybe. Furthermore, their readme claims that these have been produced by processing some other applications sources. So we either need to hope that the license also covers the original applications, or have a look whether we can find out more about this and confirm this assumption.

@lembergerth lembergerth force-pushed the combined-tasks branch 3 times, most recently from 718eb2d to cbc1944 Compare November 3, 2020 18:35
…2012/Problem12_label0*

Used in 'Dirk Beyer, Marie-Christine Jakobs, Thomas Lemberger: Difference Verification with Conditions, SEFM20'
…ring

Used in 'Dirk Beyer, Marie-Christine Jakobs, Thomas Lemberger: Difference Verification with Conditions, SEFM20'
…ton_

Used in 'Dirk Beyer, Marie-Christine Jakobs, Thomas Lemberger: Difference Verification with Conditions, SEFM20'
…oft_float_

Used in 'Dirk Beyer, Marie-Christine Jakobs, Thomas Lemberger: Difference Verification with Conditions, SEFM20'
…systemc/token_ring

Used in 'Dirk Beyer, Marie-Christine Jakobs, Thomas Lemberger: Difference Verification with Conditions, SEFM20'
@lembergerth
Copy link
Contributor Author

I've created #1210 to handle the feasible tasks.

@hernanponcedeleon
Copy link
Contributor

I might be missing something, but I don't see what is the aggregated value for this benchmarks.

Besides the fact that the new benchmarks are larger than the base ones in isolation, I don't think they pose any challenge to the verifiers that are not posed by the base benchmarks already.

Is there anything I am missing?

@lembergerth
Copy link
Contributor Author

The tasks are not just larger than the individual components, but simulate systems with loosely coupled components. This is almost non-existent in sv-benchmarks at the moment, as most tasks describe a single algorithm or program behavior.

@hernanponcedeleon
Copy link
Contributor

I can see the value in having loosely couple systems (microservices), but since "loosely" in here actually mean completely decoupled (exactly one of them is executed) one could argue that this does not fairly simulate these systems.

As I see it, they just increase maintainability complexity without bringing much new in terms of verification challenges.

@lembergerth
Copy link
Contributor Author

I didn't talk about microservices, and I'm not trying to simulate those.
The tasks submitted here are very loosely coupled, yes, but that doesn't mean that instances like this don't exist in the real world (strategy pattern, command-line tools with different commands, etc.)

You can read more about the tasks here, and also see that they actually are challenging: https://www.sosy-lab.org/research/pub/2020-SEFM.Difference_Verification_with_Conditions.pdf

@lembergerth lembergerth closed this Nov 3, 2020
@lembergerth lembergerth reopened this Nov 3, 2020
@hernanponcedeleon
Copy link
Contributor

Thanks for the pointer. I see that in the context of evolving programs this makes sense.

@dbeyer dbeyer added this to the SV-COMP 2022 milestone Nov 4, 2020
@lembergerth lembergerth marked this pull request as draft November 4, 2020 16:34
@priyankadarke
Copy link

I agree with @hernanponcedeleon. These programs are interesting if the original and evolved programs are provided. Are these programs being submitted in such a context ? If not, then what is the value add by them in the context of reachability analysis? These tasks can be made more challenging if not as loosely coupled.

@lembergerth
Copy link
Contributor Author

Since when is it necessary to argue for new benchmark contributions? I don't see this in the PR checklist. It requires time that is probably better spend on fixing other benchmark tasks.

These programs are interesting if the original and evolved programs are provided

Just to avoid confusion: The submitted programs are "original" and "evolved", but these concepts may be confusing because the tasks were artificially generated and "original" and "evolved" can be swapped arbitrarily. (We can say that A -> B or B -> A).
There is no task marked as coming before another, chronologically.

Are these programs being submitted in such a context ?

Sorry, I'm not sure what you mean with this.

If not, then what is the value add by them in the context of reachability analysis?

This was already stated: the tasks are not just larger than the individual components, but simulate systems with loosely coupled components. This is almost non-existent in sv-benchmarks at the moment, as most tasks describe a single algorithm or program behavior.

Loose coupling is a feature for these tasks:
As far as I know, for tasks with expected verdict 'false', this are the first tasks that provide difficult-to-analyze program logic that is not relevant to finding a counterexample.

In case you're interested, all further information about how theses tasks were selected and how well CPA-seq and UAutomizer perform on them can be found in the referenced paper.

@priyankadarke
Copy link

Hi,
It is interesting to see how the combination programs have been created and used in the context of incremental verification. Very interesting indeed. We present our understanding of the combination programs here, please correct us if we are wrong. These programs have been submitted as a new sub-category of the Reach-Safety category. Each program, Pi, is defined as follows:
/* definitions of Task 1 ... /
int main1() { /
main-method of Task 1 ... / }
/
definitions of Task 2 ... /
int main2() { /
main-method of Task 2 ... */ }

int main() {
if (__VERIFIER_nondet_int()) {
main1();
} else {
main2();
}

main1 and main2 are independent of each other. So, they belong to different program slices. Any tool that can verify existing svcomp programs corresponding to main1 and main2 within a CPU time limit of 900 seconds can verify Pi. Thus verifying Pi is like firing the following commands either in parallel, or sequentially,
cbmc --unwind x --unwinding-assertions Pi --function main1
cbmc --unwind x --unwinding-assertions Pi --function main2
Replace cbmc with any other tool.
Any tool that cannot verify both the existing svcomp programs corresponding to main1 and main2 (within 900s) will not be able to verify Pi.
To the best of our knowledge, such a program Pi when seen in practice represent large 'software systems' which run independent or loosely coupled modules. Shall we then not add combinations as a new 'synthesized programs' sub-category of software systems ?
Adding a large number of programs are harder to maintain while the main feature and the intent of adding each Pi is the same.

@lembergerth
Copy link
Contributor Author

@priyankadarke,

main1 and main2 are independent of each other. So, they belong to different program slices.

Exactly.

Any tool that can verify existing svcomp programs corresponding to main1 and main2 within a CPU time limit of 900 seconds can verify Pi.

Yes, in theory. But the runtime of verifying Pi is unknown.
For example, if Pi is safe, let's assume we use predicate abstraction with predicates applied globally.
Then, the predicates necessary to prove Pi safe is the sum of the predicates necessary to prove main1 and main2 correct (in the worst case, but expected).
Let's assume predicate abstraction first solves main1 fully, and then goes on to solve main2. Predicate abstraction now has lots of predicates from main1 that it will try to apply for each abstraction step in main2, but none of them are valid.
This may worsen the runtime.

Similarly, traditional bounded model checking on Pi will have to check significantly larger formulas then if main1 and main2 were considered separately.

As another example, let's assume Pi is unsafe; main1 is unsafe and main2 is safe.
If a verifier is not directed and does not perform smart abstraction, it may take a significant amount of time (or even fail) to prove main2 safe.

The tasks that were chosen for main1 and main2 are already difficult to solve by themselves - so choosing an efficient abstraction or portfolio approach may be necessary to solve Pi.

In theory it is possible to simply run two instances of your tool, as you proposed; you run one instance on main1 and one instance on main2.
But how do you find out that this is the right strategy to do?
And, looking at CPU time, this may take too much time.

Many other issues regarding scalability of verifiers apply to Pi. I would argue that these scalability issues are even more evident because Pi consists of loosely coupled components that don't benefit from caching etc.

I hope this was able to show that solving these combinations without knowing their loose structure is not as trivial as it may seem.

To the best of our knowledge, such a program Pi when seen in practice represent large 'software systems' which run independent or loosely coupled modules. Shall we then not add combinations as a new 'synthesized programs' sub-category of software systems ?

I think that's an interesting idea.
But I think the current idea of SoftwareSystems is that these are real-world software systems, which doesn't apply here.
I wouldn't mind changing the category name, though.

Adding a large number of programs are harder to maintain while the main feature and the intent of adding each Pi is the same.

Yes, that's certainly true. But these tasks, as they currently are, can still be useful for incremental verification. So removing some of them would diminish their purpose.
In addition, most of the maintenance work we do is either (a) automated through scripts or (b) fixing wrong verdicts or undefined behavior. Since the submitted tasks only slightly differ, if one of the tasks is wrong, (b) should also be automatable.

@priyankadarke
Copy link

Hmm. Thanks for the clarification.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
C Task in language C new benchmarks
Development

Successfully merging this pull request may close these issues.

5 participants