Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential optimization in the RISC-V example #16

Open
mbty opened this issue Aug 23, 2021 · 5 comments
Open

Potential optimization in the RISC-V example #16

mbty opened this issue Aug 23, 2021 · 5 comments

Comments

@mbty
Copy link

mbty commented Aug 23, 2021

I noticed that the getFields function of the RISC-V example often gets called just to get the value of a single field (with calls along the line of let funct3 := get(getFields(inst), funct3)). Since fetching the values of all the fields just to extract the value of one of them looked a bit overkill, I tried replacing getFields with individual getXxx functions (e.g. getFunct3). This provides a decent boost to simulation performance:

-- Running tests with Cuttlesim --
_objects/rv32i.v/rvcore.cuttlesim.opt tests/_build/rv32i/integ/rvbench_qsort.rv32 -1
[before] real: 0m0.018s	user: 0m0.017s	sys: 0m0.001s
[after ] real: 0m0.013s	user: 0m0.013s	sys: 0m0.000s
_objects/rv32i.v/rvcore.cuttlesim.opt tests/_build/rv32i/integ/rvbench_median.rv32 -1
[before] real: 0m0.012s	user: 0m0.012s	sys: 0m0.000s
[after ] real: 0m0.008s	user: 0m0.007s	sys: 0m0.001s
_objects/rv32i.v/rvcore.cuttlesim.opt tests/_build/rv32i/integ/img.rv32 -1
[before] real: 0m0.216s	user: 0m0.195s	sys: 0m0.020s
[after ] real: 0m0.160s	user: 0m0.142s	sys: 0m0.018s
_objects/rv32i.v/rvcore.cuttlesim.opt tests/_build/rv32i/integ/primes.rv32 -1
[before] real: 0m4.421s	user: 0m4.417s	sys: 0m0.001s
[after ] real: 0m3.090s	user: 0m3.084s	sys: 0m0.001s
_objects/rv32i.v/rvcore.cuttlesim.opt tests/_build/rv32i/integ/morse.rv32 -1
[before] real: 0m1.584s	user: 0m1.583s	sys: 0m0.000s
[after ] real: 0m1.086s	user: 0m1.085s	sys: 0m0.000s
-- Running tests with Verilator --
_objects/rv32i.v/obj_dir.opt/Vtop +VMH=tests/_build/rv32i/integ/rvbench_qsort.vmh -1
[before] real: 0m0.033s	user: 0m0.032s	sys: 0m0.001s
[after ] real: 0m0.030s	user: 0m0.030s	sys: 0m0.000s
_objects/rv32i.v/obj_dir.opt/Vtop +VMH=tests/_build/rv32i/integ/morse.vmh -1
[before] real: 0m2.689s	user: 0m2.687s	sys: 0m0.000s
[after ] real: 0m2.186s	user: 0m2.184s	sys: 0m0.000s
_objects/rv32i.v/obj_dir.opt/Vtop +VMH=tests/_build/rv32i/integ/rvbench_median.vmh -1
[before] real: 0m0.022s	user: 0m0.022s	sys: 0m0.000s
[after ] real: 0m0.018s	user: 0m0.017s	sys: 0m0.001s
_objects/rv32i.v/obj_dir.opt/Vtop +VMH=tests/_build/rv32i/integ/primes.vmh -1
[before] real: 0m8.358s	user: 0m8.349s	sys: 0m0.001s
[after ] real: 0m6.778s	user: 0m6.767s	sys: 0m0.001s
_objects/rv32i.v/obj_dir.opt/Vtop +VMH=tests/_build/rv32i/integ/img.vmh -1
[before] real: 0m0.399s	user: 0m0.385s	sys: 0m0.014s
[after ] real: 0m0.327s	user: 0m0.310s	sys: 0m0.017s

This results in a speedup of roughly 1.43 for Cuttlesim and of 1.23 for Verilator compared to master. I did not yet check the effects on synthesis but I might do it soon. I assume that this kind of optimization is applied automatically when synthesizing (isn't it?), so I don't expect changes on this side (alas). See commit 04cfdf8 for a demonstration.

@cpitclaudel
Copy link
Contributor

Nice sleuthing!

Indeed, this won't speed up actual circuits, because constructing a structure in Verilog doesn't cost anything (it just means remembering that a bunch of wires are bundled together).

On the simulation side it's a nice speedup, but we don't really want to "cheat" by rewriting code to make the simulator happier (especially because these sort of code changes then make it harder to say anything sinceere about Cuttlesim vs. other simulators, since we would have modified the code to fit Cuttlesim better). Instead, we should improve the simulator so that it handles this pattern better (It's a common pattern in the bluespec world).

I thought I did some experiments with this in the past, but I can't find my notes about them. What happens to performance if you force inlining of that function call at the C++ level (IIRC __attribute__((always_inline)) should do it)? Ideally inlining + dead code elimination should get rid of the unnecessary computation.

If that doesn't work, or if it's unreliable, then this would be a really interesting optimization pass to implement in Cuttlesim! We would check whether a function is pure and returns a struct, and if it does we would inline just the relevant computations. Sounds fun, actually :)

@mbty
Copy link
Author

mbty commented Aug 24, 2021

I just took a glance at how the C++ code for getFields is generated. As it is not a rule but a function, it returns by modifying its ret argument (initially passed by reference). This argument refers to a temporary variable created by the CALL_FN macro in the surrounding statement expression. Just a quick heads up, I guess that naming something tmp in a Kôika function (which should not be too rare) could lead to frustration since it would be hidden by this variable when used as an argument. This is especially true in the case where an element of type T called tmp is passed to a function returning T, since then the compiler wouldn't even complain about types not matching.

@cpitclaudel
Copy link
Contributor

Great catch, thanks! Fortunately, there's at least an inkling of the issue thanks to the compiler printing a warning (cuttlesim.hpp:253:47: warning: ‘tmp’ may be used uninitialized in this function [-Wmaybe-uninitialized]).

I pushed a fix just now.

@mbty
Copy link
Author

mbty commented Sep 30, 2021

An occurrence of tmp was left unchanged in the READ macro, which leads to errors when compiling anything that uses it (e.g. examples/conflicts_modular.v: cuttlesim.hpp:1403:27: error: ‘tmp’ was not declared in this scope; did you mean ‘tm’?).

cpitclaudel added a commit that referenced this issue Sep 30, 2021
@cpitclaudel
Copy link
Contributor

Ugh, thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants