-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
risc-v vector v1.0 support #4050
Comments
And I would like to make some PRs to solve the problem 2 firstly.
Maybe the explicit |
Not sure what happened to the lengthy comment I wrote yesterday ... anyway |
I agree. Indeed we should keep the x280 kernels unchanged and apply changes in a new copy. Now the isa spec of rvv has been stable, but the intrinsic spec has not. So it may be better to wait a little longer until the intrinsic spec be stable. I will follow the progress of the spec. |
@sh-zheng, our intent when creating this rvv.c versions using the C intrinsics was that it adhere to the risc-v vector 1.0 support and not be x280 specific. However, we did want to fully exploit the available RVV 1.0 features. We had the same quandary as you did when we initially reviewed the _vector.c implementations, and then chose to create a separate implementation. Perhaps in time this can all converge. We will continue to update as the C intrinsic spec is finalized with hopefully not too many more changes. I'm aware of one more change forthcoming to the segment load/store intrinsics. Our intent was that the implementation should work for various VLENs however as you've pointed out in #3 (vl=8) that in some cases we've tried to maximize the performance given the x280's VLEN=512. So, some alternate implementations may be required there. GEMM is such an important performance metric and we tried to maximize, without reverting to assembler. While we developed in QEMU, we tested in hardware emulation to measure actual performance on real RTL and validate correctness, since we do not yet have silicon. Once we do have silicon we will do a full performance/test sweep again. For the compiler flags noted in #1, those are used by the clang/llvm auto-vectorizer on the generic C code and should not have an impact on the C intrinsic code. More recently I believe those flags are being standardized through the use of 'zvl512b' in the -march string. One point #2 'Changing vl in a loop, leading to tail cleared without tail undisturbed' I'm a little surprised that this would be x280 specific and not RVV standard. I'll go review. To me it would seem like a flaw if it is not prescribed in the standard? BTW, thanks for your work on FFTW for risc-v. We've used your PR directly :). |
@ken-unger , thank you very much for explaining some details of the design. I think the design intention of vector-isa, is to operate the vectors in their "natural" length. So if we have to use strip mining to operate long vector, it should be order-independent. In fact the rvv spec couldn't even garantee that |
Hello @sh-zheng, we are also interested in a generic RVV implementation. |
I believe this is addressed now through the additional RISCV64_ZVL256B and RISCV64_ZVL128B targets that got merged from the risc-v branch ? |
Yes, PR #4159 has updated the tail policy. Now almost all kernels have supported rvv, except for *trsm and c/zsymv kernels. Another issue is that, if we can merge ZVL256B and ZVL128B, since there is no essential gap between them? |
I'm not familiar enough with RISC-V, aren't at least the current GEMM kernels for ZVL256B and ZVL128B different (beyond using different unrolling) ? |
Yes they are different realizations now, at least using different unrolling. Even so maybe it's better if they are in a unified framework in future, regardless of vlen. |
The discussions in #4049 inspire me to creat an issue for further discussions.
Differ from commercial ISAs, which have a clear development plan, the total amount of products supporting rvv may be large. Optimization for all individual products may lead to code bloat, and is contrary to the purpose of the vector isa, which is expected to be length-adaptive.
Until now the intrinsic spec of rvv 1.0 is stable enough to develop codes, and the support of rvv 1.0 has been fully submitted to openblas, based on sifive x280, an in-order cpu with vlen=512.
Would it be better to do more development, based on this x280 version? The final destination may be the compatibility in different vlen, instruction execution order, tail/mask policy. Of course the pursuing of compatibility may lead to suboptimum performance, a balance have to be considered.
There are some cpu specified features in kernels of x280 and may lead to incorrect results in other cpus. List as following
-riscv-v-vector-bits-min=512
and-ffast-math
.vl = VSETVL(k);
in symv_L_rvv.c, line 96.size_t vl = 8;
in gemm_tcopy_8_rvv.c, line 84.In addition to above, the registers tiling in gemm of different vlen should be considered. Now we set
GEMM_UNROLL_N_SHIFT 8
, which may waste other vector registers. 12 or 14 may be better?The text was updated successfully, but these errors were encountered: