Skip to content

Commit

Permalink
Merge pull request #12 from mrcmry/fix-after-following-tutorial
Browse files Browse the repository at this point in the history
Fix typos
  • Loading branch information
Dolu1990 authored Dec 27, 2024
2 parents 666c586 + fdd8240 commit 43a5652
Show file tree
Hide file tree
Showing 13 changed files with 146 additions and 146 deletions.
2 changes: 1 addition & 1 deletion source/VexiiRiscv/BranchPrediction/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ Will :
Note that it may help to not make the BTB learn when there has been a non-taken branch.

- The BTB don't need to predict non-taken branch
- Keep the BTB entry for something more usefull
- Keep the BTB entry for something more useful
- For configs in which multiple instruction can reside in a single fetch word (ex dual issue with RVC),
multiple branch/jump instruction can reside in a single fetch word => need for compromises,
and hope that some of the branch/jump in the chunk are rarely taken.
Expand Down
6 changes: 3 additions & 3 deletions source/VexiiRiscv/Debug/index.rst
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
Debug support
================================================
=============

Architecture
-------------------
------------
VexiiRiscv support hardware debugging by implementing the official RISC-V debug spec.

- Compatible with OpenOCD (and maybe some other closed vendor, but untested)
Expand Down Expand Up @@ -40,7 +40,7 @@ via openocd and its TCP remote_bitbang bridge as if it was real hardware:
But note that the speed will be quite low (as it is a hardware simulation)

EmbeddedRiscvJtag
-------------------
-----------------

EmbeddedRiscvJtag is a plugin which can be used to integrate the RISC-V debug module and its JTAG TAP directly inside
the VexiiRiscv. This simplify its deployment, but can only be used in single core configs.
Expand Down
6 changes: 3 additions & 3 deletions source/VexiiRiscv/Docker/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ where you cloned the repo to doesn't have the same uid as the ubuntu user inside
Docker container! The uid of the ubuntu user is 1000

Linux and MacOS X
------------------
-----------------

There's a bash script called run_docker.sh which automatically pulls the most
recent Docker image, starts it and then launches a VNC viewer.
Expand All @@ -37,7 +37,7 @@ Then you can simply run
./run_docker.sh
After the image has been fetched and the virtual X server has started you should
be greated with an XFCE4 desktop in a VNC viewer
be greeted with an XFCE4 desktop in a VNC viewer

Windows
-------
Expand Down Expand Up @@ -183,7 +183,7 @@ Next load the konata log by going into the folder as shown in the picture
:width: 400
:alt: Load konata log

You should be greated with a colorful representation of the instructions
You should be greeted with a colorful representation of the instructions
in the RISC-V pipeline during boot up

.. image:: Screenshot_20241203_151124.png
Expand Down
4 changes: 2 additions & 2 deletions source/VexiiRiscv/Execute/fpu.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ There is a few foundation plugins that compose the FPU :
.. image:: /asset/picture/fpu.png

Area / Timings options
-----------------------
----------------------

To improve the FPU area and timings (especially on FPGA), there is currently two main options implemented.

Expand All @@ -49,7 +49,7 @@ and if the user provide floating point constants which are subnormals number,
they will be considered as 2^exp_subnormal numbers.

In practice those two option do not seems to creates issues (for regular use cases),
as it was tested by running debian with various software and graphical environnements.
as it was tested by running debian with various software and graphical environments.

Optimized software
------------------
Expand Down
4 changes: 2 additions & 2 deletions source/VexiiRiscv/Execute/plugins.rst
Original file line number Diff line number Diff line change
Expand Up @@ -131,11 +131,11 @@ CsrAccessPlugin
- Implement the CSR read and write instruction in the execute pipeline
- Provide an API for other plugins to specify the mapping between the CSR registers and the CSR instruction

See the :ref:`privileges` chapter for more informations.
See the :ref:`privileges` chapter for more information.

EnvPlugin
^^^^^^^^^^^^^^^

See the :ref:`privileges` chapter for more informations.
See the :ref:`privileges` chapter for more information.

- Implement a few instructions as MRET, SRET, ECALL, EBREAK, FENCE.I, WFI by producing hardware traps
6 changes: 3 additions & 3 deletions source/VexiiRiscv/Fetch/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,10 @@ Fetch
=====

The goal of the fetch pipeline is to provide the CPU with a stream of words in which the instructions to execute are presents.
So more precisely, the fetch pipeline doesn't realy have the notion of instruction, but instead, just provide memory aligned chunks of memory block (ex 64 bits).
So more precisely, the fetch pipeline doesn't really have the notion of instruction, but instead, just provide memory aligned chunks of memory block (ex 64 bits).
Those chunks of memory (word) will later be handled by the "AlignerPlugin" to extract the instruction to be executed (and also handle the decompression in the case of RVC).

Here is an example of fetch architecture with an instruction cache, branch predictor aswell as a prefetcher.
Here is an example of fetch architecture with an instruction cache, branch predictor as well as a prefetcher.

.. image:: /asset/picture/fetch_l1.png

Expand Down Expand Up @@ -87,7 +87,7 @@ Will :

To improve the performances, consider first increasing the number of cache ways to 4.
The hardware prefetcher can help, but it is very variable in function of the workload. If you enable it, then consider
increasing the number of refill slots to at least 2, idealy 3.
increasing the number of refill slots to at least 2, ideally 3.



Expand Down
2 changes: 1 addition & 1 deletion source/VexiiRiscv/Framework/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ Framework
=========

Tools and API
------------------------
-------------

Overall VexiiRiscv is based on a few tools and API which aim at describing hardware in more productive/flexible ways than with Verilog/VHDL.

Expand Down
2 changes: 1 addition & 1 deletion source/VexiiRiscv/HowToUse/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -318,7 +318,7 @@ Konata is a Node JS application started with Electron, so you will have to insta

You can setup and start Konata by cloning it and using npm

The make comman will execute npm electron ., which will open the Konata window
The make command will execute npm electron ., which will open the Konata window

.. code-block:: bash
Expand Down
44 changes: 22 additions & 22 deletions source/VexiiRiscv/Memory/index.rst
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
.. _lsu:

Memory (LSU)
###################
############

LSU stand for Load Store Unit, VexiiRiscv has currently 2 implementations for it:

- LsuCachelessPlugin for microcontrollers, which doesn't implement any cache
- LsuPlugin / LsuL1Plugin which can work together to implement load and store through an L1 cache

Without L1
====================
==========

Implemented by the LsuCachelessPlugin, it should be noted that to
reach good frequencies on FPGA SoC, forking the memory request at
Expand All @@ -19,7 +19,7 @@ as it relax the AGU timings as well as the PMA (Physical Memory Attributes) chec
.. image:: /asset/picture/lsu_nol1.png

With L1
====================
=======

This configuration supports :

Expand Down Expand Up @@ -97,13 +97,13 @@ To improve the performances, consider first increasing the number of cache ways

The store buffer will help a lot with the store bandwidth by allowing the CPU to not be blocked by every store miss.
The hardware prefetcher will help with both store/load bandwidth (but if the store buffer is already enabled, it will not
realy increase the store bandwidth).
really increase the store bandwidth).

For the hardware prefetcher to stretch its leg, consider using 4 refill/writeback slots. This will also help the store buffer.


Prefetching
----------------------
-----------

Currently there is two implementation of prefetching

Expand Down Expand Up @@ -173,11 +173,11 @@ Also, prefetch which fail (ex : because of hazards in L1) aren't replayed.
The prefetcher can be turned off by setting the CSR 0x7FF bit 1.

performance measurements
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^

Here are a few performance gain measurements done on litex with a :

- quad-core RV64GC running at 200 Mhz
- quad-core RV64GC running at 200 MHz
- 16 KB L1 cache for each core
- 512 KB of l2 cache shared (128 bits data bus)
- 4 refill slots + 4 writeback slots + 32 entry store queue + 4 slots store queue
Expand Down Expand Up @@ -206,19 +206,19 @@ Here are a few performance gain measurements done on litex with a :
- 50.2 fps

Hardware Memory coherency
--------------------------------------------
-------------------------

Hardware memory coherency, is the feature which allows multiple memory agents (ex : CPU, DMA, ...)
to work on the same memory locations and notify each others when they change their contents.
Without it, the CPU software would have to manualy flush/invalidate their L1 caches to keep things in sync.
Without it, the CPU software would have to manually flush/invalidate their L1 caches to keep things in sync.

There is mostly 2 kinds of hardware memory coherency architecture :

- By invalidation : When a CPU/DMA write some memory, it notifies the other CPU caches that they should invalidate any
old copy that they have of the written memory locations. This is generaly used for write-through L1 caches.
old copy that they have of the written memory locations. This is generally used for write-through L1 caches.
This isn't what VexiiRiscv implements.
- By permition : Memory blocks copies (typicaly 64 aligned bytes blocks which resides in L1 cache lines) can have multiple states.
Some of which provide read only accesses, while others provide read/write accesses. This is generaly used in write-back L1 caches,
- By permission : Memory blocks copies (typically 64 aligned bytes blocks which resides in L1 cache lines) can have multiple states.
Some of which provide read only accesses, while others provide read/write accesses. This is generally used in write-back L1 caches,
and this is what VexiiRiscv uses.

In VexiiRiscv, the hardware memory coherency (L1) with other memory agents (CPU, DMA, L2, ..) is supported though a MESI implementation which can be bridged to a tilelink memory bus.
Expand Down Expand Up @@ -249,32 +249,32 @@ Here is the hardware interfaces :
When data need to be written back, it will be done through the write_cmd channel.

Memory system
----------------------
-------------

Currently, VexiiRiscv can be used with the Tilelink memory interconnect from SpinalHDL and Chipyard (https://chipyard.readthedocs.io/en/latest/Generators/VexiiRiscv.html).

Why Tilelink
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^

So, why using Tilelink, while most of the FPGA industry is using AXI4 ? Here are some issues / complexities that AXI4 bring with it.
(Dolu1990 opinions, with the perspective of using it in FPGA, with limited man power, don't see this as an absolute truth)
(Dolu1990 opinions, with the perspective of using it in FPGA, with limited manpower, don't see this as an absolute truth)

- The AXI4 memory ordering, while allowing CPU/DMA to get preserved ordering between transactions with the same ID,
is creating complexities and bottlenecks in the memory system. Typically in the interconnect decoders
to avoid dead-locks, but even more in L2 caches and DRAM controllers which ideally would handle every request out of order.
to avoid dead-locks, but even more in L2 caches and DRAM controllers which ideally would handle every request out of order.
Tilelink instead specify that the CPU/DMAs shouldn't assume any memory ordering between inflight transactions.
- AXI4 specifies that memory read response channel can interleave between multiple ongoing bursts.
While this can be use full for very large burst (which in itself is a bad idea, see next chapter),
this can lead to big area overhead for memory bridges, especially with width adapters.
Tilelink doesn't allows this behaviour.
- AXI4 splits write address from write data, which add additional synchronisations points in the interconnect decoders/arbiters and peripherals (bad for timings)
Tilelink doesn't allows this behavior.
- AXI4 splits write address from write data, which add additional synchronizations points in the interconnect decoders/arbiters and peripherals (bad for timings)
as well as potentially decrease performances when integrating multiple AXI4 modules which do not use similar address/data timings.
- AXI4 isn't great for low latency memory interconnects, mostly because of the previous point.
- AXI4 splits read and write channels (ar r / aw w b), which mostly double the area cost of address decoding/routing for DMA and non-coherent CPUs.
- AXI4 specifies a few "low values" features which increase complexity and area (ex: WRAP/FIXED bursts, unaligned memory accesses).

Efficiency cookbook
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^

Here are a set of design guideline to keep a memory system lean and efficient (don't see this as an absolute truth) :

Expand All @@ -288,14 +288,14 @@ Here are a set of design guideline to keep a memory system lean and efficient (d
- DMA should access up to 64 aligned bytes per burst, this should be enough to reach peak bandwidth. No need for 4KB Rambo bursts.
Asking a system to support bursts bigger than 64 aligned bytes can lead to extra cost, as it create new ordering constraints between the memory block of the burst.
For instance in a L2 cache it can lead to implementation of a reorder buffer to deal between transaction which hit/miss the cache. Adding extra complexity/area/timings to deal with.
Additionaly, big burst can create high latency spike for other agents (CPU/DMA).
Additionally, big burst can create high latency spike for other agents (CPU/DMA).
- DMA should only do burst aligned memory accesses (to keep them easily portable to Tilelink)
- It is fine for DMA to over fetch (let's say you need 48 bytes, but access aligned 64 bytes instead),
as long as the bulk of the memory bandwidth is not doing it.
- DMA should avoid doing multiple accesses in a 64 byte block if possible, and instead use a single access.
This can preserve the DRAM controller bandwidth (see DDR3/4/5 comments above),
but also, L2/L3 cache designs may block any additional memory request targeting a memory block which is already under operation.
- When a DMA start a write burst, it has to complet as fast as possible. The reason is that the interconnect can lock itself on your burst until you finish it.
- When a DMA start a read burst, it should avoid putting backpresure on the read responses. The reason is that the interconnect can lock itself on your burst until you finish it.
- When a DMA start a write burst, it has to complete as fast as possible. The reason is that the interconnect can lock itself on your burst until you finish it.
- When a DMA start a read burst, it should avoid putting backpressure on the read responses. The reason is that the interconnect can lock itself on your burst until you finish it.


82 changes: 41 additions & 41 deletions source/VexiiRiscv/Performance/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,67 +38,67 @@ Here are a few synthesis results :
rv32i_noBypass ->
- 0.78 Dhrystone/MHz 0.60 Coremark/MHz
- Artix 7 -> 210 Mhz 1182 LUT 1759 FF
- Cyclone V -> 159 Mhz 1,015 ALMs
- Cyclone IV -> 130 Mhz 1,987 LUT 2,017 FF
- Trion -> 94 Mhz LUT 1847 FF 1990
- Titanium -> 320 Mhz LUT 2005 FF 2030
- Artix 7 -> 210 MHz 1182 LUT 1759 FF
- Cyclone V -> 159 MHz 1,015 ALMs
- Cyclone IV -> 130 MHz 1,987 LUT 2,017 FF
- Trion -> 94 MHz LUT 1847 FF 1990
- Titanium -> 320 MHz LUT 2005 FF 2030
rv32i ->
- 1.12 Dhrystone/MHz 0.87 Coremark/MHz
- Artix 7 -> 206 Mhz 1413 LUT 1761 FF
- Cyclone V -> 138 Mhz 1,244 ALMs
- Cyclone IV -> 124 Mhz 2,188 LUT 2,019 FF
- Trion -> 78 Mhz LUT 2252 FF 1962
- Titanium -> 300 Mhz LUT 2347 FF 2000
- Artix 7 -> 206 MHz 1413 LUT 1761 FF
- Cyclone V -> 138 MHz 1,244 ALMs
- Cyclone IV -> 124 MHz 2,188 LUT 2,019 FF
- Trion -> 78 MHz LUT 2252 FF 1962
- Titanium -> 300 MHz LUT 2347 FF 2000
rv64i ->
- 1.18 Dhrystone/MHz 0.77 Coremark/MHz
- Artix 7 -> 186 Mhz 2157 LUT 2332 FF
- Cyclone V -> 117 Mhz 1,760 ALMs
- Cyclone IV -> 113 Mhz 3,432 LUT 2,770 FF
- Trion -> 83 Mhz LUT 3883 FF 2681
- Titanium -> 278 Mhz LUT 3909 FF 2783
- Artix 7 -> 186 MHz 2157 LUT 2332 FF
- Cyclone V -> 117 MHz 1,760 ALMs
- Cyclone IV -> 113 MHz 3,432 LUT 2,770 FF
- Trion -> 83 MHz LUT 3883 FF 2681
- Titanium -> 278 MHz LUT 3909 FF 2783
rv32im ->
- 1.20 Dhrystone/MHz 2.70 Coremark/MHz
- Artix 7 -> 190 Mhz 1815 LUT 2078 FF
- Cyclone V -> 131 Mhz 1,474 ALMs
- Cyclone IV -> 125 Mhz 2,781 LUT 2,266 FF
- Trion -> 83 Mhz LUT 2643 FF 2209
- Titanium -> 324 Mhz LUT 2685 FF 2279
- Artix 7 -> 190 MHz 1815 LUT 2078 FF
- Cyclone V -> 131 MHz 1,474 ALMs
- Cyclone IV -> 125 MHz 2,781 LUT 2,266 FF
- Trion -> 83 MHz LUT 2643 FF 2209
- Titanium -> 324 MHz LUT 2685 FF 2279
rv32im_branchPredict ->
- 1.45 Dhrystone/MHz 2.99 Coremark/MHz
- Artix 7 -> 195 Mhz 2066 LUT 2438 FF
- Cyclone V -> 136 Mhz 1,648 ALMs
- Cyclone IV -> 117 Mhz 3,093 LUT 2,597 FF
- Trion -> 86 Mhz LUT 2963 FF 2568
- Titanium -> 327 Mhz LUT 3015 FF 2636
- Artix 7 -> 195 MHz 2066 LUT 2438 FF
- Cyclone V -> 136 MHz 1,648 ALMs
- Cyclone IV -> 117 MHz 3,093 LUT 2,597 FF
- Trion -> 86 MHz LUT 2963 FF 2568
- Titanium -> 327 MHz LUT 3015 FF 2636
rv32im_branchPredict_cached8k8k ->
- 1.45 Dhrystone/MHz 2.97 Coremark/MHz
- Artix 7 -> 210 Mhz 2721 LUT 3477 FF
- Cyclone V -> 137 Mhz 1,953 ALMs
- Cyclone IV -> 127 Mhz 3,648 LUT 3,153 FF
- Trion -> 93 Mhz LUT 3388 FF 3204
- Titanium -> 314 Mhz LUT 3432 FF 3274
- Artix 7 -> 210 MHz 2721 LUT 3477 FF
- Cyclone V -> 137 MHz 1,953 ALMs
- Cyclone IV -> 127 MHz 3,648 LUT 3,153 FF
- Trion -> 93 MHz LUT 3388 FF 3204
- Titanium -> 314 MHz LUT 3432 FF 3274
rv32imasu_cached_branchPredict_cached8k8k_linux ->
- 1.45 Dhrystone/MHz 2.96 Coremark/MHz
- Artix 7 -> 199 Mhz 3351 LUT 3833 FF
- Cyclone V -> 131 Mhz 2,612 ALMs
- Cyclone IV -> 109 Mhz 4,909 LUT 3,897 FF
- Trion -> 73 Mhz LUT 4367 FF 3613
- Titanium -> 270 Mhz LUT 4409 FF 3724
- Artix 7 -> 199 MHz 3351 LUT 3833 FF
- Cyclone V -> 131 MHz 2,612 ALMs
- Cyclone IV -> 109 MHz 4,909 LUT 3,897 FF
- Trion -> 73 MHz LUT 4367 FF 3613
- Titanium -> 270 MHz LUT 4409 FF 3724
rv32im_branchPredictStressed_cached8k8k_ipcMax_lateAlu ->
- 1.74 Dhrystone/MHz 3.41 Coremark/MHz
- Artix 7 -> 140 Mhz 3247 LUT 3755 FF
- Cyclone V -> 99 Mhz 2,477 ALMs
- Cyclone IV -> 85 Mhz 4,835 LUT 3,765 FF
- Trion -> 60 Mhz LUT 4438 FF 3832
- Titanium -> 228 Mhz LUT 4459 FF 3963
- Artix 7 -> 140 MHz 3247 LUT 3755 FF
- Cyclone V -> 99 MHz 2,477 ALMs
- Cyclone IV -> 85 MHz 4,835 LUT 3,765 FF
- Trion -> 60 MHz LUT 4438 FF 3832
- Titanium -> 228 MHz LUT 4459 FF 3963
Tuning
Expand Down Expand Up @@ -128,7 +128,7 @@ On FPGA there is a few options which can be key in order to scale up the IPC whi


Critical paths tool
--------------------------------
-------------------

At the end of your synthesis/place/route tools, you get a critical path report where hopefully, the source and destination registers are well named.
The issue is that in between, all the combinatorial logic and signals names become unrecognizable or misleading most of the time.
Expand Down
Loading

0 comments on commit 43a5652

Please sign in to comment.