Merge pull request #12 from mrcmry/fix-after-following-tutorial

Fix typos
SpinalHDL · Dec 27, 2024 · 43a5652 · 43a5652
2 parents 666c586 + fdd8240
commit 43a5652
Show file tree

Hide file tree

Showing 13 changed files with 146 additions and 146 deletions.
diff --git a/source/VexiiRiscv/BranchPrediction/index.rst b/source/VexiiRiscv/BranchPrediction/index.rst
@@ -37,7 +37,7 @@ Will :
 Note that it may help to not make the BTB learn when there has been a non-taken branch.
 
 - The BTB don't need to predict non-taken branch
-- Keep the BTB entry for something more usefull
+- Keep the BTB entry for something more useful
 - For configs in which multiple instruction can reside in a single fetch word (ex dual issue with RVC), 
   multiple branch/jump instruction can reside in a single fetch word => need for compromises, 
   and hope that some of the branch/jump in the chunk are rarely taken.

diff --git a/source/VexiiRiscv/Debug/index.rst b/source/VexiiRiscv/Debug/index.rst
@@ -1,8 +1,8 @@
 Debug support
-================================================
+=============
 
 Architecture
--------------------
+------------
 VexiiRiscv support hardware debugging by implementing the official RISC-V debug spec.
 
 - Compatible with OpenOCD (and maybe some other closed vendor, but untested)
@@ -40,7 +40,7 @@ via openocd and its TCP remote_bitbang bridge as if it was real hardware:
 But note that the speed will be quite low (as it is a hardware simulation)
 
 EmbeddedRiscvJtag
--------------------
+-----------------
 
 EmbeddedRiscvJtag is a plugin which can be used to integrate the RISC-V debug module and its JTAG TAP directly inside
 the VexiiRiscv. This simplify its deployment, but can only be used in single core configs.

diff --git a/source/VexiiRiscv/Docker/index.rst b/source/VexiiRiscv/Docker/index.rst
@@ -22,7 +22,7 @@ where you cloned the repo to doesn't have the same uid as the ubuntu user inside
 Docker container! The uid of the ubuntu user is 1000
 
 Linux and MacOS X
-------------------
+-----------------
 
 There's a bash script called run_docker.sh which automatically pulls the most
 recent Docker image, starts it and then launches a VNC viewer.
@@ -37,7 +37,7 @@ Then you can simply run
     ./run_docker.sh
 
 After the image has been fetched and the virtual X server has started you should
-be greated with an XFCE4 desktop in a VNC viewer
+be greeted with an XFCE4 desktop in a VNC viewer
 
 Windows
 -------
@@ -183,7 +183,7 @@ Next load the konata log by going into the folder as shown in the picture
   :width: 400
   :alt: Load konata log
 
-You should be greated with a colorful representation of the instructions
+You should be greeted with a colorful representation of the instructions
 in the RISC-V pipeline during boot up
 
 .. image:: Screenshot_20241203_151124.png

diff --git a/source/VexiiRiscv/Execute/fpu.rst b/source/VexiiRiscv/Execute/fpu.rst
@@ -29,7 +29,7 @@ There is a few foundation plugins that compose the FPU :
 .. image:: /asset/picture/fpu.png
 
 Area / Timings options
------------------------
+----------------------
 
 To improve the FPU area and timings (especially on FPGA), there is currently two main options implemented.
 
@@ -49,7 +49,7 @@ and if the user provide floating point constants which are subnormals number,
 they will be considered as 2^exp_subnormal numbers.
 
 In practice those two option do not seems to creates issues (for regular use cases),
-as it was tested by running debian with various software and graphical environnements.
+as it was tested by running debian with various software and graphical environments.
 
 Optimized software
 ------------------

diff --git a/source/VexiiRiscv/Execute/plugins.rst b/source/VexiiRiscv/Execute/plugins.rst
@@ -131,11 +131,11 @@ CsrAccessPlugin
 - Implement the CSR read and write instruction in the execute pipeline
 - Provide an API for other plugins to specify the mapping between the CSR registers and the CSR instruction
 
-See the :ref:`privileges` chapter for more informations.
+See the :ref:`privileges` chapter for more information.
 
 EnvPlugin
 ^^^^^^^^^^^^^^^
 
-See the :ref:`privileges` chapter for more informations.
+See the :ref:`privileges` chapter for more information.
 
 - Implement a few instructions as MRET, SRET, ECALL, EBREAK, FENCE.I, WFI by producing hardware traps
diff --git a/source/VexiiRiscv/Fetch/index.rst b/source/VexiiRiscv/Fetch/index.rst
@@ -3,10 +3,10 @@ Fetch
 =====
 
 The goal of the fetch pipeline is to provide the CPU with a stream of words in which the instructions to execute are presents.
-So more precisely, the fetch pipeline doesn't realy have the notion of instruction, but instead, just provide memory aligned chunks of memory block (ex 64 bits).
+So more precisely, the fetch pipeline doesn't really have the notion of instruction, but instead, just provide memory aligned chunks of memory block (ex 64 bits).
 Those chunks of memory (word) will later be handled by the "AlignerPlugin" to extract the instruction to be executed (and also handle the decompression in the case of RVC).
 
-Here is an example of fetch architecture with an instruction cache, branch predictor aswell as a prefetcher.
+Here is an example of fetch architecture with an instruction cache, branch predictor as well as a prefetcher.
 
 .. image:: /asset/picture/fetch_l1.png
 
@@ -87,7 +87,7 @@ Will :
 
 To improve the performances, consider first increasing the number of cache ways to 4.
 The hardware prefetcher can help, but it is very variable in function of the workload. If you enable it, then consider
-increasing the number of refill slots to at least 2, idealy 3.
+increasing the number of refill slots to at least 2, ideally 3.
 
 
 

diff --git a/source/VexiiRiscv/Framework/index.rst b/source/VexiiRiscv/Framework/index.rst
@@ -2,7 +2,7 @@ Framework
 =========
 
 Tools and API
-------------------------
+-------------
 
 Overall VexiiRiscv is based on a few tools and API which aim at describing hardware in more productive/flexible ways than with Verilog/VHDL.
 

diff --git a/source/VexiiRiscv/HowToUse/index.rst b/source/VexiiRiscv/HowToUse/index.rst
@@ -318,7 +318,7 @@ Konata is a Node JS application started with Electron, so you will have to insta
 
 You can setup and start Konata by cloning it and using npm
 
-The make comman will execute npm electron ., which will open the Konata window
+The make command will execute npm electron ., which will open the Konata window
 
 .. code-block:: bash
 

diff --git a/source/VexiiRiscv/Memory/index.rst b/source/VexiiRiscv/Memory/index.rst
@@ -1,15 +1,15 @@
 .. _lsu:
 
 Memory (LSU)
-###################
+############
 
 LSU stand for Load Store Unit, VexiiRiscv has currently 2 implementations for it:
 
 - LsuCachelessPlugin for microcontrollers, which doesn't implement any cache
 - LsuPlugin / LsuL1Plugin which can work together to implement load and store through an L1 cache
 
 Without L1
-====================
+==========
 
 Implemented by the LsuCachelessPlugin, it should be noted that to
 reach good frequencies on FPGA SoC, forking the memory request at
@@ -19,7 +19,7 @@ as it relax the AGU timings as well as the PMA (Physical Memory Attributes) chec
 .. image:: /asset/picture/lsu_nol1.png
 
 With L1
-====================
+=======
 
 This configuration supports :
 
@@ -97,13 +97,13 @@ To improve the performances, consider first increasing the number of cache ways
 
 The store buffer will help a lot with the store bandwidth by allowing the CPU to not be blocked by every store miss.
 The hardware prefetcher will help with both store/load bandwidth (but if the store buffer is already enabled, it will not
-realy increase the store bandwidth).
+really increase the store bandwidth).
 
 For the hardware prefetcher to stretch its leg, consider using 4 refill/writeback slots. This will also help the store buffer.
 
 
 Prefetching
-----------------------
+-----------
 
 Currently there is two implementation of prefetching
 
@@ -173,11 +173,11 @@ Also, prefetch which fail (ex : because of hazards in L1) aren't replayed.
 The prefetcher can be turned off by setting the CSR 0x7FF bit 1.
 
 performance measurements
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^
 
 Here are a few performance gain measurements done on litex with a :
 
-- quad-core RV64GC running at 200 Mhz
+- quad-core RV64GC running at 200 MHz
 - 16 KB L1 cache for each core
 - 512 KB of l2 cache shared (128 bits data bus)
 - 4 refill slots + 4 writeback slots + 32 entry store queue + 4 slots store queue
@@ -206,19 +206,19 @@ Here are a few performance gain measurements done on litex with a :
      - 50.2 fps
 
 Hardware Memory coherency
---------------------------------------------
+-------------------------
 
 Hardware memory coherency, is the feature which allows multiple memory agents (ex : CPU, DMA, ...)
 to work on the same memory locations and notify each others when they change their contents.
-Without it, the CPU software would have to manualy flush/invalidate their L1 caches to keep things in sync.
+Without it, the CPU software would have to manually flush/invalidate their L1 caches to keep things in sync.
 
 There is mostly 2 kinds of hardware memory coherency architecture :
 
 - By invalidation : When a CPU/DMA write some memory, it notifies the other CPU caches that they should invalidate any
-  old copy that they have of the written memory locations. This is generaly used for write-through L1 caches.
+  old copy that they have of the written memory locations. This is generally used for write-through L1 caches.
   This isn't what VexiiRiscv implements.
-- By permition : Memory blocks copies (typicaly 64 aligned bytes blocks which resides in L1 cache lines) can have multiple states.
-  Some of which provide read only accesses, while others provide read/write accesses. This is generaly used in write-back L1 caches,
+- By permission : Memory blocks copies (typically 64 aligned bytes blocks which resides in L1 cache lines) can have multiple states.
+  Some of which provide read only accesses, while others provide read/write accesses. This is generally used in write-back L1 caches,
   and this is what VexiiRiscv uses.
 
 In VexiiRiscv, the hardware memory coherency (L1) with other memory agents (CPU, DMA, L2, ..) is supported though a MESI implementation which can be bridged to a tilelink memory bus.
@@ -249,32 +249,32 @@ Here is the hardware interfaces :
   When data need to be written back, it will be done through the write_cmd channel.
 
 Memory system
-----------------------
+-------------
 
 Currently, VexiiRiscv can be used with the Tilelink memory interconnect from SpinalHDL and Chipyard (https://chipyard.readthedocs.io/en/latest/Generators/VexiiRiscv.html).
 
 Why Tilelink
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^
 
 So, why using Tilelink, while most of the FPGA industry is using AXI4 ? Here are some issues / complexities that AXI4 bring with it.
-(Dolu1990 opinions, with the perspective of using it in FPGA, with limited man power, don't see this as an absolute truth)
+(Dolu1990 opinions, with the perspective of using it in FPGA, with limited manpower, don't see this as an absolute truth)
 
 - The AXI4 memory ordering, while allowing CPU/DMA to get preserved ordering between transactions with the same ID,
   is creating complexities and bottlenecks in the memory system. Typically in the interconnect decoders
-  to avoid dead-locks, but even more in L2 caches and DRAM controllers  which ideally would handle every request out of order.
+  to avoid dead-locks, but even more in L2 caches and DRAM controllers which ideally would handle every request out of order.
   Tilelink instead specify that the CPU/DMAs shouldn't assume any memory ordering between inflight transactions.
 - AXI4 specifies that memory read response channel can interleave between multiple ongoing bursts.
   While this can be use full for very large burst (which in itself is a bad idea, see next chapter),
   this can lead to big area overhead for memory bridges, especially with width adapters.
-  Tilelink doesn't allows this behaviour.
-- AXI4 splits write address from write data, which add additional synchronisations points in the interconnect decoders/arbiters and peripherals (bad for timings)
+  Tilelink doesn't allows this behavior.
+- AXI4 splits write address from write data, which add additional synchronizations points in the interconnect decoders/arbiters and peripherals (bad for timings)
   as well as potentially decrease performances when integrating multiple AXI4 modules which do not use similar address/data timings.
 - AXI4 isn't great for low latency memory interconnects, mostly because of the previous point.
 - AXI4 splits read and write channels (ar r / aw w b), which mostly double the area cost of address decoding/routing for DMA and non-coherent CPUs.
 - AXI4 specifies a few "low values" features which increase complexity and area (ex: WRAP/FIXED bursts, unaligned memory accesses).
 
 Efficiency cookbook
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^
 
 Here are a set of design guideline to keep a memory system lean and efficient (don't see this as an absolute truth) :
 
@@ -288,14 +288,14 @@ Here are a set of design guideline to keep a memory system lean and efficient (d
 - DMA should access up to 64 aligned bytes per burst, this should be enough to reach peak bandwidth. No need for 4KB Rambo bursts.
   Asking a system to support bursts bigger than 64 aligned bytes can lead to extra cost, as it create new ordering constraints between the memory block of the burst. 
   For instance in a L2 cache it can lead to implementation of a reorder buffer to deal between transaction which hit/miss the cache. Adding extra complexity/area/timings to deal with.
-  Additionaly, big burst can create high latency spike for other agents (CPU/DMA).
+  Additionally, big burst can create high latency spike for other agents (CPU/DMA).
 - DMA should only do burst aligned memory accesses (to keep them easily portable to Tilelink)
 - It is fine for DMA to over fetch (let's say you need 48 bytes, but access aligned 64 bytes instead),
   as long as the bulk of the memory bandwidth is not doing it.
 - DMA should avoid doing multiple accesses in a 64 byte block if possible, and instead use a single access.
   This can preserve the DRAM controller bandwidth (see DDR3/4/5 comments above),
   but also, L2/L3 cache designs may block any additional memory request targeting a memory block which is already under operation.
-- When a DMA start a write burst, it has to complet as fast as possible. The reason is that the interconnect can lock itself on your burst until you finish it.  
-- When a DMA start a read burst, it should avoid putting backpresure on the read responses. The reason is that the interconnect can lock itself on your burst until you finish it.
+- When a DMA start a write burst, it has to complete as fast as possible. The reason is that the interconnect can lock itself on your burst until you finish it.  
+- When a DMA start a read burst, it should avoid putting backpressure on the read responses. The reason is that the interconnect can lock itself on your burst until you finish it.
 
 
diff --git a/source/VexiiRiscv/Performance/index.rst b/source/VexiiRiscv/Performance/index.rst
@@ -38,67 +38,67 @@ Here are a few synthesis results :
 
     rv32i_noBypass ->
     - 0.78 Dhrystone/MHz 0.60 Coremark/MHz
-    - Artix 7    -> 210 Mhz 1182 LUT 1759 FF 
-    - Cyclone V  -> 159 Mhz 1,015 ALMs
-    - Cyclone IV -> 130 Mhz 1,987 LUT 2,017 FF 
-    - Trion      -> 94 Mhz LUT 1847   FF 1990
-    - Titanium   -> 320 Mhz LUT 2005   FF 2030
+    - Artix 7    -> 210 MHz 1182 LUT 1759 FF 
+    - Cyclone V  -> 159 MHz 1,015 ALMs
+    - Cyclone IV -> 130 MHz 1,987 LUT 2,017 FF 
+    - Trion      -> 94 MHz LUT 1847   FF 1990
+    - Titanium   -> 320 MHz LUT 2005   FF 2030
 
     rv32i ->
     - 1.12 Dhrystone/MHz 0.87 Coremark/MHz
-    - Artix 7    -> 206 Mhz 1413 LUT 1761 FF 
-    - Cyclone V  -> 138 Mhz 1,244 ALMs
-    - Cyclone IV -> 124 Mhz 2,188 LUT 2,019 FF 
-    - Trion      -> 78 Mhz LUT 2252   FF 1962
-    - Titanium   -> 300 Mhz LUT 2347   FF 2000
+    - Artix 7    -> 206 MHz 1413 LUT 1761 FF 
+    - Cyclone V  -> 138 MHz 1,244 ALMs
+    - Cyclone IV -> 124 MHz 2,188 LUT 2,019 FF 
+    - Trion      -> 78 MHz LUT 2252   FF 1962
+    - Titanium   -> 300 MHz LUT 2347   FF 2000
 
     rv64i ->
     - 1.18 Dhrystone/MHz 0.77 Coremark/MHz
-    - Artix 7    -> 186 Mhz 2157 LUT 2332 FF 
-    - Cyclone V  -> 117 Mhz 1,760 ALMs
-    - Cyclone IV -> 113 Mhz 3,432 LUT 2,770 FF 
-    - Trion      -> 83 Mhz LUT 3883   FF 2681
-    - Titanium   -> 278 Mhz LUT 3909   FF 2783
+    - Artix 7    -> 186 MHz 2157 LUT 2332 FF 
+    - Cyclone V  -> 117 MHz 1,760 ALMs
+    - Cyclone IV -> 113 MHz 3,432 LUT 2,770 FF 
+    - Trion      -> 83 MHz LUT 3883   FF 2681
+    - Titanium   -> 278 MHz LUT 3909   FF 2783
 
     rv32im ->
     - 1.20 Dhrystone/MHz 2.70 Coremark/MHz
-    - Artix 7    -> 190 Mhz 1815 LUT 2078 FF 
-    - Cyclone V  -> 131 Mhz 1,474 ALMs
-    - Cyclone IV -> 125 Mhz 2,781 LUT 2,266 FF 
-    - Trion      -> 83 Mhz LUT 2643   FF 2209
-    - Titanium   -> 324 Mhz LUT 2685   FF 2279
+    - Artix 7    -> 190 MHz 1815 LUT 2078 FF 
+    - Cyclone V  -> 131 MHz 1,474 ALMs
+    - Cyclone IV -> 125 MHz 2,781 LUT 2,266 FF 
+    - Trion      -> 83 MHz LUT 2643   FF 2209
+    - Titanium   -> 324 MHz LUT 2685   FF 2279
 
     rv32im_branchPredict ->
     - 1.45 Dhrystone/MHz 2.99 Coremark/MHz
-    - Artix 7    -> 195 Mhz 2066 LUT 2438 FF 
-    - Cyclone V  -> 136 Mhz 1,648 ALMs
-    - Cyclone IV -> 117 Mhz 3,093 LUT 2,597 FF 
-    - Trion      -> 86 Mhz LUT 2963   FF 2568
-    - Titanium   -> 327 Mhz LUT 3015   FF 2636
+    - Artix 7    -> 195 MHz 2066 LUT 2438 FF 
+    - Cyclone V  -> 136 MHz 1,648 ALMs
+    - Cyclone IV -> 117 MHz 3,093 LUT 2,597 FF 
+    - Trion      -> 86 MHz LUT 2963   FF 2568
+    - Titanium   -> 327 MHz LUT 3015   FF 2636
 
     rv32im_branchPredict_cached8k8k ->
     - 1.45 Dhrystone/MHz 2.97 Coremark/MHz
-    - Artix 7    -> 210 Mhz 2721 LUT 3477 FF 
-    - Cyclone V  -> 137 Mhz 1,953 ALMs
-    - Cyclone IV -> 127 Mhz 3,648 LUT 3,153 FF 
-    - Trion      -> 93 Mhz LUT 3388   FF 3204
-    - Titanium   -> 314 Mhz LUT 3432   FF 3274
+    - Artix 7    -> 210 MHz 2721 LUT 3477 FF 
+    - Cyclone V  -> 137 MHz 1,953 ALMs
+    - Cyclone IV -> 127 MHz 3,648 LUT 3,153 FF 
+    - Trion      -> 93 MHz LUT 3388   FF 3204
+    - Titanium   -> 314 MHz LUT 3432   FF 3274
 
     rv32imasu_cached_branchPredict_cached8k8k_linux ->
     - 1.45 Dhrystone/MHz 2.96 Coremark/MHz
-    - Artix 7    -> 199 Mhz 3351 LUT 3833 FF 
-    - Cyclone V  -> 131 Mhz 2,612 ALMs
-    - Cyclone IV -> 109 Mhz 4,909 LUT 3,897 FF 
-    - Trion      -> 73 Mhz LUT 4367   FF 3613
-    - Titanium   -> 270 Mhz LUT 4409   FF 3724
+    - Artix 7    -> 199 MHz 3351 LUT 3833 FF 
+    - Cyclone V  -> 131 MHz 2,612 ALMs
+    - Cyclone IV -> 109 MHz 4,909 LUT 3,897 FF 
+    - Trion      -> 73 MHz LUT 4367   FF 3613
+    - Titanium   -> 270 MHz LUT 4409   FF 3724
 
     rv32im_branchPredictStressed_cached8k8k_ipcMax_lateAlu ->
     - 1.74 Dhrystone/MHz 3.41 Coremark/MHz
-    - Artix 7    -> 140 Mhz 3247 LUT 3755 FF 
-    - Cyclone V  -> 99 Mhz 2,477 ALMs
-    - Cyclone IV -> 85 Mhz 4,835 LUT 3,765 FF 
-    - Trion      -> 60 Mhz LUT 4438   FF 3832
-    - Titanium   -> 228 Mhz LUT 4459   FF 3963
+    - Artix 7    -> 140 MHz 3247 LUT 3755 FF 
+    - Cyclone V  -> 99 MHz 2,477 ALMs
+    - Cyclone IV -> 85 MHz 4,835 LUT 3,765 FF 
+    - Trion      -> 60 MHz LUT 4438   FF 3832
+    - Titanium   -> 228 MHz LUT 4459   FF 3963
 
 
 Tuning
@@ -128,7 +128,7 @@ On FPGA there is a few options which can be key in order to scale up the IPC whi
 
 
 Critical paths tool
---------------------------------
+-------------------
 
 At the end of your synthesis/place/route tools, you get a critical path report where hopefully, the source and destination registers are well named.
 The issue is that in between, all the combinatorial logic and signals names become unrecognizable or misleading most of the time.
-Original file line number
+Diff line change
@@ Expand Up / @@ -2,7 +2,7 @@ Framework @@
     =========
     Tools and API
-    ------------------------
+    -------------
     Overall VexiiRiscv is based on a few tools and API which aim at describing hardware in more productive/flexible ways than with Verilog/VHDL.
@@ Expand Down @@