Skip to content
This repository has been archived by the owner on Feb 14, 2023. It is now read-only.

add __ARM_NEON support #157

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

add __ARM_NEON support #157

wants to merge 7 commits into from

Conversation

m6w6
Copy link

@m6w6 m6w6 commented Aug 22, 2022

  • convert literally translated instructions to idiomatic
  • produce bit identical output of lepton-scalar
  • test_suite PASS

@CLAassistant
Copy link

CLAassistant commented Aug 22, 2022

CLA assistant check
All committers have signed the CLA.

@m6w6
Copy link
Author

m6w6 commented Aug 22, 2022

Performance gain is not huge currently, but most of the instructions are just literally translated, yet, and not iterated upon.

lepton{,-scalar} -benchmark results on an M1 mini (2020):

lepton-scalar                                                                   lepton (neon)                                                            
BENCHMARK: 16 trials                                                            BENCHMARK: 16 trials                                                            
 217.37ms ( 90.87Mbit/s) : Verified encode                                       212.32ms ( 93.04Mbit/s) : Verified encode                                      
 156.15ms (126.50Mbit/s) : Unverified encode                                     156.51ms (126.21Mbit/s) : Unverified encode                                    
  63.80ms (309.60Mbit/s) : decode                                                 61.57ms (320.82Mbit/s) : decode                                               
 609.13ms ( 32.43Mbit/s) : Single threaded Verified encode                       600.13ms ( 32.92Mbit/s) : Single threaded Verified encode                      
 328.37ms ( 60.15Mbit/s) : Single threaded Unverified encode                     324.12ms ( 60.94Mbit/s) : Single threaded Unverified encode                    
 283.65ms ( 69.64Mbit/s) : Single threaded decode                                277.37ms ( 71.22Mbit/s) : Single threaded decode                               
 304.66ms ( 64.84Mbit/s) : Loaded 2 Verified encode                              292.47ms ( 67.54Mbit/s) : Loaded 2 Verified encode                             
 203.42ms ( 97.10Mbit/s) : Loaded 2 Unverified encode                            196.53ms (100.51Mbit/s) : Loaded 2 Unverified encode                           
 107.45ms (183.84Mbit/s) : Loaded 2 decode                                       110.84ms (178.21Mbit/s) : Loaded 2 decode                                      
 491.56ms ( 40.18Mbit/s) : Loaded 4 Verified encode                              488.95ms ( 40.40Mbit/s) : Loaded 4 Verified encode                             
 288.03ms ( 68.58Mbit/s) : Loaded 4 Unverified encode                            286.09ms ( 69.04Mbit/s) : Loaded 4 Unverified encode                           
 223.50ms ( 88.38Mbit/s) : Loaded 4 decode                                       213.59ms ( 92.48Mbit/s) : Loaded 4 decode                                      
 669.68ms ( 29.50Mbit/s) : Loaded 6 Verified encode                              690.89ms ( 28.59Mbit/s) : Loaded 6 Verified encode                             
 394.21ms ( 50.11Mbit/s) : Loaded 6 Unverified encode                            393.49ms ( 50.20Mbit/s) : Loaded 6 Unverified encode                           
 322.71ms ( 61.21Mbit/s) : Loaded 6 decode                                       313.99ms ( 62.91Mbit/s) : Loaded 6 decode                                      
 935.54ms ( 21.11Mbit/s) : Loaded 8 Verified encode                              890.26ms ( 22.19Mbit/s) : Loaded 8 Verified encode                             
 498.29ms ( 39.64Mbit/s) : Loaded 8 Unverified encode                            498.52ms ( 39.62Mbit/s) : Loaded 8 Unverified encode                           
 445.81ms ( 44.31Mbit/s) : Loaded 8 decode                                       417.14ms ( 47.35Mbit/s) : Loaded 8 decode                                      
1425.39ms ( 13.86Mbit/s) : Loaded 12 Verified encode                            1377.82ms ( 14.34Mbit/s) : Loaded 12 Verified encode                            
 770.91ms ( 25.62Mbit/s) : Loaded 12 Unverified encode                           767.00ms ( 25.75Mbit/s) : Loaded 12 Unverified encode                          
 663.58ms ( 29.77Mbit/s) : Loaded 12 decode                                      638.00ms ( 30.96Mbit/s) : Loaded 12 decode                                     
1853.21ms ( 10.66Mbit/s) : Loaded 16 Verified encode                            2211.84ms (  8.93Mbit/s) : Loaded 16 Verified encode                            
1042.54ms ( 18.95Mbit/s) : Loaded 16 Unverified encode                          1036.09ms ( 19.07Mbit/s) : Loaded 16 Unverified encode                          
 918.81ms ( 21.50Mbit/s) : Loaded 16 decode                                      921.06ms ( 21.45Mbit/s) : Loaded 16 decode                                     
Backfill verified encode bandwidth 162.22 Mbit/s [12 threads]                   Backfill verified encode bandwidth 170.17 Mbit/s [8 threads]                    
Backfill unverified encode bandwidth 301.64 Mbit/s [8 threads]                  Backfill unverified encode bandwidth 315.18 Mbit/s [8 threads]                  
Backfill decode bandwidth 353.71 Mbit/s [6 threads]                             Backfill decode bandwidth 371.61 Mbit/s [6 threads]  

I also noticed that some "legacy" tests are failing, thus the "WIP/Draft" status of this PR.

EDIT: typos; resolved

@m6w6
Copy link
Author

m6w6 commented Aug 23, 2022

Also, while 100% restoring e.g. hq.jpg, its .lep vastly differs from that of lepton-scalar and is about 1k bigger.

EDIT: resolved

@m6w6
Copy link
Author

m6w6 commented Aug 24, 2022

Benchmark on C6g.4xlarge with clang-10 and ARCH_FLAGS=-mcpu=neoverse-n1

lepton-scalar                                                           lepton (neon)
BENCHMARK: 16 trials                                                    BENCHMARK: 16 trials
 333.65ms ( 59.20Mbit/s) : Verified encode                               329.43ms ( 59.96Mbit/s) : Verified encode
 258.18ms ( 76.51Mbit/s) : Unverified encode                             256.96ms ( 76.87Mbit/s) : Unverified encode
  75.78ms (260.66Mbit/s) : decode                                         72.83ms (271.24Mbit/s) : decode
1118.17ms ( 17.67Mbit/s) : Single threaded Verified encode              1072.64ms ( 18.42Mbit/s) : Single threaded Verified encode
 630.89ms ( 31.31Mbit/s) : Single threaded Unverified encode             611.47ms ( 32.30Mbit/s) : Single threaded Unverified encode
 488.95ms ( 40.40Mbit/s) : Single threaded decode                        459.32ms ( 43.01Mbit/s) : Single threaded decode
 340.60ms ( 58.00Mbit/s) : Loaded 2 Verified encode                      336.53ms ( 58.70Mbit/s) : Loaded 2 Verified encode
 261.09ms ( 75.66Mbit/s) : Loaded 2 Unverified encode                    261.56ms ( 75.52Mbit/s) : Loaded 2 Unverified encode
  78.10ms (252.91Mbit/s) : Loaded 2 decode                                78.74ms (250.88Mbit/s) : Loaded 2 decode
 435.00ms ( 45.41Mbit/s) : Loaded 4 Verified encode                      417.49ms ( 47.31Mbit/s) : Loaded 4 Verified encode
 311.89ms ( 63.33Mbit/s) : Loaded 4 Unverified encode                    308.04ms ( 64.13Mbit/s) : Loaded 4 Unverified encode
 131.23ms (150.52Mbit/s) : Loaded 4 decode                               131.77ms (149.91Mbit/s) : Loaded 4 decode
 560.87ms ( 35.22Mbit/s) : Loaded 6 Verified encode                      539.63ms ( 36.60Mbit/s) : Loaded 6 Verified encode
 366.90ms ( 53.84Mbit/s) : Loaded 6 Unverified encode                    357.37ms ( 55.27Mbit/s) : Loaded 6 Unverified encode
 220.63ms ( 89.53Mbit/s) : Loaded 6 decode                               214.97ms ( 91.89Mbit/s) : Loaded 6 decode
 748.58ms ( 26.39Mbit/s) : Loaded 8 Verified encode                      725.59ms ( 27.22Mbit/s) : Loaded 8 Verified encode
 455.86ms ( 43.33Mbit/s) : Loaded 8 Unverified encode                    449.56ms ( 43.94Mbit/s) : Loaded 8 Unverified encode
 269.47ms ( 73.30Mbit/s) : Loaded 8 decode                               270.47ms ( 73.03Mbit/s) : Loaded 8 decode
 947.95ms ( 20.84Mbit/s) : Loaded 12 Verified encode                     951.36ms ( 20.76Mbit/s) : Loaded 12 Verified encode
 547.49ms ( 36.08Mbit/s) : Loaded 12 Unverified encode                   529.14ms ( 37.33Mbit/s) : Loaded 12 Unverified encode
 347.81ms ( 56.79Mbit/s) : Loaded 12 decode                              357.71ms ( 55.22Mbit/s) : Loaded 12 decode
1191.98ms ( 16.57Mbit/s) : Loaded 16 Verified encode                    1135.81ms ( 17.39Mbit/s) : Loaded 16 Verified encode
 648.67ms ( 30.45Mbit/s) : Loaded 16 Unverified encode                   654.03ms ( 30.20Mbit/s) : Loaded 16 Unverified encode
 527.66ms ( 37.44Mbit/s) : Loaded 16 decode                              496.93ms ( 39.75Mbit/s) : Loaded 16 decode
Backfill verified encode bandwidth 262.20 Mbit/s [16 threads]           Backfill verified encode bandwidth 272.78 Mbit/s [16 threads]
Backfill unverified encode bandwidth 474.70 Mbit/s [16 threads]         Backfill unverified encode bandwidth 490.32 Mbit/s [16 threads]
Backfill decode bandwidth 597.99 Mbit/s [12 threads]                    Backfill decode bandwidth 633.80 Mbit/s [12 threads]

@m6w6 m6w6 marked this pull request as ready for review August 30, 2022 09:08
@AGSaidi
Copy link

AGSaidi commented Oct 7, 2022

Please use an ‘isb’ not a ‘dmb’. See this as an example. haproxy/haproxy@1e237d0

Copy link

@AGSaidi AGSaidi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the isb change, lgtm.

@sebpop
Copy link
Contributor

sebpop commented Oct 10, 2022

Looks good to me.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants