Age | Commit message (Collapse) | Author | Files | Lines |
|
|
|
with these changes:
- clang: avx2 comparable to scalar
- gcc: avx2 still slower than scalar
bench results
-------------
gcc scalar:
> make clean all BACKEND=1 CC=gcc && ./bench
info: cpucycles: version=20240318 implementation=amd64-pmc persecond=4800000000
info: backend=scalar num_trials=2000 src_lens=64,256,1024,4096,16384 dst_lens=32
function,dst_len,64,256,1024,4096,16384
sha3_224,28,19.5,10.0,9.9,9.0,8.8
sha3_256,32,19.5,10.0,9.9,9.5,9.3
sha3_384,48,19.5,14.7,12.3,12.2,12.0
sha3_512,64,19.5,19.6,18.2,17.1,17.1
shake128,32,19.6,9.9,8.7,7.8,7.6
shake256,32,19.6,9.9,10.0,9.5,9.3
gcc avx2:
> make clean all BACKEND=6 CC=gcc && ./bench
info: cpucycles: version=20240318 implementation=amd64-pmc persecond=4800000000
info: backend=avx2 num_trials=2000 src_lens=64,256,1024,4096,16384 dst_lens=32
function,dst_len,64,256,1024,4096,16384
sha3_224,28,24.5,12.3,12.2,11.1,10.9
sha3_256,32,24.4,12.2,12.2,11.9,11.6
sha3_384,48,24.2,18.3,15.3,15.2,15.0
sha3_512,64,24.5,24.4,22.8,21.6,21.6
shake128,32,24.6,12.1,10.8,9.6,9.4
shake256,32,24.7,12.2,12.2,11.8,11.6
clang scalar:
> make clean all BACKEND=1 CC=clang && ./bench
info: cpucycles: version=20240318 implementation=amd64-pmc persecond=4800000000
info: backend=scalar num_trials=2000 src_lens=64,256,1024,4096,16384 dst_lens=32
function,dst_len,64,256,1024,4096,16384
sha3_224,28,21.8,9.9,9.7,8.8,8.7
sha3_256,32,21.1,9.9,9.8,9.4,9.2
sha3_384,48,21.1,14.6,12.1,12.0,11.8
sha3_512,64,21.2,19.2,17.9,16.9,16.9
shake128,32,21.0,9.9,8.6,7.7,7.5
shake256,32,20.9,9.9,9.8,9.5,9.2
clang avx2:
> make clean all BACKEND=6 CC=clang && ./bench
info: cpucycles: version=20240318 implementation=amd64-pmc persecond=4800000000
info: backend=avx2 num_trials=2000 src_lens=64,256,1024,4096,16384 dst_lens=32
function,dst_len,64,256,1024,4096,16384
sha3_224,28,19.9,10.0,9.9,9.0,8.9
sha3_256,32,19.9,10.0,9.9,9.6,9.4
sha3_384,48,20.1,14.9,12.4,12.3,12.2
sha3_512,64,19.9,19.6,18.4,17.4,17.4
shake128,32,19.9,10.0,8.8,7.9,7.7
shake256,32,20.0,10.0,9.9,9.6,9.4
|
|
|
|
|
|
|
|
shuffle_epi32()
|
|
|
|
|
|
|
|
|
|
|
|
bench results
-------------
before (gcc, avx2):
> make clean all BACKEND=6 CC=gcc && ./bench
info: cpucycles: version=20240318 implementation=amd64-pmc persecond=4800000000
info: backend=avx2 num_trials=2000 src_lens=64,256,1024,4096,16384 dst_lens=32
function,dst_len,64,256,1024,4096,16384
sha3_224,28,30.8,15.3,15.4,14.0,13.8
sha3_256,32,30.7,15.3,15.4,15.0,14.6
sha3_384,48,30.6,23.0,19.2,19.2,19.0
sha3_512,64,30.2,30.7,28.8,27.4,27.4
shake128,32,30.6,15.3,13.5,12.1,11.9
shake256,32,30.8,15.4,15.4,15.0,14.6
after (gcc, avx2):
> make clean all BACKEND=6 CC=gcc && ./bench
info: cpucycles: version=20240318 implementation=amd64-pmc persecond=4800000000
info: backend=avx2 num_trials=2000 src_lens=64,256,1024,4096,16384 dst_lens=32
function,dst_len,64,256,1024,4096,16384
sha3_224,28,29.4,14.8,14.8,13.5,13.2
sha3_256,32,29.4,14.8,14.8,14.4,14.0
sha3_384,48,29.3,22.1,18.5,18.5,18.3
sha3_512,64,29.4,29.4,27.6,26.3,26.3
shake128,32,29.4,14.8,13.0,11.6,11.4
shake256,32,29.5,14.8,14.9,14.4,14.1
|
|
remove unnecessary ands
|
|
|
|
scalar)
bench results
-------------
scalar (gcc):
> make clean all BACKEND=1 CC=gcc && ./bench
info: cpucycles: version=20240318 implementation=amd64-pmc persecond=4800000000
info: backend=scalar num_trials=2000 src_lens=64,256,1024,4096,16384 dst_lens=32
function,dst_len,64,256,1024,4096,16384
sha3_224,28,19.7,10.1,10.0,9.0,8.8
sha3_256,32,20.0,10.1,9.9,9.6,9.3
sha3_384,48,20.0,14.9,12.3,12.2,12.0
sha3_512,64,20.0,19.5,18.3,17.2,17.1
shake128,32,20.2,10.1,8.7,7.8,7.6
shake256,32,20.3,10.1,10.0,9.6,9.3
scalar (clang):
> make clean all BACKEND=1 CC=clang && ./bench
info: cpucycles: version=20240318 implementation=amd64-pmc persecond=4800000000
info: backend=scalar num_trials=2000 src_lens=64,256,1024,4096,16384 dst_lens=32
function,dst_len,64,256,1024,4096,16384
sha3_224,28,19.7,9.9,9.8,8.8,8.7
sha3_256,32,19.6,9.8,9.8,9.4,9.2
sha3_384,48,19.8,14.6,12.1,12.0,11.8
sha3_512,64,19.6,19.3,17.9,16.9,16.9
shake128,32,19.7,9.9,8.6,7.7,7.5
shake256,32,19.7,9.9,9.8,9.4,9.2
avx2 (gcc):
> make clean all BACKEND=6 CC=gcc && ./bench
info: cpucycles: version=20240318 implementation=amd64-pmc persecond=4800000000
info: backend=avx2 num_trials=2000 src_lens=64,256,1024,4096,16384 dst_lens=32
function,dst_len,64,256,1024,4096,16384
sha3_224,28,29.7,14.9,15.0,13.6,13.4
sha3_256,32,30.0,14.9,15.0,14.6,14.2
sha3_384,48,29.8,22.4,18.7,18.7,18.5
sha3_512,64,29.8,29.8,28.0,26.6,26.6
shake128,32,29.9,15.0,13.2,11.8,11.6
shake256,32,30.0,14.9,15.0,14.6,14.2
avx2 (clang):
> make clean all BACKEND=6 CC=clang && ./bench
info: cpucycles: version=20240318 implementation=amd64-pmc persecond=4800000000
info: backend=avx2 num_trials=2000 src_lens=64,256,1024,4096,16384 dst_lens=32
function,dst_len,64,256,1024,4096,16384
sha3_224,28,23.8,11.6,11.3,10.3,10.1
sha3_256,32,23.7,11.6,11.3,10.9,10.7
sha3_384,48,23.8,17.2,14.1,14.1,13.9
sha3_512,64,23.8,22.8,21.1,20.0,20.0
shake128,32,22.8,11.2,9.8,8.8,8.6
shake256,32,22.9,11.2,11.1,10.8,10.5
chi appears to be the culprit now; look at replacing mask/and/or with blend
|
|
|
|
|
|
bench results (oof)
-------------------
scalar (gcc):
> make clean all BACKEND=1 && ./bench
...
info: cpucycles: version=20240318 implementation=amd64-pmc persecond=4800000000
info: backend=scalar num_trials=2000 src_lens=64,256,1024,4096,16384 dst_lens=32
function,dst_len,64,256,1024,4096,16384
sha3_224,28,19.5,10.0,10.0,8.9,8.8
sha3_256,32,19.5,10.0,9.9,9.5,9.3
sha3_384,48,19.5,14.7,12.3,12.2,12.0
sha3_512,64,19.5,19.4,18.2,17.2,17.1
shake128,32,19.7,9.9,8.7,7.8,7.6
shake256,32,20.2,10.1,10.0,9.6,9.3
scalar (clang):
> make clean all BACKEND=1 CC=clang && ./bench
info: cpucycles: version=20240318 implementation=amd64-pmc persecond=4800000000
info: backend=scalar num_trials=2000 src_lens=64,256,1024,4096,16384 dst_lens=32
function,dst_len,64,256,1024,4096,16384
sha3_224,28,19.7,9.9,9.8,8.8,8.7
sha3_256,32,19.7,9.9,9.8,9.4,9.2
sha3_384,48,19.7,14.6,12.1,12.0,11.8
sha3_512,64,19.7,19.3,17.8,16.9,16.9
shake128,32,19.7,9.8,8.6,7.7,7.5
shake256,32,19.7,9.9,9.8,9.4,9.2
avx2 (gcc):
> make clean all BACKEND=6 && ./bench
info: cpucycles: version=20240318 implementation=amd64-pmc persecond=4800000000
info: backend=avx2 num_trials=2000 src_lens=64,256,1024,4096,16384 dst_lens=32
function,dst_len,64,256,1024,4096,16384
sha3_224,28,80.0,40.5,40.9,37.2,36.6
sha3_256,32,79.8,40.5,40.9,39.8,38.9
sha3_384,48,80.1,61.0,51.2,51.3,50.7
sha3_512,64,79.9,81.5,76.7,73.0,73.1
shake128,32,86.0,43.6,38.6,34.5,33.9
shake256,32,86.0,43.6,44.1,42.8,41.9
avx2 (clang):
> make clean all BACKEND=6 CC=clang && ./bench
info: cpucycles: version=20240318 implementation=amd64-pmc persecond=4800000000
info: backend=avx2 num_trials=2000 src_lens=64,256,1024,4096,16384 dst_lens=32
function,dst_len,64,256,1024,4096,16384
sha3_224,28,62.0,31.6,31.9,29.0,28.5
sha3_256,32,61.9,31.6,32.0,31.1,30.3
sha3_384,48,62.6,47.7,40.1,40.2,39.8
sha3_512,64,62.0,63.8,60.2,57.2,57.3
shake128,32,62.5,31.7,27.9,24.9,24.5
shake256,32,62.0,31.5,31.9,31.0,30.3
the culprit is using gather in the pi step for permutes, so those will
need to be rewritten as permutes:
> perf annotate -Mintel --stdio --quiet permute_n_avx2 | grep gatherqq
24.09 : 207e: vpgatherqq ymm0,QWORD PTR [rdi+ymm3*8],ymm6
14.07 : 20a0: vpgatherqq ymm4,QWORD PTR [rdi+ymm3*8],ymm6
13.36 : 20aa: vpgatherqq ymm8,QWORD PTR [rdi+ymm7*8],ymm6
13.50 : 20c4: vpgatherqq ymm3,QWORD PTR [rdi+ymm7*8],ymm6
13.61 : 20ce: vpgatherqq ymm7,QWORD PTR [rdi+ymm1*8],ymm6
|
|
|
|
|
|
|
|
bench results (samish across the board)
---------------------------------------
x86-64-avx512-gcc before:
> make clean all BACKEND=2 && ./bench 10000
...
info: cpucycles: version=20240318 implementation=amd64-pmc persecond=4800000000
info: backend=avx512 num_trials=10000 src_lens=64,256,1024,4096,16384 dst_lens=32
function,dst_len,64,256,1024,4096,16384
sha3_224,28,15.4,7.8,7.8,7.1,7.0
sha3_256,32,15.4,7.8,7.8,7.6,7.4
sha3_384,48,15.4,11.6,9.8,9.8,9.7
sha3_512,64,15.5,15.5,14.5,13.8,13.8
shake128,32,15.6,7.8,6.9,6.2,6.1
shake256,32,15.6,7.8,7.9,7.6,7.4
x86-64-avx512-clang before:
> make clean all CC=clang BACKEND=2 && ./bench 10000
...
info: cpucycles: version=20240318 implementation=amd64-pmc persecond=4800000000
info: backend=avx512 num_trials=10000 src_lens=64,256,1024,4096,16384 dst_lens=32
function,dst_len,64,256,1024,4096,16384
sha3_224,28,21.6,10.8,10.7,9.6,9.5
sha3_256,32,21.6,10.8,10.7,10.3,10.1
sha3_384,48,21.5,16.0,13.3,13.2,13.1
sha3_512,64,21.5,21.2,19.8,18.8,18.7
shake128,32,21.8,10.8,9.4,8.4,8.2
shake256,32,21.6,10.8,10.7,10.3,10.1
x86-64-avx512-gcc after:
> make clean all BACKEND=2 && ./bench 10000
...
info: cpucycles: version=20240318 implementation=amd64-pmc persecond=4800000000
info: backend=avx512 num_trials=10000 src_lens=64,256,1024,4096,16384 dst_lens=32
function,dst_len,64,256,1024,4096,16384
sha3_224,28,15.4,7.8,7.8,7.1,7.0
sha3_256,32,15.4,7.8,7.8,7.6,7.4
sha3_384,48,15.4,11.6,9.8,9.8,9.7
sha3_512,64,15.5,15.5,14.6,13.8,13.8
shake128,32,15.5,7.8,6.9,6.2,6.1
shake256,32,15.5,7.8,7.9,7.6,7.4
x86-64-avx512-clang after:
> make clean all CC=clang BACKEND=2 && ./bench 10000
...
info: cpucycles: version=20240318 implementation=amd64-pmc persecond=4800000000
info: backend=avx512 num_trials=10000 src_lens=64,256,1024,4096,16384 dst_lens=32
function,dst_len,64,256,1024,4096,16384
sha3_224,28,21.5,10.8,10.7,9.6,9.5
sha3_256,32,21.5,10.8,10.7,10.3,10.0
sha3_384,48,21.6,16.0,13.3,13.2,13.1
sha3_512,64,21.5,21.2,19.8,18.8,18.7
shake128,32,21.7,10.8,9.3,8.3,8.2
shake256,32,21.7,10.8,10.7,10.3,10.1
|
|
bench results (samish across the board)
-------------
x86-64-scalar-gcc before:
> make clean all BACKEND=1 && perf record ./bench 10000
...
info: cpucycles: version=20240318 implementation=amd64-pmc persecond=4800000000
info: backend=scalar num_trials=10000 src_lens=64,256,1024,4096,16384 dst_lens=32
function,dst_len,64,256,1024,4096,16384
sha3_224,28,19.5,10.0,10.0,9.0,8.8
sha3_256,32,19.5,10.0,9.9,9.5,9.3
sha3_384,48,20.1,14.7,12.3,12.2,12.0
sha3_512,64,19.5,19.5,18.2,17.2,17.1
shake128,32,19.6,10.0,8.7,7.8,7.6
shake256,32,19.6,10.1,10.0,9.6,9.3
x86-64-scalar-clang before:
> make clean all CC=clang BACKEND=1 && perf record ./bench 10000
...
info: cpucycles: version=20240318 implementation=amd64-pmc persecond=4800000000
info: backend=scalar num_trials=10000 src_lens=64,256,1024,4096,16384 dst_lens=32
function,dst_len,64,256,1024,4096,16384
sha3_224,28,19.7,9.9,9.7,8.8,8.6
sha3_256,32,19.7,9.8,9.7,9.4,9.1
sha3_384,48,19.7,14.5,12.0,11.9,11.8
sha3_512,64,19.7,19.1,17.7,16.8,16.8
shake128,32,20.4,10.0,8.7,7.7,7.5
shake256,32,20.3,10.0,9.8,9.4,9.1
x86-64-scalar-gcc after:
> make clean all BACKEND=1 && perf record ./bench 10000
...
info: cpucycles: version=20240318 implementation=amd64-pmc persecond=4800000000
info: backend=scalar num_trials=10000 src_lens=64,256,1024,4096,16384 dst_lens=32
function,dst_len,64,256,1024,4096,16384
sha3_224,28,19.5,10.0,9.9,9.0,8.8
sha3_256,32,19.5,10.0,9.9,9.5,9.3
sha3_384,48,19.5,14.7,12.3,12.2,12.0
sha3_512,64,19.7,19.5,18.2,17.2,17.1
shake128,32,19.6,10.0,8.7,7.8,7.6
shake256,32,19.6,10.1,10.0,9.6,9.3
x86-64-scalar-gcc after:
> make clean all CC=clang BACKEND=1 && perf record ./bench 10000
...
info: cpucycles: version=20240318 implementation=amd64-pmc persecond=4800000000
info: backend=scalar num_trials=10000 src_lens=64,256,1024,4096,16384 dst_lens=32
function,dst_len,64,256,1024,4096,16384
sha3_224,28,19.7,9.9,9.8,8.8,8.7
sha3_256,32,19.6,9.9,9.8,9.4,9.2
sha3_384,48,19.7,14.6,12.1,12.0,11.9
sha3_512,64,19.7,19.3,17.9,16.9,16.9
shake128,32,19.7,9.9,8.7,7.7,7.5
shake256,32,19.7,9.9,9.8,9.5,9.2
a76-scalar-gcc before:
> make clean all BACKEND=1 && ./bench 5000
info: cpucycles: version=20240318 implementation=arm64-vct persecond=2400000000
info: backend=scalar num_trials=5000 src_lens=64,256,1024,4096,16384 dst_lens=32
function,dst_len,64,256,1024,4096,16384
sha3_224,28,20.2,10.1,10.2,9.3,9.2
sha3_256,32,20.2,10.3,10.3,10.0,9.7
sha3_384,48,20.9,15.3,12.8,12.7,12.5
sha3_512,64,20.9,20.3,18.9,18.0,17.9
shake128,32,20.2,10.3,9.0,8.1,7.9
shake256,32,20.2,10.1,10.3,9.9,9.7
a76-scalar-clang before:
> make clean all CC=clang BACKEND=1 && ./bench 5000
info: cpucycles: version=20240318 implementation=arm64-vct persecond=2400000000
info: backend=scalar num_trials=5000 src_lens=64,256,1024,4096,16384 dst_lens=32
function,dst_len,64,256,1024,4096,16384
sha3_224,28,18.8,9.6,9.5,8.7,8.5
sha3_256,32,18.8,9.6,9.5,9.2,9.0
sha3_384,48,18.8,14.1,11.8,11.8,11.6
sha3_512,64,18.8,18.6,17.5,16.6,16.6
shake128,32,18.8,9.6,8.4,7.5,7.4
shake256,32,18.8,9.6,9.6,9.2,9.0
a76-scalar-gcc after:
> make clean all BACKEND=1 && ./bench 5000
info: cpucycles: version=20240318 implementation=arm64-vct persecond=2400000000
info: backend=scalar num_trials=5000 src_lens=64,256,1024,4096,16384 dst_lens=32
function,dst_len,64,256,1024,4096,16384
sha3_224,28,20.2,10.1,10.3,9.3,9.2
sha3_256,32,20.2,10.3,10.3,10.0,9.7
sha3_384,48,20.9,15.3,12.8,12.7,12.5
sha3_512,64,20.9,20.3,18.9,18.0,17.9
shake128,32,20.2,10.3,9.0,8.1,7.9
shake256,32,20.2,10.3,10.3,9.9,9.7
a76-scalar-clang after:
> make clean all CC=clang BACKEND=1 && ./bench 5000
info: cpucycles: version=20240318 implementation=arm64-vct persecond=2400000000
info: backend=scalar num_trials=5000 src_lens=64,256,1024,4096,16384 dst_lens=32
function,dst_len,64,256,1024,4096,16384
sha3_224,28,18.8,9.4,9.3,8.5,8.3
sha3_256,32,18.8,9.4,9.3,9.0,8.8
sha3_384,48,18.8,13.9,11.6,11.6,11.4
sha3_512,64,18.8,18.3,17.1,16.3,16.3
shake128,32,18.8,9.4,8.3,7.4,7.2
shake256,32,18.8,9.4,9.4,9.1,8.9
|
|
target as phony
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
with vsri(vshlq) and vsri(vshl), respectively
|
|
|
|
|
|
|
|
|
|
|
|
results, show results for multiple backends on all test systems
|
|
|
|
- switch row_eor() from macro to static inline function
- compress rho rotate values into from 15 128-bit registers to two to
reduce register pressure (still spilling, though)
- remove PERMUTE macro
- switch from unrolled loop with macro in body of permute_n_neon() to
regular loop
- add documentation for register/lane layout and for compressed rho
rotations
with these changes the neon backend is still uses ~50% more cycles than
the scalar backend, so i will probably leave it disabled for the initial
release.
scalar (pi5):
> ./bench 2000
info: cpucycles: version=20240318 implementation=arm64-vct persecond=2400000000
info: backend=scalar num_trials=2000 src_lens=64,256,1024,4096,16384 dst_lens=32
function,dst_len,64,256,1024,4096,16384
sha3_224,28,20.2,10.3,10.3,9.3,9.2
sha3_256,32,20.2,10.3,10.3,9.9,9.7
sha3_384,48,20.9,15.3,12.8,12.7,12.5
sha3_512,64,20.2,20.2,18.9,25.3,17.9
shake128,32,20.2,10.1,9.0,8.1,7.9
shake256,32,20.2,10.3,10.3,9.9,9.7
neon backend bench results (pi5):
> ./bench 2000
info: cpucycles: version=20240318 implementation=arm64-vct persecond=2400000000
info: backend=neon num_trials=2000 src_lens=64,256,1024,4096,16384 dst_lens=32
function,dst_len,64,256,1024,4096,16384
sha3_224,28,32.7,16.3,16.4,14.9,14.6
sha3_256,32,32.0,16.2,16.4,15.9,15.5
sha3_384,48,32.7,24.2,20.4,20.2,20.0
sha3_512,64,32.0,32.2,30.1,28.6,28.5
shake128,32,32.7,16.2,14.2,12.8,12.5
shake256,32,32.7,16.2,16.3,15.7,15.4
|
|
|