aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2024-05-09sha3.c: add internal documentation, s/absorb12/absorb_12/Paul Duncan1-39/+227
2024-05-08sha3.c: update internal documentationPaul Duncan1-16/+50
2024-05-08rand-bytes.h: use getentropy() instead of getrandom() to support macosPaul Duncan1-8/+24
2024-05-08sha3.c, Makefile: s/SHA3_TEST/TEST_SHA3/Paul Duncan2-5/+5
2024-05-08.gitignore: add tests/neon/neonPaul Duncan1-0/+1
2024-05-08s/permute/permute_24/, s/permute12/permute_12/, misc comment cleanupPaul Duncan1-99/+89
2024-05-08sha3.c: rename hybrid-neon to hybrid, fix commentsPaul Duncan1-30/+26
2024-05-08sha3.c: hybrid: do not mix neon and non-neon instructions (still slow)Paul Duncan1-15/+26
2024-05-08s/SHA3_BACKEND/BACKEND/gPaul Duncan5-66/+66
2024-05-08sha3.c: s/call permute_n_.*(/call permute_n(/ in test commentsPaul Duncan1-4/+4
2024-05-08sha3.c: add hybrid-neon backend (slow)Paul Duncan1-1/+202
2024-05-08sha3.c: diet-neon: misc fixes. still too slowPaul Duncan1-46/+36
2024-05-08sha3.c: neon, diet-neon: use replace vorrq(vshlq, vshrq) and vorr(vshl,vshr) ↵Paul Duncan1-3/+4
with vsri(vshlq) and vsri(vshl), respectively
2024-05-08sha3.c: move INTERNAL before includesPaul Duncan1-2/+2
2024-05-08sha3.h: fix typo in sha3_backend() documentationPaul Duncan1-1/+1
2024-05-08README.md: add median linkPaul Duncan1-0/+2
2024-05-08sha3.c: prefer scalar backend to neon for nowv0.7Paul Duncan1-2/+2
2024-05-08README.md: add Backends and Benchmarks sectionsPaul Duncan1-1/+71
2024-05-08tests/bench/README.md: fix cpb links, update default trial count, add pi5 ↵Paul Duncan1-21/+103
results, show results for multiple backends on all test systems
2024-05-08tests/bench/bench.c: reduce defailt number of trails from 100k to 2kPaul Duncan1-1/+1
2024-05-08sha3.c: neon: refactor, add documentationPaul Duncan1-115/+157
- switch row_eor() from macro to static inline function - compress rho rotate values into from 15 128-bit registers to two to reduce register pressure (still spilling, though) - remove PERMUTE macro - switch from unrolled loop with macro in body of permute_n_neon() to regular loop - add documentation for register/lane layout and for compressed rho rotations with these changes the neon backend is still uses ~50% more cycles than the scalar backend, so i will probably leave it disabled for the initial release. scalar (pi5): > ./bench 2000 info: cpucycles: version=20240318 implementation=arm64-vct persecond=2400000000 info: backend=scalar num_trials=2000 src_lens=64,256,1024,4096,16384 dst_lens=32 function,dst_len,64,256,1024,4096,16384 sha3_224,28,20.2,10.3,10.3,9.3,9.2 sha3_256,32,20.2,10.3,10.3,9.9,9.7 sha3_384,48,20.9,15.3,12.8,12.7,12.5 sha3_512,64,20.2,20.2,18.9,25.3,17.9 shake128,32,20.2,10.1,9.0,8.1,7.9 shake256,32,20.2,10.3,10.3,9.9,9.7 neon backend bench results (pi5): > ./bench 2000 info: cpucycles: version=20240318 implementation=arm64-vct persecond=2400000000 info: backend=neon num_trials=2000 src_lens=64,256,1024,4096,16384 dst_lens=32 function,dst_len,64,256,1024,4096,16384 sha3_224,28,32.7,16.3,16.4,14.9,14.6 sha3_256,32,32.0,16.2,16.4,15.9,15.5 sha3_384,48,32.7,24.2,20.4,20.2,20.0 sha3_512,64,32.0,32.2,30.1,28.6,28.5 shake128,32,32.7,16.2,14.2,12.8,12.5 shake256,32,32.7,16.2,16.3,15.7,15.4
2024-05-08tests/neon/neon.c: port changes back from sha3.cPaul Duncan1-302/+405
2024-05-07tests/neon/Makefile: add all targetPaul Duncan1-2/+5
2024-05-07tests/bench/Makefile: add test targetPaul Duncan1-0/+3
2024-05-07sha3.c: s/union/struct/ (yeesh)Paul Duncan1-7/+2
2024-05-06sha3.c: neon backend now twice the speed of scalar backend (~50% fewer ↵Paul Duncan1-153/+137
cyles, see commit message) made the following changes: - row_t contents are now 3 uint64x2_t instead of uin64x2x3_t (so they are stored as registers instead of memory) - fetch round constants 2 at a time - round loop unrolled once - drop convoluted ext/trn store (hard to read, doesn't help) bench results ------------- scalar backend: > make clean all SHA3_BACKEND=1 ... > ./bench 10000 info: cpucycles: version=20240318 implementation=arm64-vct persecond=2400000000 info: backend=scalar num_trials=10000 src_lens=64,256,1024,4096,16384 dst_lens=32 function,dst_len,64,256,1024,4096,16384 sha3_224,28,20.2,10.3,10.3,9.3,9.2 sha3_256,32,20.2,10.3,10.3,9.9,9.7 sha3_384,48,20.9,15.3,12.8,12.7,12.7 sha3_512,64,20.2,20.2,18.9,17.9,18.1 shake128,32,20.2,10.3,9.0,8.1,7.9 shake256,32,20.2,10.1,10.3,9.9,9.7 neon backend: > make clean all SHA3_BACKEND=3 ... > ./bench 10000 info: cpucycles: version=20240318 implementation=arm64-vct persecond=2400000000 info: backend=neon num_trials=10000 src_lens=64,256,1024,4096,16384 dst_lens=32 function,dst_len,64,256,1024,4096,16384 sha3_224,28,9.7,5.0,5.0,4.6,4.5 sha3_256,32,9.7,5.0,5.0,4.9,4.8 sha3_384,48,9.7,7.3,6.2,6.2,6.1 sha3_512,64,9.7,9.7,9.1,8.7,8.7 shake128,32,9.7,5.0,4.5,4.0,4.0 shake256,32,9.7,5.0,5.1,4.9,4.8
2024-05-05sha3.c: diet-neon: s/permute_n_neon/permute_n_diet_neon/Paul Duncan1-2/+2
2024-05-05sha3.c: add diet-neon backend (even slower, see commit message)Paul Duncan1-0/+329
scalar bench results: info: cpucycles: version=20240318 implementation=arm64-vct persecond=2400000000 info: backend=scalar num_trials=50000 src_lens=64,256,1024,4096,16384 dst_lens=32 function,dst_len,64,256,1024,4096,16384 sha3_224,28,20.2,10.3,10.3,9.3,9.2 sha3_256,32,20.2,10.1,10.3,9.9,9.7 sha3_384,48,20.9,15.1,12.8,12.7,12.5 sha3_512,64,20.2,20.2,18.9,18.0,18.0 shake128,32,20.2,10.1,9.0,8.1,7.9 shake256,32,20.2,10.1,10.3,9.9,9.7 neon bench results: info: cpucycles: version=20240318 implementation=arm64-vct persecond=2400000000 info: backend=neon num_trials=50000 src_lens=64,256,1024,4096,16384 dst_lens=32 function,dst_len,64,256,1024,4096,16384 sha3_224,28,32.7,16.2,16.3,14.8,14.5 sha3_256,32,32.7,16.2,16.3,15.8,15.4 sha3_384,48,32.7,24.2,20.3,20.3,20.0 sha3_512,64,32.0,32.3,30.2,28.7,29.3 shake128,32,34.8,16.9,14.9,13.3,13.4 shake256,32,35.5,18.1,17.4,17.2,16.4 diet-neon bench results: info: cpucycles: version=20240318 implementation=arm64-vct persecond=2400000000 info: backend=diet-neon num_trials=50000 src_lens=64,256,1024,4096,16384 dst_lens=32 function,dst_len,64,256,1024,4096,16384 sha3_224,28,33.4,16.5,16.6,15.1,15.0 sha3_256,32,33.4,16.5,16.6,16.1,15.9 sha3_384,48,33.4,25.0,21.0,20.7,21.4 sha3_512,64,33.4,34.9,33.5,31.1,32.0 shake128,32,36.8,18.4,16.3,14.3,14.0 shake256,32,34.1,17.7,18.2,17.6,17.3
2024-05-05sha3.c, Makefile, tests/bench/Makefile: allow overriding SHA3_BACKEND via ↵Paul Duncan3-10/+23
command-line argument
2024-05-05sha3.c: s/avx512/neon/ in neon test commentsPaul Duncan1-2/+2
2024-05-05.gitignore: add *.sw?Paul Duncan1-0/+1
2024-05-04sha3.c: neon: add tests, improve performance (still too slow, see full ↵Paul Duncan1-198/+217
commit message) scalar (odroid n2l): pabs@pizza:~/git/sha3/tests/bench> ./bench 1000 info: cpucycles: version=20240318 implementation=arm64-vct persecond=1800000000 info: backend=scalar num_trials=1000 src_lens=64,256,1024,4096,16384 dst_lens=32 function,dst_len,64,256,1024,4096,16384 sha3_224,28,34.0,16.4,15.5,14.0,13.7 sha3_256,32,34.0,16.1,15.4,14.8,14.4 sha3_384,48,34.0,23.4,19.0,18.8,18.6 sha3_512,64,34.0,30.8,28.1,26.5,26.5 shake128,32,34.0,16.1,13.6,12.1,11.8 shake256,32,34.0,16.1,15.5,14.8,14.4 neon (odroid n2l): pabs@pizza:~/git/sha3/tests/bench> ./bench 1000 info: cpucycles: version=20240318 implementation=arm64-vct persecond=1800000000 info: backend=neon num_trials=1000 src_lens=64,256,1024,4096,16384 dst_lens=32 function,dst_len,64,256,1024,4096,16384 sha3_224,28,65.6,32.5,31.8,28.7,28.2 sha3_256,32,65.6,32.5,31.9,30.8,30.0 sha3_384,48,65.6,48.0,39.7,39.5,39.0 sha3_512,64,68.0,63.9,59.1,56.0,55.9 shake128,32,65.6,32.5,28.4,25.4,24.8 shake256,32,65.6,32.5,31.6,30.5,29.7
2024-05-04tests/bench/Makefile: add commented CFLAGS with scalar backendPaul Duncan1-0/+1
2024-05-03sha3.c: add missing RHO_IDSPaul Duncan1-0/+9
2024-05-03sha3.c: add neon backendPaul Duncan1-9/+305
2024-05-03add tests/neonPaul Duncan3-0/+989
2024-05-03sha3.c: refactor backends so they only implement permute_n()Paul Duncan1-40/+21
i verified that (gcc, at least) does constant propagation and inlines permute_n_<backend> and that this change does not affect performance. bench results, before: pabs@flex:~/git/sha3/tests/bench> ./bench info: cpucycles: version=20240318 implementation=amd64-pmc persecond=4800000000 info: backend=avx512 num_trials=100000 src_lens=64,256,1024,4096,16384 dst_lens=32 function,dst_len,64,256,1024,4096,16384 sha3_224,28,15.4,7.8,7.8,7.1,7.0 sha3_256,32,15.5,7.8,7.8,7.6,7.4 sha3_384,48,15.5,11.7,9.8,9.8,9.7 sha3_512,64,15.6,15.5,14.6,13.9,13.9 shake128,32,15.5,7.8,6.9,6.2,6.1 shake256,32,15.5,7.8,7.9,7.6,7.4 bench results, after change: pabs@flex:~/git/sha3/tests/bench> ./bench info: cpucycles: version=20240318 implementation=amd64-pmc persecond=4800000000 info: backend=avx512 num_trials=100000 src_lens=64,256,1024,4096,16384 dst_lens=32 function,dst_len,64,256,1024,4096,16384 sha3_224,28,15.4,7.8,7.8,7.1,7.0 sha3_256,32,15.6,7.8,7.8,7.6,7.4 sha3_384,48,15.6,11.7,9.8,9.8,9.7 sha3_512,64,15.6,15.5,14.6,13.8,13.8 shake128,32,15.6,7.9,6.9,6.2,6.1 shake256,32,15.7,7.9,7.9,7.6,7.4
2024-05-01sha3.h: remove extraneous exclamation pointsPaul Duncan1-2/+2
2024-05-01sha3.c: improve internal documentation, s/SHA3_BACKEND_/BACKEND_/Paul Duncan1-78/+134
2024-04-29sha3.c: add permute_n_{scalar,avx512}() and refactor ↵Paul Duncan1-198/+23
permute{,12}_{scalar,avx512}() to use them
2024-04-29sha3.c: s/ifdef/if/ in a few placesPaul Duncan1-2/+2
2024-04-29tests/bench: add backend to metadataPaul Duncan2-3/+4
2024-04-29.gitignore: add all-fnsPaul Duncan1-0/+1
2024-04-29sha3.[hc]: add sha3_backend()Paul Duncan2-1/+26
2024-04-29examples/06-all/all-fns.c: add sha3_backend() examplePaul Duncan1-0/+11
2024-04-29sha3.c: add/use SHA3_BACKENDPaul Duncan1-11/+26
2024-04-29tests/bench/README.md: add n2l examplePaul Duncan1-2/+8
2024-04-29tests/bench: refactor so bench prints a cpb table to stdoutPaul Duncan2-93/+129
2024-04-29tests/bench/README.md: remove mean_cpb, add "cycles per byte" link, misc ↵Paul Duncan1-4/+5
cleanups
2024-04-29tests/bench/bench.c: fix bench function memory allocation, remove mean_cpb ↵Paul Duncan1-20/+24
from output, use defines for src/dst lengths