Poly1305 implementations are characterized by several parameters:
- radix or base of inputs representation, or how many digits represent the 130-bit value the algorithm operates on;
- vectorization factor, or how many input blocks are processed per [loop] iteration and in parallel;
- floating-point vs. integer/scalar arithmetic;
We are most interested in performing calculations in
minimum time, and for this reason find all these parameters
interdependent. For example, the algorithm specification favours
base 2^{32} floating-point implementation utilizing extended-precision operations supported by Intel x87 ISA;- mixed
base 2^{16/32} floating-point implementation for double-precision arithmetics; base 2^{22}-ish vectorized floating-point implementation targeting x86_64 AVX unit;base 2^{26} vectorized implementation targeting integer SIMD units in x86, x86_64 and ARM processors;base 2^{26} scalar implementation as a reference or fall-back implementation;
This list is obviously incomplete in the sense that
there are no
As just mentioned, vectorized
SSE2, 32-bit | AVX, 64-bit | AVX2, 64-bit | |
---|---|---|---|
Core 2 | 1.80 | ||
Sandy Bridge | 1.36 | 1.10 | |
Haswell | 1.18 | 1.11 | 0.65 |
Atom Silvermont | 4.80 | ||
Bulldozer | 1.31 | 0.97 | |
NEON, 32-bit | NEON, 64-bit | ||
Cortex-A8 | 2.36 | ||
Cortex-A15 | 1.25 | ||
Cortex-A53 | 1.61 | 1.47 | |
Cortex-A57 | 1.29 | 1.14 | |
Apple A7 | 0.78 | 0.72 |
The table mixes different optimizations, which might be a little bit confusing, but it does provide a useful overview and underlines following point: Even though 32-bit results are inherently worse because of limited register bank capacity, they don’t differ significantly from 64-bit ones on the same processor. This is because it’s same radix and therefore the same amount of computational operations.
Impressive as they are, the above results come with a "price tag" of pre-calculation of key powers, which "costs" several 16-byte blocks. We should also keep in mind that to process a single block with a vectorized procedure, one must perform "throw-away" processing of several blocks and discard all of the results but one. The bottom line is that the vectorized procedure is far from optimal for short inputs, for which scalar implementations are preferable, even if large-block performance is measured to be several times worse:
base 2^{32} | base 2^{64} | |
---|---|---|
Core 2 | 4.85 | 2.39 |
Sandy Bridge | 3.90 | 1.39 |
Haswell | 3.88 | 1.10 |
Atom Silvermont | 11.0 | 2.83 |
Bulldozer | 4.53 | 2.21 |
Cortex-A8 | 6.25 | |
Cortex-A15 | 3.79 | |
Cortex-A53 | 4.07 | 2.63 |
Cortex-A57 | 3.51 | 2.70 |
Apple A7 | 3.62 | 1.86 |
It was observed that a single, up to 16 bytes long
block is fully hashed, i.e. accounting even for initialization
and finalization, about
Note that there are
As for floating point: In general non-vectorized
implementations are limited by a critical path length which is a complex
function of instruction latencies, pipeline throughput, and other
obscure factors. The thing to recognize about floating point is that the
critical path is likely to be dominated by floating-point operation
latency. This is because floating-point addition has the same latency as
multiplication. Given algorithmic dependencies an ideal processor with
sufficient resources should perform a block operation in 10 latencies.
For example, for latency of 6, performance would be 10*6/16=3.75 cycles
per byte. Once again, this is an asymptotic limit for an ideal
processor, and in practice we are likely to measure worse performance.
Either way the key point is that I failed to identify a processor
capable of supporting
base 2^{32} | floating-point | base 2^{64} | |
---|---|---|---|
Freescale e300 | 14.8 | 9.78 | |
PPC970 | 7.20 | 6.24 | 3.51 |
POWER7 | 3.67 | 3.50 | 1.87 |
POWER8 | - | 3.75 | 2.13 |
z10 | 11.2 | 6.40 | |
z196 | 7.30 | 2.20 |
The choice of IBM processors is rather circumstantial, the principle holds true for many other platforms.
There are odd-ball cases that are tricky to evaluate;
one example is SPARCv9. The floating-point option is distinctly superior
on the original UltraSPARC pre-Tx family, several times faster than
integer
In conclusion let’s note some processors of
historical interest. Floating point would surely be optimal for PA-RISC
and Itanium, because integer multiplications are performed by the
floating-point unit on these processors, and transfer between register
banks is not exactly gratis. In addition Itanium is capable of performing
extended-precision operations which makes
Summary and future work
Integer/scalar
As for future work, note that Power ISA 2.07 adds
instructions that make vectorized
But the closest goal is AVX512 which allows to double vectorization factor. This requires an even larger pre-computed table. The intention to not preserve second half of the table, but instead calculate it upon every call with long enough input. That may appear excessive, but the most common usage pattern is to make one "big" call per TLS record anyway.