Poly1305 implementations are characterized by several parameters:
- radix or base of inputs representation, or how many digits represent the 130-bit value the algorithm operates on;
- vectorization factor, or how many input blocks are processed per [loop] iteration and in parallel;
- floating-point vs. integer/scalar arithmetic;
We are most interested in performing calculations in minimum time, and for this reason find all these parameters interdependent. For example, the algorithm specification favours
base 232by structuring the key in a manner that minimizes carries. But, that radix is impossible with IEEE 754 double-precision operations because the 53-bit mantissa can't accommodate the result of the multiplication of a pair of corresponding digits. Nor is it practical with vectorized implementation, because that uses not only the key itself, but also powers of the key. And the trouble is that the just mentioned carry-friendly structure is lost if the key is multiplied with itself, and one must either handle carries sequentially or make the radix smaller to preserve parallelism. Long story short; so far implementors have explored the following options: base 232floating-point implementation utilizing extended-precision operations supported by Intel x87 ISA;
base 216/32floating-point implementation for double-precision arithmetics; base 222-ishvectorized floating-point implementation targeting x86_64 AVX unit; base 226vectorized implementation targeting integer SIMD units in x86, x86_64 and ARM processors; base 226scalar implementation as a reference or fall-back implementation; </UL>
This list is obviously incomplete in the sense that there are no
base 232or base 264integer/scalar implementations. A floating-point approach is often assumed to be superior. That assumption is effectively disproved by base 226vectorized integer implementations, but what about processors that lack a sufficiently capable SIMD unit? This post addresses that situation.
As just mentioned, vectorized
base 226implementations turn to be a better choice at least in comparison to floating-point implementations, which allows us to say the following: As all contemporary processors in the x86 family have vector units, we can dismiss base 232floating point as hardly relevant. But for the sake of completeness we can mention that its large-block performance varies from ~2.6 cycles per processed byte on latest desktop/server Intel processors to ~7.7 on Atom Silvermont. Among pre-SSE2 processors it varies from ~9.0 on Pentium to ~4.6 on PIII and ~3.1 on AMD Athlon. These results were collected with poly1305-util. Once again, the x87 code is fastest option on pre-SSE2 processors, but those are stone-age and are very uncommon now, hence the omission. Back to vectorized, here are selected results for OpenSSL implementations, also in cycles per processed byte out of large block: SSE2, 32-bit AVX, 64-bit AVX2, 64-bit Core 2 1.80 Sandy Bridge 1.36 1.10 Haswell 1.18 1.11 0.65 Atom Silvermont 4.80 Bulldozer 1.31 0.97 NEON, 32-bit NEON, 64-bit Cortex-A8 2.36 Cortex-A15 1.25 Cortex-A53 1.61 1.47 Cortex-A57 1.29 1.14 Apple A7 0.78 0.72
The table mixes different optimizations, which might be a little bit confusing, but it does provide a useful overview and underlines following point: Even though 32-bit results are inherently worse because of limited register bank capacity, they don't differ significantly from 64-bit ones on the same processor. This is because it's same radix and therefore the same amount of computational operations.
Impressive as they are, the above results come with a "price tag" of pre-calculation of key powers, which "costs" several 16-byte blocks. We should also keep in mind that to process a single block with a vectorized procedure, one must perform "throw-away" processing of several blocks and discard all of the results but one. The bottom line is that the vectorized procedure is far from optimal for short inputs, for which scalar implementations are preferable, even if large-block performance is measured to be several times worse:
base 232 base 264 Core 2 4.85 2.39 Sandy Bridge 3.90 1.39 Haswell 3.88 1.10 Atom Silvermont 11.0 2.83 Bulldozer 4.53 2.21 Cortex-A8 6.25 Cortex-A15 3.79 Cortex-A53 4.07 2.63 Cortex-A57 3.51 2.70 </TABLE> Apple A7 3.62 1.86
It was observed that a single, up to 16 bytes long block is fully hashed, i.e. accounting even for initialization and finalization, about
4-5times slower than one block out of large buffer, which in turn is 16 times the result from the table above. For example, on Haswell single block is hashed in ~80 cycles which is faster than vectorized pre-computation alone. Another factor that benefits small-block performance is that scalar code is really compact, e.g. x86_64 base 264code is less than 500 bytes, and for ARMv8 less than 400. Does all this mean that input length plays a role in which approach is applied? Yes. In OpenSSL the internal context is always initialized for scalar processing, but as soon as long input is passed, long enough to justify additional overhead, the current hash value is converted to a different radix and necessary pre-calculations are performed.
Note that there are
base 264results that are not worse than vectorized SSE2 on some x86_64 CPUs, and some are even better, most notably on Atom Silvermont. This is basically why 64-bit SSE2 code is omitted from OpenSSL; it's not a better option on contemporary pre-AVX processors. One can argue that this is unfair to legacy systems, such as Westmere and predecessors, but it's not an unreasonable compromise, because we're talking about processors that are approaching or have passed their end of life. Fortunately the performance gap is lower for more recent processors.
As for floating point: In general non-vectorized implementations are limited by a critical path length which is a complex function of instruction latencies, pipeline throughput, and other obscure factors. The thing to recognize about floating point is that the critical path is likely to be dominated by floating-point operation latency. This is because floating-point addition has the same latency as multiplication. Given algorithmic dependencies an ideal processor with sufficient resources should perform a block operation in 10 latencies. For example, for latency of 6, performance would be 10*6/16=3.75 cycles per byte. Once again, this is an asymptotic limit for an ideal processor, and in practice we are likely to measure worse performance. Either way the key point is that I failed to identify a processor capable of supporting
base 264that performs worse than the floating-point estimate for same processor. Or in other words, floating point makes sense only as an alternative to base 232scalar option. Here are comparison results for a few PowerPC and z/Architecture processors to demostrate the assertion: base 232 floating-point base 264 Freescale e300 14.8 9.78 PPC970 7.20 6.24 3.51 POWER7 3.67 3.50 1.87 POWER8 - 3.75 2.13 z10 11.2 6.40 z196 7.30 2.20
The choice of IBM processors is rather circumstantial, the principle holds true for many other platforms.
There are odd-ball cases that are tricky to evaluate; one example is SPARCv9. The floating-point option is distinctly superior on the original UltraSPARC pre-Tx family, several times faster than integer
base 232. But then the UltraSPARC T1 was a really poor floating-point "performer" and so are its successors. The Tx family is improving in that regard, but the VIS3 extension introduced in the T3 made base 264implementation possible, in both 32- and 64-bit builds, which turned to be fast enough to render floating point irrelevant. As UltraSPARC I-IV is unlikely to be found in production, floating-point implementation is omitted for SPARCv9.
In conclusion let's note some processors of historical interest. Floating point would surely be optimal for PA-RISC and Itanium, because integer multiplications are performed by the floating-point unit on these processors, and transfer between register banks is not exactly gratis. In addition Itanium is capable of performing extended-precision operations which makes
base 232floating-point implementation possible…
Summary and future work
base 232and base 264are implemented as a fall-back and as the preferred choice for short inputs. The base 264approach is shown to be the best choice for 64-bit processors that lack a SIMD unit. Vectorized base 226implementation is optimized with all-round performance in mind, i.e. attempting to meet multiple optimization strategies for different microarchitectures.
As for future work, note that Power ISA 2.07 adds instructions that make vectorized
base 226implementation possible, and so does z13 from the z/Architecture family… The base 264approach might turn to be suboptimal for MIPS64 processors because of a combination of two factors: carries must be handled programmatically, and it takes additional instructions to pull the high part of the multiplication result. Both mean an increased number of instructions and in combination might tip the balance toward a base 232implementation, but one expressed in 64-bit instructions…
But the closest goal is AVX512 which allows to double vectorization factor. This requires an even larger pre-computed table. The intention to not preserve second half of the table, but instead calculate it upon every call with long enough input. That may appear excessive, but the most common usage pattern is to make one "big" call per TLS record anyway. </div>