Snapdragon S4: disappointing NEON performance

I find it unfortunate that Qualcomm lets unaffiliated technical writers write about their CPU internals. Because the result is that information comes out inaccurate and misleading, at least for programmers such as myself. Take for example AnandTech (search for "VeNum"):

"Qualcomm calls its NEON engine VeNum and has increased its issue capabilities by 50%. Whereas Scorpion could only issue two NEON instructions in parallel, Krait can do three.

Qualcomm's NEON data paths are still 128-bits wide."

I didn't have opportunity to benchmark above mentioned Scorpion CPU, but below is ILP vs. IPC comparison table for Snapdragon S4 and Cortex-A8 for simple logical integer-only, mixtures of integer and NEON, and NEON-only instruction sequences. ILP, instruction-level parallelism, is how many [independent] instructions can execute at the same time, while IPC, instructions per cycle, indicates how many instructions actually executed. On ideal processor with unlimited [or adequate amount of] resources IPC is equal to ILP. Larger is better, red is "bad":

ILP IPC@Snapdragon S4 IPC@Cortex-A8 IPC@Cortex-A15 IPC@Apple A7
1x 0.55 0.99 1.00 1.00
1xNEON 0.33 0.50 0.33 0.36
2x 1.38 1.86 1.87 1.97
1+1xNEON 0.74 0.99 0.67 0.69
2xNEON 0.66 0.99 0.66 0.69
3x 2.66 1.90 1.91 2.52
2+1xNEON 0.99/1.99(*) 1.49/1.94(*) 1.00/1.99(*) 1.16/2.16(*)
1+2xNEON 0.99/1.41(*) 1.49/1.50(*) 0.99/1.65(*) 1.04/2.18(*)
3xNEON 0.99 0.99 1.00 1.05
(*) Numbers after slash are for 4+2xNEON and 2+4xNEON, i.e. without dependencies between NEON instructions in adjacent 3x bundles.
4x 2.37 1.93 1.93 3.94
3+1xNEON 1.33 1.93 1.33 1.47
2+2xNEON 1.33 1.93 1.33 1.36
1+3xNEON 1.33 1.35 1.33 1.42
4xNEON 0.99 0.99 0.92 1.38
5x 2.66 1.94 1.94 3.96
4+1xNEON 1.66 1.99 1.66 1.90
3+2xNEON 1.66 1.99 1.66 1.75
2+3xNEON 1.66 1.72 1.66 1.82
1+4xNEON 1.09 1.30 1.39 1.77
5xNEON 1.00 1.00 1.22 1.67
6+3xNEON 2.63(**) 1.93 2.87(**) 2.99
(**) While these are exciting results, trouble is that such sequences are rare, and apparently none are found in OpenSSL.

As it can be seen, Snapdragon S4 can not issue more than one NEON instruction per cycle. Moreover, because of apparently higher NEON instruction latency, number of sequences were observed to deliver lower IPC than corresponding one achieved by Cortex-A8. Nor can it be seen what does "still 128-bit data path" mean. I mean the statement implies that contender doesn't have 128-bit data paths, but NEON instructions used in the tests were all 128-bit ones...

Bottom line. While integer-only code ¡with ILP of 3 and higher! would run up to 40% faster on Snapdragon S4 [scaled to same clock frequency], we can not expect anything close to it for NEON code, not for integer arithmetic, non-VFP code of interest in cryptography context. If there is improvement for such NEON code, it's rather thanks to better specific instructions timings and out-of-order execution logic (for compiler-generated code). Otherwise it's not unlike that you observe regression, as chances are that instructions were scheduled without regard to higher NEON instruction latency. Also keep in mind that ILP can be limited by algorithm, e.g. MD5 has limited one and as result performs poorer on Snapdragon S4, >25% slower [for integer-only compiler-generated code, see 2x line for explanation].

With more processors in comparison matrix one can argue that this page doesn't have much of a case. But point was not to bash processor in question, but rather lack of useful and reliable information about its internals.