I find it unfortunate that Qualcomm lets unaffiliated technical writers write about their CPU internals. Because the result is that information comes out inaccurate and misleading, at least for programmers such as myself. Take for example AnandTech (search for "VeNum"):
"Qualcomm calls its NEON engine VeNum and has increased its issue capabilities by 50%. Whereas Scorpion could only issue two NEON instructions in parallel, Krait can do three.
Qualcomm's NEON data paths are still 128-bits wide."
I didn't have opportunity to benchmark above
mentioned Scorpion CPU, but below is ILP vs. IPC comparison table for
Snapdragon S4 and
|ILP||IPC@Snapdragon S4||IPC@Cortex-A8||IPC@Cortex-A15||IPC@Apple A7|
As it can be seen, Snapdragon S4 can not issue
more than one NEON instruction per cycle. Moreover, because of
apparently higher NEON instruction latency, number of sequences were
observed to deliver lower IPC than corresponding one achieved by
Bottom line. While integer-only code ¡with ILP of 3 and higher! would run up to 40% faster on Snapdragon S4 [scaled to same clock frequency], we can not expect anything close to it for NEON code, not for integer arithmetic, non-VFP code of interest in cryptography context. If there is improvement for such NEON code, it's rather thanks to better specific instructions timings and out-of-order execution logic (for compiler-generated code). Otherwise it's not unlike that you observe regression, as chances are that instructions were scheduled without regard to higher NEON instruction latency. Also keep in mind that ILP can be limited by algorithm, e.g. MD5 has limited one and as result performs poorer on Snapdragon S4, >25% slower [for integer-only compiler-generated code, see 2x line for explanation].
With more processors in comparison matrix one can argue that this page doesn't have much of a case. But point was not to bash processor in question, but rather lack of useful and reliable information about its internals.