Draft: Try alternative cubic expression
This is an alternative cubic expression that should be less impacted by the high latency of fused multiply-add instructions on x86_64.
The cubic and bicubic interpolation of the distance integral is one of the code's hotspots. For a long time, we have been using Horner's method to evaluate the cubic polynomials for interpolation. Direct form is y(t) = a_0 t^3 + a_1 t^2 + a_2 t + a_3
for 5 multiplications and 3 additions. Horner's method is y(t) = (((a_0 t + a_1) t + a_2) t + a_3) t + a_0
for 3 multiplications and 3 additions.
Horner's method gets compiled down to only 3 fused multiply-add (VFMADDxxxxx
) instructions. However, each FMA instruction depends on the output of the previous one. On Skylake, for example, the FMA instructions have a fairly high latency of 4 cycles, so the CPU pipeline might not be kept as busy as it can possibly be (see Agner Fog's instruction tables).
The idea here is to express the cubic interpolant in a form that has fewer data dependencies and can keep the CPU pipeline more full.