# Draft: Try alternative cubic expression

This is an alternative cubic expression that should be less impacted by the high latency of fused multiply-add instructions on x86_64.

The cubic and bicubic interpolation of the distance integral is one of the code's hotspots. For a long time, we have been using Horner's method to evaluate the cubic polynomials for interpolation. Direct form is `y(t) = a_0 t^3 + a_1 t^2 + a_2 t + a_3`

for 5 multiplications and 3 additions. Horner's method is `y(t) = (((a_0 t + a_1) t + a_2) t + a_3) t + a_0`

for 3 multiplications and 3 additions.

Horner's method gets compiled down to only 3 fused multiply-add (`VFMADDxxxxx`

) instructions. However, each FMA instruction depends on the output of the previous one. On Skylake, for example, the FMA instructions have a fairly high latency of 4 cycles, so the CPU pipeline might not be kept as busy as it can possibly be (see Agner Fog's instruction tables).

The idea here is to express the cubic interpolant in a form that has fewer data dependencies and can keep the CPU pipeline more full.