Skip to content

Draft: Try alternative cubic expression

Leo Pound Singer requested to merge leo-singer/ligo.skymap:halfdirect into main

This is an alternative cubic expression that should be less impacted by the high latency of fused multiply-add instructions on x86_64.

The cubic and bicubic interpolation of the distance integral is one of the code's hotspots. For a long time, we have been using Horner's method to evaluate the cubic polynomials for interpolation. Direct form is y(t) = a_0 t^3 + a_1 t^2 + a_2 t + a_3 for 5 multiplications and 3 additions. Horner's method is y(t) = (((a_0 t + a_1) t + a_2) t + a_3) t + a_0 for 3 multiplications and 3 additions.

Horner's method gets compiled down to only 3 fused multiply-add (VFMADDxxxxx) instructions. However, each FMA instruction depends on the output of the previous one. On Skylake, for example, the FMA instructions have a fairly high latency of 4 cycles, so the CPU pipeline might not be kept as busy as it can possibly be (see Agner Fog's instruction tables).

The idea here is to express the cubic interpolant in a form that has fewer data dependencies and can keep the CPU pipeline more full.

Edited by Leo Pound Singer

Merge request reports