Precision errors in the SDF monitoring

Bugzilla ticket #951

From Jonathan Hanks 2015-11-23 13:47:52 PST

This entry is the recording of a cds-software email.

While testing the slow controls monitoring via SDF Dave noticed a few differences that suprised him. He found two channels with differences on the order of 1.0e-17. He could accept the values and go on. We thought it might have been an precision issue on burtrb (which he used to generate a initial OBSERVE.snap file). So we accepted, saved the .snap file and restarted the SDF. The difference came back again.

The root cause of this appears to be the following:

.snap files are written with 15 digits of precision
the SDF (front end or the new channel access build) work with the full precision of a 64bit double.

Whenever the .snap (or a burt file) is written all floating point values are written out with 15 digits of precision.

On the front end we don’t see this because on start up the SDF system reads the safe.snap file and writes the values into the IOC/model. This works as you only restart the SDF system (the EPICS IOC) when you restart the model. This essentially truncates the value to 15 digits of precision.

On the new channel access build of SDF we read in the safe.snap file but DO NOT write the values into EPICS (ie the beckhoff system). So each time we start the SDF system we limit our reference value to 15 digits of precision, but the beckhoff system may have something with more precision in memory, showing a difference.

There are a few solutions:

Do nothing. Accept the values whenever we restart the SDF system.
Add more digits to the .snap file. At this point we may loose BURT compatibility and will introduce a one time occurance of a large number of differences into the SVN as new safe and OBSERVE files are saved.
Allow an epsilon value in the difference comparision. A channel is thus defined as not having a difference when the difference is <= 1.0e-16 or so.

For right now we are doing #1 (moved) (do nothing, we are seeing this on two channels that we are monitoring).

Patrick Thomas notes:

If you decide to change the precision in the snapshot file, in Conlog testing I am essentially doing:

#include printf("%.*e", FLT_DECIMAL_DIG - 1, float_value) printf("%.*e", DBL_DECIMAL_DIG - 1, double_value)

to convert floats and doubles to strings without losing precision.

http://stackoverflow.com/questions/32640620/ieee-floating-point-number-to-exact-base10-character-string

From Jonathan: 2016-01-13 14:41:21 PST

This should be fixed in trunk at r4097.

For regular front end builds there is no change. For CA_SDF builds we have increased the precision to 20 decimal places.

From Keith: 2017-02-14 14:30:16 PST

Hmm - still we build all our own EPICS stuff now, why don't we patch the 'burt' source so it uses FLT_DECIMAL_DIG, etc. instead of hardcoding as in

from burtcommon.h

/* the following 2 must agree ... (-x.)+6+(+exx)=13 */ #define FLOATFORMATSTRING "%.6e" #define FLOATSTRINGLENGTH 13

/* the following 2 must agree ... (-x.)+15+(+exx)=22 */ #define DOUBLEFORMATSTRING "%.15e" #define DOUBLESTRINGLENGTH 22

** Yes, this is why burt code only does 15 characters for double. No reason it can't be more....

From Keith: 2017-02-14 14:35:33 PST

These appear to be (from float.h loaded by cfloat.h) FLT_DECIMAL_DIG DBL_DECIMAL_DIG defined as 9 for IEEE float and 17 for IEEE double.

I have EPICS 3.15.4 about ready to go. We could add this patch in and then be OK for any newly built front-end models (I think)

From Keith: 2017-04-20 06:19:36 PDT

The issue is not solved

See https://services.ligo-la.caltech.edu/FRS/show_bug.cgi?id=7379 for examples

Still seen in PSL FSS,etc.

-- As I detailed, we likely need to increase number of digits in both the BURT and SDF snap files following the IEEE guidance in float.h.

Steps

Since BURT is EOL, apply a CDS patch so that it makes the output match IEEE guidance (17 places for double-precision)
Change the SNAP files to use 17 (instead of 15 digits)

From Stuart Aston 2018-04-25 12:30:48 PDT

SDF precision error is still present at LLO, such as on the l1iopasc0 model (r4743), see LLO aLOG entry 38776:

https://alog.ligo-la.caltech.edu/aLOG/index.php?callRep=38776

From Jonathan 2018-04-25 12:42:27 PDT

Stuart, Keith,

The fix we put in handles differences at startup. I'm not sure that extending the precision would help us in this case.

For the FE/models we have 15 digits of precision, this is less than can be fully expressed in the data type. However the models do a hard reset of the values to the snap file, so the difference in precision is nullified at startup.

The reason we needed a change in the storage format was the CA SDF system does not set the values on startup (as the lifetime of the SDF system in this case is separate from the lifetime of the systems being controlled).

What we are seeing here is likely a series of small modifications to the channel after startup that are not setting the channel back to its safe value. In this case precision isn't the issue. To solve this we should come to an agreement of some small delta that is allowed between the safe.snap value and the current value before it is declared to differ.

From Keith: 2018-04-25 15:04:46 PDT

A problem is that in some cases (calibration), this number can be inherently less than 1e-16. So it is not the specific difference that is important but instead the fractional difference in the value. Yes we likely need to rethink this.

Edited Feb 13, 2020 by Erik von Reis