Skip to content

Large Test Stand (DTS1) Full MX Testing.

Overview

The entire large test stand had MX adapters and switches purchased for it. The test stand was converted to MX from IX, and testing commenced.

Aug 1

Seems like we have been able to isolate the long RTTs to the long fiber. Tests where done where we bypassed the Adnaco bus extender with the same FE/DHA/DSwitch, and they passed. We put the bus extender back in on a short fiber, and that also didn't show unreasonably long RTTs.

Aug 2

  • Had Will observe LEDs on Adnaco bus extender while test had errors, no LEDs reporting issues.
  • Old cards from the first set of three were swapped in, and still had the long RTT on 4Km.
  • Driver was downgraded from 5.21.4 to 5.21.2 and still had issue.

Aug 5

  • Configured MPS to be 128 max, no change in long cycles.
  • Disabled link watchdog ( linkWatchdogEnabled=0; in dis_irm.conf ), no change in long cycles.
  • Disabled sessionHeartbeatsEnabled=0, no change in long cycles.
  • Made other Dolphin config file changes and re-tested, no change in long cycles.
  • Adnaco adapter back to Gen 2 connection, no change in long cycles.

Aug 6

  • The long RTT happen on both the EX and EY legs, so we converted the EX leg back to IX as well as two FEs. Benchmark shows no issue with the PCIe extender with IX.
  • Tested (again?) with just the EY card in the CDSRFM and one EY FE, still had long RTTs.
  • Filled out a bug report with Dolphin.

Aug 12

  • Got new dev build from dolphin Dolphin IRM (GX) 5.22.0-d June 19th 2024 (rev 5f41689548)
  • Re-tested but LR Dolphin had same issues.

Aug 22

  • Using the Dolphin IRM (GX) 5.22.0-d June 19th 2024 (rev 5f41689548) dev build, we have been seeing some CS short link IPC errors. Started tracking on this page under CS Short Link Stability Testing

Test TODOs

Test Name Complete? Notes
mitigations off No change in long cycles
MSP of 512 too big? No change in long cycles

The setup is pictured below

image

The CS and EY Dolphin networks are using the new MXS924 switch while the EX network is using the MXS824 switch.

Model Testing

Models were rebuild with the wait for IPC functionality added, which caused the IX setup to be completely free of IPC errors. However the MX setup had periodic RFM IPC errors reported on EX and EY.

image

LR Stability Testing

Initial testing of MX does not show stability for even 1+ hours.

x2lsc (server) <-> x2oaf0 (client) Round Trip Time (RTT) Testing

Even after the changes to the dishosts.conf file (now using correct gen 4, x4 config) periodic slowdowns are present.

Configuration

System CPU Driver Version Full Config Dump
x2oaf W-2275 CPU @ 3.30GHz 5.21.2 here
x2lac W-3323 CPU @ 3.50GHz 5.21.2 here

Result, Network Manager Running, Other nodes on

Some (assumed) slow RTTs after ~ 2 hours. *

[Tue Jul 30 10:26:15 2024] rts_cpu_isolator: calling LIGO code
[Tue Jul 30 10:26:21 2024] smpboot: CPU 1 didn't die...
[Tue Jul 30 12:29:16 2024] dolphin_client: INFO - Got long RTT of 30486
[Tue Jul 30 12:29:16 2024] dolphin_client: INFO - Got long RTT of 26259
[Tue Jul 30 12:29:17 2024] dolphin_client: INFO - Got long RTT of 6339
[Tue Jul 30 12:29:17 2024] dolphin_client: INFO - Got long RTT of 37550
[Tue Jul 30 12:29:51 2024] dolphin_client: INFO - Got long RTT of 10278
[Tue Jul 30 12:29:51 2024] dolphin_client: INFO - Got long RTT of 47321
[Tue Jul 30 12:44:25 2024] dolphin_client: INFO - Got long RTT of 86171
[Tue Jul 30 12:44:25 2024] dolphin_client: INFO - Got long RTT of 69650
[Tue Jul 30 12:45:46 2024] min: 1467, max: 86171, mean: 2143
[Tue Jul 30 12:45:46 2024] <2000 : 3967847
[Tue Jul 30 12:45:46 2024] [2000, 2500) : 3806637209
[Tue Jul 30 12:45:46 2024] [2500, 3000) : 229
[Tue Jul 30 12:45:46 2024] [3000, 3500) : 2
[Tue Jul 30 12:45:46 2024] [3500, 4000) : 1
[Tue Jul 30 12:45:46 2024] [4000, 5000) : 0
[Tue Jul 30 12:45:46 2024] [5000, 6000) : 0
[Tue Jul 30 12:45:46 2024] [6000, 7000) : 1
[Tue Jul 30 12:45:46 2024] [7000, 8000) : 0
[Tue Jul 30 12:45:46 2024] [8000, 9000) : 0
[Tue Jul 30 12:45:46 2024] [9000, 10000) : 0
[Tue Jul 30 12:45:46 2024] [10000, 11000) : 1
[Tue Jul 30 12:45:46 2024] [11000, 12000) : 0
[Tue Jul 30 12:45:46 2024] [12000, 13000) : 0
[Tue Jul 30 12:45:46 2024] [13000, 14000) : 0
[Tue Jul 30 12:45:46 2024] [14000, 15000) : 0
[Tue Jul 30 12:45:46 2024] [15000, 16000) : 0
[Tue Jul 30 12:45:46 2024] [16000, 17000) : 0
[Tue Jul 30 12:45:46 2024] [17000, 18000) : 0
[Tue Jul 30 12:45:46 2024] [18000, 20000) : 0
[Tue Jul 30 12:45:46 2024] [20000, 22000) : 0
[Tue Jul 30 12:45:46 2024] [22000, 25000) : 0
[Tue Jul 30 12:45:46 2024] [25000, 30000) : 1
[Tue Jul 30 12:45:46 2024] >30000 : 5
[Tue Jul 30 12:45:46 2024] rts_cpu_isolator: LIGO code is done, calling regular shutdown code
[Tue Jul 30 12:45:46 2024] dolphin_client: INFO - Count was 3810605296, err_cnt: 0, total time (s): 2791
[Tue Jul 30 12:45:46 2024] dolphin_client: INFO - Estimated bandwidth (MB/s): 10
[Tue Jul 30 12:45:46 2024] dolphin_client: INFO - Histogram Of All Latencies (ns)
[Tue Jul 30 12:45:46 2024] dolphin_client: INFO - Any latency over 6000 ns:
[Tue Jul 30 12:45:46 2024] dolphin_client: INFO - RTT: 30486, at offset (us) 1801853483
[Tue Jul 30 12:45:46 2024] dolphin_client: INFO - RTT: 26259, at offset (us) 1801856872
[Tue Jul 30 12:45:46 2024] dolphin_client: INFO - RTT: 6339, at offset (us) 1801917581
[Tue Jul 30 12:45:46 2024] dolphin_client: INFO - RTT: 37550, at offset (us) 1801934914
[Tue Jul 30 12:45:46 2024] dolphin_client: INFO - RTT: 10278, at offset (us) 1836033655
[Tue Jul 30 12:45:46 2024] dolphin_client: INFO - RTT: 47321, at offset (us) 1836784994
[Tue Jul 30 12:45:46 2024] dolphin_client: INFO - RTT: 86171, at offset (us) 2710817881
[Tue Jul 30 12:45:46 2024] dolphin_client: INFO - RTT: 69650, at offset (us) 2710822126
[Tue Jul 30 12:45:46 2024] dolphin_client: INFO - It took 11 ms for the RT code to exit.

x2lsc (server) <-> x2oaf0 (client) Round Trip Time (RTT) Testing

This time the Network manager was turned off, more IPCs were written, and drivers were at 5.21.4. Same 5.10.149-2 kernel as above. Run command sudo insmod dolphin_client.ko NUM_COUNTERS_OPEN=50 CFLUSH_INDIVIDUALLY=1

Results

Almost 4 hours, no long RTT.

[Wed Jul 31 12:59:42 2024] smpboot: CPU 1 didn't die...
[Wed Jul 31 16:41:14 2024] min: 2752, max: 10116, mean: 5399
[Wed Jul 31 16:41:14 2024] <2000 : 0
[Wed Jul 31 16:41:14 2024] [2000, 2500) : 0
[Wed Jul 31 16:41:14 2024] [2500, 3000) : 263975192
[Wed Jul 31 16:41:14 2024] [3000, 3500) : 7239843264
[Wed Jul 31 16:41:14 2024] [3500, 4000) : 7045206459
[Wed Jul 31 16:41:14 2024] [4000, 5000) : 12121402530
[Wed Jul 31 16:41:14 2024] [5000, 6000) : 10910298729
[Wed Jul 31 16:41:14 2024] [6000, 7000) : 10904439281
[Wed Jul 31 16:41:14 2024] [7000, 8000) : 10930136117
[Wed Jul 31 16:41:14 2024] [8000, 9000) : 1203056317
[Wed Jul 31 16:41:14 2024] [9000, 10000) : 259
[Wed Jul 31 16:41:14 2024] [10000, 11000) : 2
[Wed Jul 31 16:41:14 2024] [11000, 12000) : 0
[Wed Jul 31 16:41:14 2024] [12000, 13000) : 0
[Wed Jul 31 16:41:14 2024] [13000, 14000) : 0
[Wed Jul 31 16:41:14 2024] [14000, 15000) : 0
[Wed Jul 31 16:41:14 2024] [15000, 16000) : 0
[Wed Jul 31 16:41:14 2024] [16000, 17000) : 0
[Wed Jul 31 16:41:14 2024] [17000, 18000) : 0
[Wed Jul 31 16:41:14 2024] [18000, 20000) : 0
[Wed Jul 31 16:41:14 2024] [20000, 22000) : 0
[Wed Jul 31 16:41:14 2024] [22000, 25000) : 0
[Wed Jul 31 16:41:14 2024] [25000, 30000) : 0
[Wed Jul 31 16:41:14 2024] >30000 : 0
[Wed Jul 31 16:41:14 2024] rts_cpu_isolator: LIGO code is done, calling regular shutdown code
[Wed Jul 31 16:41:14 2024] dolphin_client: INFO - Count was 60618358150, err_cnt: 0, total time (s): 2148
[Wed Jul 31 16:41:14 2024] dolphin_client: INFO - Estimated bandwidth (MB/s): 225
[Wed Jul 31 16:41:14 2024] dolphin_client: INFO - Histogram Of All Latencies (ns)
[Wed Jul 31 16:41:14 2024] dolphin_client: INFO - Any latency over 15000 ns:
[Wed Jul 31 16:41:14 2024] dolphin_client: INFO - It took 9 ms for the RT code to exit.

x2cdsrfm (server) <-> x2oaf0 (client) Round Trip Time (RTT) Testing

This is the short link on the CDSRFM, the NM was running and all dolphin nodes were on. We observed ~13 hours of stability. Client command sudo insmod dolphin_client.ko.

Configuration

System CPU Driver Version Full Config Dump
x2cdsrfm W-3323 CPU @ 3.50GHz 5.21.4 here

Results, no long RTT

[Wed Jul 31 19:11:00 2024] rts_cpu_isolator: calling LIGO code
[Wed Jul 31 19:11:06 2024] smpboot: CPU 1 didn't die...
[Thu Aug  1 04:38:34 2024] perf: interrupt took too long (2505 > 2500), lowering kernel.perf_event_max_sample_rate to 79750
[Thu Aug  1 09:10:52 2024] min: 1477, max: 3524, mean: 2144
[Thu Aug  1 09:10:52 2024] <2000 : 12310
[Thu Aug  1 09:10:52 2024] [2000, 2500) : 22953645156
[Thu Aug  1 09:10:52 2024] [2500, 3000) : 300
[Thu Aug  1 09:10:52 2024] [3000, 3500) : 1
[Thu Aug  1 09:10:52 2024] [3500, 4000) : 1
[Thu Aug  1 09:10:52 2024] [4000, 5000) : 0
[Thu Aug  1 09:10:52 2024] [5000, 6000) : 0
[Thu Aug  1 09:10:52 2024] [6000, 7000) : 0
[Thu Aug  1 09:10:52 2024] [7000, 8000) : 0
[Thu Aug  1 09:10:52 2024] [8000, 9000) : 0
[Thu Aug  1 09:10:52 2024] [9000, 10000) : 0
[Thu Aug  1 09:10:52 2024] [10000, 11000) : 0
[Thu Aug  1 09:10:52 2024] [11000, 12000) : 0
[Thu Aug  1 09:10:52 2024] [12000, 13000) : 0
[Thu Aug  1 09:10:52 2024] [13000, 14000) : 0
[Thu Aug  1 09:10:52 2024] [14000, 15000) : 0
[Thu Aug  1 09:10:52 2024] [15000, 16000) : 0
[Thu Aug  1 09:10:52 2024] [16000, 17000) : 0
[Thu Aug  1 09:10:52 2024] [17000, 18000) : 0
[Thu Aug  1 09:10:52 2024] [18000, 20000) : 0
[Thu Aug  1 09:10:52 2024] [20000, 22000) : 0
[Thu Aug  1 09:10:52 2024] [22000, 25000) : 0
[Thu Aug  1 09:10:52 2024] [25000, 30000) : 0
[Thu Aug  1 09:10:52 2024] >30000 : 0
[Thu Aug  1 09:10:52 2024] rts_cpu_isolator: LIGO code is done, calling regular shutdown code
[Thu Aug  1 09:10:52 2024] dolphin_client: INFO - Count was 22953657768,  total time (s): 254
[Thu Aug  1 09:10:52 2024] dolphin_client: INFO - Estimated bandwidth (MB/s): 720
[Thu Aug  1 09:10:52 2024] dolphin_client: INFO - Histogram Of All Latencies (ns)
[Thu Aug  1 09:10:52 2024] dolphin_client: INFO - Any latency over 15000 ns:
[Thu Aug  1 09:10:52 2024] dolphin_client: INFO - It took 8 ms for the RT code to exit.

x2cdsrfm (server) <-> x2iscex (client) Round Trip Time (RTT) Testing

This is over the 4Km link, H1A de-rated to gen 1.

Configuration

System CPU Driver Version Full Config Dump
x2cdsrfm W-3323 CPU @ 3.50GHz 5.21.4 here
x2iscex W-2245 CPU @ 3.90GHz 5.21.4 here
[67323.881494] rts_cpu_isolator: calling LIGO code
[67329.624247] smpboot: CPU 1 didn't die...
[68036.626256] min: 42834, max: 81596, mean: 43551
[68036.626257] <36000 : 0
[68036.626257] [36000, 38000) : 0
[68036.626257] [38000, 40000) : 0
[68036.626258] [40000, 42000) : 0
[68036.626258] [42000, 44000) : 16119566
[68036.626259] [44000, 46000) : 5
[68036.626259] [46000, 48000) : 0
[68036.626259] [48000, 50000) : 2
[68036.626260] [50000, 55000) : 9
[68036.626260] [55000, 60000) : 8
[68036.626260] [60000, 65000) : 7
[68036.626261] [65000, 70000) : 6
[68036.626261] [70000, 75000) : 12
[68036.626261] [75000, 80000) : 12
[68036.626262] [80000, 85000) : 2
[68036.626262] [85000, 90000) : 0
[68036.626262] [90000, 100000) : 0
[68036.626263] >100000 : 0
[68036.626390] dolphin_client: INFO - Count was 16119629,  total time (s): 702

x2cdsrfm (server) <-> x2iscex (client) Round Trip Time (RTT) Testing

This configuration has a short fiber with the Adnaco bus extender.

Thu Aug  1 15:52:01 2024] rts_cpu_isolator: calling LIGO code
[Thu Aug  1 15:52:07 2024] smpboot: CPU 1 didn't die...
[Thu Aug  1 16:14:19 2024] min: 3102, max: 5338, mean: 3205
[Thu Aug  1 16:14:19 2024] <36000 : 409014333
[Thu Aug  1 16:14:19 2024] [36000, 38000) : 0
[Thu Aug  1 16:14:19 2024] [38000, 40000) : 0
[Thu Aug  1 16:14:19 2024] [40000, 42000) : 0
[Thu Aug  1 16:14:19 2024] [42000, 44000) : 0
[Thu Aug  1 16:14:19 2024] [44000, 46000) : 0
[Thu Aug  1 16:14:19 2024] [46000, 48000) : 0
[Thu Aug  1 16:14:19 2024] [48000, 50000) : 0
[Thu Aug  1 16:14:19 2024] [50000, 55000) : 0
[Thu Aug  1 16:14:19 2024] [55000, 60000) : 0
[Thu Aug  1 16:14:19 2024] [60000, 65000) : 0
[Thu Aug  1 16:14:19 2024] [65000, 70000) : 0
[Thu Aug  1 16:14:19 2024] [70000, 75000) : 0
[Thu Aug  1 16:14:19 2024] [75000, 80000) : 0
[Thu Aug  1 16:14:19 2024] [80000, 85000) : 0
[Thu Aug  1 16:14:19 2024] [85000, 90000) : 0
[Thu Aug  1 16:14:19 2024] [90000, 100000) : 0
[Thu Aug  1 16:14:19 2024] >100000 : 0
[Thu Aug  1 16:14:19 2024] rts_cpu_isolator: LIGO code is done, calling regular shutdown code
[Thu Aug  1 16:14:19 2024] dolphin_client: INFO - Count was 409014333,  total time (s): 1328

CS Short Link Stability Testing

Test

No network manager running, and set disable_force_link_training=1; in dis_irm.conf. After manual configure of Dolphin and start of models, no PTE Write access is not set errors in first hour.

Results

Run from Thu Aug 22, 12 PM - Mon Aug 26 8 AM with no errors (besides long range) ~92 hours. Test was left to run longer, now at Tue Aug 27 10 AM with no short link errors. ~118 hours.

Edited by Ezekiel Dohmen