Dolphin node 4 not working

error code 536874779

hex is 0x20000f1b

which works out to local SCI error (0x20000000)

0xf1b is ESCI_BUSY

 ESCI_BUSY                        = _SCI_ERROR(0xF1B)

Dear Eric,

On 11.10.2021 19.38, Erik von Reis wrote:

Dear Dolphinics,

My name is Erik von Reis. I work as a software developer at LIGO Hanford.

We are running an IX Dolphin network, version 5.17. The network is 26 nodes. Each node is controlled by a Linux kernel module. We use the broadcast feature.

Node number 4 does not work for us. There are some hints in the IX header files that node 4 might be the broadcast node, but it's not clear to us.

The error occurs in this code (error handling removed):

err = sci_create_segment( NO_BINDING, 0, ii, DIS_BROADCAST, IPC_TOTAL_ALLOC_SIZE, create_segment_callback, 0, &segment[ ii ] ); err = sci_set_local_segment_available( segment[ ii ], 0 ); err = sci_export_segment( segment[ ii ], 0, DIS_BROADCAST );

sci_export_segment always returns a ESCI_BUSY error (0x20000f1b), even if the kernel module sleeps for 4 seconds.

create_segment_callback() returns 0 and does nothing else.

This same code works for nodes 8 to 104.

Is there a way to make node 4 work in this case?

Alternatively, is there a way to extend the maximum node number to a higher level?

There is nothing special about the NodeId 4 -node, except that this is an NTB-capable node (as opposed to the last 2 of the 26 nodes, which are only reflective-memory capable, due to hw-limitations) - in this regard, NodeId 4 should behave in an identical fashion to say NodeId 8.

Our developers suggest that you try to reproduce this behavior with one of the the reflective-memory applications that come with the Dolphin eXpressWare installation - reflective_bench, for instance. The source-code for this is available in the sisci-devel package that can be installed from the packages built by the installer so if you have behavior differing between your reflective_bench and your application you can review and rebuild the source to see how your implementation differs from ours.

If possible in your application, also perhaps shuffle the writer/receiver roles around on a smaller set of participating nodes to see if the issue remains with the NodeId 4 node.

If there is an issue that remains with NodeId 4, please review kernel output from dmesg and driver status with dis_diag (remember the -V 9 parameter to review error counters and similar), and see if you can narrow the issue down.

Please also make sure that the issue persists after a cluster-wide reboot, and that also the switches have been repowered as part of this. We've made several software releases after your 5.17 installation, and while we have not focused much on the IX platform, there may be general bugs that have been resolved that would benefit your installation.

Best wishes, -S

Thank you for your help, Erik von Reis

Pci-support-list mailing list Pci-support-list@nypostle.dolphinics.no http://nypostle.dolphinics.no/mailman/listinfo/pci-support-list

-- Simen Timian Thoresen, Dolphin Interconnect Solutions

sci_export_segment() is a read herring. This only happens the second time after you start the model, and every time thereafter because the segment was already exported. Maybe this is a problem with our back-out code on error.

Real error is 3 calls down to memory mapping. Turns out we pass node 4 to the connect call. We may be able to pass arbitrary node numbers. 252 works. This is the node ID used for broadcast in the sisci driver. The node number just can't be the same as the node you're on.

Working now in test stand with oaf1 as node 4. Interestingly, models started with '4' work fine with models started with '252'. If the arg doesn't matter, why does it cause problem. My only guess is the check to refuse same-node arguments comes before the check on broadcast.

mentioned in merge request !262 (merged)

mentioned in commit 53b63376

closed with merge request !262 (merged)

Dolphin node 4 not working

Designs

Child items ...

Activity