My name is Erik von Reis. I work as a software developer at LIGO Hanford.
We are running an IX Dolphin network, version 5.17. The network is 26
nodes. Each node is controlled by a Linux kernel module. We use the
broadcast feature.
Node number 4 does not work for us. There are some hints in the IX
header files that node 4 might be the broadcast node, but it's not
clear to us.
The error occurs in this code (error handling removed):
err = sci_create_segment( NO_BINDING,
0,
ii,
DIS_BROADCAST,
IPC_TOTAL_ALLOC_SIZE,
create_segment_callback,
0,
&segment[ ii ] );
err = sci_set_local_segment_available( segment[ ii ], 0 );
err = sci_export_segment( segment[ ii ], 0, DIS_BROADCAST );
sci_export_segment always returns a ESCI_BUSY error (0x20000f1b), even
if the kernel module sleeps for 4 seconds.
create_segment_callback() returns 0 and does nothing else.
This same code works for nodes 8 to 104.
Is there a way to make node 4 work in this case?
Alternatively, is there a way to extend the maximum node number to a
higher level?
There is nothing special about the NodeId 4 -node, except that this is
an NTB-capable node (as opposed to the last 2 of the 26 nodes, which are
only reflective-memory capable, due to hw-limitations) - in this regard,
NodeId 4 should behave in an identical fashion to say NodeId 8.
Our developers suggest that you try to reproduce this behavior with one
of the the reflective-memory applications that come with the Dolphin
eXpressWare installation - reflective_bench, for instance. The
source-code for this is available in the sisci-devel package that can be
installed from the packages built by the installer so if you have
behavior differing between your reflective_bench and your application
you can review and rebuild the source to see how your implementation
differs from ours.
If possible in your application, also perhaps shuffle the
writer/receiver roles around on a smaller set of participating nodes to
see if the issue remains with the NodeId 4 node.
If there is an issue that remains with NodeId 4, please review kernel
output from dmesg and driver status with dis_diag (remember the -V 9
parameter to review error counters and similar), and see if you can
narrow the issue down.
Please also make sure that the issue persists after a cluster-wide
reboot, and that also the switches have been repowered as part of this.
We've made several software releases after your 5.17 installation, and
while we have not focused much on the IX platform, there may be general
bugs that have been resolved that would benefit your installation.
sci_export_segment() is a read herring. This only happens the second time after you start the model, and every time thereafter because the segment was already exported. Maybe this is a problem with our back-out code on error.
Real error is 3 calls down to memory mapping. Turns out we pass node 4 to the connect call. We may be able to pass arbitrary node numbers. 252 works. This is the node ID used for broadcast in the sisci driver. The node number just can't be the same as the node you're on.
Working now in test stand with oaf1 as node 4. Interestingly, models started with '4' work fine with models started with '252'. If the arg doesn't matter, why does it cause problem. My only guess is the check to refuse same-node arguments comes before the check on broadcast.