ETA Reactor crash with multiple slow consumers

We're using latest available ETA SDK (2.2.1) from github.

We have an interactive provider application using ETA Reactor. Normally this is working without any issue. Occasionally we have an apparent network issue that causes several connected clients to not properly handle the RSSL channel (eventually leading to a disconnect due to ping timeout).

In some cases when this occurs we get a crash with the following call stack:

rtr_dfltcIntFreeMsg cutildfltcbuffer.c:460
rtr_dfltcFreeMsg cutildfltcbuffer.c:760
ipcWriteSession rsslSocketTransportImpl.c:2941
rsslSocketWrite rsslSocketTransportImpl.c:9848
rsslWrite rsslImpl.c:1944
rsslReactorSubmit rsslReactor.c:3596
TREPReactorChannelTask::sendRsslMsg ReactorChannelTask.cpp:2129
...

I have created a standalone test that connects one or more clients (that do not do any data handling), and then forcibly sending messages to them via rsslReactorSubmit(...):

// We have our standard Reactor code listening here:
char hostName[10] = "127.0.0.1";
char serviceName[6] = "23000";

RsslConnectOptions copt = RSSL_INIT_CONNECT_OPTS;
RsslError err;
copt.hostName = hostName;
copt.serviceName = serviceName;
copt.blocking = RSSL_TRUE;
copt.majorVersion = RSSL_RWF_MAJOR_VERSION;
copt.minorVersion = RSSL_RWF_MINOR_VERSION;

std::vector<RsslChannel*> channels;

for(int x = 0; x < num_connections; ++x)
{
    channels.push_back(rsslConnect(&copt, &err));
}

// force feed the clients messages here by calling rsslReactorSubmit(...)
// sends 4096 messages to each client every second

for(auto c : channels)
{
   if(c)
   {
      if(RSSL_RET_SUCCESS > rsslCloseChannel(c, &err))
      {
          ASSERT_TRUE(false) << err.text;
      }
   }
}

I have set the bindops on our reactor like this:

.bindopts.port=23000
.bindopts.guaranteed_output_buffers=128
.bindopts.max_output_buffers=256

If rsslReactorGetBuffer() errors with RSSL_RET_BUFFER_NO_BUFFERS, we have code to dynamically increase the buffers by 1000 via rsslReactorChannelIoctl.

If I run this test application with num_connections=1, it will run (apparently) indefintely with rsslReactorChannelBufferUsage() showing a growing number of buffers in use.

However, increasing num_connections to anything above 1 will give crashes similar to above, relatively quickly (< 2 or 3 seconds).

This gives me pause about our application threading model. We have called rsslInitialize(RSSL_LOCK_GLOBAL_AND_CHANNEL,...), and when this crash occurs I can see other threads in rsslReactorSubmit, but blocked on reactorLockInterface() (as expected). However, other threads may be calling rsslReactorGetBuffer (specifying their channel) without any synchronization.

We have one thread that owns the rsslReactor, -- this calls rsslReactorAccept() and rsslReactorDispatch().

When rsslReactorAccept returns a new connection, that connection gets spun off into a different thread which calls rsslReactorGetBuffer() and rsslReactorSubmit.

So, with two threads (one handling reactorDispatch() and one calling rsslReactorGetBuffer()/rsslReactorSubmit()) it seems there is no problem, but with more than two threads, (one handling reactorDispatch() and n calling rsslReactorGetBuffer()/rsslReactorSubmit()) the crash occurs.

Is the Reactor thread safe in this manner? Having one Reactor service multiple reactor channels each in their own thread? Note that this seems to work fine in the absence of "slow clients" -- this application will run sometimes for weeks on end without issue.

Find more posts tagged with

reactor

eta-api

#technology

Accepted answers

ckoehler

Workaround: Ensure calls to get buffer request buffers that are smaller than the max fragment size.

Fix: Ensure shared pool lock is set to true in bind options.

All comments

wasin.w

Hello @ckoehler

The support case number 13811057 has been created and the RDC (Real-Time APIs Support) team will look into the case to investigate further.

ckoehler

TREP_Fail.txt
See the above attachment to reproduce. Appears related to fragmented buffers, if you uncomment line 93, the crash will go away.

ckoehler

Workaround: Ensure calls to get buffer request buffers that are smaller than the max fragment size.

Fix: Ensure shared pool lock is set to true in bind options.

EXPLORE OUR SITES