We have received several reports from different customers over the past few months of our OMM provider application crashing seemingly at random. Upon examining the memory dumps provided by the customers, the common point of failure is in the rfa::sessionLayer::RSSL_Prov_ChannelSession::removeToken(rfa::sessionLayer::RSSLRequestToken*) method. This is called when posting an OMM status message that closes the stream.
Partial Stack Trace from LLDB:
frame #0: 0x00007f6193959033 libRFA_RSSL_Prov_Adapter.so`rfa::sessionLayer::RSSL_Prov_ChannelSession::removeToken(rfa::sessionLayer::RSSLRequestToken*) + 483 frame #1: 0x00007f6193959414 libRFA_RSSL_Prov_Adapter.so`rfa::sessionLayer::RSSL_Prov_ChannelSession::processStatusMsg(RsslMsg*, rfa::sessionLayer::RSSLRequestToken*, unsigned char, unsigned char, rfa::common::RFA_String&) + 260 frame #2: 0x00007f61932f03a8 libRFA_SessionLayer_OMM.so`rfa::sessionLayer::OMMProviderImpl::submitCmd(long, rfa::sessionLayer::OMMSolicitedItemCmd const&, void*) + 664 frame #3: 0x00007f61932f05c1 libRFA_SessionLayer_OMM.so`rfa::sessionLayer::OMMProviderImpl::submit(rfa::sessionLayer::OMMCmd const*, void*) + 337 frame #4: Our code...
From looking through the disassembly of removeToken it appears that the crash occurred outside of a critical section (pthread_mutex_unlock appears to have been called recently), so I don't know if it's some kind of race condition, but if you could let me know what the root cause might be and/or if there is a fix available I'd be very grateful.
The weirdest part is that this can happen twice in a week, then go silent for months. We've had 3 reports of this in the last year from different customers. It is noted that the changelog for 7.5.1.L1 that there was a fix for "OMM RFA provider crashes randomly in removeToken function", could this be a regression from that release or is there something else at fault (either on the RFA side or our application code)?
- We're currently using RFA 7.6.2.E2 on Linux (RHEL6 x64 GCC44). Unfortunately we cannot upgrade to RFA8 since we need to continue supporting a limited number of RHEL5 customers as well as RHEL6.
- The crashing application is a multi-threaded OMM provider application.
- The full stack trace indicates that our application attempted to send a Status message that closed the stream (in this case, the requested item was not found on the server.
- We are not able to reliably replicate the issue in our lab tests.
I am more than happy to provide more information as required.