Crash in C++ channelset handling when failing over

Using the C++ real time sdk API, I configured a <ChannelSet> with two hosts to act as a cold standby. If I deliberately cut the tcp connection (either by deliberately misconfiguring one of the hosts, or using tcpkill on an established connection), the SDK crashes during failover. Without the <ChannelSet>, just using the normal <Channel> setup, I see no such crash.


I filed this as a GH issue: https://github.com/Refinitiv/Real-Time-SDK/issues/220 - but figured I should ask here too in case there's something obvious I'm doing wrong. The code is simple (but proprietary); and as I can see no evidence of any of my code in the call stack (nor any obvious callback of mine called during reconnection) I can't see how it's my code.

Best Answer

Answers

  • zoya faberov
    zoya faberov ✭✭✭✭✭

    Hello @matthew.godbolt ,

    Thank you for reporting the issue and I think you have done the right thing by reporting on GitHub and including the steps, let me share my thoughts.

    I have run a quick test with next-to-last release of RTSDK, on windows:

    1. Run two of IProvider examples EMA 110, on ports 14002 and 14003
    2. Connect Consumer EMA 110 via ChannelSet
    3. Kill connection to port 14002
    4. See ChannelSet failover to the second IProvider

    Run the same test with Consumer EMA 110 connecting to testbed infra ADSs, to test same port connectivity interrupt and same port failover.

    Was not able to reproduce this way.

    This is why I would suggest to re-test in your environment with the latest RTSDK, please find the latest release at RTSDK C/C++ Downloads, to me, the most effective would be to run the same test with EMA example 110 from SDK rather than the proprietary code, as does not require to create new tester code, and should allow to see if the behavior that you observe with example and latest RTSDK in the local environment be the same or different, therefore, narrowing down the possible causes of the issue.

  • Thanks @zoya faberov - I'll try that. Due to other bugs and issues we're working around, we currently use a branched build of the real-time code but I will bring it up to date and then see if I can reproduce in the way you suggest.

  • I haven't upgraded the SDK yet, but the attached code is a minimal reproducer for me. Note that if I use localhost server it _does not_ crash, but if I use a remote server it does, which makes me wonder about some threading/race conditions. main.cpp.txt

    I will upgrade the rtsdk and check again. I'm on Linux and have no access to windows.

  • zoya faberov
    zoya faberov ✭✭✭✭✭
    Hello @matthew.godbolt ,

    My infra test on 14002 port was with both ADSs remote to my consumer, so this aspect can be excluded

  • zoya faberov
    zoya faberov ✭✭✭✭✭

    Hello @matthew.godbolt ,

    I have tested ChannelSet failover with EMA C++ Example 110 on Linux (Ora Linux 7.4), it has run as expected. Have killed the connection, so the example consumer has reported Service Lost, then connected to another ADS. Kileld the connection again and it has reconnected to the original ADS.

    You may wish to enable trace to learn more:

     <XmlTraceToStdout value="1"/>

    It may be helpful, on your side, to first run example EMA 110 with trace on, to break connection, examine the output, and next run custom consumer with trace on, and compare the output. As I notice that custom consumer uses in-code config, you may wish to keep all the code as is, except just swap xml config read from 110, to eliminate a possible incomplete config read issue and to test quicker.

    Hope that this information helps

  • Thanks @zoya faberov for checking on Linux too. I'm still in the process of updating to the newer SDK so maybe it will be fixed when I do so.

    I already tried logging to stdout but nothing obvious appeared.

    I am using the custom configuration as that's how it's configured in our application: we have never used the XML configuration, and I have no experience doing so.

    If there's an issue with the configuration I supplied, I would still expect the SDK not to crash.

    I will follow up once I've looked into this more.

    Can you try my sample code and see if it crashes for you?

    thanks

  • I have updated my SDK and still experience this crash with the code as posted above. I'm on Ubuntu 22.04, and I am compiling both the SDK and my example with GCC 12.1.0, if that helps.

    I will try and debug a little more to see what further information I can glean. I'll also try using the XML configuration you describe.

  • @zoya faberov I've now tried with an XML config only (removing my code for generating config) and I still see the crash with the latest SDK. I used the default Ema-C xml config with this difference:

    $ diff  ../Real-Time-SDK/Cpp-C/Ema/EmaConfig.xml EmaConfig.xml 
    25c25
    < <XmlTraceToStdout value="0"/>
    ---
    > <XmlTraceToStdout value="1"/>
    346c346
    < <CompressionType value="CompressionType::None"/>
    ---
    > <CompressionType value="CompressionType::Zlib"/>
    355,356c355,356
    < <Host value="localhost"/>
    < <Port value="14002"/>
    ---
    > <Host value="146.242.129.1"/>
    > <Port value="14007"/>
    363c363
    < <Host value="localhost"/>
    ---
    > <Host value="146.242.133.1"/>
    600c600
    < <LoggerSeverity value="LoggerSeverity::Success"/>
    ---
    > <LoggerSeverity value="LoggerSeverity::Verbose"/>
  • Despite the "XmlTraceToStdout" I don't see anything on stdout. It seems the minimum change I need to cause the segfault is just to alter the Host and Port of Channel_1, and the Host of Channel_2.

  • Further debugging the line of the crash is:
    4952 case RSSL_DMT_LOGIN:
    4953 {
    4954 RsslRDMLoginRequest loginRequest;
    4955
    4956 loginRequest = *pWatchlistImpl->login.pRequest[pWatchlistImpl->login.index]->pLoginReqMsg; /// CRASH IS HERE
    4957 loginRequest.rdmMsgBase.streamId = pStream->base.streamId;
    4958
    4959 if (!pWatchlistImpl->base.config.supportOptimizedPauseResume)
    4960 loginRequest.flags &= ~RDM_LG_RQF_PAUSE_ALL;
  • Debugging around this point:


    (gdb) p pWatchlistImpl
    $1 = (RsslWatchlistImpl *) 0x61f000000e80
    (gdb) p pWatchlistImpl->login
    $2 = {pStream = 0x60b000009360, pRequest = 0x6020002974f0, index = 1, count = 1}
    (gdb) p pWatchlistImpl->login.index
    $3 = 1
    (gdb) p pWatchlistImpl->login.pRequest[pWatchlistImpl->login.index]
    $4 = (WlLoginRequest *) 0x0


    The login request is clearly nullptr here for the login.index 1 (The one that's being looked at here). It's non-null for index 0:

    (gdb) p pWatchlistImpl->login.pRequest[0]
    $5 = (WlLoginRequest *) 0x60e000004e20


    But also note that the "count" of the request is 1 too, which makes me wonder if this is reading off the end of the array?

  • zoya faberov
    zoya faberov ✭✭✭✭✭

    Hello @matthew.godbolt ,

    RTSK is not supported on Ubuntu. Please find the list of supported OS options in API Compatibility Matrix and also RTSDK on GitHub -> Readme file.

    Hope that this helps, and if you would like to get started on a supported platform RTSDK (EMA/ETA) Quick Start guidance can be helpful to save you time and effort.

  • Thanks. A colleague was able to test 2.0.6.L1 on Centos and he does _not_ see the issue. I am now checking 2.0.6.L1 on Ubuntu (I know unsupported, but I'm confident this is not an OS problem).

    While looking at the changes between 2.0.6 and 2.0.7 there are definitely some things that appear to affect the part of the code I am seeing the crash in: https://github.com/Refinitiv/Real-Time-SDK/compare/Real-Time-SDK-2.0.6.L1...Real-Time-SDK-2.0.7.L1#diff-4f38211129ed66138518990f0994f9394e75c02d9545ece0213f81bbb4c3c476

    These indices appear not to be bound-checked within the size of the login.

    I will continue investigating.

  • I just built against the 2.0.6.L1 release and no longer see this issue. I updated the Github issue tracker accordingly.