question

Upvotes
Accepted
5 0 1 4

Crash in C++ channelset handling when failing over

Using the C++ real time sdk API, I configured a <ChannelSet> with two hosts to act as a cold standby. If I deliberately cut the tcp connection (either by deliberately misconfiguring one of the hosts, or using tcpkill on an established connection), the SDK crashes during failover. Without the <ChannelSet>, just using the normal <Channel> setup, I see no such crash.


I filed this as a GH issue: https://github.com/Refinitiv/Real-Time-SDK/issues/220 - but figured I should ask here too in case there's something obvious I'm doing wrong. The code is simple (but proprietary); and as I can see no evidence of any of my code in the call stack (nor any obvious callback of mine called during reconnection) I can't see how it's my code.

ema-api#technologyc++refinitiv-realtime-sdk
icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Upvotes
Accepted
13.8k 28 5 10

I'd like to confirm that this is fixed in the Real-Time-SDK-2.0.8.L1 (more detail can be found in GitHub - Issue #220)

icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Upvotes
32.2k 40 11 19

Hello @matthew.godbolt ,

Thank you for reporting the issue and I think you have done the right thing by reporting on GitHub and including the steps, let me share my thoughts.

I have run a quick test with next-to-last release of RTSDK, on windows:

  1. Run two of IProvider examples EMA 110, on ports 14002 and 14003
  2. Connect Consumer EMA 110 via ChannelSet
  3. Kill connection to port 14002
  4. See ChannelSet failover to the second IProvider

Run the same test with Consumer EMA 110 connecting to testbed infra ADSs, to test same port connectivity interrupt and same port failover.

Was not able to reproduce this way.

This is why I would suggest to re-test in your environment with the latest RTSDK, please find the latest release at RTSDK C/C++ Downloads, to me, the most effective would be to run the same test with EMA example 110 from SDK rather than the proprietary code, as does not require to create new tester code, and should allow to see if the behavior that you observe with example and latest RTSDK in the local environment be the same or different, therefore, narrowing down the possible causes of the issue.

icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Upvotes
5 0 1 4

Thanks @zoya faberov - I'll try that. Due to other bugs and issues we're working around, we currently use a branched build of the real-time code but I will bring it up to date and then see if I can reproduce in the way you suggest.

icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Upvotes
5 0 1 4

I haven't upgraded the SDK yet, but the attached code is a minimal reproducer for me. Note that if I use localhost server it _does not_ crash, but if I use a remote server it does, which makes me wonder about some threading/race conditions. main.cpp.txt

I will upgrade the rtsdk and check again. I'm on Linux and have no access to windows.


maincpp.txt (5.0 KiB)
icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Hello @matthew.godbolt ,

My infra test on 14002 port was with both ADSs remote to my consumer, so this aspect can be excluded

Upvotes
32.2k 40 11 19

Hello @matthew.godbolt ,

I have tested ChannelSet failover with EMA C++ Example 110 on Linux (Ora Linux 7.4), it has run as expected. Have killed the connection, so the example consumer has reported Service Lost, then connected to another ADS. Kileld the connection again and it has reconnected to the original ADS.

You may wish to enable trace to learn more:

 <XmlTraceToStdout value="1"/>

It may be helpful, on your side, to first run example EMA 110 with trace on, to break connection, examine the output, and next run custom consumer with trace on, and compare the output. As I notice that custom consumer uses in-code config, you may wish to keep all the code as is, except just swap xml config read from 110, to eliminate a possible incomplete config read issue and to test quicker.

Hope that this information helps

icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Upvotes
5 0 1 4

Thanks @zoya faberov for checking on Linux too. I'm still in the process of updating to the newer SDK so maybe it will be fixed when I do so.

I already tried logging to stdout but nothing obvious appeared.

I am using the custom configuration as that's how it's configured in our application: we have never used the XML configuration, and I have no experience doing so.

If there's an issue with the configuration I supplied, I would still expect the SDK not to crash.

I will follow up once I've looked into this more.

Can you try my sample code and see if it crashes for you?

thanks

icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Upvotes
5 0 1 4

I have updated my SDK and still experience this crash with the code as posted above. I'm on Ubuntu 22.04, and I am compiling both the SDK and my example with GCC 12.1.0, if that helps.

I will try and debug a little more to see what further information I can glean. I'll also try using the XML configuration you describe.

icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Upvotes
5 0 1 4

@zoya faberov I've now tried with an XML config only (removing my code for generating config) and I still see the crash with the latest SDK. I used the default Ema-C xml config with this difference:

$ diff  ../Real-Time-SDK/Cpp-C/Ema/EmaConfig.xml EmaConfig.xml 
25c25
< <XmlTraceToStdout value="0"/>
---
> <XmlTraceToStdout value="1"/>
346c346
< <CompressionType value="CompressionType::None"/>
---
> <CompressionType value="CompressionType::Zlib"/>
355,356c355,356
< <Host value="localhost"/>
< <Port value="14002"/>
---
> <Host value="146.242.129.1"/>
> <Port value="14007"/>
363c363
< <Host value="localhost"/>
---
> <Host value="146.242.133.1"/>
600c600
< <LoggerSeverity value="LoggerSeverity::Success"/>
---
> <LoggerSeverity value="LoggerSeverity::Verbose"/>
icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Upvotes
5 0 1 4

Despite the "XmlTraceToStdout" I don't see anything on stdout. It seems the minimum change I need to cause the segfault is just to alter the Host and Port of Channel_1, and the Host of Channel_2.

icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Upvotes
5 0 1 4
Further debugging the line of the crash is:
   4952                         case RSSL_DMT_LOGIN:
   4953                         {
   4954                                 RsslRDMLoginRequest loginRequest;
   4955 
   4956                                 loginRequest = *pWatchlistImpl->login.pRequest[pWatchlistImpl->login.index]->pLoginReqMsg;  /// CRASH IS HERE
   4957                                 loginRequest.rdmMsgBase.streamId = pStream->base.streamId;
   4958 
   4959                                 if (!pWatchlistImpl->base.config.supportOptimizedPauseResume)
   4960                                         loginRequest.flags &= ~RDM_LG_RQF_PAUSE_ALL;
icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Upvotes
5 0 1 4

Debugging around this point:


(gdb) p pWatchlistImpl
$1 = (RsslWatchlistImpl *) 0x61f000000e80
(gdb) p pWatchlistImpl->login
$2 = {pStream = 0x60b000009360, pRequest = 0x6020002974f0, index = 1, count = 1}
(gdb) p pWatchlistImpl->login.index
$3 = 1
(gdb) p pWatchlistImpl->login.pRequest[pWatchlistImpl->login.index]
$4 = (WlLoginRequest *) 0x0


The login request is clearly nullptr here for the login.index 1 (The one that's being looked at here). It's non-null for index 0:

(gdb) p pWatchlistImpl->login.pRequest[0]
$5 = (WlLoginRequest *) 0x60e000004e20


But also note that the "count" of the request is 1 too, which makes me wonder if this is reading off the end of the array?

icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Upvotes
32.2k 40 11 19

Hello @matthew.godbolt ,

RTSK is not supported on Ubuntu. Please find the list of supported OS options in API Compatibility Matrix and also RTSDK on GitHub -> Readme file.

Hope that this helps, and if you would like to get started on a supported platform RTSDK (EMA/ETA) Quick Start guidance can be helpful to save you time and effort.

icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Upvotes
5 0 1 4

Thanks. A colleague was able to test 2.0.6.L1 on Centos and he does _not_ see the issue. I am now checking 2.0.6.L1 on Ubuntu (I know unsupported, but I'm confident this is not an OS problem).

While looking at the changes between 2.0.6 and 2.0.7 there are definitely some things that appear to affect the part of the code I am seeing the crash in: https://github.com/Refinitiv/Real-Time-SDK/compare/Real-Time-SDK-2.0.6.L1...Real-Time-SDK-2.0.7.L1#diff-4f38211129ed66138518990f0994f9394e75c02d9545ece0213f81bbb4c3c476

These indices appear not to be bound-checked within the size of the login.

I will continue investigating.

icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Upvotes
5 0 1 4

I just built against the 2.0.6.L1 release and no longer see this issue. I updated the Github issue tracker accordingly.

icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Hello @matthew.godbolt

We appreciate you keeping us informed!

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.