Crash in C++ channelset handling when failing over

Using the C++ real time sdk API, I configured a <ChannelSet> with two hosts to act as a cold standby. If I deliberately cut the tcp connection (either by deliberately misconfiguring one of the hosts, or using tcpkill on an established connection), the SDK crashes during failover. Without the <ChannelSet>, just using the normal <Channel> setup, I see no such crash.
I filed this as a GH issue: https://github.com/Refinitiv/Real-Time-SDK/issues/220 - but figured I should ask here too in case there's something obvious I'm doing wrong. The code is simple (but proprietary); and as I can see no evidence of any of my code in the call stack (nor any obvious callback of mine called during reconnection) I can't see how it's my code.
Best Answer
-
I'd like to confirm that this is fixed in the Real-Time-SDK-2.0.8.L1 (more detail can be found in GitHub - Issue #220)
0
Answers
-
Hello @matthew.godbolt ,
Thank you for reporting the issue and I think you have done the right thing by reporting on GitHub and including the steps, let me share my thoughts.
I have run a quick test with next-to-last release of RTSDK, on windows:
- Run two of IProvider examples EMA 110, on ports 14002 and 14003
- Connect Consumer EMA 110 via ChannelSet
- Kill connection to port 14002
- See ChannelSet failover to the second IProvider
Run the same test with Consumer EMA 110 connecting to testbed infra ADSs, to test same port connectivity interrupt and same port failover.
Was not able to reproduce this way.
This is why I would suggest to re-test in your environment with the latest RTSDK, please find the latest release at RTSDK C/C++ Downloads, to me, the most effective would be to run the same test with EMA example 110 from SDK rather than the proprietary code, as does not require to create new tester code, and should allow to see if the behavior that you observe with example and latest RTSDK in the local environment be the same or different, therefore, narrowing down the possible causes of the issue.
0 -
Thanks @zoya faberov - I'll try that. Due to other bugs and issues we're working around, we currently use a branched build of the real-time code but I will bring it up to date and then see if I can reproduce in the way you suggest.
0 -
I haven't upgraded the SDK yet, but the attached code is a minimal reproducer for me. Note that if I use localhost server it _does not_ crash, but if I use a remote server it does, which makes me wonder about some threading/race conditions. main.cpp.txt
I will upgrade the rtsdk and check again. I'm on Linux and have no access to windows.
0 -
@matthew.godbolt ,Hello
My infra test on 14002 port was with both ADSs remote to my consumer, so this aspect can be excluded
0 -
Hello @matthew.godbolt ,
I have tested ChannelSet failover with EMA C++ Example 110 on Linux (Ora Linux 7.4), it has run as expected. Have killed the connection, so the example consumer has reported Service Lost, then connected to another ADS. Kileld the connection again and it has reconnected to the original ADS.
You may wish to enable trace to learn more:
<XmlTraceToStdout value="1"/>
It may be helpful, on your side, to first run example EMA 110 with trace on, to break connection, examine the output, and next run custom consumer with trace on, and compare the output. As I notice that custom consumer uses in-code config, you may wish to keep all the code as is, except just swap xml config read from 110, to eliminate a possible incomplete config read issue and to test quicker.
Hope that this information helps
0 -
Thanks @zoya faberov for checking on Linux too. I'm still in the process of updating to the newer SDK so maybe it will be fixed when I do so.
I already tried logging to stdout but nothing obvious appeared.
I am using the custom configuration as that's how it's configured in our application: we have never used the XML configuration, and I have no experience doing so.
If there's an issue with the configuration I supplied, I would still expect the SDK not to crash.
I will follow up once I've looked into this more.
Can you try my sample code and see if it crashes for you?
thanks
0 -
I have updated my SDK and still experience this crash with the code as posted above. I'm on Ubuntu 22.04, and I am compiling both the SDK and my example with GCC 12.1.0, if that helps.
I will try and debug a little more to see what further information I can glean. I'll also try using the XML configuration you describe.
0 -
@zoya faberov I've now tried with an XML config only (removing my code for generating config) and I still see the crash with the latest SDK. I used the default Ema-C xml config with this difference:
$ diff ../Real-Time-SDK/Cpp-C/Ema/EmaConfig.xml EmaConfig.xml
25c25
< <XmlTraceToStdout value="0"/>
---
> <XmlTraceToStdout value="1"/>
346c346
< <CompressionType value="CompressionType::None"/>
---
> <CompressionType value="CompressionType::Zlib"/>
355,356c355,356
< <Host value="localhost"/>
< <Port value="14002"/>
---
> <Host value="146.242.129.1"/>
> <Port value="14007"/>
363c363
< <Host value="localhost"/>
---
> <Host value="146.242.133.1"/>
600c600
< <LoggerSeverity value="LoggerSeverity::Success"/>
---
> <LoggerSeverity value="LoggerSeverity::Verbose"/>0 -
Despite the "XmlTraceToStdout" I don't see anything on stdout. It seems the minimum change I need to cause the segfault is just to alter the Host and Port of Channel_1, and the Host of Channel_2.
0 -
Further debugging the line of the crash is:
4952 case RSSL_DMT_LOGIN:
4953 {
4954 RsslRDMLoginRequest loginRequest;
4955
4956 loginRequest = *pWatchlistImpl->login.pRequest[pWatchlistImpl->login.index]->pLoginReqMsg; /// CRASH IS HERE
4957 loginRequest.rdmMsgBase.streamId = pStream->base.streamId;
4958
4959 if (!pWatchlistImpl->base.config.supportOptimizedPauseResume)
4960 loginRequest.flags &= ~RDM_LG_RQF_PAUSE_ALL;0 -
Debugging around this point:
(gdb) p pWatchlistImpl
$1 = (RsslWatchlistImpl *) 0x61f000000e80
(gdb) p pWatchlistImpl->login
$2 = {pStream = 0x60b000009360, pRequest = 0x6020002974f0, index = 1, count = 1}
(gdb) p pWatchlistImpl->login.index
$3 = 1
(gdb) p pWatchlistImpl->login.pRequest[pWatchlistImpl->login.index]
$4 = (WlLoginRequest *) 0x0The login request is clearly nullptr here for the login.index 1 (The one that's being looked at here). It's non-null for index 0:
(gdb) p pWatchlistImpl->login.pRequest[0]
$5 = (WlLoginRequest *) 0x60e000004e20But also note that the "count" of the request is 1 too, which makes me wonder if this is reading off the end of the array?
0 -
Hello @matthew.godbolt ,
RTSK is not supported on Ubuntu. Please find the list of supported OS options in API Compatibility Matrix and also RTSDK on GitHub -> Readme file.
Hope that this helps, and if you would like to get started on a supported platform RTSDK (EMA/ETA) Quick Start guidance can be helpful to save you time and effort.
0 -
Thanks. A colleague was able to test 2.0.6.L1 on Centos and he does _not_ see the issue. I am now checking 2.0.6.L1 on Ubuntu (I know unsupported, but I'm confident this is not an OS problem).
While looking at the changes between 2.0.6 and 2.0.7 there are definitely some things that appear to affect the part of the code I am seeing the crash in: https://github.com/Refinitiv/Real-Time-SDK/compare/Real-Time-SDK-2.0.6.L1...Real-Time-SDK-2.0.7.L1#diff-4f38211129ed66138518990f0994f9394e75c02d9545ece0213f81bbb4c3c476
These indices appear not to be bound-checked within the size of the login.
I will continue investigating.
0 -
I just built against the 2.0.6.L1 release and no longer see this issue. I updated the Github issue tracker accordingly.
0 -
Hello @matthew.godbolt
We appreciate you keeping us informed!0
Categories
- All Categories
- 3 Polls
- 6 AHS
- 36 Alpha
- 166 App Studio
- 6 Block Chain
- 4 Bot Platform
- 18 Connected Risk APIs
- 47 Data Fusion
- 34 Data Model Discovery
- 685 Datastream
- 1.4K DSS
- 615 Eikon COM
- 5.2K Eikon Data APIs
- 10 Electronic Trading
- Generic FIX
- 7 Local Bank Node API
- 3 Trading API
- 2.9K Elektron
- 1.4K EMA
- 252 ETA
- 556 WebSocket API
- 38 FX Venues
- 14 FX Market Data
- 1 FX Post Trade
- 1 FX Trading - Matching
- 12 FX Trading – RFQ Maker
- 5 Intelligent Tagging
- 2 Legal One
- 23 Messenger Bot
- 3 Messenger Side by Side
- 9 ONESOURCE
- 7 Indirect Tax
- 60 Open Calais
- 275 Open PermID
- 44 Entity Search
- 2 Org ID
- 1 PAM
- PAM - Logging
- 6 Product Insight
- Project Tracking
- ProView
- ProView Internal
- 22 RDMS
- 1.9K Refinitiv Data Platform
- 652 Refinitiv Data Platform Libraries
- 4 LSEG Due Diligence
- LSEG Due Diligence Portal API
- 4 Refinitiv Due Dilligence Centre
- Rose's Space
- 1.2K Screening
- 18 Qual-ID API
- 13 Screening Deployed
- 23 Screening Online
- 12 World-Check Customer Risk Screener
- 1K World-Check One
- 46 World-Check One Zero Footprint
- 45 Side by Side Integration API
- 2 Test Space
- 3 Thomson One Smart
- 10 TR Knowledge Graph
- 151 Transactions
- 143 REDI API
- 1.8K TREP APIs
- 4 CAT
- 27 DACS Station
- 121 Open DACS
- 1.1K RFA
- 104 UPA
- 193 TREP Infrastructure
- 228 TRKD
- 917 TRTH
- 5 Velocity Analytics
- 9 Wealth Management Web Services
- 90 Workspace SDK
- 11 Element Framework
- 5 Grid
- 18 World-Check Data File
- 1 Yield Book Analytics
- 46 中文论坛