question

Upvotes
Accepted
11 8 7 14

Network outage not reported by ETA reactor when reconnect enabled in ConnectOptions

We ran a network test to compare what events an ETA application and an SSL application receive when connected to an Elektron server.

The tests were run with:

cnctOpts.reconnectAttemptLimit = -1;
cnctOpts.reconnectMinDelay = 5000;
cnctOpts.reconnectMaxDelay = 10000;

If the server is rebooted, the ETA application receives repeated RSSL_RC_CET_CHANNEL_DOWN_RECONNECTING events in the event callback until the server is back online and ready to accept connections.

Next we simulated a network outage (broken cable, switch problem, etc.) for different time intervals. The simulation was performed with the cooperation of a Reuters engineer disabling the NIC card at the Elektron server.

The result for each interval was the same, the ETA application did not receive any event callback until the NIC was re-enabled. The longest interval we tested was 15 minutes and no notification was provided to the application via its callbacks during that time.

It wasn't until the NIC was re-enabled that the application received any notification at all.

Is this by design?

Cursory examination of the reactor source code seems to indicate that this is the case (whether intentional or not)

If so our application (and your documentation) may need updating. The usefulness of the reconnect logic in the reactor is limited if the application receives no notification that the connection is no longer healthy. It would be possible to either add a watchdog thread to continuously poll the last event time for each session, or to disable the reconnect logic in the reactor.

Perhaps we've missed some setting when writing our application code.

When plan on testing again early next week with the source modified to disable the reconnect logic in the reactor to see if it receives notification that the connection was gone bad.

elektronrefinitiv-realtimeelektron-sdkrrteta-apielektron-transport-apiconnectionstatusreactorcallback
icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Upvotes
Accepted
361 1 4 3

@joe.ferraro


One method the Reactor depends on to detect connection failures is being notified on the actual TCP socket to read from/write to it (at which point the failure is detected via read/write error). If the disconnect is graceful (e.g. the server has closed the connection) it is usually detected immediately, but in this sort of test the notification is usually not immediate and the time it takes may vary.

However, the Reactor also expects to see periodic pings (or actual data) within a negotiated ping interval, and 15 minutes would be a pretty high setting (ETA's default is 60 seconds but the server may set it differently). If your application is not already doing so, can you try calling rsslReactorDispatch periodically (e.g. once every second or few seconds)?

icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Upvotes
11 8 7 14

@jim.carroll

The rsslReactorDispath function is called repeatedly.

The application creates a worker thread for the purposes of perdiodically calling select on the fd_sets for read and exceptions, using no timeout. If the select completes sucessfully, it'll call rsslReactorDispatch.

In practice, once a connection is established, the select will return periodically, despite the lack of a timeout, about once every 10 seconds. It looks like the wakeup is the result of the receipt of an inbound PING message. This results in a call to rsslReactorDispatch followed by a call to select (that'll return in 10 seconds, etc.)

I'm going to The rsslReactorDispath function is called repeatedly.

Looking at the the reactor source code (at cursory examination), it looks like the 'Down' event is not delivered to the application if reconnect logic was enabled by setting the reconnectAttmeptLimit is not 0 -- that is the reporting of the event is bypassed since the reactor will try to reconnect.

I will be attempting a simulated network outage again this afternoon, this time with a few changes to the code:

  • * reconnectAttemptLimit set to 0, 1, and 5
  • * a timeout on the select in the worker to assure that the rsslReactorDispatch is called periodically, rather than assume that once a connection is established, the select will always timeout every 10 seconds or that the reactor will manage the ping logic in it's own thread.

Once that test is complete, I'll post the results here.

icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Upvotes
11 8 7 14

@jim.carroll

It looks like the application stopped calling rsslReactorDispatch after the network connection was disconnected on the Elektron server since the select() call is not returning.

When writing the application, I assumed that the select would return when the network connection failed due to the arrival of an event on the fd_set for exceptions (or to a lesser extent fds for read).

For ehatever reason, neither fd_set receives a change of status, causing select() call that had been blocking the worker thread from returning, allowing the logic to continue to the rsslReactorDispatch call.

Adding a timeout to the select() call so that rsslReactorDispatch is called periodically solves the problem (thanks for the tip Jim)

I'm not certain why the select isn't completing with an indication that the connect was lost on one of the sockets in either fd_set, but it doesn't matter since I've added the timeout to the select so that rsslReactorDispatch will be called regularly.

icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.