UPA Reactor skipping "READY"

Joe.Ferraro · August 2016

Occasionally after calling rsslReactorConnect, after the call completes, the worker thread waiting on the select will start to return, resulting in calls to rsslReactorDispatch and calls to the channel event callback routine.

Normally, the callback will be called with the channelEventType RSSL_RC_CET_CHANNEL_UP followed by RSSL_RC_CET_CHANNEL_READY.
At this point, the connection can be used to send queries.

In the error case howver, this is NOT happening. The RSSL_CET_RC_CHANNEL_UP events IS RECEIVED, but the RSSL_RC_CHANNEL_READY event *NEVER ARRIVES*.
Instead, after some time an RSSL_RC_CET_CHANNEL_RECONNECTING event arrives following a few passes through select/rsslReactorDispatch.
After a 1 or more RECONNECTING events, the reactor will once again invoke the callback with a RSSL_RC_CHANNEL_UP, followed by more select/rsslDispatch cycles and 1 or more RSSL_RC_CET_RECONNECTING events and the whole thing repeats.

Is this normal behavior, the result of misuse of the reactor by my code or a bug in the reactor.

Currently the only way my program can recover from this UP/RECONNECTING endless cycles is to manually restart it.
If this beahavior is part of the normal pattern of communications, that is NEVER getting a READY event, what needs to be done to successfully establish a usable connection to the reactor?

Thanks

Joe.Ferraro · September 2016

This problem is caused by the same issue described in the Question:
"Calling ETA Reactor rsslReactorConnect in tight loop to stress test corrupts reactor"

The simple answer is that the rsslReactorConnect requires that the pointers to the structures it is passed must exist beyond the lifetime of the method making the call.
This means that if the structures are on the stack when the call is made, by the time the reactor services it (raector worker thread), the data may have been overwritten due to stack reuse.

jim.carroll · August 2016

Try looking at the reconnecting RsslReactorChannelEvent's RsslErrorInfo. It should be present and should contain error text which may help identify the cause.

For example, if the error text was "Connection reset by peer," it would likely indicate that the server closed your connection for some reason.

Joe.Ferraro · September 2016

This is what is happening in a loop:

Sep 02 12:30:29.345 : Channel UP, pReactorChannel: 0x605E0000F080, socket: 20

12:30:29.345 Channel down, RECONNECTING, pReactorChannel: 0x605E0000F080, socket:20, reason: errcode: SUCCESS, errtext: <Impl/ripcsrvr.c:6868> Error:1002 ripcRead() failure. Connection reset by peer (ret: -1, sys: 0), location: Reactor/Impl/rsslReactor.c:2993

Or, put another back into the structure:
RsslErrorInfo {

rsslErrorInfoCode = RSSL_EIC_SUCCESS
RsslError {
rsslErrorId = -1
sysError = 0
text = "<Impl/ripcsrvr.c:6868> Error: 1002 ripcRead() failure. Connection reset by peer"
}
errorLocation = "Reactor/Impl/rsslReactor.c:2993
}

Looking at the rsslReactor.c:2993 shows the 'default' case in the switch

This loops until I kill the process.
How can I recover from this?
Since the RsslErrorInfo.RsslErrorInfoCode is set to SUCCESS ?!, it would appear that the reactor is reporting that the condition has been handled.

As a result, would the program have to inspect the RsslErrorInfo.RsslError.text or RsslErrorInfo.RsslError.rsslErrorId determine that an error occurred?

Or possibly just count the number of times an UP/RECONNECTED occurs with no other intervening events, and then close the connection and try to reestablish it?

Or perhaps it's a bug in the Reactor and the way this particular sequence of events is handled.

@jim.carroll

jim.carroll · September 2016

Based on this set of events and the error text, it looks like you are successfully connecting, but then the server (or possibly something in between, if that exists) is disconnecting you -- this error indicates that your underlying network connection was disconnected from the other end. The Reactor is responding by reconnecting, only for the server to disconnect you again.

Do you know what it is you are connecting to (is it an ADS, one of our provider examples, or your own device)? It might log or otherwise indicate the reason why it is disconnecting you. That may provide a clue.

What version of the API is this? Offhand, I don't recall encountering any bug like this, but the source lines in the error text suggest that it is very old.

Joe.Ferraro · September 2016

I captured the full log of the output on a subsequent occurrence of this error.

I've subscribed to many symbols in the course of the execution of the program, some from RICs specified individually, and some as the result of a query using a symbol list.

After running for some time, the event that triggers this deadly cycle is a RECONNECTING.
All the information above was the same, except on this occasion, the RsslErrorInfo.RsslError.sysInfo was 104.

As for the version of UPA that I'm using, it's 7.6.1.L1

I'm connecting to an ELEKTRON edge device, version 2.5.4, using the ELEKTRON_DD service.

Looking at the logs on the server, I see the following:

Warning Output threshold OK for <username> at <ip_address>/net on <hostname> using application 256 on channel 2740. Resuming normal operation.

Following this are many, many of the following:

Harmless RSSL disconnect from host "<hostname>" at address <ip_address> on channel 2700. Reason: rsslRead() failed with code -1 ans a system error 0. Text: <..\..\Ripc\Impl\ripcsrvr.c:6868> Error: 1002 RipcRead() failure. Connection reset by peer.

These continue until I kill my program.

Is it normal for the edge server to disconnect / reconnect after the "output threshold" has been reached?

@jim.carroll

Joe.Ferraro · September 2016

A bit more info, looking at the log a bit more:

The sequence of events seems to be similar to the following:

Warning Output threshold breached... on channel 2816

Warning Output threshold OK..... Resuming normal operation.

Warning Output threshold breached..... on channel 2700.

Harmless User <user> at position .... using application 256 on channel 2816 has been disconnected due to an overflow condition.

Harmless RSSL disconnect from ...... Reason: user .... has been disconnected due to an overflow condition.

Harmless dacsapi - DACS_UserLogout(enter) (894)

Harmless dacsapi - DACS_UserLogout(), calling logout callback or handle - 894

Harmless dacsapi - DACS_UserLogout(success) (894)

Warning Output threshold breached..... on channel 2700.

Harmless RSSL disconnect...... rsslRead() failed with code -1 and system error: 0......

It looks like some output threshold is being met and the reactor cannot recover from that condition.

I'm guessing that the application is not processing the data fast enough and the edge server is disconnecting us.

The big problem is the inability to restablish communications.

@jim.carroll

jim.carroll · September 2016

I believe your guess regarding your application's processing is correct -- your application is probably reading too slowly, backing up your connection to the edge, which (unless configured otherwise) will cause it to disconnect you. I would, however, expect that the subsequent reconnect would succeed.

I don't think the 104 sysError means much here; this should be the errno from the disconnection -- on Linux that would be ECONNRESET.

Can you confirm that you are keeping your file descriptor sets updated properly when reconnecting, i.e. clearing the channel's FD on the reconnecting events, and setting the FD on the channel-up events? It sounds like you are probably doing this properly but please confirm that your FD read set contains the Reactor's FD and the channel's updated FD after the post-reconnect channel-up event.

I don't know if it will help, but consider upgrading to the latest version of UPA (now ETA). You should not need to make any code changes to do so.

Otherwise I would recommend contacting support regarding this -- it may be difficult to diagnose your problem without taking a closer look.

Joe.Ferraro · September 2016

From my code:

case RSSL_RC_CET_CHANNEL_DOWN_RECONNECTING:
if (REACTOR_INVALID_SOCKET != pReactorChannel->socketId) {
FD_CLR(pReactorChannel->socketId, &g_fdsRead);
FD_CLR(pReactorChannel->socketId, &g_fdsExcept);
}

I've just built a new version using ElektronSDK 1.0.6

Let's see if I can get the problem to occur again using the new code.

Hopefully I can get that version deployed to our prod environment

@jim.carroll

UPA Reactor skipping "READY"

Best Answer

Answers

Categories

EXPLORE OUR SITES

UPA Reactor skipping "READY"

Best Answer

Answers

Categories