Reliable Reconnect

DiffusionTM communication is composed of discrete messages. Each client sends various types of message to its server, and separately receives messages from the server. There are messages that carry new values for subscribed topics; application messages sent using the messaging API; and service messages that implement API features. All Diffusion’s protocols assure message delivery while a client remains connected to a server.

Diffusion achieves exceptional data rates by keeping messages in memory and not persisting them to disk. The trade-off for high performance is that messages can be lost if a client or server process crashes, or there is a network failure between client and server.

This article covers how Diffusion protects against message loss in the event of network failure, including the Diffusion 5.8 feature – Reliable Reconnection. Reliable Reconnection ensures that if a network connection is lost, a client session will either establish a new connection to the server with no message loss, or message loss will be detected and the session will be closed.

Connections and Reconnection

The server has a queue of messages to be delivered in order for each of its client sessions. Similarly, each client has a separate queue of messages to be delivered in order to the server. Depending on the protocol, each client maintains one or many long-lived TCP connections to a server. The WebSocket protocol uses a single bi-directional connection. HTTP long-polling protocols use a connection for messages from server to client, and a separate connection for messages from client to server.

Connections can fail and be re-established without loss of a session – a process known as reconnection. Reconnection is configured on the server, and further controlled by the reconnection strategy of the client’s session factory. The server configuration determines the reconnection timeout – a limit on how long the server will maintain a session for a disconnected client. If the client fails to reconnect before the reconnection timeout expires, the server will close the session. Reconnection can be disabled by setting the timeout to 0.

The server configuration allows different reconnection settings for each connector. Here’s an example configuration from Connectors.xml that enables reconnection for an Example connector with a reconnection timeout of 60 seconds.

<!-- Connectors.xml -->
  ...
      <connector name="Example">
  ...
          <reconnect>
              <!-- The number of milliseconds the server will maintain a
                     disconnected session. If the client fails to reconnect
                     within this time, the server will close the session.
              -->
              <keep-alive>60000</keep-alive>
          </reconnect>
      </connector>
  ...

Even though a session is disconnected, both server and client will continue to queue messages for delivery. For example, the server will queue new update messages for topics to which the session is subscribed. Each message queue has a maximum size. If a new message causes the queue to overflow the configured maximum number of messages, the session will be closed.

The maximum queue size for server queues is configured in Server.xml.

<!-- Server.xml -->
  <server>
  ...
      <client-queues>
          <default-queue-definition>default</default-queue-definition>
          <queue-definition name="default">
              <!-- The maximum queue size -->
              <max-depth>1000</max-depth>
          </queue-definition>
  ...

It’s likely a larger backlog will accumulate in the message queue while a session is disconnected than while it is connected. So queues can extend to accommodate more messages for disconnected sessions, a higher maximum queue size can be configured in the reconnection settings.

<!-- Connectors.xml -->
  ...
      <connector name="Example">
  ...
          <reconnect>
              <keep-alive>60000</keep-alive>
              <!-- Maximum queue size that applies when disconnected.
                     Ignored if less than the queue-definition setting.
              -->
              <max-depth>5000</max-depth>
          </reconnect>
      </connector>
  ...

For some applications, the message backlog can be effectively managed using conflation. Conflation reduces the likelihood of a message queue overflowing, discards stale data, and shrinks the number of messages to be delivered when the client reconnects.

For reconnection to work, the client needs to detect failure in a timely fashion. Intermediate network devices such as routers, load balancers, and firewalls, add additional connection hops between the client and the server which can delay failure detection. To address this problem, Diffusion has several features to monitor the end-to-end health of connections, including automatic server pings (a “heartbeat” mechanism where the server regularly sends a request message to the client and expects it to respond with a separate message), and client-side connection activity monitoring.

Reliable Reconnection

From Diffusion 5.8, applications can depend on Reliable Reconnection: if a client session loses connection and reconnects successfully, no messages have been lost. During reconnection, client and server exchange the number of messages they have sent and received. This information is used to determine whether messages have been lost, and if so, close the session.

Detecting failure is not enough to provide a useful quality of service. In typical applications, messages are flowing all the time. When a connection is lost, a message might have been partly transmitted. Routers and load balancers buffer TCP data transmission for efficiency, and any in-flight messages in these buffers will be lost together with the connection.

To mitigate against message loss and improve the chances of successful reconnection, the servers maintains a recovery buffer of recently sent messages. A reconnection request from the client includes the sequence number of the next message it expects. The server re-sends messages from its recovery buffer to satisfy the client’s expectation. It’s still possible that messages have been irrecoverably lost if the recovery buffer is too small. In this case, the server will close the session.

Like the message queue max-depth setting, the appropriate recovery buffer size depends on a number of factors including application data rates, buffer sizes, the complexity of the intermediate network, the typical message size, and available memory. The server recovery buffer size is set per connector, in Connectors.xml.

<!-- Connectors.xml -->
  ...
      <connector name="Example">
  ...
          <reconnect>
  ...
              <!-- Size of the server-side recovery buffer for
                     clients that use this connector. The default
                     value is 128 messages. Higher values increase the
                     chance of successful reconnection, but increase
                     the per-client memory footprint.
              -->
              <recovery-buffer-size>1000</recovery-buffer-size>
          </reconnect>
      </connector>
  ...

The recovery buffer size is an upper limit on the number of queued messages. To reduce the memory footprint, messages are removed from the recovery buffers of connected sessions after a few minutes.

The Java, Android, and JavaScript client libraries also have a client-side recovery buffer for messages sent to the server. The reconnection request from the client advertises the available messages it can resend from its recovery buffer. The server instructs the client to resend the appropriate number of messages based on the last message it received.

There is a usually a higher rate of messages from server to client than from client to server, so there is less need to tune client recovery buffer sizes. This may not be true for control clients, particularly those that that send topic updates to the server. The client-side maximum queue and recovery buffer sizes are configured through the session factory. The default values are 1 000 and 128 messages respectively. Here’s an example using the Java API that starts a session with a maximum queue size of 10 000 messages and a recovery buffer size of 500 messages.

Session session = Diffusion.sessions()
                .maximumQueueSize(10000)
                .recoveryBufferSize(500)
                .open("ws://localhost:8080");

The current .NET, Apple, and C client libraries do not have a client-side recovery buffer. We plan to address this in future releases. Meanwhile, all client libraries benefit from the message loss detection and the server-side recovery buffer.

Reconnection check-list

To improve the chance of successful reconnection:

  • Configure the reconnection timeout to match the longest network outage you wish to support. Larger reconnection timeouts need more server memory, and require larger message queues.
  • Tune the message queue max-depth settings to accommodate the largest expected backlog when connected and disconnected. The appropriate values will depend on application data rates. Larger values will require more memory.
  • Tune recovery buffer sizes on the server and client. Appropriate values will depend on several factors including application data rates, and configured buffer sizes, so fine-tuning requires some experimentation. Again, larger values will require more memory.

Use different connectors to specialise settings for different classes of client. It’s common to configure separate connectors for control clients, fan-out connections, and external clients.

Session replication and fail-over

To wrap up this article, I’ll briefly mention how a Diffusion session can survive loss of a server.

Where possible, reconnection to the existing server is the preferred way to deal with network failure. Load-balancers should be configured to use sticky routing, that is to route reconnection requests back to the server to which a client was previously connected. But what if that server has crashed or a network route to that server is no longer available? If the failed server is part of a cluster with session replication enabled, a client session can connect to another server in the cluster and recover topic state. This is known as fail-over.

As a recovery process, fail-over is less clean than reconnection.

  • The session’s topic selections will be re-evaluated, generating unsubscription and subscription notifications. The current value for each subscribed stateful topic will be resent to the client. Depending on the number of subscriptions, this can require transmission of a large amount of data.
  • Operations for which the client session is awaiting for a callback response will be cancelled. For example, the client may have requested that a topic be added. The original server may or may not have received the request – there’s no way for the client to know. The onDiscard method of the AddCallback will be called, indicating that the operation may have failed.
  • Client-side components registered with the original server, such as authentication handlers or exclusive updaters, will be closed. The application should use the onClose callback to re-register the handlers with the new server.

So you can see, dealing with fail-over involves much more work. Configure and use reconnection where you can.