Recovering from Connectivity Outages

One of the most basic problems in high availability communications is disconnection between distributed endpoints. An AMPS client, for example, may disconnect from a remote server for a variety of reasons: a physically disconnected network segment, a malfunctioning or misconfigured firewall or router, or even a down AMPS server. As a starting point, highly available AMPS clients need to reconnect to their existing server when a disconnection occurs.

HA Client

The AMPS Client distributions include HA clients that provide capabilities for detecting disconnection and automatically recovering.

Disconnect Detection

In most cases, you become aware that your connection is no longer alive when the TCP stack on your client system reports a failure to your application. To ensure that your application detects a failure in a timely manner, you can request that your application exchange heartbeat messages with the AMPS server. These messages are handled automatically by the AMPS client infrastructure, and prompt an automatic reconnect from the HAClient when disconnection is detected.

HAClient c(...);
c.setHeartbeat(3); // 3 second heartbeats

// Set up server chooser, connect and logon
// here (see Developer Guides for details)

Without heartbeating, it can be difficult for your application to detect disconnection in a timely fashion. For example, unless your client uses the heartbeat mechanism, a publisher or subscriber may remain running for minutes with an invalid connection before the operating system reports that the connection is not valid.

  The operating system may not notify the client that an idle socket has been closed until an extended period of time has elapsed, or the client attempts to send data to the socket. Current AMPS servers and clients provide a heartbeat mechanism for reducing the amount of elapsed time before detecting that the connection is no longer active.

Choosing a Client Name

In AMPS, each client must log on with a unique client name. An AMPS instance will not allow multiple connections to log on with the same client name. When publishing, AMPS uses the client name as a key to record the highest sequence number seen from each client; when a publisher disconnects and reconnects to the same AMPS instance or a secondary, it should log on with the same client name as it used previously.

  AMPS will not allow two connections to use the same client name. If a name is in use by another connection, AMPS will refuse the new connection if it has different credentials, and remove the old connection (on the assumption that this is the same application reconnecting) if the connection has the same credentials.

With the AMPS Client Library, applications pass the client name to the HAClient constructor.

Choosing a Server

Since disconnection can occur for so many reasons, an application needs to be smart about choosing a new server. In the case of disconnection with a single AMPS instance, the only choice is to attempt reconnection to the same server, or to give up entirely. You face a more complicated situation, however, when building a system with a primary and secondary; the application’s options are even more complex for choosing a new server when your system involves a mix of backup servers in the same rack and others in entirely different data centers.

As a rule, you should always attempt to reconnect to the primary when a disconnection occurs. Network “hiccups” happen even when both the client and server hosts are healthy. Choose to move on to secondary servers only when reconnection to the primary fails. In most cases, your code should prefer a secondary that is close by rather than far away; but it is important to keep this logic as simple as possible, since the code for choosing a new server will be among the hardest in your application to fully test.

The AMPS Client Library allows you to encapsulate your logic for choosing a server and reuse it across multiple applications, by implementing the ServerChooser interface and using it with the HAClient. You provide the ServerChooser, both to establish an initial connection and to choose an alternate connection. The HAClient works by implementing a disconnect handler for you, and by interacting with the ServerChooser. The reference implementation, DefaultServerChooser, keeps a list of URIs to connect with; upon an initial disconnection, it attempts to reconnect to the previously connected server, and if that fails, it keeps trying alternates in its list. You can use this reference implementation out of the box, extend it for your own needs, or create your own ServerChooser from scratch.

Here is an example using the reference implementation:

HAClient haClient(...);

// Default server chooser is provided for demonstration
// purposes: for production, implement a ServerChooser
// with the behavior that you want.

DefaultServerChooser myServerList = new DefaultServerChooser();
myServerList.add("tcp://primary.amps:9004/fix");
myServerList.add("tcp://seconday.amps:10000/fix");
myServerList.add("tcp://far.away.amps:9004/fix");

haClient.setServerChooser(myServerList);

haClient.connectAndLogon();

// After this point, the haClient will
// automatically atttempt reconnection
// and connection to the secondary servers
// provided by "myServerList".

Implementing your own ServerChooser can create a distinct advantage since you will be able to incorporate knowledge about the particulars of the local network environment. In large organizations with multiple AMPS deployments, a central team may be responsible for implementing a custom ServerChooser that utilizes proprietary knowledge to provide even greater reliability and efficiency than the DefaultServerChooser. Contact your site’s AMPS team for more details.

Establishing a new connection after connectivity is broken is an important part of creating highly available AMPS clients in nearly every scenario. Though AMPS does not segregate workloads, many AMPS clients are primarily either publishers or subscribers. For the remainder of this white paper, we will focus first on issues specific to publishers, followed by issues specific to subscribers. If your application is both a publisher and a subscriber, then both sets of considerations apply.