28. Troubleshooting AMPS

This chapter presents common techniques for troubleshooting AMPS. Additional troubleshooting information and answers to common questions about AMPS are included on our support site at http://support.crankuptheamps.com/hc.

Planning for Troubleshooting

There are several steps that you can take before you need to troubleshoot a problem that will make troubleshooting easier. 60East recommends that you consider taking the following steps for a production instance of AMPS:

  1. Configure the instance to log messages of at least warning or more verbose. Some problems require more information, so increasing the amount of logging may make troubleshooting easier, if your instance has storage available. 60East recommends logging at info level or more verbose for production instances. Even if storage space for logs is extremely restricted, an instance should log at at least warning level.
  2. Ensure that client applications use unique names. Wherever possible, ensure that those names can easily be traced back to the instance of the application. For example, you might use the name of application combined with the name of the logged on user as a unique name. This will help you to more quickly find log messages related to a problem.
  3. Enable the administrative server. The administrative console is a good way to get a snapshot of the current state of a running instance.
  4. If you are using replication, ensure that your AMPS instances have unique names. Where possible, use names that make it easy to relate replication messages to the servers that process the message. For example, you might relate the AMPS instance name to the purpose that the instance serves, the physical server that the instance runs on, or both.
  5. Learn what normal operation looks like for your application. If possible, take the time to inspect the AMPS logs and the output of the administrator console when everything is working as expected. Applications vary in how they use AMPS, and what is normal for your application might indicate a problem in a different application. For example, if your application normally has a few publishers and many subscribers, seeing dozens of publishers come online may indicate that an application has unexpectedly started more publishers. Likewise, if no publishers are online, that may indicate an issue with connectivity to the AMPS server. Understanding normal behavior will help you to more easily and accurately spot problems.

Finding Information in the Log

The AMPS log is one of the most useful places to find information when there’s a problem with your application. Here are some techniques to use for finding relevant information in the log.

  • Ensure the log is capturing information that will be useful for diagnosing the problem. In general production use, 60East recommends logging at info level (or more verbose). To fully troubleshoot an error, it may be necessary to log at trace level to see the exact behavior in AMPS. Some deployments may find it useful to keep a separate log at warning level and above to more easily detect errors and potential problems, while maintaining an info level log for being able to troubleshoot problems.

  • To find log messages that may indicate a problem, use amps-grep or the Linux grep tool to find log messages at warning, error, critical, or emergency levels. For example, you might use the following command line:

    amps-grep -E 'warning|error|critical|emergency' log_file

    or

    grep -E 'warning|error|critical|emergency' log_file

    This will show lines from the log that contain messages logged at those levels. The text that AMPS uses for log messages is guaranteed not to include strings that duplicate one of the log levels, although information that you configure (such as client names, topic names, and so on) may contain those strings.

  • If you know the name of the client that experienced the problem, you can use that name to get information about the client. It’s often helpful to get log messages that include the client name and several lines of output after the client name to help you understand the context in which AMPS produced the message for the client name. To do this, you might you use the following command line:

    amps-grep client_name log_file

    or

    grep –B2 –A10 client_name log_file

This command line looks for all occurences of the client_name in the log file. The amps-grep version prints the full messages where the client_name appears. The grep version attempts to approximate this by printing two lines of context before the line that contains the client name, and ten lines of context after the line that contains the client name.

Once you’ve found the information you’re looking for, the ampserr utility can help you look up more information on messages, as described in Looking Up Errors with ampserr.

Reading Replication Log Messages

For replication connections, the replication source creates a client name that it uses to connect to the downstream instance. This client name contains the source, destination, sync setting, and protocol for the connection. The client name uses the following format:

source!destination!sync_setting!protocol

Notice, however, that this is a client name. The client name is the name used for the connection, but it does not indicate the direction of any particular message. As an example, consider a client name of:

OrderServer!HotBackup!sync!amps-replication

This client name is used for a connection that the AMPS instance named OrderServer has made to AMPS instance named HotBackup. The connection uses the amps-replication protocol, and was configured for synchronous replication at the time the client connected. In this case, a message like the following:

12-1002 client[OrderServer!HotBackup!sync!amps-replication] replication ack received: publish ack
      [txid=35922]

Means that a publish acknowledgment was received on the connection that OrderServer made to HotBackup.

Troubleshooting Disconnected Clients

One common symptom of problems in an AMPS application is that AMPS disconnects clients unexpectedly. AMPS disconnects clients in the following situations:

  • When transaction logging is configured for the instance and a client with a duplicate name logs on
  • When heartbeating is enabled, and the client misses a heartbeat
  • When a slow client falls behind by more than the configured threshold
  • When the entitlement cache for an instance is reset
  • When the administration console disconnects a client
  • When the transport is disabled

This section presents techniques to help you identify why clients are disconnected and correct any problems that may exist.

Locating the Reason for Disconnection

To discover the reason that a client was disconnected, use the following command to find the client name in the logs:

grep -B2 -A5 client_name log_file

The results of this can provide information as to why the client was disconnected. AMPS logs a reason for the disconnection if the disconnection was the result of an internal action by AMPS. If the disconnection was the result of an action from the Admin console, or the client chose to disconnect, the disconnection is logged, but no further information is given.

Duplicate Client Name Disconnection

When a client is disconnected due to another client with the same name logging on, the messages produced might look like:

2014-11-20T16:26:59.6408410-08:00 [5] warning: 02-0025 A client logon with an 'in use' client name for the same user id forced a disconnect of client: client[my-name] with user id:

To resolve this issue, ensure that clients use unique names when connecting to instances that configure a transaction log.

Missed Heartbeat Disconnection

When AMPS disconnects a client due to the client failing to heartbeat, the log messages produced look like the following:

2014-11-20T16:35:23.9185690-08:00 [6] error: 07-0042 AMPS heartbeat manager is disconnecting an unresponsive client: no-heartbeat-client

This error most often arises from severe network congestion, a deadlock or similar problem in the application that is preventing the AMPS client library from producing heartbeats, or a problem in AMPS that prevents AMPS from servicing heartbeat requests.

Slow Client Disconnection

The following shows sample log entries for slow client disconnection. If a client named sleepy-client was disconnected for being a slow client, the relevant entries in the transaction log might look like:

2014-11-20T15:33:06.8496430-08:00 [7] warning: 70-0011 client[sleepy-client] slow consumption detected, offline messages.
2014-11-20T15:33:06.8498130-08:00 [7] error: 70-0004 client[sleepy-client] is not consuming messages, disconnecting slow client

Notice that there may be a considerable period of time between the client being offlined and the client being disconnected.

There are several approaches to solving the problem:

  • Reduce the number of messages returned. Clients most often fall behind when a SOW query or a replay from the transaction log returns a large number of messages. If possible, use content filtering to return a more precise set of messages.
  • Improve the rate at which the client handles messages. If the client message handler takes a relatively long time to process the message, moving message processing onto a different thread or streamlining the processing may improve the speed of the client and allow the client to keep up.
  • Adjust the client offlining threshold. You can also increase the capacity of messages that AMPS will buffer for clients connected to a specific transport, as described in Slow Client Management and Capacity Limits.

Admin Console Client Disconnection

Disconnection from the admin console provides no additional information, and produces a log message like the following:

2014-11-20T15:33:06.8502350-08:00 [4] info: 07-0013 client[sleepy-client] disconnected.

Admin Console Transport Disabled

A transport being disabled through the admin console produces messages like the following:

2014-11-20T16:04:00.9548130-08:00 [10] info: 07-0047 Transport[json-tcp] being disabled.
2014-11-20T16:04:00.9550150-08:00 [4] info: 07-0013 client[amps-json-tcp-18] disconnected.