Catchpoint Linux Nodes TCP SYN Retries

Prev Next

Occasionally, the default Linux TCP retransmission rules will cause a [50005] - "Request exceeded the maximum allowed response time" in Catchpoint’s HTTP-based monitors (Object, API, and Chrome). This may lead users to believe the test failed due to Catchpoint system timeouts when there was actually a TCP connection failure. Because of this, we are making changes to the way our Linux nodes handle TCP retransmission.

How TCP Retransmission Works

The TCP (Transmission Control Protocol) is designed to ensure the delivery of all packets. This design includes built-in logic to handle cases where packets are lost (not received), using a process referred to as 'retransmission'. Packet loss may occur during the initial TCP three-way handshake, or later when the data payload is being transferred. In both cases, the protocol provides a method for retransmitting the lost packet(s), but the two cases are treated differently by TCP implementations.

A TCP three-way handshake works as follows:

  1. The initiator (Host A) sends a small TCP packet with a special "SYN" flag indicating a connection request.
  2. The receiver (Host B) sends two TCP packets back to Host A, one with an “ACK” flag which simply acknowledges the request, and another one with an “SYN” flag.
  3. Once the initiator receives these two TCP packets, it sends an acknowledgment "ACK" back to the receiver to end the handshake and complete the connection.

If the initiator does not receive acknowledgment of the initial TCP Packet (SYN flag) from the receiver, the initiator can resend the same TCP packet (SYN flag), as in this example:

Screenshot 2024-10-20 at 11.02.19 PM.png

Figure 1: TCP Retransmission

The TCP protocol has a special requirement for handling cases when no acknowledgment is received to the initial SYN packet in the three-way handshake process. This requirement was established to handle cases where there are connectivity issues so that the protocol has a chance of functioning even if some packets are lost. The TCP RFC provides a recommendation on this topic, however different OSes (and even different OS versions) implement this recommendation using different logic, resulting in slightly different behaviors.

In all cases, the logic involves two key parameters:

  • SYN Retransmit Count –the number of times the OS will resend the SYN packet without receiving acknowledgment before giving up.
  • Retransmission Timeout (RTO) –the length of time the OS waits for acknowledgment of each SYN request before retransmitting or giving up.

How Each OS Handles Retransmission

The RFC very specifically states that the OS should progressively increase the time between the retries, using double the previous RTO each time. However, each OS uses different values for the initial RTO and SYN Retransmit Count, which results in different overall timeouts for the handshake attempt (when the OS completely gives upand returns the [50005] error).

  • Windows Server 2012 R2 is configured out of the box with an initial RTO of 3 seconds and 2 Syn Retransmits. This results in a total handshake timeout of about 21 seconds:
    • Initial SYN at 0s
    • First retransmit 3 seconds later at 3s,
    • Second retransmit 6 seconds later at 9s,
    • Final timeout 12 seconds later at 21s.

See our documentation here.

  • Windows 10 is configured out of the box with an initial RTO of 1 second and 4 Syn Retransmits:

    • Initial SYNat 0s
    • First retransmit at 1s
    • Second retransmit at 3s
    • Third retransmit at 7s
    • Fourth retransmit at 15s
    • Final timeout at 21 seconds. (Based on the RTO logic it should time out at 31s so this behavior is a little unclear)
  • Linux is configured with an initial RTO of 1 second and 6 SYN Retransmits which results in a much longer total timeout:

    • Initial SYN at 0s
    • First retransmit at 1s
    • Second retransmit at 3s
    • Third retransmit at 7s
    • Fourth retransmit at 15s
    • Fifth retransmit at 31s
    • Sixth retransmit at 63s
    • Final timeout at 127 seconds.

Below is a table depicting the timing of each retransmission and the final timeout for each OS. Note that for each column the last value is when it times out completely (returns the [50005] error). The values before that represent when each SYN retransmit is sent. Windows 10 should give up at 31s, but gives up at 21s (unclear why).

Screen_Shot_2021-06-21_at_9.49.21_AM.png
Table 1: All retransmit times are in seconds.

Catchpoint Issues Related to This OS Difference

Catchpoint’s default test timeout period is currently 30 seconds, which is greater than the Windows TCP timeout period of 21 seconds, but much shorter than the Linux TCP timeout of 127 seconds. As a result, when there are connectivity issues to the target server from a Linux node, Catchpoint almost never fails with a TCP Connection timeout. Instead, we will fail with [50005] - "Request exceeded the maximum allowed response time". In addition, because of how Chrome and other clients require our monitors to work, it might not be clear to users that the timeout occurred because the TCP handshake could not be completed within 30 seconds. This can create confusion for the user, triggering support tickets and investigations to figure out what happened. This is usually a waste of time and effort as the issue is almost always the result of a failed TCP handshake.

Proposed Solution

Catchpoint has announced EOL of Windows support and is in the process of migrating all our nodes to Linux. As part of this process,we should address this timeout issue by adjusting the TCP settings in Linux such that it will be clear when a Test timeout is the result of a failed TCP handshake.

We plan to keep the initial RTO setting of 1 second, as changing this could adversely impact other TCP logic. We will reduce the SYN Retransmit counter to 3. The result of this change is that the OS will give up at 15 seconds, having sent a total of 4 SYN packets without receiving a SYN ACK.

Screen_Shot_2021-06-21_at_9.49.33_AM.png

Table 2: All retransmit times are in seconds.