Saturday, 18 July 2015

TCP ACK generation rate


A simple but also very impotent question of TCP, how often the TCP ACK should be generated?

According the RFC, the ACK SHOULD be generated for at least every second full-sized segment (to be correct this comes from RFC5681 which obsoletes RFC1122, because in old version come with confusion, saying in one place MUST in other SHOULD). Sound simple, but if look to Wireshark (after making iperf test) we would see that the ACK generation rate is much lower than TCP DATA (segment) rate.


First  of all i have to note that this is not a problems of system or TCP implementations because according the RFC it's allow to send TCP ACK less often (the ACK SHOULD be generated...). And in kernel code (i am only speaking about Linux Kernel) it is defined to send one ACK after two full size TCP data segments. 
Not going deep in exploitation t the TCP ACK rate difference is caused be two conditions :



1. TCP offloading to the NIC  - almost all modern NIC allow to offload some basic functions to NIC card (like TCP segmentation, IP CRC check and etc).
2. Due to high TCP receiving rate  - in same cases if the receiving rate is overfill the TCP receiver buffer it reduce the TCP ACK generation rate.

TCP offloading to the NIC 

TCP offloading to the NIC is mainly used to reduce the CPU load, because many minor check and calculations are made on NIC interface. It not only reduce system load but also allows faster transmission (It has also a dark side in it :). To control NIC offloading we use ethtool  command in linux.

To see current setting we use:

ethtool -k eth1
Features for eth1:
rx-checksumming: on
tx-checksumming: on
    tx-checksum-ipv4: off [fixed]
    tx-checksum-ip-generic: on
    tx-checksum-ipv6: off [fixed]
    tx-checksum-fcoe-crc: off [fixed]
    tx-checksum-sctp: off [fixed]
scatter-gather: on
    tx-scatter-gather: on
    tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
    tx-tcp-segmentation: on
    tx-tcp-ecn-segmentation: off [fixed]
    tx-tcp6-segmentation: on
udp-fragmentation-offload: off [fixed]
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
......

......

In our case we would like to turn off the following ofloading feaures:
tcp-segmentation-offload - tso
generic-segmentation-offload - gso
generic-receive-offload - gro
 
To change current NIC offloding setting we use:

ethtool -K eth1 gro off tso off  gso off 

(we should make on both system - sender and receiver)

We also turned off both the receiving offloading because with some NIC's  will see not aggregated TCP packets.
The results of the TCP ACK and TCP Data with segmentation offloading turned can be seen in picture beloow:




Now it's more like to say that TCP ACK generation rate is more likely according to RFC. 
But be turning off the TCP ACK offloading we will increase to system CPU and  reduce the data performance of the system.

In testing purpose it is good to turn off the pause frame support (not allowing the receiver or network node to reduce sending rate due to heavy system load of the receiver)

ethtool  -A eth0 trx off tx off autoneg off

To check the status if the pause frame was turned off:

ethtool  -a eth0

TCP acknowledgment in Linux Kernel

To understand the TCP ACK generation in Linux kernel first we have to look to source code of TCP stack. The function responsible for ACK generation is call  __tcp_ack_snd_check, it check if ACK should be send now ( tcp_send_ack) or should be delayed (  tcp_send_delayed_ack ). To full code of function is showed bellow:


4817 static void __tcp_ack_snd_check(struct sock *sk, int ofo_possible)
4818 {
4819         struct tcp_sock *tp = tcp_sk(sk);
4820
4821             /* More than one full frame received... */
4822         if (((tp->rcv_nxt - tp->rcv_wup) > inet_csk(sk)->icsk_ack.rcv_mss &&
4823              /* ... and right edge of window advances far enough.
4824              * (tcp_recvmsg() will send ACK otherwise). Or...
4825              */
4826              __tcp_select_window(sk) >= tp->rcv_wnd) ||
4827             /* We ACK each frame or... */
4828             tcp_in_quickack_mode(sk) ||
4829             /* We have out of order data. */
4830             (ofo_possible && skb_peek(&tp->out_of_order_queue))) {
4831                 /* Then ack it now */
4832                 tcp_send_ack(sk);
4833         } else {
4834                 /* Else, send delayed ack. */
4835                 tcp_send_delayed_ack(sk);
4836         }
4837 }


The tcp_ack_snd_check function checks if TCP acknowledgment must be send now , execute   tcp_send_ack() function or can  be delayed tcp_send_delayed_ack(). To allow the kernel to send the TCP ACK packet now 4 conditions must be meet:


  1. The first condition that the received data segment or unacknowledged data must by more than one  maximum segment size for defined in session icsk_ack.rcv_mss variable. This comes from RFC1122 or STD003 specification (4.2.3.2 section), saying that an ACK SHOULD be generated for at least every second full-sized segment (to be correct this comes from RFC5681 which obsoletes RFC1122, because in old version come with confusion, saying in one place MUST in other SHOULD).
    (tp->rcv_nxt - tp->rcv_wup) > inet_csk(sk)->icsk_ack.rcv_mss &&__tcp_select_window(sk) >= tp->rcv_wnd)

    This logical condition has two parts, the first part we seen that the unacknowledged data must be more then
    inet_csk(sk)->icsk_ack.rcv_mss value, theoretically it is not according the RFC, but in practice this condition is met after receiving the second TCP segment. That is very important here is that the RFC define that the ACK “SHOULD”, but not “MUST” be generated. So basically RFC allows us to generate the ACK more rarely.
  2. In the second conditions says that the TCP receive window or usable buffer space must be bigger the the receive windows or advertised TCP window by the server to the client. This is done to overcome the the Silly windows syndrome (SWS) problem, which was first defined in RFC813. It occurs due the bad system implementation of TCP flow control or due to the slow system, which consumes data slowly or can’t handle the received information. In such conditions the receive windows rcv_wnd is filled with data much faster  than it can handle it (clean up the receive buffer). In such condition the kernel must reduce the advertised window, by sending the update size to the client. This condition would go on until the receive window is set to minimal allowed size, making the data transmission ineffective. By forbidding the server to send  TCP ACK packet, we reduce the packet flow rate, the client must wait for ACK for sending more data and reduce the server load. This condition also satisfies the RFC5681 that “an ACK SHOULD be  generated for at least every second full-sized segment”.
  3. The third condition check if tcp_in_quickack_mode(sk) any data are ping-ponged back to the client, in such case the TCP connection in interactive state and an ACK packet must be send immediately (like telnet or remote data access information applications). In the other way the kernel would wait up to 500ms before sending an ACK message.
  4. The fourth and final conditions check if  the server receives out of order data, by checking the ofo_possible variable and the looking to receive queue tp->out_of_order_queue, to see if any out-of-order packet are received. This must be done for faster data recovery and improves TCP recovery time after a loss RFC5681 - “A TCP receiver SHOULD send an immediate duplicate ACK when an out-   of-order segment arrives.  The purpose of this ACK is to inform the   sender that a segment was received out-of-order and which sequence   number is expected.”This condition usually happens if packet loss or corruption occurs in the link between the client and server.

After checking these conditions (the first and second are in conjunction), kernel can send ACK message immediately if one of three conditions are true, else the the sending of ACK message must be delayed, by calling the tcp_send_delayed_ack function, which adjust the sending time based on RTT value and system MIN and MAX delay values. In case where is no packet drop or the TCP session has not ping-pong data going back to the sender, the TCP acknowledgment generation directly relay on first two conditions (if more than one full size segment is received and the we have enough space in receive buffer). 

Results/Outcome :

The TCP ACK generation rate mostly depends from two things, the NIC (TCP) offloading, which reduce TCP ACK rate (also reduce the system load) and the Linux Kernel TCP generation function tcp_ack_snd_check() function. According one condition of the function (__tcp_select_window(sk) >= tp->rcv_wnd) the ACK  must be delayed until the buffter get freed.

 
 

 

Saturday, 11 July 2015

TCP Bottleneck (Increasing TCP performance Linux OS) 1

The default system settings are usually good and work fine in most cases, but default sittings are boring :) We all want more power, faster performance, less latency an so on.  How to increase TCP performance  in Linux is easy, but how to find the bottle neck in your system is on your on is more interesting:)

What we need:
OS - Debian ( Jessie/Wheezy) - Because Debian ROCKS...
Hardware - 2 PC, Ethernet cable
Tools : tc (Traffic Control), iperf, tcpdump, wireshark, ethtool

     PC1                    1Gbit/s                  PC2
|192.168.1.10|------>------Ethernet------>------|192.168.1.100|

How we test:

First of all we must to know the max performance of our system setup with any changes - Default configuration (hard for my systems :)

Data sending from PC1 --> PC2 :

     PC1                    1Gbit/s                  PC2
|192.168.1.10|------>------Ethernet------>------|192.168.1.100|

iperf -c 192.168.1.100 -i                    iperf -s -i 1

------------------------------------------------------------
Client connecting to 192.168.1.100, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[  3] local 192.168.1.10 port 49372 connected with 192.168.1.100 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec   110 MBytes   921 Mbits/sec
[  3]  1.0- 2.0 sec   109 MBytes   915 Mbits/sec
[  3]  2.0- 3.0 sec   111 MBytes   930 Mbits/sec
[  3]  3.0- 4.0 sec   111 MBytes   928 Mbits/sec
[  3]  4.0- 5.0 sec   111 MBytes   934 Mbits/sec
[  3]  5.0- 6.0 sec   110 MBytes   924 Mbits/sec
[  3]  6.0- 7.0 sec   111 MBytes   931 Mbits/sec
[  3]  7.0- 8.0 sec   111 MBytes   931 Mbits/sec
[  3]  8.0- 9.0 sec   111 MBytes   933 Mbits/sec
[  3]  9.0-10.0 sec   110 MBytes   926 Mbits/sec
[  3]  0.0-10.0 sec  1.08 GBytes   927 Mbits/sec


Good we get ~1Gbit/s (the diff go to the Ethernet TCP and IP headers). Must to note the value (Throughput) can deviate depending from system and setup. 
Now lets make conditions a little bit worse, and delay the the traffic going from PC1 to PC2 by 10ms with tc command (a demon of ancient world:)
And to the manual to rule it -->

tc qdisc add dev eth0 root netem delay 10ms

and remake the test :

------------------------------------------------------------
Client connecting to 192.168.1.100, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[  3] local 192.168.1.10 port 49425 connected with 192.168.1.100 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec  84.9 MBytes   712 Mbits/sec
[  3]  1.0- 2.0 sec  91.2 MBytes   765 Mbits/sec
[  3]  2.0- 3.0 sec  90.9 MBytes   762 Mbits/sec
[  3]  3.0- 4.0 sec  91.2 MBytes   765 Mbits/sec
[  3]  4.0- 5.0 sec  90.5 MBytes   759 Mbits/sec
[  3]  5.0- 6.0 sec  91.0 MBytes   763 Mbits/sec
[  3]  6.0- 7.0 sec  90.9 MBytes   762 Mbits/sec
[  3]  7.0- 8.0 sec  91.2 MBytes   765 Mbits/sec
[  3]  8.0- 9.0 sec  90.4 MBytes   758 Mbits/sec
[  3]  9.0-10.0 sec  91.8 MBytes   770 Mbits/sec
[  3]  0.0-10.0 sec   904 MBytes   758 Mbits/sec


Now only get average off 758Mbit/s, no so bad considering the fact we increase the latency from 1ms to 10ms. In doing such changes it's very important to see and check if we performance degradations was only due to delay not due  our system instabilty (system load and other factors).



First we check the tc for information about the queuing (it does all the hard work of delaying out traffic)



 tc -s qdisc show
qdisc netem 8001: dev eth0 root refcnt 2 limit 1000 delay 10.0ms
 Sent 991282850 bytes 655137 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0 



No drops overlimits and requeues - good, so the traffic delaying has not drop any packets. The second (i would say more impotent check is to see if we had any TCP packet drop) by making packet sniffing.
We can do it any system (PC1 or PC2), it does not matter now to see if we had any packet drops.

tcpdump -i eth0 -s 80 -w /tmp/as.cap
 

the -i interface, -s the packet size (80Bytes of all packet capture) and -w write to the file. 
The -s or snaplen specifies the data length or every packet to capture, the less it is the less system resources is needed, but setting to low we will end with lost header data, so in setting it lower then 74 bytes is bad idea, i use 80 bytes but needed, but it's good idea to set it to 100bytes.

Now lets look to the wireshark for TCP session details, if don't see errors we don't have drops, good (Analyze -> Expert Info)


Now lets make the delay a little bit bigger and change from 10ms to 100ms



 
tc qdisc change dev eth0 root netem delay 100ms limit 10000

We have added additional value limit 10000, defining the queue length of netem, it's needed if the delay value or packet rate are high (should write additional post about this -->)

If we want to delete the tc we can make it with:

tc qdisc del dev eth0 root  

More info on netem module can be found here (I love this tool ) -->

So the results of TCP throughput with 100ms, not good at all:) ~ 94Mbit/s

------------------------------------------------------------
Client connecting to 192.168.1.100, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[  3] local 192.168.1.10 port 50200 connected with 192.168.1.100 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec  5.75 MBytes  48.2 Mbits/sec
[  3]  1.0- 2.0 sec  11.8 MBytes  98.6 Mbits/sec
[  3]  2.0- 3.0 sec  12.5 MBytes   105 Mbits/sec
[  3]  3.0- 4.0 sec  12.4 MBytes   104 Mbits/sec
[  3]  4.0- 5.0 sec  12.4 MBytes   104 Mbits/sec
[  3]  5.0- 6.0 sec  10.8 MBytes  90.2 Mbits/sec
[  3]  6.0- 7.0 sec  12.8 MBytes   107 Mbits/sec
[  3]  7.0- 8.0 sec  11.5 MBytes  96.5 Mbits/sec
[  3]  8.0- 9.0 sec  11.2 MBytes  94.4 Mbits/sec
[  3]  9.0-10.0 sec  11.9 MBytes  99.6 Mbits/sec
[  3]  0.0-10.0 sec   113 MBytes  94.4 Mbits/sec


 The packet trace (sniff) did not showed any drops, so like before the impact to TCP performace/throughput comes only from delay value (in my case).

To find the bottleneck we must go back to wireshark (Statistics -> TCP StreamGraph -> Window Scaling Graph) It's is important to select packet with source IP 192.168.1.100:




 
In our case (not always) the bottleneck looks to be the WindowSize, after 1s the window size stops growing and is constant until the end of test. The value is of 3145728 bytes. 
So lets check the what performance i should get with such window by calculating BDP (more info -->) :

Bandwidth = tcp_rwin  / Delay = (3145728 * 8) / 100ms = 251658240 ~ 240Mbit/s

But we get only 94Mbit/s, lets check the sending side tcp_wmem value:
TCP send and receive buffer size in my PC1:

#check the TCP sending buffer size - value passed to TCP protocol
# In our case we must look to the last number (buffer size in bytes)
cat /proc/sys/net/ipv4/tcp_wmem
4096    16384    4194304


#check the TCP receiving buffer size - value passed to TCP protocol
# In our case we must look to the last number (buffer size in bytes)
cat /proc/sys/net/ipv4/tcp_rmem
4096    87380    6291456


# the sending OS socket buffer size for all connection and protocol, this value overrides the tcp_wmem
cat /proc/sys/net/core/wmem_max
212992



# the receiving OS socket buffer size for all connection and protocol, this value overrides the tcp_rmem
cat /proc/sys/net/core/rmem_max
212992




Lets increase the the TCP_WMEM value to higher size:


echo  6291456 > /proc/sys/net/ipv4/tcp_wmem

So we get a little better results:

------------------------------------------------------------
Client connecting to 192.168.1.100, TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[  3] local 192.168.1.10 port 54642 connected with 192.168.1.100 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec  6.88 MBytes  57.7 Mbits/sec
[  3]  1.0- 2.0 sec  19.9 MBytes   167 Mbits/sec
[  3]  2.0- 3.0 sec  17.9 MBytes   150 Mbits/sec
[  3]  3.0- 4.0 sec  18.5 MBytes   155 Mbits/sec
[  3]  4.0- 5.0 sec  17.1 MBytes   144 Mbits/sec
[  3]  5.0- 6.0 sec  18.1 MBytes   152 Mbits/sec
[  3]  6.0- 7.0 sec  16.4 MBytes   137 Mbits/sec
[  3]  7.0- 8.0 sec  19.5 MBytes   164 Mbits/sec
[  3]  8.0- 9.0 sec  17.8 MBytes   149 Mbits/sec
[  3]  9.0-10.0 sec  18.0 MBytes   151 Mbits/sec
[  3]  0.0-10.0 sec   170 MBytes   143 Mbits/sec





After more increase (to up 62914560 ~ 60M) we get  239Mbit/s, (close to over calculated value). By checking it Wireshark (Statistics -> IO Graph) we see:



We use to filter to coming in and outgoing traffic and tick interval of 0.01

Filter for TCP Data : ip.src==192.168.1.10
Filter for TCP ACK  : ip.src==192.168.1.100

The interesting part is the gap of No traffic of idle time, this idle period is "eating our throughput", in our case this hapens due to small receiving side TCP window or TCP_RWIN. After increasing it to 12.5MBytes (value calculate with BDP formula) we get only 456Mbit/s, only after increasing the TCP_WMEM to up to 24MB we got our max thoughput, by setting the sending side (PC2) to 24MB, we did not get the same resutls, so where i as wrong?

The only logical answer is the system delay which we forgot to include our delay calculation (still have to prove it)

Resutls:

1. The TCP throughput is depending from sending and receiving side TCP window size (tcp_wmem and tcp_rmem) on sender and receiver.
2. The calculated size of TCP window according the BDP formula is not allays correct due to the fact that the equation is using only network delay (RTT) value.
3. It is best way to check for TCP throughput issue is wireshark (one interesting thing we mist in the last graph is the TCP ACK rate, according RFC the TCP ACK SHOULD be sent after every second full size data segment. But about this next time).