Saturday, 11 July 2015

TCP Bottleneck (Increasing TCP performance Linux OS) 1

The default system settings are usually good and work fine in most cases, but default sittings are boring :) We all want more power, faster performance, less latency an so on.  How to increase TCP performance  in Linux is easy, but how to find the bottle neck in your system is on your on is more interesting:)

What we need:
OS - Debian ( Jessie/Wheezy) - Because Debian ROCKS...
Hardware - 2 PC, Ethernet cable
Tools : tc (Traffic Control), iperf, tcpdump, wireshark, ethtool

     PC1                    1Gbit/s                  PC2
|192.168.1.10|------>------Ethernet------>------|192.168.1.100|

How we test:

First of all we must to know the max performance of our system setup with any changes - Default configuration (hard for my systems :)

Data sending from PC1 --> PC2 :

     PC1                    1Gbit/s                  PC2
|192.168.1.10|------>------Ethernet------>------|192.168.1.100|

iperf -c 192.168.1.100 -i                    iperf -s -i 1

------------------------------------------------------------
Client connecting to 192.168.1.100, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[  3] local 192.168.1.10 port 49372 connected with 192.168.1.100 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec   110 MBytes   921 Mbits/sec
[  3]  1.0- 2.0 sec   109 MBytes   915 Mbits/sec
[  3]  2.0- 3.0 sec   111 MBytes   930 Mbits/sec
[  3]  3.0- 4.0 sec   111 MBytes   928 Mbits/sec
[  3]  4.0- 5.0 sec   111 MBytes   934 Mbits/sec
[  3]  5.0- 6.0 sec   110 MBytes   924 Mbits/sec
[  3]  6.0- 7.0 sec   111 MBytes   931 Mbits/sec
[  3]  7.0- 8.0 sec   111 MBytes   931 Mbits/sec
[  3]  8.0- 9.0 sec   111 MBytes   933 Mbits/sec
[  3]  9.0-10.0 sec   110 MBytes   926 Mbits/sec
[  3]  0.0-10.0 sec  1.08 GBytes   927 Mbits/sec


Good we get ~1Gbit/s (the diff go to the Ethernet TCP and IP headers). Must to note the value (Throughput) can deviate depending from system and setup. 
Now lets make conditions a little bit worse, and delay the the traffic going from PC1 to PC2 by 10ms with tc command (a demon of ancient world:)
And to the manual to rule it -->

tc qdisc add dev eth0 root netem delay 10ms

and remake the test :

------------------------------------------------------------
Client connecting to 192.168.1.100, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[  3] local 192.168.1.10 port 49425 connected with 192.168.1.100 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec  84.9 MBytes   712 Mbits/sec
[  3]  1.0- 2.0 sec  91.2 MBytes   765 Mbits/sec
[  3]  2.0- 3.0 sec  90.9 MBytes   762 Mbits/sec
[  3]  3.0- 4.0 sec  91.2 MBytes   765 Mbits/sec
[  3]  4.0- 5.0 sec  90.5 MBytes   759 Mbits/sec
[  3]  5.0- 6.0 sec  91.0 MBytes   763 Mbits/sec
[  3]  6.0- 7.0 sec  90.9 MBytes   762 Mbits/sec
[  3]  7.0- 8.0 sec  91.2 MBytes   765 Mbits/sec
[  3]  8.0- 9.0 sec  90.4 MBytes   758 Mbits/sec
[  3]  9.0-10.0 sec  91.8 MBytes   770 Mbits/sec
[  3]  0.0-10.0 sec   904 MBytes   758 Mbits/sec


Now only get average off 758Mbit/s, no so bad considering the fact we increase the latency from 1ms to 10ms. In doing such changes it's very important to see and check if we performance degradations was only due to delay not due  our system instabilty (system load and other factors).



First we check the tc for information about the queuing (it does all the hard work of delaying out traffic)



 tc -s qdisc show
qdisc netem 8001: dev eth0 root refcnt 2 limit 1000 delay 10.0ms
 Sent 991282850 bytes 655137 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0 



No drops overlimits and requeues - good, so the traffic delaying has not drop any packets. The second (i would say more impotent check is to see if we had any TCP packet drop) by making packet sniffing.
We can do it any system (PC1 or PC2), it does not matter now to see if we had any packet drops.

tcpdump -i eth0 -s 80 -w /tmp/as.cap
 

the -i interface, -s the packet size (80Bytes of all packet capture) and -w write to the file. 
The -s or snaplen specifies the data length or every packet to capture, the less it is the less system resources is needed, but setting to low we will end with lost header data, so in setting it lower then 74 bytes is bad idea, i use 80 bytes but needed, but it's good idea to set it to 100bytes.

Now lets look to the wireshark for TCP session details, if don't see errors we don't have drops, good (Analyze -> Expert Info)


Now lets make the delay a little bit bigger and change from 10ms to 100ms



 
tc qdisc change dev eth0 root netem delay 100ms limit 10000

We have added additional value limit 10000, defining the queue length of netem, it's needed if the delay value or packet rate are high (should write additional post about this -->)

If we want to delete the tc we can make it with:

tc qdisc del dev eth0 root  

More info on netem module can be found here (I love this tool ) -->

So the results of TCP throughput with 100ms, not good at all:) ~ 94Mbit/s

------------------------------------------------------------
Client connecting to 192.168.1.100, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[  3] local 192.168.1.10 port 50200 connected with 192.168.1.100 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec  5.75 MBytes  48.2 Mbits/sec
[  3]  1.0- 2.0 sec  11.8 MBytes  98.6 Mbits/sec
[  3]  2.0- 3.0 sec  12.5 MBytes   105 Mbits/sec
[  3]  3.0- 4.0 sec  12.4 MBytes   104 Mbits/sec
[  3]  4.0- 5.0 sec  12.4 MBytes   104 Mbits/sec
[  3]  5.0- 6.0 sec  10.8 MBytes  90.2 Mbits/sec
[  3]  6.0- 7.0 sec  12.8 MBytes   107 Mbits/sec
[  3]  7.0- 8.0 sec  11.5 MBytes  96.5 Mbits/sec
[  3]  8.0- 9.0 sec  11.2 MBytes  94.4 Mbits/sec
[  3]  9.0-10.0 sec  11.9 MBytes  99.6 Mbits/sec
[  3]  0.0-10.0 sec   113 MBytes  94.4 Mbits/sec


 The packet trace (sniff) did not showed any drops, so like before the impact to TCP performace/throughput comes only from delay value (in my case).

To find the bottleneck we must go back to wireshark (Statistics -> TCP StreamGraph -> Window Scaling Graph) It's is important to select packet with source IP 192.168.1.100:




 
In our case (not always) the bottleneck looks to be the WindowSize, after 1s the window size stops growing and is constant until the end of test. The value is of 3145728 bytes. 
So lets check the what performance i should get with such window by calculating BDP (more info -->) :

Bandwidth = tcp_rwin  / Delay = (3145728 * 8) / 100ms = 251658240 ~ 240Mbit/s

But we get only 94Mbit/s, lets check the sending side tcp_wmem value:
TCP send and receive buffer size in my PC1:

#check the TCP sending buffer size - value passed to TCP protocol
# In our case we must look to the last number (buffer size in bytes)
cat /proc/sys/net/ipv4/tcp_wmem
4096    16384    4194304


#check the TCP receiving buffer size - value passed to TCP protocol
# In our case we must look to the last number (buffer size in bytes)
cat /proc/sys/net/ipv4/tcp_rmem
4096    87380    6291456


# the sending OS socket buffer size for all connection and protocol, this value overrides the tcp_wmem
cat /proc/sys/net/core/wmem_max
212992



# the receiving OS socket buffer size for all connection and protocol, this value overrides the tcp_rmem
cat /proc/sys/net/core/rmem_max
212992




Lets increase the the TCP_WMEM value to higher size:


echo  6291456 > /proc/sys/net/ipv4/tcp_wmem

So we get a little better results:

------------------------------------------------------------
Client connecting to 192.168.1.100, TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[  3] local 192.168.1.10 port 54642 connected with 192.168.1.100 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec  6.88 MBytes  57.7 Mbits/sec
[  3]  1.0- 2.0 sec  19.9 MBytes   167 Mbits/sec
[  3]  2.0- 3.0 sec  17.9 MBytes   150 Mbits/sec
[  3]  3.0- 4.0 sec  18.5 MBytes   155 Mbits/sec
[  3]  4.0- 5.0 sec  17.1 MBytes   144 Mbits/sec
[  3]  5.0- 6.0 sec  18.1 MBytes   152 Mbits/sec
[  3]  6.0- 7.0 sec  16.4 MBytes   137 Mbits/sec
[  3]  7.0- 8.0 sec  19.5 MBytes   164 Mbits/sec
[  3]  8.0- 9.0 sec  17.8 MBytes   149 Mbits/sec
[  3]  9.0-10.0 sec  18.0 MBytes   151 Mbits/sec
[  3]  0.0-10.0 sec   170 MBytes   143 Mbits/sec





After more increase (to up 62914560 ~ 60M) we get  239Mbit/s, (close to over calculated value). By checking it Wireshark (Statistics -> IO Graph) we see:



We use to filter to coming in and outgoing traffic and tick interval of 0.01

Filter for TCP Data : ip.src==192.168.1.10
Filter for TCP ACK  : ip.src==192.168.1.100

The interesting part is the gap of No traffic of idle time, this idle period is "eating our throughput", in our case this hapens due to small receiving side TCP window or TCP_RWIN. After increasing it to 12.5MBytes (value calculate with BDP formula) we get only 456Mbit/s, only after increasing the TCP_WMEM to up to 24MB we got our max thoughput, by setting the sending side (PC2) to 24MB, we did not get the same resutls, so where i as wrong?

The only logical answer is the system delay which we forgot to include our delay calculation (still have to prove it)

Resutls:

1. The TCP throughput is depending from sending and receiving side TCP window size (tcp_wmem and tcp_rmem) on sender and receiver.
2. The calculated size of TCP window according the BDP formula is not allays correct due to the fact that the equation is using only network delay (RTT) value.
3. It is best way to check for TCP throughput issue is wireshark (one interesting thing we mist in the last graph is the TCP ACK rate, according RFC the TCP ACK SHOULD be sent after every second full size data segment. But about this next time).







No comments:

Post a Comment