Version 25

    JGroups specific

     

    If you do a Channel.connect("DefaultPartition-sessioncache"), which looks innocent enough, you add 29 + 2 == 31 bytes to each JGroups message !

     

    I'll investigate what I can do with canonicalization in https://jira.jboss.org/jira/browse/JGRP-872, to transparently replace cluster names with shorts, but until then  the best way is to shorten those long names.

     

    This is not a bug, but may affect perf negatively. Note that I'm talking about high-end perf scenarios, if you only send a couple of hundreds of messages per second, you don't need to be concerned about this !

    Ethernet flow control IEEE 802.3x

     

    JGroups performs flow control, such that fast senders don't overwhelm slow receivers, and cause them to fill up their input buffers, causing packet loss.

     

    However, some network interface cards (NICs) and switches perform ethernet flow control (IEEE 802.3x), which applies back pressure to senders when packet loss occurs.

     

    This (flow control) feature is not good for TCP, as TCP already has flow control built in, and so 802.3x duplicates that. In addition, TCP relies on packet loss to throttle the sender (slow start, congestion avoidance and recovery), so with 802.3x enabled, TCP won't be able to find the optimal congestion window size (cwnd) quickly.

     

    With UDP (IP multicast or regular datagram packets), however, ethernet flow control is usually a good thing, as UDP doesn't provide flow control. Therefore I recommend to turn ethernet flow control on.

     

    For managed switches, this can usually be done via a web or telnet/ssh interface. For unmanaged switches, unfurtunately the only chance is to hope 802.3x is not implemented, or to replace the switch.

     

    The NICs usually provide a means to disable this option, on Linux with the ethtool:

     

    /sbin/ethtool -A eth0 tx on rx on

     

    /sbin/ethtool -a eth0 can be used to verify ethernet flow control is on.

     

     

    I ran 4 JGroups processes, each on a different host (connected to the same 1GBit switch), each sending 1'000'000 1K messages with tcp.xml and udp.xml. With 802.3x (autoneg=on, rx=off, tx=off), the results were (on 2.7 CVS head, Oct 9 2008):

     

    • UDP: 34'849 messages/sec/instance

    • TCP: 63'312 messages/sec/instance

     

    With 802.3x (autoneg=on, rx=on, tx=on):

     

    • UDP: 126'463 messages/sec/instance !

    • TCP: 83'829 messages/sec/instance

     

    This means that by simply turning ethernet flow control on, TCP's performance increases slightly, but UDP's performance increases almost by a factor of 4 !

     

    Latest update (Oct 28 2008): I re-ran the above test with udp.xml and 4 nodes and got 136'000 messages/sec/instance !

     

    Increasing UDP receive buffers

     

    Packet loss is usually caused by the receivers not having enough space to buffer all packets, therefore the kernel start dropping packets. If the sender's send rate is always faster than the rate at which the receiver(s) can process packets, then the UDP receive buffers will fill up and packets get discarded, which leads to costly retransmissions. However, for temporary packet spikes, we can increase the receiver's UDP input buffers by calling Datagram/MulticastSocket.setReceiveBufferSize().

     

    Say we're sending 1'000 1'000-byte packets to a receiver. The default receive buffer is 65'000 bytes, which is not enough to hold all 1'000'000 bytes sent by the sender if the receiver cannot process the packets (and thus free buffer space) fast enough.

     

    So the easiest solution is to increase the receiver's buffer space. However, the kernel has a limit on the max size, e.g. on 2.6.21, sysctl net.core.rmem_max shows that this is around 130'000 bytes. So if we set the receive buffer to 1'000'000 bytes, the kernel would only allow it to be set to 130'000 bytes !

     

     

    If we want all 1'000 packets to get buffered, we can increase the max size:

     

    sysctl -w net.core.rmem_max=5000000

     

    This increases the max size to 5MB, so we can easily set a buffer space of 1MB now.

     

    So if your application sends a lot of data via UDP, then increasing the receive buffer sizes is a good tool to reduce packet loss and costly retransmission, speeding up a JGroups application.

     

    Update: in Linux kernel 2.6.25, net.ipv4.udp_mem, net.ipv4.udp_rmem_min and net.ipv4.udp_wmem were added to set the UDP buffers

     

     

    Using jumbo frames

     

    The default MTU is usually 1'500 bytes, this can be increased to (say) 9000 bytes with jumbo frames. Jumbo frames have to be enabled on all of the NICs attached to the gigabit switch and on the switch itself. On Linux, we enable a frame size of 9000 bytes with ifconfig:

     

    ifconfig eth0 mtu 9000.

     

    Using jumbo frames means that we will send IP packets that are much larger, so the ratio between ethernet, IP and UDP headers on the one hand and the payload on the other increases, so we can send more data, increasing throughput.

     

    Increase the NIC's input buffer

     

    (text)

     

    /sbin/ifconfig txqueuelen 5000

     

     

    IP Bonding

    Bonding can be used to create a virtual NIC that runs over multiple physical NICs. Packets sent out over the virtual NIC can then be load balanced across the physical NICs. This should increase performance as we can now theoretically double or triple the available bandwidth. Has to be confirmed in the labs though...

     

    More information for IP bonding can be found at:

     

    JVM performance tuning links