Results 1 to 2 of 2

Thread: 10.04 64bit TCP connections close

  1. #1
    Join Date
    Jun 2010
    Beans
    3

    10.04 64bit TCP connections close

    I have been working on this problem for over 1.5 weeks part time.

    I have 10 SuperMicro systems spread over a wide geography, 4 are new 2009 year 6025W or 6026T machines running mother board 1GBit LAN controllers 82575EB and 82576 respectively which use the igb driver and 6 are year 2005 units with 1GBit 82546EB LAN controllers using the e1000 driver. I checked the driver versions with ethtool -i eth0 and it is consistent with the current 2.6.32-22 kernel as of June 1 2010.

    Prior to the upgrade to 10.04 all units were on 9.10 and I did not see this problem.

    I run 2 custom applications, one uses C library socket calls, the other uses gstreamer souphttpsrc which ultimately uses C library socket calls, connecting to the devices by HTTP in both cases. One application transfers 8*64k/sec the other 2*16k/sec. Sites vary in device count, the most is 24, the smallest is 9. When a socket disconnect occurs the application layer pops a timer and then reconnects. The timeouts are random across all devices and with 2 applications talking to each device for different purposes the timeouts occur in both applications but never at the same time unless a device is restarted which is an understandable reason for timeout. This smells like something gets lost and never gets regenerated by the protocol.

    On all but one of the 64 bit machines I see from 0 to 20 disconnects per day. All devices are 100 BaseT. One site has Netgear 724 1000 BaseT switches, all others are 100 BaseT switches. One 64 bit site does not have disconnects at all, all machines were installed the same using the same method. The 32 bit machines have fewer devices typically but have been running for 2 week continuous with both applications clean - no timeouts and disconnects.

    If I run ethtool -S I see no errors at the board level.

    If I run netstat -s I see packets collapsed in receive queue due to low socket buffer here is a sample output for the worst offender.

    Ip:
    60980702 total packets received
    0 forwarded
    0 incoming packets discarded
    60980702 incoming packets delivered
    36474830 requests sent out
    Icmp:
    171 ICMP messages received
    67 input ICMP message failed.
    ICMP input histogram:
    destination unreachable: 125
    echo requests: 6
    echo replies: 40
    186 ICMP messages sent
    0 ICMP messages failed
    ICMP output histogram:
    destination unreachable: 126
    echo request: 54
    echo replies: 6
    IcmpMsg:
    InType0: 40
    InType3: 125
    InType8: 6
    OutType0: 6
    OutType3: 126
    OutType8: 54
    Tcp:
    124 active connections openings
    2 passive connection openings
    39 failed connection attempts
    0 connection resets received
    35 connections established
    60978538 segments received
    36473062 segments send out
    66 segments retransmited
    0 bad segments received.
    3 resets sent
    Udp:
    1504 packets received
    1 packets to unknown port received.
    0 packet receive errors
    1513 packets sent
    UdpLite:
    TcpExt:
    4 packets pruned from receive queue because of socket buffer overrun
    8 TCP sockets finished time wait in fast timer
    1 time wait sockets recycled by time stamp
    297353 delayed acks sent
    34 delayed acks further delayed because of locked socket
    Quick ack mode was activated 2 times
    43 packets directly queued to recvmsg prequeue.
    38 bytes directly received in process context from prequeue
    59068530 packet headers predicted
    2 packets header predicted and directly queued to user
    160 acknowledgments not containing data payload received
    263 predicted acknowledgments
    33 other TCP timeouts
    5642 packets collapsed in receive queue due to low socket buffer
    3 DSACKs sent for old packets
    IpExt:
    InMcastPkts: 421
    OutMcastPkts: 24
    InBcastPkts: 348
    OutBcastPkts: 260
    InOctets: -812390296
    OutOctets: 1900581807
    InMcastOctets: 15834
    OutMcastOctets: 3146
    InBcastOctets: 36591
    OutBcastOctets: 19760


    I did edit /etc/sysctl.conf as recommended by http://www.acc.umu.se/~maswan/linux-netperf.txt
    I also found a reference for setting parameters in the kernel source docs and the ixgb.txt document in kernel-source/Documentation/networking which is for a 10GBit controller and still had socket close problems. The recommendations in these articles result in increased buffer space for socket connections at the IP layer.

    ########
    # Added to tune networking memory parameters
    net/core/rmem_max = 8738000
    net/core/wmem_max = 6553600

    net/ipv4/tcp_rmem = 8192 873800 8738000
    net/ipv4/tcp_wmem = 4096 655360 6553600

    vm/min_free_kbytes = 65536

    and after reboot verified the values took, but I still get the problem. I have not run long enough yet to determine if the problem frequency has been reduced.


    Sorry for the long background, I have I think been thorough so a lot of information has been gathered.

    Questions:
    1) Is it possible to run the e1000 driver on the 82575EB / 82576 LAN interface instead of the igb driver. It may be more mature?

    2) I went to the Intel site and pulled down the sources to build 2.2.9, the 10.04 version is 2.1.0 and the sudo make install compiled and pushed igb.ko into the correct /lib/modules/... directory but the reboot still loads the 2.1.0 driver. I see there are instructions to run rmmod igb then insmod igb or modprobe igb so does the driver get saved somewhere else after the modprobe. I am having trouble with the IPMI on this machine and since it is 3000 miles away I can't rmmod unless IPMI works to still see the machine through the remote IPMI console but X after login gives me a blank screen on this machine.

    3) Any suggestions, if you got this far reading my post?

    Thanks, I have read the forum often and found several solutions to situations that have come up, this is my first post.

    Ron

  2. #2
    Join Date
    Jun 2010
    Beans
    3

    Re: 10.04 64bit TCP connections close

    I believe I have found a solution for this problem. The worst machine has run for over 20 hours without a disconnect. One of the processes was writing the data to disk for each device - about 1 GByte/hour per device with hourly roll-over to a new file. After the file is closed some post processing is done which writes a new file about the same size which causes a high I/O load on the disk. This causes the system to go into IO Wait freezes, the socket input buffers fill up and then things start to break down intermittently. I found this bug listed on the kernel.org site.

    https://bugzilla.kernel.org/show_bug.cgi?id=12309

    comment 390 from Perlover said using the command

    # echo deadline > /sys/block/sda/queue/scheduler

    was a cure.

    in my case the 6 TByte HW RAID was built on /dev/sdc so that one was changed to use the deadline scheduler. After that my problem is gone from all the systems affected.The 64 bit aspect is probably a red herring - the 32 bit machines have less work to do, they have fewer devices to service.


    Ron




Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •