I have been working on this problem for over 1.5 weeks part time.
I have 10 SuperMicro systems spread over a wide geography, 4 are new 2009 year 6025W or 6026T machines running mother board 1GBit LAN controllers 82575EB and 82576 respectively which use the igb driver and 6 are year 2005 units with 1GBit 82546EB LAN controllers using the e1000 driver. I checked the driver versions with ethtool -i eth0 and it is consistent with the current 2.6.32-22 kernel as of June 1 2010.
Prior to the upgrade to 10.04 all units were on 9.10 and I did not see this problem.
I run 2 custom applications, one uses C library socket calls, the other uses gstreamer souphttpsrc which ultimately uses C library socket calls, connecting to the devices by HTTP in both cases. One application transfers 8*64k/sec the other 2*16k/sec. Sites vary in device count, the most is 24, the smallest is 9. When a socket disconnect occurs the application layer pops a timer and then reconnects. The timeouts are random across all devices and with 2 applications talking to each device for different purposes the timeouts occur in both applications but never at the same time unless a device is restarted which is an understandable reason for timeout. This smells like something gets lost and never gets regenerated by the protocol.
On all but one of the 64 bit machines I see from 0 to 20 disconnects per day. All devices are 100 BaseT. One site has Netgear 724 1000 BaseT switches, all others are 100 BaseT switches. One 64 bit site does not have disconnects at all, all machines were installed the same using the same method. The 32 bit machines have fewer devices typically but have been running for 2 week continuous with both applications clean - no timeouts and disconnects.
If I run ethtool -S I see no errors at the board level.
If I run netstat -s I see packets collapsed in receive queue due to low socket buffer here is a sample output for the worst offender.
Ip:
60980702 total packets received
0 forwarded
0 incoming packets discarded
60980702 incoming packets delivered
36474830 requests sent out
Icmp:
171 ICMP messages received
67 input ICMP message failed.
ICMP input histogram:
destination unreachable: 125
echo requests: 6
echo replies: 40
186 ICMP messages sent
0 ICMP messages failed
ICMP output histogram:
destination unreachable: 126
echo request: 54
echo replies: 6
IcmpMsg:
InType0: 40
InType3: 125
InType8: 6
OutType0: 6
OutType3: 126
OutType8: 54
Tcp:
124 active connections openings
2 passive connection openings
39 failed connection attempts
0 connection resets received
35 connections established
60978538 segments received
36473062 segments send out
66 segments retransmited
0 bad segments received.
3 resets sent
Udp:
1504 packets received
1 packets to unknown port received.
0 packet receive errors
1513 packets sent
UdpLite:
TcpExt:
4 packets pruned from receive queue because of socket buffer overrun
8 TCP sockets finished time wait in fast timer
1 time wait sockets recycled by time stamp
297353 delayed acks sent
34 delayed acks further delayed because of locked socket
Quick ack mode was activated 2 times
43 packets directly queued to recvmsg prequeue.
38 bytes directly received in process context from prequeue
59068530 packet headers predicted
2 packets header predicted and directly queued to user
160 acknowledgments not containing data payload received
263 predicted acknowledgments
33 other TCP timeouts
5642 packets collapsed in receive queue due to low socket buffer
3 DSACKs sent for old packets
IpExt:
InMcastPkts: 421
OutMcastPkts: 24
InBcastPkts: 348
OutBcastPkts: 260
InOctets: -812390296
OutOctets: 1900581807
InMcastOctets: 15834
OutMcastOctets: 3146
InBcastOctets: 36591
OutBcastOctets: 19760
I did edit /etc/sysctl.conf as recommended by http://www.acc.umu.se/~maswan/linux-netperf.txt
I also found a reference for setting parameters in the kernel source docs and the ixgb.txt document in kernel-source/Documentation/networking which is for a 10GBit controller and still had socket close problems. The recommendations in these articles result in increased buffer space for socket connections at the IP layer.
########
# Added to tune networking memory parameters
net/core/rmem_max = 8738000
net/core/wmem_max = 6553600
net/ipv4/tcp_rmem = 8192 873800 8738000
net/ipv4/tcp_wmem = 4096 655360 6553600
vm/min_free_kbytes = 65536
and after reboot verified the values took, but I still get the problem. I have not run long enough yet to determine if the problem frequency has been reduced.
Sorry for the long background, I have I think been thorough so a lot of information has been gathered.
Questions:
1) Is it possible to run the e1000 driver on the 82575EB / 82576 LAN interface instead of the igb driver. It may be more mature?
2) I went to the Intel site and pulled down the sources to build 2.2.9, the 10.04 version is 2.1.0 and the sudo make install compiled and pushed igb.ko into the correct /lib/modules/... directory but the reboot still loads the 2.1.0 driver. I see there are instructions to run rmmod igb then insmod igb or modprobe igb so does the driver get saved somewhere else after the modprobe. I am having trouble with the IPMI on this machine and since it is 3000 miles away I can't rmmod unless IPMI works to still see the machine through the remote IPMI console but X after login gives me a blank screen on this machine.
3) Any suggestions, if you got this far reading my post?
Thanks, I have read the forum often and found several solutions to situations that have come up, this is my first post.
Ron
Bookmarks