Red Hat Bugzilla – Bug 572668
Potential shadow/schedd protocol error
Last modified: 2018-10-27 10:20:00 EDT
Description of problem: While analyzing condor_schedd responsiveness we noticed that when looking at "netstat -ap| grep schedd" the UDP command queue for the schedd had backed up to 3 million packets and we were receiving many, many errors per second in the SchedLog that look like the following: 03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a long message still waiting to be closed (consumed=0). Closing it now. 03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a short message still waiting to be closed (consumed=0). Closing it now. 03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a short message still waiting to be closed (consumed=0). Closing it now. 03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a long message still waiting to be closed (consumed=0). Closing it now. 03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a long message still waiting to be closed (consumed=0). Closing it now. 03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a long message still waiting to be closed (consumed=0). Closing it now. 03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a short message still waiting to be closed (consumed=0). Closing it now. 03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a short message still waiting to be closed (consumed=0). Closing it now. 03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a short message still waiting to be closed (consumed=0). Closing it now. 03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a short message still waiting to be closed (consumed=0). Closing it now. 03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a short message still waiting to be closed (consumed=0). Closing it now. 03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a long message still waiting to be closed (consumed=0). Closing it now. 03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a long message still waiting to be closed (consumed=0). Closing it now. We captured SchedLog, ShadowLog, and a packet capture using tshark -R udp.port==(UDP port backlogged) -i lo -w /tmp/stuff.cap I'll upload that info to dropbox.redhat.com given this IT. 1819c32a28ef5a20ccb018239f325207 IT596093-logs.tar.bz2 This was with condor-7.4.3-0.3.el5 ------------------------------------ FWIW, after a night's run: (on gemco) netstat -apn | grep -i schedd udp 67106984 0 0.0.0.0:41800 0.0.0.0:* 15551/condor_schedd That's 67 million queued UDP packets. We've clearly fallen off the cliff here. I'm assuming that it might be desirable to inspect that packet queue at a system level so that you can more readily identify a packet with corrupted protocol. That seems like a job for the SystemTap experts. We're on RHEL 5.4 here, so if someone can sling a tap script to dump out the packets in queue to help the debugging here it might be worthwhile.
Fixed upstream, see http://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1270 Built in 7.4.3-0.5
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Previously, the UDP command queue for the scheduler daemon could back up several millions of packets and the user received, large amounts of errors when checking "netstat -ap| grep schedd". This update resolves this issue.
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1 @@ -Previously, the UDP command queue for the scheduler daemon could back up several millions of packets and the user received, large amounts of errors when checking "netstat -ap| grep schedd". This update resolves this issue.+Previously, the UDP command queue for the scheduler daemon could back up several millions of packets and the user received, large amounts of errors when checking netstat -ap grep schedd". This update resolves this issue.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0773.html