Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 572668

Summary: Potential shadow/schedd protocol error
Product: Red Hat Enterprise MRG Reporter: Scott Spurrier <spurrier>
Component: condorAssignee: Matthew Farrellee <matt>
Status: CLOSED ERRATA QA Contact: Luigi Toscano <ltoscano>
Severity: medium Docs Contact:
Priority: medium    
Version: 1.2CC: fnadge, jthomas, ltoscano, matt, tao
Target Milestone: 1.3   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Previously, the UDP command queue for the scheduler daemon could back up several millions of packets and the user received, large amounts of errors when checking netstat -ap grep schedd". This update resolves this issue.
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-10-14 16:13:04 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Scott Spurrier 2010-03-11 19:54:27 UTC
Description of problem:

While analyzing condor_schedd responsiveness we noticed that when looking at "netstat -ap| grep schedd" the UDP command queue for the schedd had backed up to 3 million packets and we were receiving many, many errors per second in the SchedLog that look like the following:

03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a long message still waiting to be closed (consumed=0). Closing it now.
03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a short message still waiting to be closed (consumed=0). Closing it now.
03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a short message still waiting to be closed (consumed=0). Closing it now.
03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a long message still waiting to be closed (consumed=0). Closing it now.
03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a long message still waiting to be closed (consumed=0). Closing it now.
03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a long message still waiting to be closed (consumed=0). Closing it now.
03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a short message still waiting to be closed (consumed=0). Closing it now.
03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a short message still waiting to be closed (consumed=0). Closing it now.
03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a short message still waiting to be closed (consumed=0). Closing it now.
03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a short message still waiting to be closed (consumed=0). Closing it now.
03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a short message still waiting to be closed (consumed=0). Closing it now.
03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a long message still waiting to be closed (consumed=0). Closing it now.
03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a long message still waiting to be closed (consumed=0). Closing it now.

We captured SchedLog, ShadowLog, and a packet capture using

tshark -R udp.port==(UDP port backlogged) -i lo -w /tmp/stuff.cap

I'll upload that info to dropbox.redhat.com given this IT.
1819c32a28ef5a20ccb018239f325207  IT596093-logs.tar.bz2

This was with condor-7.4.3-0.3.el5

------------------------------------
FWIW, after a night's run:

(on gemco)
netstat -apn | grep -i schedd

udp   67106984      0 0.0.0.0:41800               0.0.0.0:*                               15551/condor_schedd

That's 67 million queued UDP packets.  We've clearly fallen off the cliff here.

I'm assuming that it might be desirable to inspect that packet queue at a system level so that you can more readily identify a packet with corrupted protocol.  That seems like a job for the SystemTap experts.  We're on RHEL 5.4 here, so if someone can sling a tap script to dump out the packets in queue to help the debugging here it might be worthwhile.

Comment 1 Matthew Farrellee 2010-03-16 02:40:45 UTC
Fixed upstream, see http://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1270

Built in 7.4.3-0.5

Comment 5 Florian Nadge 2010-10-07 18:40:00 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Previously, the UDP command queue for the scheduler daemon could back up several millions of packets and the user received, large amounts of errors when checking "netstat -ap| grep schedd". This update resolves this issue.

Comment 6 Florian Nadge 2010-10-08 10:21:18 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1 @@
-Previously, the UDP command queue for the scheduler daemon could back up several millions of packets and the user received, large amounts of errors when checking "netstat -ap| grep schedd". This update resolves this issue.+Previously, the UDP command queue for the scheduler daemon could back up several millions of packets and the user received, large amounts of errors when checking netstat -ap  grep schedd". This update resolves this issue.

Comment 8 errata-xmlrpc 2010-10-14 16:13:04 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0773.html