Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 572668

Summary:	Potential shadow/schedd protocol error
Product:	Red Hat Enterprise MRG	Reporter:	Scott Spurrier <spurrier>
Component:	condor	Assignee:	Matthew Farrellee <matt>
Status:	CLOSED ERRATA	QA Contact:	Luigi Toscano <ltoscano>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	1.2	CC:	fnadge, jthomas, ltoscano, matt, tao
Target Milestone:	1.3
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Previously, the UDP command queue for the scheduler daemon could back up several millions of packets and the user received, large amounts of errors when checking netstat -ap grep schedd". This update resolves this issue.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2010-10-14 16:13:04 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Scott Spurrier 2010-03-11 19:54:27 UTC

Description of problem:

While analyzing condor_schedd responsiveness we noticed that when looking at "netstat -ap| grep schedd" the UDP command queue for the schedd had backed up to 3 million packets and we were receiving many, many errors per second in the SchedLog that look like the following:

03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a long message still waiting to be closed (consumed=0). Closing it now.
03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a short message still waiting to be closed (consumed=0). Closing it now.
03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a short message still waiting to be closed (consumed=0). Closing it now.
03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a long message still waiting to be closed (consumed=0). Closing it now.
03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a long message still waiting to be closed (consumed=0). Closing it now.
03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a long message still waiting to be closed (consumed=0). Closing it now.
03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a short message still waiting to be closed (consumed=0). Closing it now.
03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a short message still waiting to be closed (consumed=0). Closing it now.
03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a short message still waiting to be closed (consumed=0). Closing it now.
03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a short message still waiting to be closed (consumed=0). Closing it now.
03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a short message still waiting to be closed (consumed=0). Closing it now.
03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a long message still waiting to be closed (consumed=0). Closing it now.
03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a long message still waiting to be closed (consumed=0). Closing it now.

We captured SchedLog, ShadowLog, and a packet capture using

tshark -R udp.port==(UDP port backlogged) -i lo -w /tmp/stuff.cap

I'll upload that info to dropbox.redhat.com given this IT.
1819c32a28ef5a20ccb018239f325207  IT596093-logs.tar.bz2

This was with condor-7.4.3-0.3.el5

------------------------------------
FWIW, after a night's run:

(on gemco)
netstat -apn | grep -i schedd

udp   67106984      0 0.0.0.0:41800               0.0.0.0:*                               15551/condor_schedd

That's 67 million queued UDP packets.  We've clearly fallen off the cliff here.

I'm assuming that it might be desirable to inspect that packet queue at a system level so that you can more readily identify a packet with corrupted protocol.  That seems like a job for the SystemTap experts.  We're on RHEL 5.4 here, so if someone can sling a tap script to dump out the packets in queue to help the debugging here it might be worthwhile.

Comment 1 Matthew Farrellee 2010-03-16 02:40:45 UTC

Fixed upstream, see http://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1270

Built in 7.4.3-0.5

Comment 5 Florian Nadge 2010-10-07 18:40:00 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Previously, the UDP command queue for the scheduler daemon could back up several millions of packets and the user received, large amounts of errors when checking "netstat -ap| grep schedd". This update resolves this issue.

Comment 6 Florian Nadge 2010-10-08 10:21:18 UTC

    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1 @@
-Previously, the UDP command queue for the scheduler daemon could back up several millions of packets and the user received, large amounts of errors when checking "netstat -ap| grep schedd". This update resolves this issue.+Previously, the UDP command queue for the scheduler daemon could back up several millions of packets and the user received, large amounts of errors when checking netstat -ap  grep schedd". This update resolves this issue.

Comment 8 errata-xmlrpc 2010-10-14 16:13:04 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0773.html