Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 572668 - Potential shadow/schedd protocol error
Potential shadow/schedd protocol error
Status: CLOSED ERRATA
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor (Show other bugs)
1.2
All Linux
medium Severity medium
: 1.3
: ---
Assigned To: Matthew Farrellee
Luigi Toscano
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2010-03-11 14:54 EST by Scott Spurrier
Modified: 2018-10-27 10:20 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously, the UDP command queue for the scheduler daemon could back up several millions of packets and the user received, large amounts of errors when checking netstat -ap grep schedd". This update resolves this issue.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-10-14 12:13:04 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2010:0773 normal SHIPPED_LIVE Moderate: Red Hat Enterprise MRG Messaging and Grid Version 1.3 2010-10-14 11:56:44 EDT

  None (edit)
Description Scott Spurrier 2010-03-11 14:54:27 EST
Description of problem:

While analyzing condor_schedd responsiveness we noticed that when looking at "netstat -ap| grep schedd" the UDP command queue for the schedd had backed up to 3 million packets and we were receiving many, many errors per second in the SchedLog that look like the following:

03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a long message still waiting to be closed (consumed=0). Closing it now.
03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a short message still waiting to be closed (consumed=0). Closing it now.
03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a short message still waiting to be closed (consumed=0). Closing it now.
03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a long message still waiting to be closed (consumed=0). Closing it now.
03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a long message still waiting to be closed (consumed=0). Closing it now.
03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a long message still waiting to be closed (consumed=0). Closing it now.
03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a short message still waiting to be closed (consumed=0). Closing it now.
03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a short message still waiting to be closed (consumed=0). Closing it now.
03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a short message still waiting to be closed (consumed=0). Closing it now.
03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a short message still waiting to be closed (consumed=0). Closing it now.
03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a short message still waiting to be closed (consumed=0). Closing it now.
03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a long message still waiting to be closed (consumed=0). Closing it now.
03/03 11:32:52 (pid:29312) ERROR: receiving new UDP message but found a long message still waiting to be closed (consumed=0). Closing it now.

We captured SchedLog, ShadowLog, and a packet capture using

tshark -R udp.port==(UDP port backlogged) -i lo -w /tmp/stuff.cap

I'll upload that info to dropbox.redhat.com given this IT.
1819c32a28ef5a20ccb018239f325207  IT596093-logs.tar.bz2

This was with condor-7.4.3-0.3.el5

------------------------------------
FWIW, after a night's run:

(on gemco)
netstat -apn | grep -i schedd

udp   67106984      0 0.0.0.0:41800               0.0.0.0:*                               15551/condor_schedd

That's 67 million queued UDP packets.  We've clearly fallen off the cliff here.

I'm assuming that it might be desirable to inspect that packet queue at a system level so that you can more readily identify a packet with corrupted protocol.  That seems like a job for the SystemTap experts.  We're on RHEL 5.4 here, so if someone can sling a tap script to dump out the packets in queue to help the debugging here it might be worthwhile.
Comment 1 Matthew Farrellee 2010-03-15 22:40:45 EDT
Fixed upstream, see http://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1270

Built in 7.4.3-0.5
Comment 5 Florian Nadge 2010-10-07 14:40:00 EDT
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Previously, the UDP command queue for the scheduler daemon could back up several millions of packets and the user received, large amounts of errors when checking "netstat -ap| grep schedd". This update resolves this issue.
Comment 6 Florian Nadge 2010-10-08 06:21:18 EDT
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1 @@
-Previously, the UDP command queue for the scheduler daemon could back up several millions of packets and the user received, large amounts of errors when checking "netstat -ap| grep schedd". This update resolves this issue.+Previously, the UDP command queue for the scheduler daemon could back up several millions of packets and the user received, large amounts of errors when checking netstat -ap  grep schedd". This update resolves this issue.
Comment 8 errata-xmlrpc 2010-10-14 12:13:04 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0773.html

Note You need to log in before you can comment on or make changes to this bug.