641020 – [RFE] improve sos plugin to capture MRG GRID related information

Bug 641020 - [RFE] improve sos plugin to capture MRG GRID related information

Summary: [RFE] improve sos plugin to capture MRG GRID related information

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	sos
Sub Component:
Version:	5.7
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Bryn M. Reeves
QA Contact:	David Kutálek
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	769266
TreeView+	depends on / blocked

Reported:	2010-10-07 14:52 UTC by Jeremy Eder
Modified:	2012-08-10 09:06 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Enhancement
Doc Text:	Cause: Prior versions of sos only collected basic configuration information for MRG components Consequence: Users of these components would have to manually retrieve required data from the system Change: The set of data collected by the mrggrid module has been greatly expanded to include full logs, configuration and status information Result: With this release the full set of information required for initial analysis of these components is collected automatically on qualified systems
Clone Of:
Environment:
Last Closed:	2012-02-21 03:24:38 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Update mrggrid.py (1.24 KB, patch) 2011-11-01 17:16 UTC, Bryn M. Reeves	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2012:0153	0	normal	SHIPPED_LIVE	Low: sos security, bug fix, and enhancement update	2012-02-21 07:25:08 UTC

Description Jeremy Eder 2010-10-07 14:52:38 UTC

Description of problem:

List of additional config/debug to add to mrggrid.py.

Comment 1 Jeremy Eder 2010-10-07 15:08:17 UTC

First pass at defining this BZ...



Files to grab:

1) entire directory:  /etc/condor
2) entire directory:  /var/log/condor

Command output:

3) # condor_q
4) # condor_q -better-analyze -long
5) # condor_status
6) # condor_status -l
7) # condor_history
8) # condor_version

Comment 3 Bryn M. Reeves 2011-01-14 16:40:53 UTC

MRG can ship its own sos plugin and this cause sosreports on systems with the MRG packages installed to automatically pull in this data.

Alternately if MRG has a script that collects this we're happy to call it from SoS and bundle up the data into the tarball (cf. rhn/lvmdump/satellite-debug etc) but this is requires more maintenance in sos if options etc change (or if there's a need to support multiple versions with different command line / naming conventions).

Comment 6 Bryn M. Reeves 2011-01-14 18:25:18 UTC

OK, so for now I think we can quite easily just cook up a mrggrid.py per Jeremy's suggestions. If MRG introduces its own wrap-up script (a la sat-debug et al.) at a later time we can switch the plugin to use that.

I'll try to have a play with this over the weekend.

Comment 7 Jon Thomas 2011-01-14 19:18:31 UTC

Hi,

This might take a few postings to get everything that we would need. 

in 1.3, config is in 

/etc/condor
and (depending is wallably is used)

/var/lib/condor/wallaby_node.config 

prior to 1.3, config was in 

/var/lib/condor

what to collect: 

all with subdirectories:  /etc/condor
toplevel: /var/lib/condor

--------------------

The running config can be obtained with

condor_config_val -dump

what to collect: condor_config_val -dump

----------------------------

DAEMON_LIST tells us what type of node we are on

$ condor_config_val -dump | grep DAEMON_LIST
DAEMON_LIST = MASTER, STARTD, SCHEDD, COLLECTOR, NEGOTIATOR, QMF_CONFIGD

--------------

logs are in 

/var/log/condor
/var/lib/condor/spool/Accountantnew.log (it's a log, but also a persistent data store of usage related info)

what to collect: 

all: /var/log/condor except on schedd nodes
all: /var/lib/condor/spool/Accountantnew.log (really should only be found on negotiator node)

On SCHEDD nodes, we generally don't want to collect every per job log such as StarterLog.slot1. There will be a StarterLog.slotx for every slot (a cluster with 1000 cpus might have 1000 slots and 1000 individual logs). The problem is that we actually might need a sampling of these logs, but collecting every one may cause the sosreport to become huge.

-------------------------------

/var/lib/condor/spool

This is another case where we might want to see the contents. The spool is where transient data is placed. On the remote node this might be the job's executable and date.

what to collect: ls -l /var/lib/condor/spool

* the history file lives in /var/lib/condor/spool. This is a listing of every job's classad that has been run. This is likely too large to collect.

Comment 8 Jon Thomas 2011-01-14 20:15:55 UTC

oops, disregard the previous section about logs

--------------

logs are in 

/var/log/condor
/var/lib/condor/spool/Accountantnew.log (it's a log, but also a persistent data
store of usage related info)
/var/lib/condor/spool/job_queue.log

what to collect: 

all: /var/log/condor
all: /var/lib/condor/spool/Accountantnew.log (really should only be found on
negotiator node)
all: /var/lib/condor/spool/job_queue.log

There will be a StarterLog.slotx for every slot (a cluster with 1000 cpus might have 1000 slots and 1000 individual logs). These will be on the startd machines. If we are troubleshooting a specific job problem, we need this from the machine on which the job is running. If it's the case where the job isn't running at all, we need to obtain this (and rest of sos) on a machine that represents a typical machine in the cluster. What we want to avoid is obtaining a sos report for every machine in the cluster. 

-------------------------------

Comment 10 RHEL Program Management 2011-06-21 06:00:16 UTC

This request was evaluated by Red Hat Product Management for inclusion in Red Hat Enterprise Linux 5.7 and Red Hat does not plan to fix this issue the currently developed update.

Contact your manager or support representative in case you need to escalate this bug.

Comment 12 Bryn M. Reeves 2011-11-01 17:16:38 UTC

Created attachment 531166 [details]
Update mrggrid.py

+        self.addCopySpec("/etc/condor")
+        self.addCopySpec("/var/log/condor")
+        self.addCopySpec("/var/lib/condor/spool/Accountantnew.log")
+        self.addCopySpec("/var/lib/condor/spool/job_queue.log")
+        self.collectExtOutput("ls -l /var/lib/condor/spool")
+        self.collectExtOutput("condor_config_val -dump")
+        self.collectExtOutput("condor_q")
+        self.collectExtOutput("condor_q -better-analyze -long")
+        self.collectExtOutput("condor_status")
+        self.collectExtOutput("condor_status -l")
+        self.collectExtOutput("condor_history")
+        self.collectExtOutput("condor_version")

Comment 13 Matthew Farrellee 2011-11-01 17:32:11 UTC

Get config loading order: condor_config_val -config

Get list of wallaby configured nodes: wallaby inventory

Record wallaby database: wallaby dump FILE, then copy FILE

Get list of agents connected to messaging bus, on host running qpidd (Messaging broker): qpid-stat -c

List of running Grid components: condor_status -any

Cumin log files: /var/log/cumin

Comment 16 Bryn M. Reeves 2011-12-06 19:09:44 UTC

Can wallaby dump be run with '-' as the file argument to dump on stdout?

Otherwise this would need a nasty hack to handle in sosreport since all the file copying APIs are asynchronous - the only synchronous collection is for tool output. This makes collecting the data and cleaning up messy without making changes outside the mrg plugin.

Also, what is the standard path for the various mrg commands? We prefer to use absolute paths in sos rather than depending on the content of $PATH.

Comment 17 Matthew Farrellee 2011-12-06 19:26:26 UTC

(In reply to comment #16)
> Can wallaby dump be run with '-' as the file argument to dump on stdout?
> 
> Otherwise this would need a nasty hack to handle in sosreport since all the
> file copying APIs are asynchronous - the only synchronous collection is for
> tool output. This makes collecting the data and cleaning up messy without
> making changes outside the mrg plugin.

'/usr/bin/wallaby dump' without additional arguments will go to stdout


> Also, what is the standard path for the various mrg commands? We prefer to use
> absolute paths in sos rather than depending on the content of $PATH.

$ rpm -ql wallaby-utils | grep -e bin/wallaby
/usr/bin/wallaby
$ rpm -ql qpid-tools | grep -e bin/qpid-stat
/usr/bin/qpid-stat
$ rpm -ql condor | grep -e bin/condor_config_val -e bin/condor_status
/usr/bin/condor_config_val
/usr/bin/condor_status

Comment 18 Lukáš Zachar 2011-12-09 16:17:01 UTC

Verified.
All requested outputs/logs/configuration files are collected.

Comment 20 Bryn M. Reeves 2012-01-25 17:20:36 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: Prior versions of sos only collected basic configuration information for MRG
components

Consequence: Users of these components would have to manually retrieve required data form
the system

Change: The set of data collected by the mrggrid module has been greatly expanded to
include full logs, configuration and status information

Result: With this release the full set of information required for initial analysis
of these components is collected automatically on qualified systems

Comment 21 Bryn M. Reeves 2012-01-25 17:47:27 UTC

    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,11 +1,7 @@
-Cause: Prior versions of sos only collected basic configuration information for MRG
-components
+Cause: Prior versions of sos only collected basic configuration information for MRG components
 
-Consequence: Users of these components would have to manually retrieve required data form
-the system
+Consequence: Users of these components would have to manually retrieve required data from the system
 
-Change: The set of data collected by the mrggrid module has been greatly expanded to
-include full logs, configuration and status information
+Change: The set of data collected by the mrggrid module has been greatly expanded to include full logs, configuration and status information
 
-Result: With this release the full set of information required for initial analysis
+Result: With this release the full set of information required for initial analysis of these components is collected automatically on qualified systems-of these components is collected automatically on qualified systems

Comment 22 errata-xmlrpc 2012-02-21 03:24:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-0153.html

Note You need to log in before you can comment on or make changes to this bug.