Description of problem: List of additional config/debug to add to mrggrid.py.
First pass at defining this BZ... Files to grab: 1) entire directory: /etc/condor 2) entire directory: /var/log/condor Command output: 3) # condor_q 4) # condor_q -better-analyze -long 5) # condor_status 6) # condor_status -l 7) # condor_history 8) # condor_version
MRG can ship its own sos plugin and this cause sosreports on systems with the MRG packages installed to automatically pull in this data. Alternately if MRG has a script that collects this we're happy to call it from SoS and bundle up the data into the tarball (cf. rhn/lvmdump/satellite-debug etc) but this is requires more maintenance in sos if options etc change (or if there's a need to support multiple versions with different command line / naming conventions).
OK, so for now I think we can quite easily just cook up a mrggrid.py per Jeremy's suggestions. If MRG introduces its own wrap-up script (a la sat-debug et al.) at a later time we can switch the plugin to use that. I'll try to have a play with this over the weekend.
Hi, This might take a few postings to get everything that we would need. in 1.3, config is in /etc/condor and (depending is wallably is used) /var/lib/condor/wallaby_node.config prior to 1.3, config was in /var/lib/condor what to collect: all with subdirectories: /etc/condor toplevel: /var/lib/condor -------------------- The running config can be obtained with condor_config_val -dump what to collect: condor_config_val -dump ---------------------------- DAEMON_LIST tells us what type of node we are on $ condor_config_val -dump | grep DAEMON_LIST DAEMON_LIST = MASTER, STARTD, SCHEDD, COLLECTOR, NEGOTIATOR, QMF_CONFIGD -------------- logs are in /var/log/condor /var/lib/condor/spool/Accountantnew.log (it's a log, but also a persistent data store of usage related info) what to collect: all: /var/log/condor except on schedd nodes all: /var/lib/condor/spool/Accountantnew.log (really should only be found on negotiator node) On SCHEDD nodes, we generally don't want to collect every per job log such as StarterLog.slot1. There will be a StarterLog.slotx for every slot (a cluster with 1000 cpus might have 1000 slots and 1000 individual logs). The problem is that we actually might need a sampling of these logs, but collecting every one may cause the sosreport to become huge. ------------------------------- /var/lib/condor/spool This is another case where we might want to see the contents. The spool is where transient data is placed. On the remote node this might be the job's executable and date. what to collect: ls -l /var/lib/condor/spool * the history file lives in /var/lib/condor/spool. This is a listing of every job's classad that has been run. This is likely too large to collect.
oops, disregard the previous section about logs -------------- logs are in /var/log/condor /var/lib/condor/spool/Accountantnew.log (it's a log, but also a persistent data store of usage related info) /var/lib/condor/spool/job_queue.log what to collect: all: /var/log/condor all: /var/lib/condor/spool/Accountantnew.log (really should only be found on negotiator node) all: /var/lib/condor/spool/job_queue.log There will be a StarterLog.slotx for every slot (a cluster with 1000 cpus might have 1000 slots and 1000 individual logs). These will be on the startd machines. If we are troubleshooting a specific job problem, we need this from the machine on which the job is running. If it's the case where the job isn't running at all, we need to obtain this (and rest of sos) on a machine that represents a typical machine in the cluster. What we want to avoid is obtaining a sos report for every machine in the cluster. -------------------------------
This request was evaluated by Red Hat Product Management for inclusion in Red Hat Enterprise Linux 5.7 and Red Hat does not plan to fix this issue the currently developed update. Contact your manager or support representative in case you need to escalate this bug.
Created attachment 531166 [details] Update mrggrid.py + self.addCopySpec("/etc/condor") + self.addCopySpec("/var/log/condor") + self.addCopySpec("/var/lib/condor/spool/Accountantnew.log") + self.addCopySpec("/var/lib/condor/spool/job_queue.log") + self.collectExtOutput("ls -l /var/lib/condor/spool") + self.collectExtOutput("condor_config_val -dump") + self.collectExtOutput("condor_q") + self.collectExtOutput("condor_q -better-analyze -long") + self.collectExtOutput("condor_status") + self.collectExtOutput("condor_status -l") + self.collectExtOutput("condor_history") + self.collectExtOutput("condor_version")
Get config loading order: condor_config_val -config Get list of wallaby configured nodes: wallaby inventory Record wallaby database: wallaby dump FILE, then copy FILE Get list of agents connected to messaging bus, on host running qpidd (Messaging broker): qpid-stat -c List of running Grid components: condor_status -any Cumin log files: /var/log/cumin
Can wallaby dump be run with '-' as the file argument to dump on stdout? Otherwise this would need a nasty hack to handle in sosreport since all the file copying APIs are asynchronous - the only synchronous collection is for tool output. This makes collecting the data and cleaning up messy without making changes outside the mrg plugin. Also, what is the standard path for the various mrg commands? We prefer to use absolute paths in sos rather than depending on the content of $PATH.
(In reply to comment #16) > Can wallaby dump be run with '-' as the file argument to dump on stdout? > > Otherwise this would need a nasty hack to handle in sosreport since all the > file copying APIs are asynchronous - the only synchronous collection is for > tool output. This makes collecting the data and cleaning up messy without > making changes outside the mrg plugin. '/usr/bin/wallaby dump' without additional arguments will go to stdout > Also, what is the standard path for the various mrg commands? We prefer to use > absolute paths in sos rather than depending on the content of $PATH. $ rpm -ql wallaby-utils | grep -e bin/wallaby /usr/bin/wallaby $ rpm -ql qpid-tools | grep -e bin/qpid-stat /usr/bin/qpid-stat $ rpm -ql condor | grep -e bin/condor_config_val -e bin/condor_status /usr/bin/condor_config_val /usr/bin/condor_status
Verified. All requested outputs/logs/configuration files are collected.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cause: Prior versions of sos only collected basic configuration information for MRG components Consequence: Users of these components would have to manually retrieve required data form the system Change: The set of data collected by the mrggrid module has been greatly expanded to include full logs, configuration and status information Result: With this release the full set of information required for initial analysis of these components is collected automatically on qualified systems
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,11 +1,7 @@ -Cause: Prior versions of sos only collected basic configuration information for MRG -components +Cause: Prior versions of sos only collected basic configuration information for MRG components -Consequence: Users of these components would have to manually retrieve required data form -the system +Consequence: Users of these components would have to manually retrieve required data from the system -Change: The set of data collected by the mrggrid module has been greatly expanded to -include full logs, configuration and status information +Change: The set of data collected by the mrggrid module has been greatly expanded to include full logs, configuration and status information -Result: With this release the full set of information required for initial analysis +Result: With this release the full set of information required for initial analysis of these components is collected automatically on qualified systems-of these components is collected automatically on qualified systems
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2012-0153.html