Bug 472129 - Confusing log messages in VMGaHPLog - and unexplained shutdown
Confusing log messages in VMGaHPLog - and unexplained shutdown
Status: CLOSED ERRATA
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: grid (Show other bugs)
1.0
All Linux
urgent Severity urgent
: 1.1
: ---
Assigned To: Matthew Farrellee
Kim van der Riet
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-11-18 16:24 EST by William Henry
Modified: 2009-02-04 11:04 EST (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-02-04 11:04:06 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description William Henry 2008-11-18 16:24:03 EST
Description of problem:

A cryptic an not very explicit message is showing up in the the VMGaHPLog.  It seems to happen every minute or so.  After a few cycles the vm is halted and restarted by condor:

The message is about "Command returned non-zero : ..."

The there is a "No more children processes to reap"

I tested whti while logged into the console and witnessed getting kicked off as it killed the vm.

Snippet of log:

11/18 14:01:13 VMGAHP[9460]: Command: CONDOR_VM_STATUS
11/18 14:01:13 VMGAHP[9460]: Inside XenType::Status
11/18 14:01:13 VMGAHP[9460]: Command returned non-zero: /usr/sbin/condor_vm_xen.sh status /var/lib/condor/execute/dir_9456/xen_vm.config
11/18 14:01:13 DaemonCore: No more children processes to reap.
11/18 14:01:15 VMGAHP[9460]: Command: RESULTS
11/18 14:01:16 VMGAHP[9460]: Command: RESULTS
11/18 14:02:16 VMGAHP[9460]: Command: CONDOR_VM_STATUS
11/18 14:02:16 VMGAHP[9460]: Inside XenType::Status
11/18 14:02:16 VMGAHP[9460]: Command returned non-zero: /usr/sbin/condor_vm_xen.sh status /var/lib/condor/execute/dir_9456/xen_vm.config
11/18 14:02:16 DaemonCore: No more children processes to reap.
11/18 14:02:18 VMGAHP[9460]: Command: RESULTS
11/18 14:02:19 VMGAHP[9460]: Command: RESULTS
11/18 14:03:19 VMGAHP[9460]: Command: CONDOR_VM_STATUS
11/18 14:03:19 VMGAHP[9460]: Inside XenType::Status
11/18 14:03:19 VMGAHP[9460]: Command returned non-zero: /usr/sbin/condor_vm_xen.sh status /var/lib/condor/execute/dir_9456/xen_vm.config
11/18 14:03:19 DaemonCore: No more children processes to reap.
11/18 14:03:21 VMGAHP[9460]: Command: RESULTS
11/18 14:03:23 VMGAHP[9460]: Command: CONDOR_VM_STOP
11/18 14:03:23 VMGAHP[9460]: Inside XenType::Shutdown
11/18 14:03:23 VMGAHP[9460]: Inside XenType::ResumeFromSoftSuspend
11/18 14:03:23 VMGAHP[9460]: Inside XenType::Shutdown
11/18 14:03:23 VMGAHP[9460]: executeStop success!
11/18 14:03:23 DaemonCore: No more children processes to reap.
11/18 14:03:25 VMGAHP[9460]: Command: RESULTS
11/18 14:03:26 VMGAHP[9460]: Command: QUIT
11/18 14:03:26 VMGAHP[9460]: Inside XenType::killVMFast
11/18 14:03:26 VMGAHP[9460]: killVMFast is called
11/18 14:03:27 Reading condor configuration from '/etc/condor/condor_config'
11/18 14:03:27 ******************************************************
11/18 14:03:27 ** condor_vm-gahp (CONDOR_VM_GAHP) STARTING UP
11/18 14:03:27 ** /usr/sbin/condor_vm-gahp
11/18 14:03:27 ** SubsystemInfo: name=VM_GAHP type=GAHP(9) class=DAEMON(1)
11/18 14:03:27 ** Configuration: subsystem:VM_GAHP local:<NONE> class:DAEMON
11/18 14:03:27 ** $CondorVersion: 7.2.0 Nov 15 2008 BuildID: RH-7.2.0-0.1.el5 PRE-RELEASE-UWCS $
11/18 14:03:27 ** $CondorPlatform: I386-LINUX_RHEL5 $
11/18 14:03:27 ** PID = 10492
11/18 14:03:27 ** Log last touched 11/18 14:03:26
11/18 14:03:27 ******************************************************
11/18 14:03:27 Using config source: /etc/condor/condor_config
11/18 14:03:27 Using local config sources: 
11/18 14:03:27    /var/lib/condor/condor_config.local
11/18 14:03:27 DaemonCore: Command Socket at <127.0.0.1:43967>
11/18 14:03:27 Will use UDP to update collector localhost.localdomain <127.0.0.1:9618>
11/18 14:03:27 VMGAHP[10492]: VM-GAHP initialized with run-mode 4
11/18 14:03:27 VMGAHP[10492]: Initial UID/GUID=0/0, EUID/EGUID=100/100, Condor UID/GID=100,103
11/18 14:03:27 VMGAHP[10492]: Initialize Uids: caller=root, job user=condor
11/18 14:03:27 VMGAHP[10492]: Inside XenType::killVMFast
11/18 14:03:27 **** condor_vm-gahp (condor_VM_GAHP) pid 10492 EXITING WITH STATUS 0
11/18 14:03:28 **** condor_vm-gahp (condor_VM_GAHP) pid 9460 EXITING WITH STATUS 0
11/18 14:03:28 Reading condor configuration from '/etc/condor/condor_config'
11/18 14:03:28 ******************************************************
11/18 14:03:28 ** condor_vm-gahp (CONDOR_VM_GAHP) STARTING UP
11/18 14:03:28 ** /usr/sbin/condor_vm-gahp
11/18 14:03:28 ** SubsystemInfo: name=VM_GAHP type=GAHP(9) class=DAEMON(1)
11/18 14:03:28 ** Configuration: subsystem:VM_GAHP local:<NONE> class:DAEMON
11/18 14:03:28 ** $CondorVersion: 7.2.0 Nov 15 2008 BuildID: RH-7.2.0-0.1.el5 PRE-RELEASE-UWCS $
11/18 14:03:28 ** $CondorPlatform: I386-LINUX_RHEL5 $
11/18 14:03:28 ** PID = 10512
11/18 14:03:28 ** Log last touched 11/18 14:03:28
11/18 14:03:28 ******************************************************
11/18 14:03:28 Using config source: /etc/condor/condor_config
11/18 14:03:28 Using local config sources: 
11/18 14:03:28    /var/lib/condor/condor_config.local
11/18 14:03:28 DaemonCore: Command Socket at <127.0.0.1:38446>
11/18 14:03:28 Will use UDP to update collector localhost.localdomain <127.0.0.1:9618>
11/18 14:03:28 VMGAHP[10512]: VM-GAHP initialized with run-mode 0
11/18 14:03:28 VMGAHP[10512]: Initial UID/GUID=0/0, EUID/EGUID=100/100, Condor UID/GID=100,103
11/18 14:03:28 VMGAHP[10512]: Initialize Uids: caller=root, job user=condor
11/18 14:03:28 **** condor_vm-gahp (condor_VM_GAHP) pid 10512 EXITING WITH STATUS 0

condor_config.local vm settings:

##  VM Universe Parameters 
VM_TYPE = xen
VM_GAHP_SERVER = $(SBIN)/condor_vm-gahp
VM_GAHP_LOG = $(LOG)/VMGahpLog
MAX_VM_GAHP_LOG   = 1000000
VM_GAHP_DEBUG = D_FULLDEBUG
VM_VERSION = 0.0
VM_MEMORY = 1024
VM_MAX_MEMORY = 1024
XEN_SCRIPT = $(SBIN)/condor_vm_xen.sh
XEN_BOOTLOADER = /usr/bin/pygrub
XEN_DEFAULT_KERNEL = /boot/vmlinuz-2.6.18-92.1.18.el5xen
XEN_DEFAULT_INITRD = /boot/initrd-2.6.18-92.1.18.el5xen.img


job submission file:

universe = vm
vm_type = xen
vm_memory = 512
xen_kernel = included
# below is the disk image created by the Virtual Machine Manager
xen_disk = /mnt/fedora-home/vms/rhel5.img:xvda:w
# the executable has no meaning except to be a title for a VM job
executable = Test Xen Job
queue





Version-Release number of selected component (if applicable):
Comment 1 William Henry 2008-11-18 16:33:10 EST
The StarterLog.slot2:

11/18 14:25:50 Job 64.0 set to execute immediately
11/18 14:25:50 Starting a VM universe job with ID: 64.0
11/18 14:25:50 About to start new VM
11/18 14:25:50 About to exec /usr/sbin/condor_vm-gahp -f -M 3
11/18 14:25:50 VMGAHP server pid=13937
11/18 14:26:49 StartJob for VM succeeded
11/18 14:26:49 condor_write(): Socket closed when trying to write 320 bytes to
<127.0.0.1:43363>, fd is 9
11/18 14:26:49 Buf::write(): condor_write() failed
11/18 14:26:49 SECMAN: Error sending response classad!
MyType = "(unknown type)"
TargetType = "(unknown type)"
AuthMethods = "FS,KERBEROS"
CryptoMethods = "3DES,BLOWFISH"
OutgoingNegotiation = "PREFERRED"
Authentication = "OPTIONAL"
Encryption = "OPTIONAL"
Integrity = "OPTIONAL"
Enact = "NO"
Subsystem = "VM_GAHP"
ParentUniqueID = "localhost:13933:1227043549"
ServerPid = 13937
SessionDuration = "86400"
NewSession = "YES"
RemoteVersion = "$CondorVersion: 7.2.0 Nov 15 2008 BuildID: RH-7.2.0-0.1.el5
PRE-RELEASE-UWCS $"
ServerCommandSock = "<127.0.0.1:47600>"
Command = 60010
AuthCommand = 60008
11/18 14:26:50 ERROR: DC_AUTHENTICATE unable to receive auth_info!
11/18 14:27:52 Received VM status, result(1,VMGAHP_ERR_CRITICAL)
11/18 14:28:55 Received VM status, result(1,VMGAHP_ERR_CRITICAL)
11/18 14:29:58 Received VM status, result(1,VMGAHP_ERR_CRITICAL)
11/18 14:31:01 Received VM status, result(1,VMGAHP_ERR_CRITICAL)
11/18 14:32:04 Repeated attempts to receive valid VM status failed up to 5
times
11/18 14:32:05 ShutdownFast all jobs.
11/18 14:32:10 Process exited, pid=13937, status=0
11/18 14:32:10 Last process exited, now Starter is exiting
11/18 14:32:10 **** condor_starter (condor_STARTER) pid 13933 EXITING WITH
STATUS 0
Comment 2 William Henry 2008-11-18 16:44:22 EST
Nothing in the xend-debug.log to report.

The only lines that get posted when the error occurs in xend.log are:
[2008-11-18 14:42:26 xend.XendDomainInfo 3251] DEBUG (XendDomainInfo:1560) XendDomainInfo.destroy: domid=11
[2008-11-18 14:42:26 xend.XendDomainInfo 3251] DEBUG (XendDomainInfo:1568) XendDomainInfo.destroyDomain(11)


The previous line while running was 6 minutes earlier:
[2008-11-18 14:36:35 xend 3251] INFO (XendDomain:380) Domain whenry_localhost.localdomain_64_0 (11) unpaused.
Comment 3 Matthew Farrellee 2008-11-19 20:39:48 EST
So, the cryptic message should be fixed - it has details now. What I'd be concerned with here is that the VM is being shutdown. What it looks like is happening is STATUS is tried a number of times (polling) and never returns because of whatever is causing the non-zero return. The result is Condor kills the VM.

Can you run 7.2.0-0.2 and report if you still see the non-zero messages and if so what the error truly is?
Comment 4 Jeff Needle 2008-11-20 08:41:19 EST
Adding additional debug stuff seems to have fixed this.  At least, the non-zero message is gone and the guest no longer gets killed.  Go figure.  Verified in condor-7.2.0-0.2.el5.  While it would be nice to know what the error was, I can't seem to make it reappear, and the behavior is now as expected, so I'm going to mark this as VERIFIED and we'll open a new bug with more details if we see the non-zero behavior resurface.
Comment 6 errata-xmlrpc 2009-02-04 11:04:06 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-0036.html

Note You need to log in before you can comment on or make changes to this bug.