Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 916612

Summary:

Suspend and continue of VM universe job

Product:

Red Hat Enterprise MRG

Reporter:

Daniel Horák <dahorak>

Component:

condor

Assignee:

grid-maint-list <grid-maint-list>

Status:

CLOSED WONTFIX

QA Contact:

MRG Quality Engineering <mrgqe-bugs>

Severity:

low

Docs Contact:

Priority:

low

Version:

2.2

CC:

matt, tstclair

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2016-05-26 20:01:11 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Condor logs.	none

Description Daniel Horák 2013-02-28 13:30:00 UTC

Created attachment 703899 [details]
Condor logs.

Description of problem:
  What is expected behaviour for suspending VM universe job?
  After suspend VM job, it looks like suspended (in condor_q) and then after a while go to held state.

Version-Release number of selected component (if applicable):
  MRG 2.3: condor-7.8.8-0.4.1

How reproducible:
  100%

Steps to Reproduce:
1. Configure condor for run VM universe job (KVM in this case):
  VM_TYPE = kvm
  VM_MEMORY = 512
  VM_MAX_MEMORY = 1024
  VM_NETWORKING = TRUE
  VM_NETWORKING_TYPE = nat,bridge
  VM_NETWORKING_DEFAULT_TYPE = bridge
  VM_NETWORKING_BRIDGE_INTERFACE = br0

  MAX_VM_GAHP_LOG   = 1000000
  VM_GAHP_DEBUG = D_FULLDEBUG

2. Prepare VM universe job:
  #cat /tmp/vmuniverse.job 
    Universe=vm
    Log=/tmp/vm_universe.log.$(cluster)
    Executable=testvm
    VM_TYPE=kvm
    VM_MEMORY=512
    VM_DISK=/var/lib/libvirt/images/vmuniverse-rhel6x.img:vda:w
    Queue

3. Submit job and wait for start, then try to submit them:
  $ condor_submit vmuniverse.job 
    Submitting job(s).
    1 job(s) submitted to cluster 1.
  
  $ condor_suspend 1.0
    Job 1.0 suspended
  
  # condor_q 
    -- Submitter: HOSTNAME : <192.168.199.1:56377> : HOSTNAME
     ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
       1.0   test            2/28 14:04   0+00:00:34 R  0   0.0  testvm            
  
    1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended

  # condor_q 
    -- Submitter: HOSTNAME : <192.168.199.1:56377> : HOSTNAME
     ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
       1.0   test            2/28 14:04   0+00:01:14 S  0   0.0  testvm            

    1 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 1 suspended

  # condor_status 
    Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime
    slot1@HOSTNAME LINUX      X86_64 Claimed   Suspende 1.120  1206  0+00:00:04
    slot2@HOSTNAME LINUX      X86_64 Unclaimed Idle     1.000  1206  0+00:00:54
    slot3@HOSTNAME LINUX      X86_64 Unclaimed Idle     1.000  1206  0+00:00:55
    slot4@HOSTNAME LINUX      X86_64 Unclaimed Idle     1.000  1206  0+00:00:56
                         Machines Owner Claimed Unclaimed Matched Preempting
            X86_64/LINUX        4     0       1         3       0          0
                   Total        4     0       1         3       0          0

  # condor_q
    -- Submitter: HOSTNAME : <192.168.199.1:56377> : HOSTNAME
     ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
       1.0   test            2/28 14:04   0+00:01:50 H  0   0.0  testvm            
    1 jobs; 0 completed, 0 removed, 0 idle, 0 running, 1 held, 0 suspended

  # condor_q -bet
    -- Submitter: HOSTNAME : <192.168.199.1:56377> : HOSTNAME
    ---
    001.000:  Request is held.
  
    Hold reason: Error from slot1@HOSTNAME: STARTER at IP failed to send file(s) to <IP:41383>: error reading from /var/lib/condor/execute/dir_16408/xen.mem.ckpt: (errno 13) Permission denied; SHADOW failed to receive file(s) from <IP:35593>
  
  
Actual results:
  VM jobs is not suspended.

Expected results:
  VM jobs is correctly suspended.


Additional info:

Comment 1 Daniel Horák 2013-02-28 13:56:10 UTC

It is valid also for MRG 2.2: condor-7.6.5-0.22.