Bug 723351

Summary: deploying one vsphere instances are put in a hold state by condor
Product: [Retired] CloudForms Cloud Engine Reporter: wes hayutin <whayutin>
Component: aeolus-conductorAssignee: Angus Thomas <athomas>
Status: CLOSED CURRENTRELEASE QA Contact: wes hayutin <whayutin>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 0.3.1CC: akarol, clalance, dajohnso, deltacloud-maint, dgao, morazi, rwsu, ssachdev
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 723894 (view as bug list) Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description wes hayutin 2011-07-19 21:08:03 UTC
Description of problem:

It seems like when you bring up a vmware based deployment with more than two instances it takes quite some time and condor places the jobs in "hold"

going into condor and releasing the jobs resolves the issue, however I'm wondering if there is some default time out for condor that can be adjusted for vmware.


Recreate:
1. setup conductor for vmware
2. create a deployment w/ four or more instances
3. start the deployable
4. vmware will take 10-15 minutes to start



root@hp-dl180g6-01 ~]# condor_q


-- Submitter: hp-dl180g6-01.rhts.eng.bos.redhat.com : <10.16.65.63:41877> : hp-dl180g6-01.rhts.eng.bos.redhat.com
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  17.0   aeolus          7/19 16:05   0+00:41:39 R  0   0.0  job_1_frontend_21 
  18.0   aeolus          7/19 16:05   0+00:41:39 R  0   0.0  job_1_backend_22  
  19.0   aeolus          7/19 16:05   0+00:42:10 R  0   0.0  job_1_middle01_23 
  20.0   aeolus          7/19 16:05   0+00:42:40 R  0   0.0  job_1_middle02_24 
  21.0   aeolus          7/19 16:11   0+00:36:06 R  0   0.0  job_2_frontend_25 
  22.0   aeolus          7/19 16:13   0+00:00:00 H  0   0.0  job_vmware1_fronte
  23.0   aeolus          7/19 16:13   0+00:00:00 H  0   0.0  job_vmware1_backen
  24.0   aeolus          7/19 16:13   0+00:00:00 H  0   0.0  job_vmware1_middle
  25.0   aeolus          7/19 16:13   0+00:00:00 H  0   0.0  job_vmware1_middle
  26.0   aeolus          7/19 16:17   0+00:30:13 R  0   0.0  job_userquota01_fr
  27.0   aeolus          7/19 16:17   0+00:30:13 R  0   0.0  job_userquota01_ba
  28.0   aeolus          7/19 16:17   0+00:30:13 R  0   0.0  job_userquota01_mi
  29.0   aeolus          7/19 16:17   0+00:29:43 R  0   0.0  job_userquota01_mi
  30.0   aeolus          7/19 16:21   0+00:26:40 R  0   0.0  job_userquota02_fr
  31.0   aeolus          7/19 16:21   0+00:26:39 R  0   0.0  job_userquota02_ba
  32.0   aeolus          7/19 16:34   0+00:13:03 R  0   0.0  job_userquota03_fr
  33.0   aeolus          7/19 16:34   0+00:13:18 R  0   0.0  job_userquota03_ba
  34.0   aeolus          7/19 16:34   0+00:13:00 R  0   0.0  job_userquota03_mi
  35.0   aeolus          7/19 16:34   0+00:13:03 R  0   0.0  job_userquota03_mi
  36.0   aeolus          7/19 16:36   0+00:12:18 R  0   0.0  job_userquota04_fr
  37.0   aeolus          7/19 16:36   0+00:12:02 R  0   0.0  job_userquota04_ba
  38.0   aeolus          7/19 16:36   0+00:11:48 R  0   0.0  job_userquota04_mi
  39.0   aeolus          7/19 16:36   0+00:11:48 R  0   0.0  job_userquota04_mi

23 jobs; 0 idle, 19 running, 4 held
[root@hp-dl180g6-01 ~]# condor_release 22.0 23.0 24.0 25.0
Job 22.0 released
Job 23.0 released
Job 24.0 released
Job 25.0 released
[root@hp-dl180g6-01 ~]# condor_q


-- Submitter: hp-dl180g6-01.rhts.eng.bos.redhat.com : <10.16.65.63:41877> : hp-dl180g6-01.rhts.eng.bos.redhat.com
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  17.0   aeolus          7/19 16:05   0+00:42:05 R  0   0.0  job_1_frontend_21 
  18.0   aeolus          7/19 16:05   0+00:42:05 R  0   0.0  job_1_backend_22  
  19.0   aeolus          7/19 16:05   0+00:42:36 R  0   0.0  job_1_middle01_23 
  20.0   aeolus          7/19 16:05   0+00:43:06 R  0   0.0  job_1_middle02_24 
  21.0   aeolus          7/19 16:11   0+00:36:32 R  0   0.0  job_2_frontend_25 
  22.0   aeolus          7/19 16:13   0+00:00:00 I  0   0.0  job_vmware1_fronte
  23.0   aeolus          7/19 16:13   0+00:00:00 I  0   0.0  job_vmware1_backen
  24.0   aeolus          7/19 16:13   0+00:00:00 I  0   0.0  job_vmware1_middle
  25.0   aeolus          7/19 16:13   0+00:00:00 I  0   0.0  job_vmware1_middle
  26.0   aeolus          7/19 16:17   0+00:30:39 R  0   0.0  job_userquota01_fr
  27.0   aeolus          7/19 16:17   0+00:30:39 R  0   0.0  job_userquota01_ba
  28.0   aeolus          7/19 16:17   0+00:30:39 R  0   0.0  job_userquota01_mi
  29.0   aeolus          7/19 16:17   0+00:30:09 R  0   0.0  job_userquota01_mi
  30.0   aeolus          7/19 16:21   0+00:27:06 R  0   0.0  job_userquota02_fr
  31.0   aeolus          7/19 16:21   0+00:27:05 R  0   0.0  job_userquota02_ba
  32.0   aeolus          7/19 16:34   0+00:13:29 R  0   0.0  job_userquota03_fr
  33.0   aeolus          7/19 16:34   0+00:13:44 R  0   0.0  job_userquota03_ba
  34.0   aeolus          7/19 16:34   0+00:13:26 R  0   0.0  job_userquota03_mi
  35.0   aeolus          7/19 16:34   0+00:13:29 R  0   0.0  job_userquota03_mi
  36.0   aeolus          7/19 16:36   0+00:12:44 R  0   0.0  job_userquota04_fr
  37.0   aeolus          7/19 16:36   0+00:12:28 R  0   0.0  job_userquota04_ba
  38.0   aeolus          7/19 16:36   0+00:12:14 R  0   0.0  job_userquota04_mi
  39.0   aeolus          7/19 16:36   0+00:12:14 R  0   0.0  job_userquota04_mi

23 jobs; 4 idle, 19 running, 0 held
[root@hp-dl180g6-01 ~]# condor_q


-- Submitter: hp-dl180g6-01.rhts.eng.bos.redhat.com : <10.16.65.63:41877> : hp-dl180g6-01.rhts.eng.bos.redhat.com
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  17.0   aeolus          7/19 16:05   0+00:42:21 R  0   0.0  job_1_frontend_21 
  18.0   aeolus          7/19 16:05   0+00:42:21 R  0   0.0  job_1_backend_22  
  19.0   aeolus          7/19 16:05   0+00:42:52 R  0   0.0  job_1_middle01_23 
  20.0   aeolus          7/19 16:05   0+00:43:22 R  0   0.0  job_1_middle02_24 
  21.0   aeolus          7/19 16:11   0+00:36:48 R  0   0.0  job_2_frontend_25 
  22.0   aeolus          7/19 16:13   0+00:00:00 I  0   0.0  job_vmware1_fronte
  23.0   aeolus          7/19 16:13   0+00:00:00 I  0   0.0  job_vmware1_backen
  24.0   aeolus          7/19 16:13   0+00:00:00 I  0   0.0  job_vmware1_middle
  25.0   aeolus          7/19 16:13   0+00:00:00 I  0   0.0  job_vmware1_middle
  26.0   aeolus          7/19 16:17   0+00:30:55 R  0   0.0  job_userquota01_fr
  27.0   aeolus          7/19 16:17   0+00:30:55 R  0   0.0  job_userquota01_ba
  28.0   aeolus          7/19 16:17   0+00:30:55 R  0   0.0  job_userquota01_mi
  29.0   aeolus          7/19 16:17   0+00:30:25 R  0   0.0  job_userquota01_mi
  30.0   aeolus          7/19 16:21   0+00:27:22 R  0   0.0  job_userquota02_fr
  31.0   aeolus          7/19 16:21   0+00:27:21 R  0   0.0  job_userquota02_ba
  32.0   aeolus          7/19 16:34   0+00:13:45 R  0   0.0  job_userquota03_fr
  33.0   aeolus          7/19 16:34   0+00:14:00 R  0   0.0  job_userquota03_ba
  34.0   aeolus          7/19 16:34   0+00:13:42 R  0   0.0  job_userquota03_mi
  35.0   aeolus          7/19 16:34   0+00:13:45 R  0   0.0  job_userquota03_mi
  36.0   aeolus          7/19 16:36   0+00:13:00 R  0   0.0  job_userquota04_fr
  37.0   aeolus          7/19 16:36   0+00:12:44 R  0   0.0  job_userquota04_ba
  38.0   aeolus          7/19 16:36   0+00:12:30 R  0   0.0  job_userquota04_mi
  39.0   aeolus          7/19 16:36   0+00:12:30 R  0   0.0  job_userquota04_mi

23 jobs; 4 idle, 19 running, 0 held
[root@hp-dl180g6-01 ~]# condor_q


-- Submitter: hp-dl180g6-01.rhts.eng.bos.redhat.com : <10.16.65.63:41877> : hp-dl180g6-01.rhts.eng.bos.redhat.com
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  17.0   aeolus          7/19 16:05   0+00:42:36 R  0   0.0  job_1_frontend_21 
  18.0   aeolus          7/19 16:05   0+00:42:36 R  0   0.0  job_1_backend_22  
  19.0   aeolus          7/19 16:05   0+00:43:07 R  0   0.0  job_1_middle01_23 
  20.0   aeolus          7/19 16:05   0+00:43:37 R  0   0.0  job_1_middle02_24 
  21.0   aeolus          7/19 16:11   0+00:37:03 R  0   0.0  job_2_frontend_25 
  22.0   aeolus          7/19 16:13   0+00:00:00 I  0   0.0  job_vmware1_fronte
  23.0   aeolus          7/19 16:13   0+00:00:00 I  0   0.0  job_vmware1_backen
  24.0   aeolus          7/19 16:13   0+00:00:00 I  0   0.0  job_vmware1_middle
  25.0   aeolus          7/19 16:13   0+00:00:00 I  0   0.0  job_vmware1_middle
  26.0   aeolus          7/19 16:17   0+00:31:10 R  0   0.0  job_userquota01_fr
  27.0   aeolus          7/19 16:17   0+00:31:10 R  0   0.0  job_userquota01_ba
  28.0   aeolus          7/19 16:17   0+00:31:10 R  0   0.0  job_userquota01_mi
  29.0   aeolus          7/19 16:17   0+00:30:40 R  0   0.0  job_userquota01_mi
  30.0   aeolus          7/19 16:21   0+00:27:37 R  0   0.0  job_userquota02_fr
  31.0   aeolus          7/19 16:21   0+00:27:36 R  0   0.0  job_userquota02_ba
  32.0   aeolus          7/19 16:34   0+00:14:00 R  0   0.0  job_userquota03_fr
  33.0   aeolus          7/19 16:34   0+00:14:15 R  0   0.0  job_userquota03_ba
  34.0   aeolus          7/19 16:34   0+00:13:57 R  0   0.0  job_userquota03_mi
  35.0   aeolus          7/19 16:34   0+00:14:00 R  0   0.0  job_userquota03_mi
  36.0   aeolus          7/19 16:36   0+00:13:15 R  0   0.0  job_userquota04_fr
  37.0   aeolus          7/19 16:36   0+00:12:59 R  0   0.0  job_userquota04_ba
  38.0   aeolus          7/19 16:36   0+00:12:45 R  0   0.0  job_userquota04_mi
  39.0   aeolus          7/19 16:36   0+00:12:45 R  0   0.0  job_userquota04_mi

23 jobs; 4 idle, 19 running, 0 held
[root@hp-dl180g6-01 ~]# condor_q


-- Submitter: hp-dl180g6-01.rhts.eng.bos.redhat.com : <10.16.65.63:41877> : hp-dl180g6-01.rhts.eng.bos.redhat.com
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  17.0   aeolus          7/19 16:05   0+00:42:42 R  0   0.0  job_1_frontend_21 
  18.0   aeolus          7/19 16:05   0+00:42:42 R  0   0.0  job_1_backend_22  
  19.0   aeolus          7/19 16:05   0+00:43:13 R  0   0.0  job_1_middle01_23 
  20.0   aeolus          7/19 16:05   0+00:43:43 R  0   0.0  job_1_middle02_24 
  21.0   aeolus          7/19 16:11   0+00:37:09 R  0   0.0  job_2_frontend_25 
  22.0   aeolus          7/19 16:13   0+00:00:05 R  0   0.0  job_vmware1_fronte
  23.0   aeolus          7/19 16:13   0+00:00:05 R  0   0.0  job_vmware1_backen
  24.0   aeolus          7/19 16:13   0+00:00:04 R  0   0.0  job_vmware1_middle
  25.0   aeolus          7/19 16:13   0+00:00:05 R  0   0.0  job_vmware1_middle
  26.0   aeolus          7/19 16:17   0+00:31:16 R  0   0.0  job_userquota01_fr
  27.0   aeolus          7/19 16:17   0+00:31:16 R  0   0.0  job_userquota01_ba
  28.0   aeolus          7/19 16:17   0+00:31:16 R  0   0.0  job_userquota01_mi
  29.0   aeolus          7/19 16:17   0+00:30:46 R  0   0.0  job_userquota01_mi
  30.0   aeolus          7/19 16:21   0+00:27:43 R  0   0.0  job_userquota02_fr
  31.0   aeolus          7/19 16:21   0+00:27:42 R  0   0.0  job_userquota02_ba
  32.0   aeolus          7/19 16:34   0+00:14:06 R  0   0.0  job_userquota03_fr
  33.0   aeolus          7/19 16:34   0+00:14:21 R  0   0.0  job_userquota03_ba
  34.0   aeolus          7/19 16:34   0+00:14:03 R  0   0.0  job_userquota03_mi
  35.0   aeolus          7/19 16:34   0+00:14:06 R  0   0.0  job_userquota03_mi
  36.0   aeolus          7/19 16:36   0+00:13:21 R  0   0.0  job_userquota04_fr
  37.0   aeolus          7/19 16:36   0+00:13:05 R  0   0.0  job_userquota04_ba
  38.0   aeolus          7/19 16:36   0+00:12:51 R  0   0.0  job_userquota04_mi
  39.0   aeolus          7/19 16:36   0+00:12:51 R  0   0.0  job_userquota04_mi

23 jobs; 0 idle, 23 running, 0 held

Comment 1 Matthew Farrellee 2011-07-20 19:50:31 UTC
FYI - condor_q 22.0 -l | grep LastHoldReason -> "Create_Instance_Failure: Failed to perform transfer: Server returned nothing (no headers, no data)"

The jobs were running.

Should investigate why the transfer failed. Possibly a timing issue?

Comment 2 Chris Lalancette 2011-07-20 20:11:38 UTC
Yeah, those sorts of errors usually are some sort of timeout, or a bug in deltacloud itself.  At the very least, deltacloudd should always be returning an error code (and not no headers, no data).

Comment 3 wes hayutin 2011-07-22 14:28:14 UTC
k.. seeing this now w/ just one instance



-- Submitter: hp-sl2x170zg6-01.rhts.eng.bos.redhat.com : <10.16.66.29:48530> : hp-sl2x170zg6-01.rhts.eng.bos.redhat.com
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   1.0   aeolus          7/22 10:24   0+00:00:00 I  0   0.0  job_RHEL02_fronten

1 jobs; 1 idle, 0 running, 0 held
[root@hp-sl2x170zg6-01 ~]# condor_q


-- Submitter: hp-sl2x170zg6-01.rhts.eng.bos.redhat.com : <10.16.66.29:48530> : hp-sl2x170zg6-01.rhts.eng.bos.redhat.com
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   1.0   aeolus          7/22 10:24   0+00:00:00 H  0   0.0  job_RHEL02_fronten

1 jobs; 0 idle, 0 running, 1 held
[root@hp-sl2x170zg6-01 ~]# condor_q


-- Submitter: hp-sl2x170zg6-01.rhts.eng.bos.redhat.com : <10.16.66.29:48530> : hp-sl2x170zg6-01.rhts.eng.bos.redhat.com
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   1.0   aeolus          7/22 10:24   0+00:00:00 H  0   0.0  job_RHEL02_fronten

1 jobs; 0 idle, 0 running, 1 held
[root@hp-sl2x170zg6-01 ~]# condor_q -hold


-- Submitter: hp-sl2x170zg6-01.rhts.eng.bos.redhat.com : <10.16.66.29:48530> : hp-sl2x170zg6-01.rhts.eng.bos.redhat.com
 ID      OWNER           HELD_SINCE HOLD_REASON                   
   1.0   aeolus          7/22 10:25 Create_Instance_Failure: Failed to perform

Comment 4 wes hayutin 2011-07-22 14:39:36 UTC
this just may be a performance issue.. but need dev to rereview this

Comment 5 wes hayutin 2011-07-22 15:39:53 UTC
It sounds like there isnt anything we can do about this, Lets just be sure to check if this indeed happens for everyone. It wouldnt hurt to check if there is a setting we can tweak too.

Thank you!

Comment 6 Ian Main 2011-07-25 23:07:16 UTC
We can make the timeout in condor longer.  There's a classad variable to do this for each job.  The default timeout is 90 seconds.  You could try this patch to double the timeout and see if that works.


diff --git a/src/app/util/condormatic.rb b/src/app/util/condormatic.rb
index b52ff46..3ecf9be 100644
--- a/src/app/util/condormatic.rb
+++ b/src/app/util/condormatic.rb
@@ -90,6 +90,7 @@ def condormatic_instance_create(task)
 
     pipe_and_log(pipe,
                  "grid_resource = deltacloud #{found.account.provider.url}\n")
+    pipe_and_log(pipe, "DeltacloudRetryTimeout = 180\n")
     pipe_and_log(pipe, "DeltacloudUsername = #{found.account.credentials_hash['username']}\n")
     pipe_and_log(pipe, "DeltacloudPasswordFile = #{pwfilename}")
     pipe_and_log(pipe, "DeltacloudImageId = #{found.provider_image.target_identifier}\n")

Comment 7 Richard Su 2011-07-25 23:14:08 UTC
I have seen the "Create Instance Failure" whenever vsphere runs out of disk space.

I wasn't able to reproduce the condition with a single instance. I don't have an environment to bring up multiple instances atm, because the vsphere in westford is maxed out.

Comment 8 Shveta 2011-07-27 12:26:57 UTC
condor_q , just after launching instance 
===========================================================

 105.0   aeolus          7/27 17:44   0+00:00:00 I  0   0.0  job_wednesday_vm_f

67 jobs; 1 idle, 35 running, 31 held
[root@snowstorm ~]# date
Wed Jul 27 17:45:04 IST 2011


condor_q 
===============================================

105.0   aeolus          7/27 17:44   0+00:00:07 R  0   0.0  job_wednesday_vm_f

67 jobs; 0 idle, 36 running, 31 held
[root@snowstorm ~]# date
Wed Jul 27 17:52:28 IST 2011

==========================================================
I didnt notice the job going in held state it remained in Idle state for around 7-8 mins and then it came to running. Tried twice.

Comment 9 wes hayutin 2011-07-27 13:33:01 UTC
OK.. dev and Shveta are not hitting this issue. Lets *not* include it in the beta

Comment 10 wes hayutin 2011-08-01 18:52:22 UTC
NFS datastores in vsphere was found to be the root cause of this issue

Comment 11 wes hayutin 2011-08-01 19:43:21 UTC
BZ 723894 - VMware deployments to low spec NFS datastores error out
Low spec NFS datastores are not recommend due to poor performance.

Comment 12 wes hayutin 2011-08-01 19:48:52 UTC
removing from tracker

Comment 13 wes hayutin 2011-08-01 19:56:09 UTC
release pending...

Comment 14 wes hayutin 2011-08-01 19:57:47 UTC
release pending...

Comment 16 wes hayutin 2011-12-08 13:54:32 UTC
closing out old bugs

Comment 17 wes hayutin 2011-12-08 14:07:47 UTC
perm close