Hide Forgot
Description of problem: It seems like when you bring up a vmware based deployment with more than two instances it takes quite some time and condor places the jobs in "hold" going into condor and releasing the jobs resolves the issue, however I'm wondering if there is some default time out for condor that can be adjusted for vmware. Recreate: 1. setup conductor for vmware 2. create a deployment w/ four or more instances 3. start the deployable 4. vmware will take 10-15 minutes to start root@hp-dl180g6-01 ~]# condor_q -- Submitter: hp-dl180g6-01.rhts.eng.bos.redhat.com : <10.16.65.63:41877> : hp-dl180g6-01.rhts.eng.bos.redhat.com ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 17.0 aeolus 7/19 16:05 0+00:41:39 R 0 0.0 job_1_frontend_21 18.0 aeolus 7/19 16:05 0+00:41:39 R 0 0.0 job_1_backend_22 19.0 aeolus 7/19 16:05 0+00:42:10 R 0 0.0 job_1_middle01_23 20.0 aeolus 7/19 16:05 0+00:42:40 R 0 0.0 job_1_middle02_24 21.0 aeolus 7/19 16:11 0+00:36:06 R 0 0.0 job_2_frontend_25 22.0 aeolus 7/19 16:13 0+00:00:00 H 0 0.0 job_vmware1_fronte 23.0 aeolus 7/19 16:13 0+00:00:00 H 0 0.0 job_vmware1_backen 24.0 aeolus 7/19 16:13 0+00:00:00 H 0 0.0 job_vmware1_middle 25.0 aeolus 7/19 16:13 0+00:00:00 H 0 0.0 job_vmware1_middle 26.0 aeolus 7/19 16:17 0+00:30:13 R 0 0.0 job_userquota01_fr 27.0 aeolus 7/19 16:17 0+00:30:13 R 0 0.0 job_userquota01_ba 28.0 aeolus 7/19 16:17 0+00:30:13 R 0 0.0 job_userquota01_mi 29.0 aeolus 7/19 16:17 0+00:29:43 R 0 0.0 job_userquota01_mi 30.0 aeolus 7/19 16:21 0+00:26:40 R 0 0.0 job_userquota02_fr 31.0 aeolus 7/19 16:21 0+00:26:39 R 0 0.0 job_userquota02_ba 32.0 aeolus 7/19 16:34 0+00:13:03 R 0 0.0 job_userquota03_fr 33.0 aeolus 7/19 16:34 0+00:13:18 R 0 0.0 job_userquota03_ba 34.0 aeolus 7/19 16:34 0+00:13:00 R 0 0.0 job_userquota03_mi 35.0 aeolus 7/19 16:34 0+00:13:03 R 0 0.0 job_userquota03_mi 36.0 aeolus 7/19 16:36 0+00:12:18 R 0 0.0 job_userquota04_fr 37.0 aeolus 7/19 16:36 0+00:12:02 R 0 0.0 job_userquota04_ba 38.0 aeolus 7/19 16:36 0+00:11:48 R 0 0.0 job_userquota04_mi 39.0 aeolus 7/19 16:36 0+00:11:48 R 0 0.0 job_userquota04_mi 23 jobs; 0 idle, 19 running, 4 held [root@hp-dl180g6-01 ~]# condor_release 22.0 23.0 24.0 25.0 Job 22.0 released Job 23.0 released Job 24.0 released Job 25.0 released [root@hp-dl180g6-01 ~]# condor_q -- Submitter: hp-dl180g6-01.rhts.eng.bos.redhat.com : <10.16.65.63:41877> : hp-dl180g6-01.rhts.eng.bos.redhat.com ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 17.0 aeolus 7/19 16:05 0+00:42:05 R 0 0.0 job_1_frontend_21 18.0 aeolus 7/19 16:05 0+00:42:05 R 0 0.0 job_1_backend_22 19.0 aeolus 7/19 16:05 0+00:42:36 R 0 0.0 job_1_middle01_23 20.0 aeolus 7/19 16:05 0+00:43:06 R 0 0.0 job_1_middle02_24 21.0 aeolus 7/19 16:11 0+00:36:32 R 0 0.0 job_2_frontend_25 22.0 aeolus 7/19 16:13 0+00:00:00 I 0 0.0 job_vmware1_fronte 23.0 aeolus 7/19 16:13 0+00:00:00 I 0 0.0 job_vmware1_backen 24.0 aeolus 7/19 16:13 0+00:00:00 I 0 0.0 job_vmware1_middle 25.0 aeolus 7/19 16:13 0+00:00:00 I 0 0.0 job_vmware1_middle 26.0 aeolus 7/19 16:17 0+00:30:39 R 0 0.0 job_userquota01_fr 27.0 aeolus 7/19 16:17 0+00:30:39 R 0 0.0 job_userquota01_ba 28.0 aeolus 7/19 16:17 0+00:30:39 R 0 0.0 job_userquota01_mi 29.0 aeolus 7/19 16:17 0+00:30:09 R 0 0.0 job_userquota01_mi 30.0 aeolus 7/19 16:21 0+00:27:06 R 0 0.0 job_userquota02_fr 31.0 aeolus 7/19 16:21 0+00:27:05 R 0 0.0 job_userquota02_ba 32.0 aeolus 7/19 16:34 0+00:13:29 R 0 0.0 job_userquota03_fr 33.0 aeolus 7/19 16:34 0+00:13:44 R 0 0.0 job_userquota03_ba 34.0 aeolus 7/19 16:34 0+00:13:26 R 0 0.0 job_userquota03_mi 35.0 aeolus 7/19 16:34 0+00:13:29 R 0 0.0 job_userquota03_mi 36.0 aeolus 7/19 16:36 0+00:12:44 R 0 0.0 job_userquota04_fr 37.0 aeolus 7/19 16:36 0+00:12:28 R 0 0.0 job_userquota04_ba 38.0 aeolus 7/19 16:36 0+00:12:14 R 0 0.0 job_userquota04_mi 39.0 aeolus 7/19 16:36 0+00:12:14 R 0 0.0 job_userquota04_mi 23 jobs; 4 idle, 19 running, 0 held [root@hp-dl180g6-01 ~]# condor_q -- Submitter: hp-dl180g6-01.rhts.eng.bos.redhat.com : <10.16.65.63:41877> : hp-dl180g6-01.rhts.eng.bos.redhat.com ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 17.0 aeolus 7/19 16:05 0+00:42:21 R 0 0.0 job_1_frontend_21 18.0 aeolus 7/19 16:05 0+00:42:21 R 0 0.0 job_1_backend_22 19.0 aeolus 7/19 16:05 0+00:42:52 R 0 0.0 job_1_middle01_23 20.0 aeolus 7/19 16:05 0+00:43:22 R 0 0.0 job_1_middle02_24 21.0 aeolus 7/19 16:11 0+00:36:48 R 0 0.0 job_2_frontend_25 22.0 aeolus 7/19 16:13 0+00:00:00 I 0 0.0 job_vmware1_fronte 23.0 aeolus 7/19 16:13 0+00:00:00 I 0 0.0 job_vmware1_backen 24.0 aeolus 7/19 16:13 0+00:00:00 I 0 0.0 job_vmware1_middle 25.0 aeolus 7/19 16:13 0+00:00:00 I 0 0.0 job_vmware1_middle 26.0 aeolus 7/19 16:17 0+00:30:55 R 0 0.0 job_userquota01_fr 27.0 aeolus 7/19 16:17 0+00:30:55 R 0 0.0 job_userquota01_ba 28.0 aeolus 7/19 16:17 0+00:30:55 R 0 0.0 job_userquota01_mi 29.0 aeolus 7/19 16:17 0+00:30:25 R 0 0.0 job_userquota01_mi 30.0 aeolus 7/19 16:21 0+00:27:22 R 0 0.0 job_userquota02_fr 31.0 aeolus 7/19 16:21 0+00:27:21 R 0 0.0 job_userquota02_ba 32.0 aeolus 7/19 16:34 0+00:13:45 R 0 0.0 job_userquota03_fr 33.0 aeolus 7/19 16:34 0+00:14:00 R 0 0.0 job_userquota03_ba 34.0 aeolus 7/19 16:34 0+00:13:42 R 0 0.0 job_userquota03_mi 35.0 aeolus 7/19 16:34 0+00:13:45 R 0 0.0 job_userquota03_mi 36.0 aeolus 7/19 16:36 0+00:13:00 R 0 0.0 job_userquota04_fr 37.0 aeolus 7/19 16:36 0+00:12:44 R 0 0.0 job_userquota04_ba 38.0 aeolus 7/19 16:36 0+00:12:30 R 0 0.0 job_userquota04_mi 39.0 aeolus 7/19 16:36 0+00:12:30 R 0 0.0 job_userquota04_mi 23 jobs; 4 idle, 19 running, 0 held [root@hp-dl180g6-01 ~]# condor_q -- Submitter: hp-dl180g6-01.rhts.eng.bos.redhat.com : <10.16.65.63:41877> : hp-dl180g6-01.rhts.eng.bos.redhat.com ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 17.0 aeolus 7/19 16:05 0+00:42:36 R 0 0.0 job_1_frontend_21 18.0 aeolus 7/19 16:05 0+00:42:36 R 0 0.0 job_1_backend_22 19.0 aeolus 7/19 16:05 0+00:43:07 R 0 0.0 job_1_middle01_23 20.0 aeolus 7/19 16:05 0+00:43:37 R 0 0.0 job_1_middle02_24 21.0 aeolus 7/19 16:11 0+00:37:03 R 0 0.0 job_2_frontend_25 22.0 aeolus 7/19 16:13 0+00:00:00 I 0 0.0 job_vmware1_fronte 23.0 aeolus 7/19 16:13 0+00:00:00 I 0 0.0 job_vmware1_backen 24.0 aeolus 7/19 16:13 0+00:00:00 I 0 0.0 job_vmware1_middle 25.0 aeolus 7/19 16:13 0+00:00:00 I 0 0.0 job_vmware1_middle 26.0 aeolus 7/19 16:17 0+00:31:10 R 0 0.0 job_userquota01_fr 27.0 aeolus 7/19 16:17 0+00:31:10 R 0 0.0 job_userquota01_ba 28.0 aeolus 7/19 16:17 0+00:31:10 R 0 0.0 job_userquota01_mi 29.0 aeolus 7/19 16:17 0+00:30:40 R 0 0.0 job_userquota01_mi 30.0 aeolus 7/19 16:21 0+00:27:37 R 0 0.0 job_userquota02_fr 31.0 aeolus 7/19 16:21 0+00:27:36 R 0 0.0 job_userquota02_ba 32.0 aeolus 7/19 16:34 0+00:14:00 R 0 0.0 job_userquota03_fr 33.0 aeolus 7/19 16:34 0+00:14:15 R 0 0.0 job_userquota03_ba 34.0 aeolus 7/19 16:34 0+00:13:57 R 0 0.0 job_userquota03_mi 35.0 aeolus 7/19 16:34 0+00:14:00 R 0 0.0 job_userquota03_mi 36.0 aeolus 7/19 16:36 0+00:13:15 R 0 0.0 job_userquota04_fr 37.0 aeolus 7/19 16:36 0+00:12:59 R 0 0.0 job_userquota04_ba 38.0 aeolus 7/19 16:36 0+00:12:45 R 0 0.0 job_userquota04_mi 39.0 aeolus 7/19 16:36 0+00:12:45 R 0 0.0 job_userquota04_mi 23 jobs; 4 idle, 19 running, 0 held [root@hp-dl180g6-01 ~]# condor_q -- Submitter: hp-dl180g6-01.rhts.eng.bos.redhat.com : <10.16.65.63:41877> : hp-dl180g6-01.rhts.eng.bos.redhat.com ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 17.0 aeolus 7/19 16:05 0+00:42:42 R 0 0.0 job_1_frontend_21 18.0 aeolus 7/19 16:05 0+00:42:42 R 0 0.0 job_1_backend_22 19.0 aeolus 7/19 16:05 0+00:43:13 R 0 0.0 job_1_middle01_23 20.0 aeolus 7/19 16:05 0+00:43:43 R 0 0.0 job_1_middle02_24 21.0 aeolus 7/19 16:11 0+00:37:09 R 0 0.0 job_2_frontend_25 22.0 aeolus 7/19 16:13 0+00:00:05 R 0 0.0 job_vmware1_fronte 23.0 aeolus 7/19 16:13 0+00:00:05 R 0 0.0 job_vmware1_backen 24.0 aeolus 7/19 16:13 0+00:00:04 R 0 0.0 job_vmware1_middle 25.0 aeolus 7/19 16:13 0+00:00:05 R 0 0.0 job_vmware1_middle 26.0 aeolus 7/19 16:17 0+00:31:16 R 0 0.0 job_userquota01_fr 27.0 aeolus 7/19 16:17 0+00:31:16 R 0 0.0 job_userquota01_ba 28.0 aeolus 7/19 16:17 0+00:31:16 R 0 0.0 job_userquota01_mi 29.0 aeolus 7/19 16:17 0+00:30:46 R 0 0.0 job_userquota01_mi 30.0 aeolus 7/19 16:21 0+00:27:43 R 0 0.0 job_userquota02_fr 31.0 aeolus 7/19 16:21 0+00:27:42 R 0 0.0 job_userquota02_ba 32.0 aeolus 7/19 16:34 0+00:14:06 R 0 0.0 job_userquota03_fr 33.0 aeolus 7/19 16:34 0+00:14:21 R 0 0.0 job_userquota03_ba 34.0 aeolus 7/19 16:34 0+00:14:03 R 0 0.0 job_userquota03_mi 35.0 aeolus 7/19 16:34 0+00:14:06 R 0 0.0 job_userquota03_mi 36.0 aeolus 7/19 16:36 0+00:13:21 R 0 0.0 job_userquota04_fr 37.0 aeolus 7/19 16:36 0+00:13:05 R 0 0.0 job_userquota04_ba 38.0 aeolus 7/19 16:36 0+00:12:51 R 0 0.0 job_userquota04_mi 39.0 aeolus 7/19 16:36 0+00:12:51 R 0 0.0 job_userquota04_mi 23 jobs; 0 idle, 23 running, 0 held
FYI - condor_q 22.0 -l | grep LastHoldReason -> "Create_Instance_Failure: Failed to perform transfer: Server returned nothing (no headers, no data)" The jobs were running. Should investigate why the transfer failed. Possibly a timing issue?
Yeah, those sorts of errors usually are some sort of timeout, or a bug in deltacloud itself. At the very least, deltacloudd should always be returning an error code (and not no headers, no data).
k.. seeing this now w/ just one instance -- Submitter: hp-sl2x170zg6-01.rhts.eng.bos.redhat.com : <10.16.66.29:48530> : hp-sl2x170zg6-01.rhts.eng.bos.redhat.com ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 aeolus 7/22 10:24 0+00:00:00 I 0 0.0 job_RHEL02_fronten 1 jobs; 1 idle, 0 running, 0 held [root@hp-sl2x170zg6-01 ~]# condor_q -- Submitter: hp-sl2x170zg6-01.rhts.eng.bos.redhat.com : <10.16.66.29:48530> : hp-sl2x170zg6-01.rhts.eng.bos.redhat.com ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 aeolus 7/22 10:24 0+00:00:00 H 0 0.0 job_RHEL02_fronten 1 jobs; 0 idle, 0 running, 1 held [root@hp-sl2x170zg6-01 ~]# condor_q -- Submitter: hp-sl2x170zg6-01.rhts.eng.bos.redhat.com : <10.16.66.29:48530> : hp-sl2x170zg6-01.rhts.eng.bos.redhat.com ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 aeolus 7/22 10:24 0+00:00:00 H 0 0.0 job_RHEL02_fronten 1 jobs; 0 idle, 0 running, 1 held [root@hp-sl2x170zg6-01 ~]# condor_q -hold -- Submitter: hp-sl2x170zg6-01.rhts.eng.bos.redhat.com : <10.16.66.29:48530> : hp-sl2x170zg6-01.rhts.eng.bos.redhat.com ID OWNER HELD_SINCE HOLD_REASON 1.0 aeolus 7/22 10:25 Create_Instance_Failure: Failed to perform
this just may be a performance issue.. but need dev to rereview this
It sounds like there isnt anything we can do about this, Lets just be sure to check if this indeed happens for everyone. It wouldnt hurt to check if there is a setting we can tweak too. Thank you!
We can make the timeout in condor longer. There's a classad variable to do this for each job. The default timeout is 90 seconds. You could try this patch to double the timeout and see if that works. diff --git a/src/app/util/condormatic.rb b/src/app/util/condormatic.rb index b52ff46..3ecf9be 100644 --- a/src/app/util/condormatic.rb +++ b/src/app/util/condormatic.rb @@ -90,6 +90,7 @@ def condormatic_instance_create(task) pipe_and_log(pipe, "grid_resource = deltacloud #{found.account.provider.url}\n") + pipe_and_log(pipe, "DeltacloudRetryTimeout = 180\n") pipe_and_log(pipe, "DeltacloudUsername = #{found.account.credentials_hash['username']}\n") pipe_and_log(pipe, "DeltacloudPasswordFile = #{pwfilename}") pipe_and_log(pipe, "DeltacloudImageId = #{found.provider_image.target_identifier}\n")
I have seen the "Create Instance Failure" whenever vsphere runs out of disk space. I wasn't able to reproduce the condition with a single instance. I don't have an environment to bring up multiple instances atm, because the vsphere in westford is maxed out.
condor_q , just after launching instance =========================================================== 105.0 aeolus 7/27 17:44 0+00:00:00 I 0 0.0 job_wednesday_vm_f 67 jobs; 1 idle, 35 running, 31 held [root@snowstorm ~]# date Wed Jul 27 17:45:04 IST 2011 condor_q =============================================== 105.0 aeolus 7/27 17:44 0+00:00:07 R 0 0.0 job_wednesday_vm_f 67 jobs; 0 idle, 36 running, 31 held [root@snowstorm ~]# date Wed Jul 27 17:52:28 IST 2011 ========================================================== I didnt notice the job going in held state it remained in Idle state for around 7-8 mins and then it came to running. Tried twice.
OK.. dev and Shveta are not hitting this issue. Lets *not* include it in the beta
NFS datastores in vsphere was found to be the root cause of this issue
BZ 723894 - VMware deployments to low spec NFS datastores error out Low spec NFS datastores are not recommend due to poor performance.
removing from tracker
release pending...
closing out old bugs
perm close