| Summary: | deploying one vsphere instances are put in a hold state by condor | |||
|---|---|---|---|---|
| Product: | [Retired] CloudForms Cloud Engine | Reporter: | wes hayutin <whayutin> | |
| Component: | aeolus-conductor | Assignee: | Angus Thomas <athomas> | |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | wes hayutin <whayutin> | |
| Severity: | unspecified | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 0.3.1 | CC: | akarol, clalance, dajohnso, deltacloud-maint, dgao, morazi, rwsu, ssachdev | |
| Target Milestone: | rc | |||
| Target Release: | --- | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | Bug Fix | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 723894 (view as bug list) | Environment: | ||
| Last Closed: | Type: | --- | ||
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
|
Description
wes hayutin
2011-07-19 21:08:03 UTC
FYI - condor_q 22.0 -l | grep LastHoldReason -> "Create_Instance_Failure: Failed to perform transfer: Server returned nothing (no headers, no data)" The jobs were running. Should investigate why the transfer failed. Possibly a timing issue? Yeah, those sorts of errors usually are some sort of timeout, or a bug in deltacloud itself. At the very least, deltacloudd should always be returning an error code (and not no headers, no data). k.. seeing this now w/ just one instance -- Submitter: hp-sl2x170zg6-01.rhts.eng.bos.redhat.com : <10.16.66.29:48530> : hp-sl2x170zg6-01.rhts.eng.bos.redhat.com ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 aeolus 7/22 10:24 0+00:00:00 I 0 0.0 job_RHEL02_fronten 1 jobs; 1 idle, 0 running, 0 held [root@hp-sl2x170zg6-01 ~]# condor_q -- Submitter: hp-sl2x170zg6-01.rhts.eng.bos.redhat.com : <10.16.66.29:48530> : hp-sl2x170zg6-01.rhts.eng.bos.redhat.com ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 aeolus 7/22 10:24 0+00:00:00 H 0 0.0 job_RHEL02_fronten 1 jobs; 0 idle, 0 running, 1 held [root@hp-sl2x170zg6-01 ~]# condor_q -- Submitter: hp-sl2x170zg6-01.rhts.eng.bos.redhat.com : <10.16.66.29:48530> : hp-sl2x170zg6-01.rhts.eng.bos.redhat.com ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 aeolus 7/22 10:24 0+00:00:00 H 0 0.0 job_RHEL02_fronten 1 jobs; 0 idle, 0 running, 1 held [root@hp-sl2x170zg6-01 ~]# condor_q -hold -- Submitter: hp-sl2x170zg6-01.rhts.eng.bos.redhat.com : <10.16.66.29:48530> : hp-sl2x170zg6-01.rhts.eng.bos.redhat.com ID OWNER HELD_SINCE HOLD_REASON 1.0 aeolus 7/22 10:25 Create_Instance_Failure: Failed to perform this just may be a performance issue.. but need dev to rereview this It sounds like there isnt anything we can do about this, Lets just be sure to check if this indeed happens for everyone. It wouldnt hurt to check if there is a setting we can tweak too. Thank you! We can make the timeout in condor longer. There's a classad variable to do this for each job. The default timeout is 90 seconds. You could try this patch to double the timeout and see if that works.
diff --git a/src/app/util/condormatic.rb b/src/app/util/condormatic.rb
index b52ff46..3ecf9be 100644
--- a/src/app/util/condormatic.rb
+++ b/src/app/util/condormatic.rb
@@ -90,6 +90,7 @@ def condormatic_instance_create(task)
pipe_and_log(pipe,
"grid_resource = deltacloud #{found.account.provider.url}\n")
+ pipe_and_log(pipe, "DeltacloudRetryTimeout = 180\n")
pipe_and_log(pipe, "DeltacloudUsername = #{found.account.credentials_hash['username']}\n")
pipe_and_log(pipe, "DeltacloudPasswordFile = #{pwfilename}")
pipe_and_log(pipe, "DeltacloudImageId = #{found.provider_image.target_identifier}\n")
I have seen the "Create Instance Failure" whenever vsphere runs out of disk space. I wasn't able to reproduce the condition with a single instance. I don't have an environment to bring up multiple instances atm, because the vsphere in westford is maxed out. condor_q , just after launching instance =========================================================== 105.0 aeolus 7/27 17:44 0+00:00:00 I 0 0.0 job_wednesday_vm_f 67 jobs; 1 idle, 35 running, 31 held [root@snowstorm ~]# date Wed Jul 27 17:45:04 IST 2011 condor_q =============================================== 105.0 aeolus 7/27 17:44 0+00:00:07 R 0 0.0 job_wednesday_vm_f 67 jobs; 0 idle, 36 running, 31 held [root@snowstorm ~]# date Wed Jul 27 17:52:28 IST 2011 ========================================================== I didnt notice the job going in held state it remained in Idle state for around 7-8 mins and then it came to running. Tried twice. OK.. dev and Shveta are not hitting this issue. Lets *not* include it in the beta NFS datastores in vsphere was found to be the root cause of this issue BZ 723894 - VMware deployments to low spec NFS datastores error out Low spec NFS datastores are not recommend due to poor performance. removing from tracker release pending... release pending... closing out old bugs perm close |