| Summary: | Failed push leaves the temporary image (created while pushing) running in aws account | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Retired] CloudForms Cloud Engine | Reporter: | Shveta <ssachdev> | ||||
| Component: | imagefactory | Assignee: | Ian McLeod <imcleod> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Martin Kočí <mkoci> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 1.0.0 | CC: | akarol, brad, dajohnso, deltacloud-maint, dgao, hbrock, ssachdev, whayutin | ||||
| Target Milestone: | beta6 | ||||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2012-05-15 20:21:53 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Attachments: |
|
||||||
|
Description
Shveta
2012-02-14 08:08:55 UTC
Given that Conductor is making the push call to imagefactory, and the creation of a running instance is an internal implementation step for imagefactory, I'm reclassifying this as an imagefactory bug. moving version to 1.0.0 . version = found in version Yikes. This looks like some combined bad behaviour on the part of boto and EC2. The instance object comes to us as an array element in this call: https://github.com/aeolusproject/imagefactory/blob/master/imgfac/builders/Fedora_ec2_Builder.py#L627 And we pass it to wait_for_ec2_instance_start here: https://github.com/aeolusproject/imagefactory/blob/master/imgfac/builders/Fedora_ec2_Builder.py#L639 And yet when we ask EC2 to update our view of the status of the instance that it has given us, it claims it does not exist. It turns out we have seen this before and added the hacky workaround a few lines earlier where we wait a full 10 seconds for EC2 to remember that it started the instance. I suppose the more robust workaround is to simply ignore exceptions when calling update() against the instance object and assume that the represent startup transients. In the worst case, where the exception represents a genuine failure, this will result in us waiting 5 minutes for an instance that is never going to start. I'll create a patch and will also research this on the boto/EC2 side to see if anyone else has complained about it. Seen by others: http://groups.google.com/group/boto-users/browse_thread/thread/441dfaf6f3acf1fa/54a317b5d1f3e6f7?lnk=gst&q=InvalidInstanceID#54a317b5d1f3e6f7 I've implemented the approach outlined in Comment 5 here: https://github.com/aeolusproject/imagefactory/commit/0366ae8c934e3b2c6e77d8953e0b666af8971baf This is brewed as 1.0.0rc7 Note, if you want to verify that this is working as expected just look for the exception message from the original report in the imagefactory log: EC2ResponseError: EC2ResponseError: 400 Bad Request <?xml version="1.0" encoding="UTF-8"?> <Response><Errors><Error><Code>InvalidInstanceID.NotFound</Code><Message>The instance ID 'i-5bbafc3e' does not exist</Message></Error></Errors><RequestID>be9dfe37-f896-4512-ae0d-fdabaebc9af3</RequestID></Response> The exception can still occur. The revised code catches it and logs the fact that it is being ignored. I've already seen this happen once in my own testing and I'd expect it to be visible with some frequency since I have removed the original 10 second delay before the first query. Thanks for info. Btw so far I'm not able to reproduce the issue.... martin can you please add an automated test to scan for that message At the risk of being super-redundant, I just want to make it clear that the message in question is now:
1) Expected, at least some of the time
and
2) Benign, because we ignore the exception under the assumption that it is a startup transient
I've already seen this in my own testing. Here is what the log looks like:
2012-02-24 17:09:06,586 DEBUG imgfac.builders.BaseBuilder.RHEL6_ec2_Builder thread(fac19108) Message: Creating temporary security group (imagefactory-fac19108-5d30-4695-94d2-06a41e7dec38)
2012-02-24 17:09:09,920 DEBUG imgfac.builders.BaseBuilder.RHEL6_ec2_Builder thread(fac19108) Message: Starting ami ami-ddf0ad98 with instance_type m1.large
2012-02-24 17:09:10,888 DEBUG imgfac.builders.BaseBuilder.RHEL6_ec2_Builder thread(fac19108) Message: Waiting for EC2 instance to start: 0/300
2012-02-24 17:09:11,168 ERROR boto thread(fac19108) Message: 400 Bad Request
2012-02-24 17:09:11,168 ERROR boto thread(fac19108) Message: <?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>InvalidInstanceID.NotFound</Code><Message>The instance ID 'i-9fe4dad8' does not exist</Message></Error></Errors><RequestID>3818bf09-96a0-4c2e-912a-ad61912bdd9b</RequestID></Re
sponse>
2012-02-24 17:09:11,169 WARNING imgfac.builders.BaseBuilder.RHEL6_ec2_Builder thread(fac19108) Message: EC2ResponseError encountered when querying EC2 instance (i-9fe4dad8) - trying to continue
Traceback (most recent call last):
File "/usr/lib/python2.6/site-packages/imgfac/builders/Fedora_ec2_Builder.py", line 490, in wait_for_ec2_instance_start
instance.update()
File "/usr/lib/python2.6/site-packages/boto/ec2/instance.py", line 221, in update
rs = self.connection.get_all_instances([self.id])
File "/usr/lib/python2.6/site-packages/boto/ec2/connection.py", line 456, in get_all_instances
[('item', Reservation)], verb='POST')
File "/usr/lib/python2.6/site-packages/boto/connection.py", line 680, in get_list
raise self.ResponseError(response.status, response.reason, body)
EC2ResponseError: EC2ResponseError: 400 Bad Request
<?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>InvalidInstanceID.NotFound</Code><Message>The instance ID 'i-9fe4dad8' does not exist</Message></Error></Errors><RequestID>3818bf09-96a0-4c2e-912a-ad61912bdd9b</RequestID></Response>
2012-02-24 17:09:24,079 DEBUG imgfac.builders.BaseBuilder.RHEL6_ec2_Builder thread(fac19108) Message: Waiting for EC2 instance to start: 10/300
2012-02-24 17:09:37,286 DEBUG imgfac.builders.BaseBuilder.RHEL6_ec2_Builder thread(fac19108) Message: Waiting for EC2 instance to start: 20/300
The instance in question did eventually start and the snapshot build completed successfully.
Again, apols if I am repeating myself.
OK understand.. I would like to just check if I get this message then the instance will start and snapshot build successfully. But so far I'm not able to reproduce this message. However if I'm not able to reproduce this message until I have done the automation test (to scan the message) I will move bug to VERIFIED. I guess I will have the test done today. Either I have missed smt. or I'm not so lucky to get the message in question. Moving bug to VERIFIED though. Based on the commit from comment 7 (and based on comments 8 and 11) I believe that the exception is ignored and build will be completed. I will keep testing this. (Automated test is quite difficult to implement as I need private credentials to access the EC2 which I don't want to spread among others). Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2012-0588.html |