Bug 790311 - Failed push leaves the temporary image (created while pushing) running in aws account
Summary: Failed push leaves the temporary image (created while pushing) running in aws...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: CloudForms Cloud Engine
Classification: Retired
Component: imagefactory
Version: 1.0.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: beta6
Assignee: Ian McLeod
QA Contact: Martin Kočí
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-02-14 08:08 UTC by Shveta
Modified: 2012-05-15 20:21 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-05-15 20:21:53 UTC


Attachments (Terms of Use)
running_img (129.48 KB, image/png)
2012-02-14 08:08 UTC, Shveta
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2012:0588 0 normal SHIPPED_LIVE new packages: imagefactory 2012-05-15 22:31:27 UTC

Description Shveta 2012-02-14 08:08:55 UTC
Created attachment 561800 [details]
running_img

Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Build for ec2
2. Pushed via UI 
3. as soon as i start pushing i can see an image coming up in my aws account.
Push failed for the reason below :
http://pastebin.test.redhat.com/76803

and the image was left running .

Tried push for the second time . This time it suceeded and VM was deprovisioned after push.
  
Actual results:


Expected results:


Additional info:
Push failure
================================================
2012-02-14 01:50:19,911 DEBUG imgfac.ReservationManager.ReservationManager thread(7a351362) Message: SUCCESS ENTERING queue: (ec2)
2012-02-14 01:50:19,911 DEBUG imgfac.BuildJob.BuildJob thread(7a351362) Message: Builder (7a351362-73a2-4061-8aa6-324ae21ce537) changed status from NEW to PUSHING
2012-02-14 01:50:19,911 DEBUG imgfac.BuildJob.BuildJob thread(7a351362) Message: Builder (7a351362-73a2-4061-8aa6-324ae21ce537) changed percent complete from 0 to 0
2012-02-14 01:50:20,872 DEBUG imgfac.builders.BaseBuilder.Fedora_ec2_Builder thread(7a351362) Message: Creating temporary security group (imagefactory-7a351362-73a2-4061-8aa6-324ae21ce537)
2012-02-14 01:50:24,756 DEBUG imgfac.builders.BaseBuilder.Fedora_ec2_Builder thread(7a351362) Message: Starting ami ami-c31cd8aa with instance_type m1.large
2012-02-14 01:50:35,465 DEBUG imgfac.builders.BaseBuilder.Fedora_ec2_Builder thread(7a351362) Message: Waiting for EC2 instance to start: 0/300
2012-02-14 01:50:35,609 ERROR boto thread(7a351362) Message: 400 Bad Request
2012-02-14 01:50:35,609 ERROR boto thread(7a351362) Message: <?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>InvalidInstanceID.NotFound</Code><Message>The instance ID 'i-5bbafc3e' does not exist</Message></Error></Errors><RequestID>be9dfe37-f896-4512-ae0d-fdabaebc9af3</RequestID></Response>
2012-02-14 01:50:35,609 DEBUG imgfac.builders.BaseBuilder.Fedora_ec2_Builder thread(7a351362) Message: Exception caught in ImageFactory
2012-02-14 01:50:35,611 DEBUG imgfac.builders.BaseBuilder.Fedora_ec2_Builder thread(7a351362) Message: Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/imgfac/builders/Fedora_ec2_Builder.py", line 458, in push_image
    credentials)
  File "/usr/lib/python2.6/site-packages/imgfac/builders/Fedora_ec2_Builder.py", line 616, in push_image_snapshot_ec2
    self.wait_for_ec2_instance_start(self.instance)
  File "/usr/lib/python2.6/site-packages/imgfac/builders/Fedora_ec2_Builder.py", line 489, in wait_for_ec2_instance_start
    instance.update()
  File "/usr/lib/python2.6/site-packages/boto/ec2/instance.py", line 221, in update
    rs = self.connection.get_all_instances([self.id])
  File "/usr/lib/python2.6/site-packages/boto/ec2/connection.py", line 456, in get_all_instances
    [('item', Reservation)], verb='POST')
  File "/usr/lib/python2.6/site-packages/boto/connection.py", line 680, in get_list
    raise self.ResponseError(response.status, response.reason, body)
EC2ResponseError: EC2ResponseError: 400 Bad Request
<?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>InvalidInstanceID.NotFound</Code><Message>The instance ID 'i-5bbafc3e' does not exist</Message></Error></Errors><RequestID>be9dfe37-f896-4512-ae0d-fdabaebc9af3</RequestID></Response>

2012-02-14 01:50:35,611 DEBUG imgfac.BuildJob.BuildJob thread(7a351362) Message: Builder (7a351362-73a2-4061-8aa6-324ae21ce537) changed status from PUSHING to FAILED
=================================================
rpm -qa|grep aeolus
aeolus-conductor-doc-0.8.0-26.el6.noarch
rubygem-aeolus-cli-0.3.0-8.el6.noarch
rubygem-aeolus-image-0.3.0-7.el6.noarch
aeolus-configure-2.5.0-13.el6.noarch
aeolus-conductor-daemons-0.8.0-26.el6.noarch
aeolus-conductor-0.8.0-26.el6.noarch
aeolus-all-0.8.0-26.el6.noarch

Comment 1 Angus Thomas 2012-02-14 09:41:14 UTC
Given that Conductor is making the push call to imagefactory, and the creation of a running instance is an internal implementation step for imagefactory, I'm reclassifying this as an imagefactory bug.

Comment 2 wes hayutin 2012-02-22 23:46:07 UTC
moving version to 1.0.0 .  version = found in version

Comment 5 Ian McLeod 2012-02-23 15:15:05 UTC
Yikes.  This looks like some combined bad behaviour on the part of boto and EC2.  

The instance object comes to us as an array element in this call:

https://github.com/aeolusproject/imagefactory/blob/master/imgfac/builders/Fedora_ec2_Builder.py#L627

And we pass it to wait_for_ec2_instance_start here:

https://github.com/aeolusproject/imagefactory/blob/master/imgfac/builders/Fedora_ec2_Builder.py#L639

And yet when we ask EC2 to update our view of the status of the instance that it has given us, it claims it does not exist.

It turns out we have seen this before and added the hacky workaround a few lines earlier where we wait a full 10 seconds for EC2 to remember that it started the instance.

I suppose the more robust workaround is to simply ignore exceptions when calling update() against the instance object and assume that the represent startup transients.  In the worst case, where the exception represents a genuine failure, this will result in us waiting 5 minutes for an instance that is never going to start.

I'll create a patch and will also research this on the boto/EC2 side to see if anyone else has complained about it.

Comment 7 Ian McLeod 2012-02-24 19:23:44 UTC
I've implemented the approach outlined in Comment 5 here:

https://github.com/aeolusproject/imagefactory/commit/0366ae8c934e3b2c6e77d8953e0b666af8971baf

This is brewed as 1.0.0rc7

Comment 8 Ian McLeod 2012-02-27 15:09:30 UTC
Note, if you want to verify that this is working as expected just look for the exception message from the original report  in the imagefactory log:

EC2ResponseError: EC2ResponseError: 400 Bad Request
<?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>InvalidInstanceID.NotFound</Code><Message>The
instance ID 'i-5bbafc3e' does not
exist</Message></Error></Errors><RequestID>be9dfe37-f896-4512-ae0d-fdabaebc9af3</RequestID></Response>

The exception can still occur.  The revised code catches it and logs the fact that it is being ignored.  I've already seen this happen once in my own testing and I'd expect it to be visible with some frequency since I have removed the original 10 second delay before the first query.

Comment 9 Martin Kočí 2012-02-27 20:04:30 UTC
Thanks for info. Btw so far I'm not able to reproduce the issue....

Comment 10 wes hayutin 2012-02-27 21:17:42 UTC
martin can you please add an automated test to scan for that message

Comment 11 Ian McLeod 2012-02-27 21:52:44 UTC
At the risk of being super-redundant, I just want to make it clear that the message in question is now:

1) Expected, at least some of the time

and

2) Benign, because we ignore the exception under the assumption that it is a startup transient

I've already seen this in my own testing.  Here is what the log looks like:

2012-02-24 17:09:06,586 DEBUG imgfac.builders.BaseBuilder.RHEL6_ec2_Builder thread(fac19108) Message: Creating temporary security group (imagefactory-fac19108-5d30-4695-94d2-06a41e7dec38)
2012-02-24 17:09:09,920 DEBUG imgfac.builders.BaseBuilder.RHEL6_ec2_Builder thread(fac19108) Message: Starting ami ami-ddf0ad98 with instance_type m1.large
2012-02-24 17:09:10,888 DEBUG imgfac.builders.BaseBuilder.RHEL6_ec2_Builder thread(fac19108) Message: Waiting for EC2 instance to start: 0/300
2012-02-24 17:09:11,168 ERROR boto thread(fac19108) Message: 400 Bad Request
2012-02-24 17:09:11,168 ERROR boto thread(fac19108) Message: <?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>InvalidInstanceID.NotFound</Code><Message>The instance ID 'i-9fe4dad8' does not exist</Message></Error></Errors><RequestID>3818bf09-96a0-4c2e-912a-ad61912bdd9b</RequestID></Re
sponse>
2012-02-24 17:09:11,169 WARNING imgfac.builders.BaseBuilder.RHEL6_ec2_Builder thread(fac19108) Message: EC2ResponseError encountered when querying EC2 instance (i-9fe4dad8) - trying to continue
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/imgfac/builders/Fedora_ec2_Builder.py", line 490, in wait_for_ec2_instance_start
    instance.update()
  File "/usr/lib/python2.6/site-packages/boto/ec2/instance.py", line 221, in update
    rs = self.connection.get_all_instances([self.id])
  File "/usr/lib/python2.6/site-packages/boto/ec2/connection.py", line 456, in get_all_instances
    [('item', Reservation)], verb='POST')
  File "/usr/lib/python2.6/site-packages/boto/connection.py", line 680, in get_list
    raise self.ResponseError(response.status, response.reason, body)
EC2ResponseError: EC2ResponseError: 400 Bad Request
<?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>InvalidInstanceID.NotFound</Code><Message>The instance ID 'i-9fe4dad8' does not exist</Message></Error></Errors><RequestID>3818bf09-96a0-4c2e-912a-ad61912bdd9b</RequestID></Response>
2012-02-24 17:09:24,079 DEBUG imgfac.builders.BaseBuilder.RHEL6_ec2_Builder thread(fac19108) Message: Waiting for EC2 instance to start: 10/300
2012-02-24 17:09:37,286 DEBUG imgfac.builders.BaseBuilder.RHEL6_ec2_Builder thread(fac19108) Message: Waiting for EC2 instance to start: 20/300


The instance in question did eventually start and the snapshot build completed successfully.

Again, apols if I am repeating myself.

Comment 12 Martin Kočí 2012-02-28 14:12:19 UTC
OK understand.. I would like to just check if I get this message then the instance will start and snapshot build successfully. But so far I'm not able to reproduce this message. 
However if I'm not able to reproduce this message until I have done the automation test (to scan the message) I will move bug to VERIFIED. I guess I will have the test done today.

Comment 13 Martin Kočí 2012-02-29 11:43:44 UTC
Either I have missed smt. or I'm not so lucky to get the message in question. Moving bug to VERIFIED though. Based on the commit from comment 7 (and based on comments 8 and 11) I believe that the exception is ignored and build will be completed. I will keep testing this.
(Automated test is quite difficult to implement as I need private credentials to access the EC2 which I don't want to spread among others).

Comment 14 errata-xmlrpc 2012-05-15 20:21:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2012-0588.html


Note You need to log in before you can comment on or make changes to this bug.