Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 790311

Summary:

Failed push leaves the temporary image (created while pushing) running in aws account

Product:

[Retired] CloudForms Cloud Engine

Reporter:

Shveta <ssachdev>

Component:

imagefactory

Assignee:

Ian McLeod <imcleod>

Status:

CLOSED ERRATA

QA Contact:

Martin Kočí <mkoci>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

1.0.0

CC:

akarol, brad, dajohnso, deltacloud-maint, dgao, hbrock, ssachdev, whayutin

Target Milestone:

beta6

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2012-05-15 20:21:53 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
running_img	none

Description Shveta 2012-02-14 08:08:55 UTC

Created attachment 561800 [details]
running_img

Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Build for ec2
2. Pushed via UI 
3. as soon as i start pushing i can see an image coming up in my aws account.
Push failed for the reason below :
http://pastebin.test.redhat.com/76803

and the image was left running .

Tried push for the second time . This time it suceeded and VM was deprovisioned after push.
  
Actual results:


Expected results:


Additional info:
Push failure
================================================
2012-02-14 01:50:19,911 DEBUG imgfac.ReservationManager.ReservationManager thread(7a351362) Message: SUCCESS ENTERING queue: (ec2)
2012-02-14 01:50:19,911 DEBUG imgfac.BuildJob.BuildJob thread(7a351362) Message: Builder (7a351362-73a2-4061-8aa6-324ae21ce537) changed status from NEW to PUSHING
2012-02-14 01:50:19,911 DEBUG imgfac.BuildJob.BuildJob thread(7a351362) Message: Builder (7a351362-73a2-4061-8aa6-324ae21ce537) changed percent complete from 0 to 0
2012-02-14 01:50:20,872 DEBUG imgfac.builders.BaseBuilder.Fedora_ec2_Builder thread(7a351362) Message: Creating temporary security group (imagefactory-7a351362-73a2-4061-8aa6-324ae21ce537)
2012-02-14 01:50:24,756 DEBUG imgfac.builders.BaseBuilder.Fedora_ec2_Builder thread(7a351362) Message: Starting ami ami-c31cd8aa with instance_type m1.large
2012-02-14 01:50:35,465 DEBUG imgfac.builders.BaseBuilder.Fedora_ec2_Builder thread(7a351362) Message: Waiting for EC2 instance to start: 0/300
2012-02-14 01:50:35,609 ERROR boto thread(7a351362) Message: 400 Bad Request
2012-02-14 01:50:35,609 ERROR boto thread(7a351362) Message: <?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>InvalidInstanceID.NotFound</Code><Message>The instance ID 'i-5bbafc3e' does not exist</Message></Error></Errors><RequestID>be9dfe37-f896-4512-ae0d-fdabaebc9af3</RequestID></Response>
2012-02-14 01:50:35,609 DEBUG imgfac.builders.BaseBuilder.Fedora_ec2_Builder thread(7a351362) Message: Exception caught in ImageFactory
2012-02-14 01:50:35,611 DEBUG imgfac.builders.BaseBuilder.Fedora_ec2_Builder thread(7a351362) Message: Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/imgfac/builders/Fedora_ec2_Builder.py", line 458, in push_image
    credentials)
  File "/usr/lib/python2.6/site-packages/imgfac/builders/Fedora_ec2_Builder.py", line 616, in push_image_snapshot_ec2
    self.wait_for_ec2_instance_start(self.instance)
  File "/usr/lib/python2.6/site-packages/imgfac/builders/Fedora_ec2_Builder.py", line 489, in wait_for_ec2_instance_start
    instance.update()
  File "/usr/lib/python2.6/site-packages/boto/ec2/instance.py", line 221, in update
    rs = self.connection.get_all_instances([self.id])
  File "/usr/lib/python2.6/site-packages/boto/ec2/connection.py", line 456, in get_all_instances
    [('item', Reservation)], verb='POST')
  File "/usr/lib/python2.6/site-packages/boto/connection.py", line 680, in get_list
    raise self.ResponseError(response.status, response.reason, body)
EC2ResponseError: EC2ResponseError: 400 Bad Request
<?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>InvalidInstanceID.NotFound</Code><Message>The instance ID 'i-5bbafc3e' does not exist</Message></Error></Errors><RequestID>be9dfe37-f896-4512-ae0d-fdabaebc9af3</RequestID></Response>

2012-02-14 01:50:35,611 DEBUG imgfac.BuildJob.BuildJob thread(7a351362) Message: Builder (7a351362-73a2-4061-8aa6-324ae21ce537) changed status from PUSHING to FAILED
=================================================
rpm -qa|grep aeolus
aeolus-conductor-doc-0.8.0-26.el6.noarch
rubygem-aeolus-cli-0.3.0-8.el6.noarch
rubygem-aeolus-image-0.3.0-7.el6.noarch
aeolus-configure-2.5.0-13.el6.noarch
aeolus-conductor-daemons-0.8.0-26.el6.noarch
aeolus-conductor-0.8.0-26.el6.noarch
aeolus-all-0.8.0-26.el6.noarch

Comment 1 Angus Thomas 2012-02-14 09:41:14 UTC

Given that Conductor is making the push call to imagefactory, and the creation of a running instance is an internal implementation step for imagefactory, I'm reclassifying this as an imagefactory bug.

Comment 2 wes hayutin 2012-02-22 23:46:07 UTC

moving version to 1.0.0 .  version = found in version

Comment 5 Ian McLeod 2012-02-23 15:15:05 UTC

Yikes.  This looks like some combined bad behaviour on the part of boto and EC2.  

The instance object comes to us as an array element in this call:

https://github.com/aeolusproject/imagefactory/blob/master/imgfac/builders/Fedora_ec2_Builder.py#L627

And we pass it to wait_for_ec2_instance_start here:

https://github.com/aeolusproject/imagefactory/blob/master/imgfac/builders/Fedora_ec2_Builder.py#L639

And yet when we ask EC2 to update our view of the status of the instance that it has given us, it claims it does not exist.

It turns out we have seen this before and added the hacky workaround a few lines earlier where we wait a full 10 seconds for EC2 to remember that it started the instance.

I suppose the more robust workaround is to simply ignore exceptions when calling update() against the instance object and assume that the represent startup transients.  In the worst case, where the exception represents a genuine failure, this will result in us waiting 5 minutes for an instance that is never going to start.

I'll create a patch and will also research this on the boto/EC2 side to see if anyone else has complained about it.

Comment 6 Ian McLeod 2012-02-23 15:31:15 UTC

Seen by others:

http://groups.google.com/group/boto-users/browse_thread/thread/441dfaf6f3acf1fa/54a317b5d1f3e6f7?lnk=gst&q=InvalidInstanceID#54a317b5d1f3e6f7

Comment 7 Ian McLeod 2012-02-24 19:23:44 UTC

I've implemented the approach outlined in Comment 5 here:

https://github.com/aeolusproject/imagefactory/commit/0366ae8c934e3b2c6e77d8953e0b666af8971baf

This is brewed as 1.0.0rc7

Comment 8 Ian McLeod 2012-02-27 15:09:30 UTC

Note, if you want to verify that this is working as expected just look for the exception message from the original report  in the imagefactory log:

EC2ResponseError: EC2ResponseError: 400 Bad Request
<?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>InvalidInstanceID.NotFound</Code><Message>The
instance ID 'i-5bbafc3e' does not
exist</Message></Error></Errors><RequestID>be9dfe37-f896-4512-ae0d-fdabaebc9af3</RequestID></Response>

The exception can still occur.  The revised code catches it and logs the fact that it is being ignored.  I've already seen this happen once in my own testing and I'd expect it to be visible with some frequency since I have removed the original 10 second delay before the first query.

Comment 9 Martin Kočí 2012-02-27 20:04:30 UTC

Thanks for info. Btw so far I'm not able to reproduce the issue....

Comment 10 wes hayutin 2012-02-27 21:17:42 UTC

martin can you please add an automated test to scan for that message

Comment 11 Ian McLeod 2012-02-27 21:52:44 UTC

At the risk of being super-redundant, I just want to make it clear that the message in question is now:

1) Expected, at least some of the time

and

2) Benign, because we ignore the exception under the assumption that it is a startup transient

I've already seen this in my own testing.  Here is what the log looks like:

2012-02-24 17:09:06,586 DEBUG imgfac.builders.BaseBuilder.RHEL6_ec2_Builder thread(fac19108) Message: Creating temporary security group (imagefactory-fac19108-5d30-4695-94d2-06a41e7dec38)
2012-02-24 17:09:09,920 DEBUG imgfac.builders.BaseBuilder.RHEL6_ec2_Builder thread(fac19108) Message: Starting ami ami-ddf0ad98 with instance_type m1.large
2012-02-24 17:09:10,888 DEBUG imgfac.builders.BaseBuilder.RHEL6_ec2_Builder thread(fac19108) Message: Waiting for EC2 instance to start: 0/300
2012-02-24 17:09:11,168 ERROR boto thread(fac19108) Message: 400 Bad Request
2012-02-24 17:09:11,168 ERROR boto thread(fac19108) Message: <?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>InvalidInstanceID.NotFound</Code><Message>The instance ID 'i-9fe4dad8' does not exist</Message></Error></Errors><RequestID>3818bf09-96a0-4c2e-912a-ad61912bdd9b</RequestID></Re
sponse>
2012-02-24 17:09:11,169 WARNING imgfac.builders.BaseBuilder.RHEL6_ec2_Builder thread(fac19108) Message: EC2ResponseError encountered when querying EC2 instance (i-9fe4dad8) - trying to continue
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/imgfac/builders/Fedora_ec2_Builder.py", line 490, in wait_for_ec2_instance_start
    instance.update()
  File "/usr/lib/python2.6/site-packages/boto/ec2/instance.py", line 221, in update
    rs = self.connection.get_all_instances([self.id])
  File "/usr/lib/python2.6/site-packages/boto/ec2/connection.py", line 456, in get_all_instances
    [('item', Reservation)], verb='POST')
  File "/usr/lib/python2.6/site-packages/boto/connection.py", line 680, in get_list
    raise self.ResponseError(response.status, response.reason, body)
EC2ResponseError: EC2ResponseError: 400 Bad Request
<?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>InvalidInstanceID.NotFound</Code><Message>The instance ID 'i-9fe4dad8' does not exist</Message></Error></Errors><RequestID>3818bf09-96a0-4c2e-912a-ad61912bdd9b</RequestID></Response>
2012-02-24 17:09:24,079 DEBUG imgfac.builders.BaseBuilder.RHEL6_ec2_Builder thread(fac19108) Message: Waiting for EC2 instance to start: 10/300
2012-02-24 17:09:37,286 DEBUG imgfac.builders.BaseBuilder.RHEL6_ec2_Builder thread(fac19108) Message: Waiting for EC2 instance to start: 20/300


The instance in question did eventually start and the snapshot build completed successfully.

Again, apols if I am repeating myself.

Comment 12 Martin Kočí 2012-02-28 14:12:19 UTC

OK understand.. I would like to just check if I get this message then the instance will start and snapshot build successfully. But so far I'm not able to reproduce this message. 
However if I'm not able to reproduce this message until I have done the automation test (to scan the message) I will move bug to VERIFIED. I guess I will have the test done today.

Comment 13 Martin Kočí 2012-02-29 11:43:44 UTC

Either I have missed smt. or I'm not so lucky to get the message in question. Moving bug to VERIFIED though. Based on the commit from comment 7 (and based on comments 8 and 11) I believe that the exception is ignored and build will be completed. I will keep testing this.
(Automated test is quite difficult to implement as I need private credentials to access the EC2 which I don't want to spread among others).

Comment 14 errata-xmlrpc 2012-05-15 20:21:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2012-0588.html