Hide Forgot
Created attachment 561800 [details] running_img Description of problem: Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Build for ec2 2. Pushed via UI 3. as soon as i start pushing i can see an image coming up in my aws account. Push failed for the reason below : http://pastebin.test.redhat.com/76803 and the image was left running . Tried push for the second time . This time it suceeded and VM was deprovisioned after push. Actual results: Expected results: Additional info: Push failure ================================================ 2012-02-14 01:50:19,911 DEBUG imgfac.ReservationManager.ReservationManager thread(7a351362) Message: SUCCESS ENTERING queue: (ec2) 2012-02-14 01:50:19,911 DEBUG imgfac.BuildJob.BuildJob thread(7a351362) Message: Builder (7a351362-73a2-4061-8aa6-324ae21ce537) changed status from NEW to PUSHING 2012-02-14 01:50:19,911 DEBUG imgfac.BuildJob.BuildJob thread(7a351362) Message: Builder (7a351362-73a2-4061-8aa6-324ae21ce537) changed percent complete from 0 to 0 2012-02-14 01:50:20,872 DEBUG imgfac.builders.BaseBuilder.Fedora_ec2_Builder thread(7a351362) Message: Creating temporary security group (imagefactory-7a351362-73a2-4061-8aa6-324ae21ce537) 2012-02-14 01:50:24,756 DEBUG imgfac.builders.BaseBuilder.Fedora_ec2_Builder thread(7a351362) Message: Starting ami ami-c31cd8aa with instance_type m1.large 2012-02-14 01:50:35,465 DEBUG imgfac.builders.BaseBuilder.Fedora_ec2_Builder thread(7a351362) Message: Waiting for EC2 instance to start: 0/300 2012-02-14 01:50:35,609 ERROR boto thread(7a351362) Message: 400 Bad Request 2012-02-14 01:50:35,609 ERROR boto thread(7a351362) Message: <?xml version="1.0" encoding="UTF-8"?> <Response><Errors><Error><Code>InvalidInstanceID.NotFound</Code><Message>The instance ID 'i-5bbafc3e' does not exist</Message></Error></Errors><RequestID>be9dfe37-f896-4512-ae0d-fdabaebc9af3</RequestID></Response> 2012-02-14 01:50:35,609 DEBUG imgfac.builders.BaseBuilder.Fedora_ec2_Builder thread(7a351362) Message: Exception caught in ImageFactory 2012-02-14 01:50:35,611 DEBUG imgfac.builders.BaseBuilder.Fedora_ec2_Builder thread(7a351362) Message: Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/imgfac/builders/Fedora_ec2_Builder.py", line 458, in push_image credentials) File "/usr/lib/python2.6/site-packages/imgfac/builders/Fedora_ec2_Builder.py", line 616, in push_image_snapshot_ec2 self.wait_for_ec2_instance_start(self.instance) File "/usr/lib/python2.6/site-packages/imgfac/builders/Fedora_ec2_Builder.py", line 489, in wait_for_ec2_instance_start instance.update() File "/usr/lib/python2.6/site-packages/boto/ec2/instance.py", line 221, in update rs = self.connection.get_all_instances([self.id]) File "/usr/lib/python2.6/site-packages/boto/ec2/connection.py", line 456, in get_all_instances [('item', Reservation)], verb='POST') File "/usr/lib/python2.6/site-packages/boto/connection.py", line 680, in get_list raise self.ResponseError(response.status, response.reason, body) EC2ResponseError: EC2ResponseError: 400 Bad Request <?xml version="1.0" encoding="UTF-8"?> <Response><Errors><Error><Code>InvalidInstanceID.NotFound</Code><Message>The instance ID 'i-5bbafc3e' does not exist</Message></Error></Errors><RequestID>be9dfe37-f896-4512-ae0d-fdabaebc9af3</RequestID></Response> 2012-02-14 01:50:35,611 DEBUG imgfac.BuildJob.BuildJob thread(7a351362) Message: Builder (7a351362-73a2-4061-8aa6-324ae21ce537) changed status from PUSHING to FAILED ================================================= rpm -qa|grep aeolus aeolus-conductor-doc-0.8.0-26.el6.noarch rubygem-aeolus-cli-0.3.0-8.el6.noarch rubygem-aeolus-image-0.3.0-7.el6.noarch aeolus-configure-2.5.0-13.el6.noarch aeolus-conductor-daemons-0.8.0-26.el6.noarch aeolus-conductor-0.8.0-26.el6.noarch aeolus-all-0.8.0-26.el6.noarch
Given that Conductor is making the push call to imagefactory, and the creation of a running instance is an internal implementation step for imagefactory, I'm reclassifying this as an imagefactory bug.
moving version to 1.0.0 . version = found in version
Yikes. This looks like some combined bad behaviour on the part of boto and EC2. The instance object comes to us as an array element in this call: https://github.com/aeolusproject/imagefactory/blob/master/imgfac/builders/Fedora_ec2_Builder.py#L627 And we pass it to wait_for_ec2_instance_start here: https://github.com/aeolusproject/imagefactory/blob/master/imgfac/builders/Fedora_ec2_Builder.py#L639 And yet when we ask EC2 to update our view of the status of the instance that it has given us, it claims it does not exist. It turns out we have seen this before and added the hacky workaround a few lines earlier where we wait a full 10 seconds for EC2 to remember that it started the instance. I suppose the more robust workaround is to simply ignore exceptions when calling update() against the instance object and assume that the represent startup transients. In the worst case, where the exception represents a genuine failure, this will result in us waiting 5 minutes for an instance that is never going to start. I'll create a patch and will also research this on the boto/EC2 side to see if anyone else has complained about it.
Seen by others: http://groups.google.com/group/boto-users/browse_thread/thread/441dfaf6f3acf1fa/54a317b5d1f3e6f7?lnk=gst&q=InvalidInstanceID#54a317b5d1f3e6f7
I've implemented the approach outlined in Comment 5 here: https://github.com/aeolusproject/imagefactory/commit/0366ae8c934e3b2c6e77d8953e0b666af8971baf This is brewed as 1.0.0rc7
Note, if you want to verify that this is working as expected just look for the exception message from the original report in the imagefactory log: EC2ResponseError: EC2ResponseError: 400 Bad Request <?xml version="1.0" encoding="UTF-8"?> <Response><Errors><Error><Code>InvalidInstanceID.NotFound</Code><Message>The instance ID 'i-5bbafc3e' does not exist</Message></Error></Errors><RequestID>be9dfe37-f896-4512-ae0d-fdabaebc9af3</RequestID></Response> The exception can still occur. The revised code catches it and logs the fact that it is being ignored. I've already seen this happen once in my own testing and I'd expect it to be visible with some frequency since I have removed the original 10 second delay before the first query.
Thanks for info. Btw so far I'm not able to reproduce the issue....
martin can you please add an automated test to scan for that message
At the risk of being super-redundant, I just want to make it clear that the message in question is now: 1) Expected, at least some of the time and 2) Benign, because we ignore the exception under the assumption that it is a startup transient I've already seen this in my own testing. Here is what the log looks like: 2012-02-24 17:09:06,586 DEBUG imgfac.builders.BaseBuilder.RHEL6_ec2_Builder thread(fac19108) Message: Creating temporary security group (imagefactory-fac19108-5d30-4695-94d2-06a41e7dec38) 2012-02-24 17:09:09,920 DEBUG imgfac.builders.BaseBuilder.RHEL6_ec2_Builder thread(fac19108) Message: Starting ami ami-ddf0ad98 with instance_type m1.large 2012-02-24 17:09:10,888 DEBUG imgfac.builders.BaseBuilder.RHEL6_ec2_Builder thread(fac19108) Message: Waiting for EC2 instance to start: 0/300 2012-02-24 17:09:11,168 ERROR boto thread(fac19108) Message: 400 Bad Request 2012-02-24 17:09:11,168 ERROR boto thread(fac19108) Message: <?xml version="1.0" encoding="UTF-8"?> <Response><Errors><Error><Code>InvalidInstanceID.NotFound</Code><Message>The instance ID 'i-9fe4dad8' does not exist</Message></Error></Errors><RequestID>3818bf09-96a0-4c2e-912a-ad61912bdd9b</RequestID></Re sponse> 2012-02-24 17:09:11,169 WARNING imgfac.builders.BaseBuilder.RHEL6_ec2_Builder thread(fac19108) Message: EC2ResponseError encountered when querying EC2 instance (i-9fe4dad8) - trying to continue Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/imgfac/builders/Fedora_ec2_Builder.py", line 490, in wait_for_ec2_instance_start instance.update() File "/usr/lib/python2.6/site-packages/boto/ec2/instance.py", line 221, in update rs = self.connection.get_all_instances([self.id]) File "/usr/lib/python2.6/site-packages/boto/ec2/connection.py", line 456, in get_all_instances [('item', Reservation)], verb='POST') File "/usr/lib/python2.6/site-packages/boto/connection.py", line 680, in get_list raise self.ResponseError(response.status, response.reason, body) EC2ResponseError: EC2ResponseError: 400 Bad Request <?xml version="1.0" encoding="UTF-8"?> <Response><Errors><Error><Code>InvalidInstanceID.NotFound</Code><Message>The instance ID 'i-9fe4dad8' does not exist</Message></Error></Errors><RequestID>3818bf09-96a0-4c2e-912a-ad61912bdd9b</RequestID></Response> 2012-02-24 17:09:24,079 DEBUG imgfac.builders.BaseBuilder.RHEL6_ec2_Builder thread(fac19108) Message: Waiting for EC2 instance to start: 10/300 2012-02-24 17:09:37,286 DEBUG imgfac.builders.BaseBuilder.RHEL6_ec2_Builder thread(fac19108) Message: Waiting for EC2 instance to start: 20/300 The instance in question did eventually start and the snapshot build completed successfully. Again, apols if I am repeating myself.
OK understand.. I would like to just check if I get this message then the instance will start and snapshot build successfully. But so far I'm not able to reproduce this message. However if I'm not able to reproduce this message until I have done the automation test (to scan the message) I will move bug to VERIFIED. I guess I will have the test done today.
Either I have missed smt. or I'm not so lucky to get the message in question. Moving bug to VERIFIED though. Based on the commit from comment 7 (and based on comments 8 and 11) I believe that the exception is ignored and build will be completed. I will keep testing this. (Automated test is quite difficult to implement as I need private credentials to access the EC2 which I don't want to spread among others).
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2012-0588.html