699883 – condor failed to find ami built by factory in us-west

Bug 699883 - condor failed to find ami built by factory in us-west

Summary: condor failed to find ami built by factory in us-west

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	CloudForms Cloud Engine
Classification:	Retired
Component:	aeolus-conductor
Sub Component:
Version:	1.0.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Assignee:	Jan Provaznik
QA Contact:	wes hayutin
Docs Contact:
URL:	https://hp-xw8600-01.rhts.eng.bos.red...
Whiteboard:
Depends On:	719382
Blocks:	ce-beta ce-ami
TreeView+	depends on / blocked

Reported:	2011-04-26 20:02 UTC by wes hayutin
Modified:	2012-01-26 12:28 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2011-09-29 16:04:52 UTC
Embargoed:

Attachments	(Terms of Use)
ami listed in ec2 us-west (113.34 KB, image/png) 2011-04-26 20:02 UTC, wes hayutin	no flags	Details
View All

Description wes hayutin 2011-04-26 20:02:38 UTC

Created attachment 495025 [details]
ami listed in ec2 us-west

Note:  I'll be trying to recreate this as I am not sure how repeatable this is.


-- Submitter: hp-xw8600-01.rhts.eng.bos.redhat.com : <10.16.65.43:47121> : hp-xw8600-01.rhts.eng.bos.redhat.com
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   3.0   aeolus          4/26 15:24   0+00:16:58 R  0   0.0  job_test2_2       
   4.0   aeolus          4/26 15:50   0+00:00:00 H  0   0.0  job_test03_3      

2 jobs; 0 idle, 1 running, 1 held
[root@hp-xw8600-01 ~]# condor_q -better


-- Submitter: hp-xw8600-01.rhts.eng.bos.redhat.com : <10.16.65.43:47121> : hp-xw8600-01.rhts.eng.bos.redhat.com
---
003.000:  Request is being serviced

---
004.000:  Request is held.

Hold reason: Create_Instance_Failure: InvalidAMIID.NotFound: The AMI ID 'ami-51693a14' does not exist

[root@hp-xw8600-01 ~]# 



[root@hp-xw8600-01 ~]# cat /var/log/imagefactory.log | grep ami-51693a14
2011-04-26 13:58:20,016 DEBUG imagefactory.builders.BaseBuilder.FedoraBuilder pid(21666) Message: Register output: IMAGE	ami-51693a14
2011-04-26 13:58:20,016 DEBUG imagefactory.builders.BaseBuilder.FedoraBuilder pid(21666) Message: Extracted AMI ID: ami-51693a14 
2011-04-26 13:58:20,020 DEBUG imagefactory.ImageWarehouse.ImageWarehouse pid(21666) Message: Setting metadata ({'provider': 'ec2-us-west-1', 'uuid': '259bb81e-5f8e-4b30-8f06-55a7339bbc15', 'icicle': 'none', 'target_identifier': 'ami-51693a14', 'object_type': 'provider_image', 'image': '1970c4ed-1fb5-48f0-8676-5343a82fbf21'}) for http://localhost:9090/provider_images/259bb81e-5f8e-4b30-8f06-55a7339bbc15
2011-04-26 13:58:21,022 DEBUG imagefactory.builders.BaseBuilder.FedoraBuilder pid(21666) Message: FedoraBuilder instance 41418960 pushed image with uuid 1970c4ed-1fb5-48f0-8676-5343a82fbf21 to provider_image UUID (259bb81e-5f8e-4b30-8f06-55a7339bbc15) and set metadata: {'target_identifier': 'ami-51693a14', 'icicle': 'none', 'image': '1970c4ed-1fb5-48f0-8676-5343a82fbf21', 'provider': 'ec2-us-west-1'}
[root@hp-xw8600-01 ~]# 


Recreate:
1. create a provider account for us-west and us-east
2. create a template
3. build and push template
4. create a realm for us-east and us-west
5. start the instance in us-west realm

error.. ami not found

Comment 1 wes hayutin 2011-04-26 20:21:25 UTC

---
004.000:  Request is held.

Hold reason: Create_Instance_Failure: InvalidAMIID.NotFound: The AMI ID 'ami-51693a14' does not exist

---
005.000:  Request is held.

Hold reason: Create_Instance_Failure: InvalidAMIID.NotFound: The AMI ID 'ami-2d693a68' does not exist


Hold reason: Create_Instance_Failure: InvalidAMIID.NotFound: The AMI ID 'ami-2d693a68' does not exist

[root@hp-xw8600-01 ~]# cat /var/log/imagefactory.log | grep ami-2d693a68
2011-04-26 16:14:17,473 DEBUG imagefactory.builders.BaseBuilder.FedoraBuilder pid(21666) Message: Register output: IMAGE	ami-2d693a68
2011-04-26 16:14:17,473 DEBUG imagefactory.builders.BaseBuilder.FedoraBuilder pid(21666) Message: Extracted AMI ID: ami-2d693a68 
2011-04-26 16:14:17,477 DEBUG imagefactory.ImageWarehouse.ImageWarehouse pid(21666) Message: Setting metadata ({'provider': 'ec2-us-west-1', 'uuid': '28f60499-0228-493b-b653-89cd4e04b338', 'icicle': 'none', 'target_identifier': 'ami-2d693a68', 'object_type': 'provider_image', 'image': 'b8041cd6-367a-4c96-bd83-bb30ca1de74c'}) for http://localhost:9090/provider_images/28f60499-0228-493b-b653-89cd4e04b338
2011-04-26 16:14:20,439 DEBUG imagefactory.builders.BaseBuilder.FedoraBuilder pid(21666) Message: FedoraBuilder instance 41419856 pushed image with uuid b8041cd6-367a-4c96-bd83-bb30ca1de74c to provider_image UUID (28f60499-0228-493b-b653-89cd4e04b338) and set metadata: {'target_identifier': 'ami-2d693a68', 'icicle': 'none', 'image': 'b8041cd6-367a-4c96-bd83-bb30ca1de74c', 'provider': 'ec2-us-west-1'}
[root@hp-xw8600-01 ~]#

Comment 2 Dave Johnson 2011-04-26 21:31:46 UTC

=======================================================
Wes asked me to recreate this and it appears I did, basically followed his recreation steps in comment 0 at which point I got:
=======================================================

[root@hp-ml370g6-01 deltacloud-ec2-us-west-1]# condor_q


-- Submitter: hp-ml370g6-01.rhts.eng.bos.redhat.com : <10.16.66.124:50937> : hp-ml370g6-01.rhts.eng.bos.redhat.com
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   1.0   aeolus          4/26 17:09   0+00:00:00 I  0   0.0  job_dave_1        

1 jobs; 1 idle, 0 running, 0 held
[root@hp-ml370g6-01 deltacloud-ec2-us-west-1]# condor_q -better


-- Submitter: hp-ml370g6-01.rhts.eng.bos.redhat.com : <10.16.66.124:50937> : hp-ml370g6-01.rhts.eng.bos.redhat.com
error: bad form
error: problem with ExprToProfile
---
001.000:  Run analysis summary.  Of 1 machines,
      1 are rejected by your job's requirements 
      0 reject your job because of their own requirements 
      0 match but are serving users with a better priority in the pool 
      0 match but reject the job for unknown reasons 
      0 match but will not currently preempt their existing job 
      0 match but are currently offline 
      0 are available to run your job
	No successful match recorded.
	Last failed match: Tue Apr 26 17:10:47 2011
	Reason for last match failure: no match found

WARNING:  Be advised:
   No resources matched request's constraints

The Requirements expression for your job is:

( target.front_end_hardware_profile_id == "14" && target.image == "1" &&
target.realm == "2" && conductor_quota_check(1,other.provider_account_id) )

[root@hp-ml370g6-01 deltacloud-ec2-us-west-1]#

====================================================================
Wasn;t sure anout this (seems like a hwp matching issue), I spoke to Wes who directed me to restart condor
====================================================================

[root@hp-ml370g6-01 deltacloud-ec2-us-west-1]# /etc/init.d/condor restart
Stopping Condor daemons: [  OK  ]
Starting Condor daemons: [  OK  ]
[root@hp-ml370g6-01 deltacloud-ec2-us-west-1]# condor_q -better
Warning:  Found no submitters


-- Submitter: hp-ml370g6-01.rhts.eng.bos.redhat.com : <10.16.66.124:44533> : hp-ml370g6-01.rhts.eng.bos.redhat.com
---
001.000:  Run analysis summary.  Of 0 machines,
      0 are rejected by your job's requirements 
      0 reject your job because of their own requirements 
      0 match but are serving users with a better priority in the pool 
      0 match but reject the job for unknown reasons 
      0 match but will not currently preempt their existing job 
      0 match but are currently offline 
      0 are available to run your job
	No successful match recorded.
	Last failed match: Tue Apr 26 17:14:48 2011
	Reason for last match failure: no match found

WARNING:  Be advised:
   No resources matched request's constraints

WARNING:  Be advised:   Request 1.0 did not match any resource's constraints

=========================================================================
At this point I considered the instance orphaned from the condor restart so I created a second instance
=========================================================================

[root@hp-ml370g6-01 deltacloud-ec2-us-west-1]# condor_q


-- Submitter: hp-ml370g6-01.rhts.eng.bos.redhat.com : <10.16.66.124:44533> : hp-ml370g6-01.rhts.eng.bos.redhat.com
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   1.0   aeolus          4/26 17:09   0+00:00:00 H  0   0.0  job_dave_1        
   2.0   aeolus          4/26 17:18   0+00:00:00 I  0   0.0  job_dave2_2       

2 jobs; 1 idle, 0 running, 1 held
[root@hp-ml370g6-01 deltacloud-ec2-us-west-1]# condor_q -better


-- Submitter: hp-ml370g6-01.rhts.eng.bos.redhat.com : <10.16.66.124:44533> : hp-ml370g6-01.rhts.eng.bos.redhat.com
---
001.000:  Request is held.

Hold reason: Create_Instance_Failure: InvalidAMIID.NotFound: The AMI ID 'ami-ef693aaa' does not exist

---
002.000:  Request has been matched.

=========================================================================
Shortly thereafter, they both showed Not Found
=========================================================================

[root@hp-ml370g6-01 deltacloud-ec2-us-west-1]# condor_q


-- Submitter: hp-ml370g6-01.rhts.eng.bos.redhat.com : <10.16.66.124:44533> : hp-ml370g6-01.rhts.eng.bos.redhat.com
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   1.0   aeolus          4/26 17:09   0+00:00:00 H  0   0.0  job_dave_1        
   2.0   aeolus          4/26 17:18   0+00:00:00 H  0   0.0  job_dave2_2       

2 jobs; 0 idle, 0 running, 2 held
You have mail in /var/spool/mail/root
[root@hp-ml370g6-01 deltacloud-ec2-us-west-1]# condor_q -better


-- Submitter: hp-ml370g6-01.rhts.eng.bos.redhat.com : <10.16.66.124:44533> : hp-ml370g6-01.rhts.eng.bos.redhat.com
---
001.000:  Request is held.

Hold reason: Create_Instance_Failure: InvalidAMIID.NotFound: The AMI ID 'ami-ef693aaa' does not exist

---
002.000:  Request is held.

Hold reason: Create_Instance_Failure: InvalidAMIID.NotFound: The AMI ID 'ami-ef693aaa' does not exist

[root@hp-ml370g6-01 deltacloud-ec2-us-west-1]#

Comment 4 Shveta 2011-04-28 07:28:08 UTC

For me , template build in east succeeded where as it failed for west 

Thu, 28 Apr 2011 06:45:21 GMT
/
2011-04-28 02:45:21,950 DEBUG boto pid(13116) Message: Method: GET
2011-04-28 02:45:21,951 DEBUG boto pid(13116) Message: Path: /?AWSAccessKeyId=AKIAI2KPFDYVZKSRTJMQ&Action=TerminateInstances&InstanceId.1=i-427e5006&SignatureMethod=HmacSHA256&SignatureVersion=2&Timestamp=2011-04-28T06%3A45%3A21&Version=2009-11-30&Signature=FH3FrJ48SauxlYytCU0QQcZJq6nY3auZS8AI6csG/Gk%3D
2011-04-28 02:45:21,951 DEBUG boto pid(13116) Message: Data:
2011-04-28 02:45:21,951 DEBUG boto pid(13116) Message: Headers: {'Date': 'Thu, 28 Apr 2011 06:45:21 GMT', 'Content-Length': '0', 'Authorization': 'AWS AKIAI2KPFDYVZKSRTJMQ:cI6VvOo0G/dTpPsi+5hY9c+KhTc=', 'User-Agent': 'Boto/1.9b (linux2)'}
2011-04-28 02:45:21,951 DEBUG boto pid(13116) Message: Host: None
2011-04-28 02:45:22,603 DEBUG boto pid(13116) Message: <?xml version="1.0" encoding="UTF-8"?>
<TerminateInstancesResponse xmlns="http://ec2.amazonaws.com/doc/2009-11-30/">
    <requestId>079cf389-4808-4586-80aa-d5d037fb5366</requestId>
    <instancesSet>
        <item>
            <instanceId>i-427e5006</instanceId>
            <currentState>
                <code>32</code>
                <name>shutting-down</name>
            </currentState>
            <previousState>
                <code>16</code>
                <name>running</name>
            </previousState>
        </item>
    </instancesSet>
</TerminateInstancesResponse>
2011-04-28 02:45:22,604 DEBUG imagefactory.builders.BaseBuilder.FedoraBuilder pid(13116) Message: Exception during push_image
2011-04-28 02:45:22,604 DEBUG imagefactory.builders.BaseBuilder.FedoraBuilder pid(13116) Message: Unexpected error: (<class 'imagefactory.ImageFactoryException.ImageFactoryException'>)
2011-04-28 02:45:22,604 DEBUG imagefactory.builders.BaseBuilder.FedoraBuilder pid(13116) Message:              value: (Unable to gain ssh access after 300 seconds - aborting)
2011-04-28 02:45:22,605 DEBUG imagefactory.builders.BaseBuilder.FedoraBuilder pid(13116) Message:          traceback: ['  File "/usr/lib/python2.6/site-packages/imagefactory/builders/FedoraBuilder.py", line 409, in push_image\n    self.push_image_snapshot(image_id, provider, credentials)\n', '  File "/usr/lib/python2.6/site-packages/imagefactory/builders/FedoraBuilder.py", line 562, in push_image_snapshot\n    raise ImageFactoryException("Unable to gain ssh access after 300 seconds - aborting")\n']
2011-04-28 02:45:22,605 DEBUG imagefactory.qmfagent.BuildAdaptor.BuildAdaptor pid(13116) Message: Raising event with agent handler (<ImageFactoryAgent(Thread-1, initial)>), changed status from PUSHING to FAILED


==========================================================================

rpm -qa |grep aeolus
aeolus-conductor-daemons-0.2.0-2.el6.x86_64
aeolus-configure-2.0.0-9.el6.noarch
aeolus-conductor-0.2.0-2.el6.x86_64
aeolus-conductor-doc-0.2.0-2.el6.x86_64


=============================================================================

Comment 5 Jan Provaznik 2011-04-28 08:35:02 UTC

problem is most probably in deltacloud-api:
with my account I can find random public images from us-east, but I can't see any image in us-west. Though in aws console I can switch between regions and can see public images from both regions. Will discuss this with Michal Fojtik.

Comment 6 Jan Provaznik 2011-04-28 08:57:06 UTC

So it seems we should set API_PROVIDER env variable when starting driver, otherwise default (us-east-1) is used.

Comment 7 Jan Provaznik 2011-04-28 09:47:24 UTC

With API_PROVIDER I can see images from us-west w/o problems.

On machine hp-ml370g6-01.rhts.eng.bos.redhat.com are now fixed init deltacloud-core daemons, but I can't reproduce bug because I'm hitting same problem as Shveta - imagefactory can't connect to us-west with my account. Wes, could you please test it with your account, which worked (I think)? I believe problem should be fixed now.

Comment 8 Shveta 2011-05-02 10:01:24 UTC

root@ip-10-118-63-61 ~]# condor_q


-- Submitter: ip-10-118-63-61.ec2.internal : <10.118.63.61:45189> : ip-10-118-63-61.ec2.internal
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   1.0   aeolus          5/2  05:22   0+00:33:53 R  0   0.0  job_ease_insta_1  
   2.0   aeolus          5/2  05:24   0+00:29:48 R  0   0.0  job_realm_insta_2 
   3.0   aeolus          5/2  05:41   0+00:00:00 I  0   0.0  job_west_insta_3  
   4.0   aeolus          5/2  05:45   0+00:00:00 I  0   0.0  job_west_insta_2_4

4 jobs; 2 idle, 2 running, 0 held


condor_q shows the job running for us-west also.

Comment 9 Jan Provaznik 2011-05-09 11:43:53 UTC

this problem should be fixed in current version of aeolus-configure

Comment 10 wes hayutin 2011-06-14 15:39:44 UTC

moving to on_qa for review

Comment 11 wes hayutin 2011-07-08 21:40:18 UTC

woot.. finally working :)

http://hp-xw6600-02.rhts.eng.bos.redhat.com:3006	


10.16.65.48 - - [08/Jul/2011 17:39:25] "GET / HTTP/1.1" 301 - 0.0011
10.16.65.48 - - [08/Jul/2011 17:39:25] "GET /api HTTP/1.1" 200 926 0.0094
10.16.65.48 - - [08/Jul/2011 17:39:25] "GET / HTTP/1.1" 301 - 0.0008
10.16.65.48 - - [08/Jul/2011 17:39:25] "GET /api HTTP/1.1" 200 926 0.0150
10.16.65.48 - - [08/Jul/2011 17:39:26] "GET / HTTP/1.1" 301 - 0.0011
10.16.65.48 - - [08/Jul/2011 17:39:26] "GET /api HTTP/1.1" 200 926 0.0091
10.16.65.48 - - [08/Jul/2011 17:39:26] "GET /api/hardware_profiles HTTP/1.1" 200 1813 0.0123

[root@hp-xw6600-02 ~]# rpm -qa | grep aeolus
rubygem-aeolus-cli-0.0.1-1.el6.20110708135911gitdb1097c.noarch
aeolus-all-0.3.0-0.el6.20110708135911gitdb1097c.noarch
aeolus-configure-2.0.1-0.el6.20110707131907gitfaa220b.noarch
aeolus-conductor-0.3.0-0.el6.20110708135911gitdb1097c.noarch
aeolus-conductor-daemons-0.3.0-0.el6.20110708135911gitdb1097c.noarch
aeolus-conductor-doc-0.3.0-0.el6.20110708135911gitdb1097c.noarch

Comment 12 wes hayutin 2011-07-08 21:43:22 UTC

ignore comment 11 .. added text to wrong bug..
this bug is blocked by 719382

Comment 13 wes hayutin 2011-09-28 16:39:42 UTC

making sure all the bugs are at the right version for future queries

Comment 15 wes hayutin 2011-09-29 16:04:52 UTC

condor is gone..


[root@unused bin]# rpm -qa | grep aeolus
aeolus-conductor-doc-0.4.0-0.20110929145941git7594098.fc15.noarch
aeolus-conductor-0.4.0-0.20110929145941git7594098.fc15.noarch
aeolus-conductor-daemons-0.4.0-0.20110929145941git7594098.fc15.noarch
aeolus-conductor-devel-0.4.0-0.20110929145941git7594098.fc15.noarch
aeolus-all-0.4.0-0.20110929145941git7594098.fc15.noarch
rubygem-aeolus-image-0.1.0-3.20110919115936gitd1d24b4.fc15.noarch
aeolus-configure-2.0.2-4.20110926142838git5044e56.fc15.noarch
[root@unused bin]# rpm -qa | grep condor
[root@unused bin]#

Note You need to log in before you can comment on or make changes to this bug.