Created attachment 496409 [details] debug log for image factory and iwhd Description of problem: I am not very confident in the recreate here, however I do seem to have warehouse in a state that restarting the service does fix. How I got into this mess. 1. create two provider accounts ec2-east, ec2-west 2. create a few templates with two package groups, create,build,push no problem 3. create a template with a large amount of packages, I'll attach the package list 4. build and push to both east and west after a few failed attempts and restarting iwhd because the service went down. I am now unable to get the template to even attempt to build. The template stays in queued current iwhd.log [root@ip-10-82-251-157 ~]# cat /var/log/iwhd.log 0 replication servers defined iwhd: convert-provider: ITER key: name iwhd: convert-provider: ITER key: type iwhd: convert-provider: ITER key: path I'm guessing log rotate cleaned the log, however I dont see where it saved the old log. [root@ip-10-82-251-157 ~]# cd /var/log [root@ip-10-82-251-157 log]# ls -ltra total 1104 drwxr-xr-x. 2 root root 4096 Jan 5 2010 ConsoleKit drwxr-xr-x. 2 ntp ntp 4096 May 13 2010 ntpstats drwxr-xr-x. 2 puppet puppet 4096 Apr 6 18:05 puppet -rw-------. 1 root root 0 May 2 10:21 tallylog drwxr-xr-x. 18 root root 4096 May 2 10:21 .. -rw-------. 1 root root 0 May 2 10:21 spooler drwx------. 5 root root 4096 May 2 10:21 libvirt -rw-r--r--. 1 root root 100910 May 2 10:24 dracut.log -rw-r--r--. 1 root root 11999 May 2 11:42 dmesg drwxr-x---. 2 root root 4096 May 2 11:42 audit -rw-r--r--. 1 root root 282 May 2 11:42 choose_repo.log -rw-------. 1 root root 211 May 2 11:42 maillog drwxr-xr-x. 2 root root 4096 May 2 11:42 aeolus-configure drwxr-xr-x. 2 mongodb root 4096 May 2 11:43 mongodb drwxr-xr-x. 2 root root 4096 May 2 11:44 deltacloud-mock drwxr-xr-x. 2 root root 4096 May 2 11:44 deltacloud-ec2-us-west-1 drwxr-xr-x. 2 root root 4096 May 2 11:44 deltacloud-ec2-us-east-1 drwx------. 2 root root 4096 May 2 11:45 httpd drwxr-xr-x. 2 aeolus aeolus 4096 May 2 11:45 aeolus-conductor -rw-r--r--. 1 root root 177626 May 2 11:46 aeolus-configure.out -rw-r--r--. 1 root root 1768 May 2 11:46 boot.log drwxr-xr-x. 15 root root 4096 May 2 15:43 . -rw-------. 1 root root 106 May 2 15:46 yum.log -rw-------. 1 root utmp 76800 May 2 17:14 btmp drwxr-xr-x. 2 condor condor 4096 May 2 17:52 condor -rw-------. 1 root root 3612 May 2 20:01 cron -rw-r--r--. 1 root root 147 May 2 20:17 iwhd.log -rw-rw-r--. 1 root utmp 4608 May 2 20:20 wtmp -rw-r--r--. 1 root root 146000 May 2 20:20 lastlog -rw-rw-rw-. 1 root root 428038 May 2 20:21 imagefactory.log -rw-rw-rw-. 1 root root 51585 May 2 20:21 aeolus-connector.log -rw-------. 1 root root 91558 May 2 20:25 secure -rw-------. 1 root root 28428 May 2 20:28 messages
Created attachment 496410 [details] screenshot1 screenshot of webui
Created attachment 496411 [details] /var/log dir tar.gz of /var/log
oops.. I messed up the tar command.. its not gz.. just a tar. to extract.. just tar -xvf
05/02/11 20:41:05 [3684] Received CHECK_LEASES signal 05/02/11 20:41:05 [3684] in doContactSchedd() 05/02/11 20:41:05 [3684] querying for renewed leases 05/02/11 20:41:05 [3684] querying for removed/held jobs 05/02/11 20:41:05 [3684] Using constraint ((Owner=?="aeolus"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External")) 05/02/11 20:41:05 [3684] Fetched 0 job ads from schedd 05/02/11 20:41:05 [3684] leaving doContactSchedd() 05/02/11 20:41:07 [3684] BaseResource::DoBatchStatus for http://localhost:3004/api. 05/02/11 20:41:07 [3684] BaseResource::DoBatchStatus: Starting bulk job poll of http://localhost:3004/api 05/02/11 20:41:07 [3684] GAHP[3685] <- 'DELTACLOUD_VM_STATUS_ALL 1896 http://localhost:3004/api AKIAJ557U7P7OIHRV2EQ 7COhvBzxrz2agtOPgk3bZM7W3ej9QI/j6WvYiHdg' 05/02/11 20:41:07 [3684] GAHP[3685] -> 'S' 05/02/11 20:41:08 [3684] GAHP[3685] <- 'RESULTS' 05/02/11 20:41:08 [3684] GAHP[3685] -> 'R' 05/02/11 20:41:08 [3684] GAHP[3685] -> 'S' '1' 05/02/11 20:41:08 [3684] GAHP[3685] -> '1896' 'NULL' 'i-d69fb792' 'RUNNING' 'i-805279c4' 'PENDING' 05/02/11 20:41:08 [3684] BaseResource::DoBatchStatus for http://localhost:3004/api. 05/02/11 20:41:08 [3684] BaseResource::DoBatchStatus: Finished bulk job poll of http://localhost:3004/api 05/02/11 20:41:10 [3684] in doContactSchedd() 05/02/11 20:41:10 [3684] querying for removed/held jobs 05/02/11 20:41:10 [3684] Using constraint ((Owner=?="aeolus"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External")) 05/02/11 20:41:10 [3684] Fetched 0 job ads from schedd 05/02/11 20:41:10 [3684] Updating classad values for 2.0: 05/02/11 20:41:10 [3684] LastRemoteStatusUpdate = 1304383268 05/02/11 20:41:10 [3684] Updating classad values for 3.0: 05/02/11 20:41:10 [3684] LastRemoteStatusUpdate = 1304383268 05/02/11 20:41:10 [3684] leaving doContactSchedd() 05/02/11 20:41:19 [3684] BaseResource::UpdateResource: Machine = "ip-10-82-251-157.ec2.internal" NumJobs = 2 IdleJobs = 1 RunningJobs = 1 SubmitsAllowed = 2 DeltacloudUserName = "AKIAJ557U7P7OIHRV2EQ" Name = "deltacloud http://localhost:3004/api" CondorVersion = "$CondorVersion: 7.6.0 Apr 07 2011 PRE-RELEASE-UWCS $" ScheddIpAddr = "<10.82.251.157:49328>" JobLimit = 1000 HashName = "http://localhost:3004/api#AKIAJ557U7P7OIHRV2EQ#7COhvBzxrz2agtOPgk3bZM7W3ej9QI/j6WvYiHdg" MyAddress = "<10.82.251.157:53077>" MyCurrentTime = 1304383279 SubmitsWanted = 0 CondorPlatform = "$CondorPlatform: X86_64-Redhat_6.0 $" CurrentTime = time() MyType = "Grid" ScheddName = "ip-10-82-251-157.ec2.internal" Owner = "aeolus" 05/02/11 20:41:19 [3684] Trying to update collector <127.0.0.1:9618> 05/02/11 20:41:19 [3684] Attempting to send update via UDP to collector localhost.localdomain <127.0.0.1:9618> 05/02/11 20:41:19 [3684] BaseResource::DoBatchStatus for http://localhost:3003/api. 05/02/11 20:41:19 [3684] BaseResource::DoBatchStatus: Starting bulk job poll of http://localhost:3003/api 05/02/11 20:41:19 [3684] GAHP[3685] <- 'DELTACLOUD_VM_STATUS_ALL 1897 http://localhost:3003/api AKIAJ557U7P7OIHRV2EQ 7COhvBzxrz2agtOPgk3bZM7W3ej9QI/j6WvYiHdg' 05/02/11 20:41:19 [3684] GAHP[3685] -> 'S' 05/02/11 20:41:19 [3684] GAHP[3685] <- 'RESULTS' 05/02/11 20:41:19 [3684] GAHP[3685] -> 'R' 05/02/11 20:41:19 [3684] GAHP[3685] -> 'S' '1' 05/02/11 20:41:19 [3684] GAHP[3685] -> '1897' 'NULL' 'i-006e5e6f' 'RUNNING' 'i-1431017b' 'RUNNING' 'i-26d2e249' 'RUNNING' 'i-fcbd8d93' 'RUNNING' 05/02/11 20:41:19 [3684] BaseResource::DoBatchStatus for http://localhost:3003/api. 05/02/11 20:41:19 [3684] Job http://localhost:3003/api#i-006e5e6f on remote host is unknown. Skipping. 05/02/11 20:41:19 [3684] Job http://localhost:3003/api#i-26d2e249 on remote host is unknown. Skipping. 05/02/11 20:41:19 [3684] Job http://localhost:3003/api#i-fcbd8d93 on remote host is unknown. Skipping. 05/02/11 20:41:19 [3684] BaseResource::DoBatchStatus: Finished bulk job poll of http://localhost:3003/api 05/02/11 20:41:19 [3684] in doContactSchedd() 05/02/11 20:41:19 [3684] querying for removed/held jobs 05/02/11 20:41:19 [3684] Using constraint ((Owner=?="aeolus"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External")) 05/02/11 20:41:19 [3684] Fetched 0 job ads from schedd 05/02/11 20:41:19 [3684] Updating classad values for 1.0: 05/02/11 20:41:19 [3684] LastRemoteStatusUpdate = 1304383279 05/02/11 20:41:19 [3684] leaving doContactSchedd() 05/02/11 20:41:20 [3684] Evaluating staleness of remote job statuses. 05/02/11 20:41:23 [3684] GAHP[3685] <- 'RESULTS' 05/02/11 20:41:23 [3684] GAHP[3685] -> 'S' '0' 05/02/11 20:41:38 [3684] BaseResource::DoBatchStatus for http://localhost:3004/api. 05/02/11 20:41:38 [3684] BaseResource::DoBatchStatus: Starting bulk job poll of http://localhost:3004/api 05/02/11 20:41:38 [3684] GAHP[3685] <- 'DELTACLOUD_VM_STATUS_ALL 1898 http://localhost:3004/api AKIAJ557U7P7OIHRV2EQ 7COhvBzxrz2agtOPgk3bZM7W3ej9QI/j6WvYiHdg' 05/02/11 20:41:38 [3684] GAHP[3685] -> 'S' 05/02/11 20:41:40 [3684] GAHP[3685] <- 'RESULTS' 05/02/11 20:41:40 [3684] GAHP[3685] -> 'R' 05/02/11 20:41:40 [3684] GAHP[3685] -> 'S' '1' 05/02/11 20:41:40 [3684] GAHP[3685] -> '1898' 'NULL' 'i-d69fb792' 'RUNNING' 'i-805279c4' 'PENDING' 05/02/11 20:41:40 [3684] BaseResource::DoBatchStatus for http://localhost:3004/api. 05/02/11 20:41:40 [3684] BaseResource::DoBatchStatus: Finished bulk job poll of http://localhost:3004/api 05/02/11 20:41:40 [3684] in doContactSchedd() 05/02/11 20:41:40 [3684] querying for removed/held jobs 05/02/11 20:41:40 [3684] Using constraint ((Owner=?="aeolus"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External")) 05/02/11 20:41:40 [3684] Fetched 0 job ads from schedd 05/02/11 20:41:40 [3684] Updating classad values for 2.0: 05/02/11 20:41:40 [3684] LastRemoteStatusUpdate = 1304383300 05/02/11 20:41:40 [3684] Updating classad values for 3.0: 05/02/11 20:41:40 [3684] LastRemoteStatusUpdate = 1304383300 05/02/11 20:41:40 [3684] leaving doContactSchedd() 05/02/11 20:41:49 [3684] BaseResource::DoBatchStatus for http://localhost:3003/api. 05/02/11 20:41:49 [3684] BaseResource::DoBatchStatus: Starting bulk job poll of http://localhost:3003/api 05/02/11 20:41:49 [3684] GAHP[3685] <- 'DELTACLOUD_VM_STATUS_ALL 1899 http://localhost:3003/api AKIAJ557U7P7OIHRV2EQ 7COhvBzxrz2agtOPgk3bZM7W3ej9QI/j6WvYiHdg' 05/02/11 20:41:49 [3684] GAHP[3685] -> 'S' 05/02/11 20:41:49 [3684] GAHP[3685] <- 'RESULTS' 05/02/11 20:41:49 [3684] GAHP[3685] -> 'R' 05/02/11 20:41:49 [3684] GAHP[3685] -> 'S' '1' 05/02/11 20:41:49 [3684] GAHP[3685] -> '1899' 'NULL' 'i-006e5e6f' 'RUNNING' 'i-1431017b' 'RUNNING' 'i-26d2e249' 'RUNNING' 'i-fcbd8d93' 'RUNNING' 05/02/11 20:41:49 [3684] BaseResource::DoBatchStatus for http://localhost:3003/api. 05/02/11 20:41:49 [3684] Job http://localhost:3003/api#i-006e5e6f on remote host is unknown. Skipping. 05/02/11 20:41:49 [3684] Job http://localhost:3003/api#i-26d2e249 on remote host is unknown. Skipping. 05/02/11 20:41:49 [3684] Job http://localhost:3003/api#i-fcbd8d93 on remote host is unknown. Skipping. 05/02/11 20:41:49 [3684] BaseResource::DoBatchStatus: Finished bulk job poll of http://localhost:3003/api 05/02/11 20:41:49 [3684] in doContactSchedd() 05/02/11 20:41:49 [3684] querying for removed/held jobs 05/02/11 20:41:49 [3684] Using constraint ((Owner=?="aeolus"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External")) 05/02/11 20:41:49 [3684] Fetched 0 job ads from schedd 05/02/11 20:41:49 [3684] Updating classad values for 1.0: 05/02/11 20:41:49 [3684] LastRemoteStatusUpdate = 1304383309 05/02/11 20:41:49 [3684] leaving doContactSchedd()
mages/c9ef49a6-4882-48e0-bec1-7c0dd74760c6 2011-05-02 20:46:01,492 ERROR imagefactory.qmfagent.ImageFactoryAgent.ImageFactoryAgent pid(2018) Message: "Problem encountered trying to reach image warehouse. Please check that iwhd is running and reachable.\nException text: 'NoneType' object has no attribute 'makefile'" Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/imagefactory/qmfagent/ImageFactoryAgent.py", line 92, in method result = getattr(target_obj, methodName)(**args) File "/usr/lib/python2.6/site-packages/imagefactory/qmfagent/ImageFactory.py", line 92, in provider_image image_metadata = self.warehouse.metadata_for_id_of_type(("template", "target"), image_id, "image") File "/usr/lib/python2.6/site-packages/imagefactory/ImageWarehouse.py", line 176, in metadata_for_id_of_type return self.metadata_for_object_at_url(metadata_keys, object_url) File "/usr/lib/python2.6/site-packages/imagefactory/ImageWarehouse.py", line 187, in metadata_for_object_at_url raise WarehouseError("Problem encountered trying to reach image warehouse. Please check that iwhd is running and reachable.\nException text: %s" % (e, )) WarehouseError: "Problem encountered trying to reach image warehouse. Please check that iwhd is running and reachable.\nException text: 'NoneType' object has no attribute 'makefile'" ^C [root@ip-10-82-251-157 tmp]# /etc/init.d/iwhd status iwhd (pid 19588) is running... [root@ip-10-82-251-157 tmp]# ps -ef | grep 19588 root 19588 1 3 20:17 ? 00:01:07 iwhd -c /etc/iwhd/conf.js -d localhost 27017 root 21571 20793 0 20:46 pts/0 00:00:00 grep 19588
Created attachment 496845 [details] coredump from dell-pe860-01.rhts.eng.bos.redhat.com coredump from dell-pe860-01.rhts.eng.bos.redhat.com
Created attachment 496846 [details] sos report
Created attachment 496903 [details] secondCore recreate: 1. built two more templates a. template_simpledr14j b. template_complexrjl5v looks like iwhd went down while template_complexrjl5v was getting pushed
I've fixed this. Patches coming shortly.
Explanation and patches here: http://thread.gmane.org/gmane.comp.lib.iwhd.devel/552
fantastic, Jim thanks for the in depth explanation!
New release, including that fix: http://thread.gmane.org/gmane.comp.lib.iwhd.devel/559
Build and Pushed templates again and again and didn't notice iwhd going down .
release pending...
release pending 3
release pending.. 2
closing out old bugs
perm close