Bug 701482

Summary: building/pushing templates may bring down warehouse
Product: [Retired] CloudForms Cloud Engine Reporter: wes hayutin <whayutin>
Component: iwhdAssignee: Jim Meyering <meyering>
Status: CLOSED CURRENTRELEASE QA Contact: Shveta <ssachdev>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 0.3.1CC: dajohnso, deltacloud-maint, morazi, ssachdev, zaitcev
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 684278, 697919    
Attachments:
Description Flags
debug log for image factory and iwhd
none
screenshot1
none
/var/log dir
none
coredump from dell-pe860-01.rhts.eng.bos.redhat.com
none
sos report
none
secondCore none

Description wes hayutin 2011-05-03 00:32:31 UTC
Created attachment 496409 [details]
debug log for image factory and iwhd

Description of problem:

I am not very confident in the recreate here, however I do seem to have warehouse in a state that restarting the service does fix.

How I got into this mess.

1. create two provider accounts ec2-east, ec2-west
2. create a few templates with two package groups, create,build,push no problem
3. create a template with a large amount of packages, I'll attach the package list
4. build and push to both east and west

after a few failed attempts and restarting iwhd because the service went down. I am now unable to get the template to even attempt to build. The template stays in queued 

current iwhd.log

[root@ip-10-82-251-157 ~]# cat /var/log/iwhd.log 
0 replication servers defined
iwhd: convert-provider: ITER key: name
iwhd: convert-provider: ITER key: type
iwhd: convert-provider: ITER key: path


I'm guessing log rotate cleaned the log, however I dont see where it saved the old log.

[root@ip-10-82-251-157 ~]# cd /var/log
[root@ip-10-82-251-157 log]# ls -ltra
total 1104
drwxr-xr-x.  2 root    root     4096 Jan  5  2010 ConsoleKit
drwxr-xr-x.  2 ntp     ntp      4096 May 13  2010 ntpstats
drwxr-xr-x.  2 puppet  puppet   4096 Apr  6 18:05 puppet
-rw-------.  1 root    root        0 May  2 10:21 tallylog
drwxr-xr-x. 18 root    root     4096 May  2 10:21 ..
-rw-------.  1 root    root        0 May  2 10:21 spooler
drwx------.  5 root    root     4096 May  2 10:21 libvirt
-rw-r--r--.  1 root    root   100910 May  2 10:24 dracut.log
-rw-r--r--.  1 root    root    11999 May  2 11:42 dmesg
drwxr-x---.  2 root    root     4096 May  2 11:42 audit
-rw-r--r--.  1 root    root      282 May  2 11:42 choose_repo.log
-rw-------.  1 root    root      211 May  2 11:42 maillog
drwxr-xr-x.  2 root    root     4096 May  2 11:42 aeolus-configure
drwxr-xr-x.  2 mongodb root     4096 May  2 11:43 mongodb
drwxr-xr-x.  2 root    root     4096 May  2 11:44 deltacloud-mock
drwxr-xr-x.  2 root    root     4096 May  2 11:44 deltacloud-ec2-us-west-1
drwxr-xr-x.  2 root    root     4096 May  2 11:44 deltacloud-ec2-us-east-1
drwx------.  2 root    root     4096 May  2 11:45 httpd
drwxr-xr-x.  2 aeolus  aeolus   4096 May  2 11:45 aeolus-conductor
-rw-r--r--.  1 root    root   177626 May  2 11:46 aeolus-configure.out
-rw-r--r--.  1 root    root     1768 May  2 11:46 boot.log
drwxr-xr-x. 15 root    root     4096 May  2 15:43 .
-rw-------.  1 root    root      106 May  2 15:46 yum.log
-rw-------.  1 root    utmp    76800 May  2 17:14 btmp
drwxr-xr-x.  2 condor  condor   4096 May  2 17:52 condor
-rw-------.  1 root    root     3612 May  2 20:01 cron
-rw-r--r--.  1 root    root      147 May  2 20:17 iwhd.log
-rw-rw-r--.  1 root    utmp     4608 May  2 20:20 wtmp
-rw-r--r--.  1 root    root   146000 May  2 20:20 lastlog
-rw-rw-rw-.  1 root    root   428038 May  2 20:21 imagefactory.log
-rw-rw-rw-.  1 root    root    51585 May  2 20:21 aeolus-connector.log
-rw-------.  1 root    root    91558 May  2 20:25 secure
-rw-------.  1 root    root    28428 May  2 20:28 messages

Comment 1 wes hayutin 2011-05-03 00:33:38 UTC
Created attachment 496410 [details]
screenshot1

screenshot of webui

Comment 2 wes hayutin 2011-05-03 00:37:15 UTC
Created attachment 496411 [details]
/var/log dir

tar.gz of /var/log

Comment 3 wes hayutin 2011-05-03 00:38:41 UTC
oops.. I messed up the tar command.. its not gz.. just a tar.
to extract.. just tar -xvf

Comment 4 wes hayutin 2011-05-03 00:42:09 UTC
05/02/11 20:41:05 [3684] Received CHECK_LEASES signal
05/02/11 20:41:05 [3684] in doContactSchedd()
05/02/11 20:41:05 [3684] querying for renewed leases
05/02/11 20:41:05 [3684] querying for removed/held jobs
05/02/11 20:41:05 [3684] Using constraint ((Owner=?="aeolus"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External"))
05/02/11 20:41:05 [3684] Fetched 0 job ads from schedd
05/02/11 20:41:05 [3684] leaving doContactSchedd()
05/02/11 20:41:07 [3684] BaseResource::DoBatchStatus for http://localhost:3004/api.
05/02/11 20:41:07 [3684] BaseResource::DoBatchStatus: Starting bulk job poll of http://localhost:3004/api
05/02/11 20:41:07 [3684] GAHP[3685] <- 'DELTACLOUD_VM_STATUS_ALL 1896 http://localhost:3004/api AKIAJ557U7P7OIHRV2EQ 7COhvBzxrz2agtOPgk3bZM7W3ej9QI/j6WvYiHdg'
05/02/11 20:41:07 [3684] GAHP[3685] -> 'S'
05/02/11 20:41:08 [3684] GAHP[3685] <- 'RESULTS'
05/02/11 20:41:08 [3684] GAHP[3685] -> 'R'
05/02/11 20:41:08 [3684] GAHP[3685] -> 'S' '1'
05/02/11 20:41:08 [3684] GAHP[3685] -> '1896' 'NULL' 'i-d69fb792' 'RUNNING' 'i-805279c4' 'PENDING'
05/02/11 20:41:08 [3684] BaseResource::DoBatchStatus for http://localhost:3004/api.
05/02/11 20:41:08 [3684] BaseResource::DoBatchStatus: Finished bulk job poll of http://localhost:3004/api
05/02/11 20:41:10 [3684] in doContactSchedd()
05/02/11 20:41:10 [3684] querying for removed/held jobs
05/02/11 20:41:10 [3684] Using constraint ((Owner=?="aeolus"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External"))
05/02/11 20:41:10 [3684] Fetched 0 job ads from schedd
05/02/11 20:41:10 [3684] Updating classad values for 2.0:
05/02/11 20:41:10 [3684]    LastRemoteStatusUpdate = 1304383268
05/02/11 20:41:10 [3684] Updating classad values for 3.0:
05/02/11 20:41:10 [3684]    LastRemoteStatusUpdate = 1304383268
05/02/11 20:41:10 [3684] leaving doContactSchedd()
05/02/11 20:41:19 [3684] BaseResource::UpdateResource: 
Machine = "ip-10-82-251-157.ec2.internal"
NumJobs = 2
IdleJobs = 1
RunningJobs = 1
SubmitsAllowed = 2
DeltacloudUserName = "AKIAJ557U7P7OIHRV2EQ"
Name = "deltacloud http://localhost:3004/api"
CondorVersion = "$CondorVersion: 7.6.0 Apr 07 2011 PRE-RELEASE-UWCS $"
ScheddIpAddr = "<10.82.251.157:49328>"
JobLimit = 1000
HashName = "http://localhost:3004/api#AKIAJ557U7P7OIHRV2EQ#7COhvBzxrz2agtOPgk3bZM7W3ej9QI/j6WvYiHdg"
MyAddress = "<10.82.251.157:53077>"
MyCurrentTime = 1304383279
SubmitsWanted = 0
CondorPlatform = "$CondorPlatform: X86_64-Redhat_6.0 $"
CurrentTime = time()
MyType = "Grid"
ScheddName = "ip-10-82-251-157.ec2.internal"
Owner = "aeolus"

05/02/11 20:41:19 [3684] Trying to update collector <127.0.0.1:9618>
05/02/11 20:41:19 [3684] Attempting to send update via UDP to collector localhost.localdomain <127.0.0.1:9618>
05/02/11 20:41:19 [3684] BaseResource::DoBatchStatus for http://localhost:3003/api.
05/02/11 20:41:19 [3684] BaseResource::DoBatchStatus: Starting bulk job poll of http://localhost:3003/api
05/02/11 20:41:19 [3684] GAHP[3685] <- 'DELTACLOUD_VM_STATUS_ALL 1897 http://localhost:3003/api AKIAJ557U7P7OIHRV2EQ 7COhvBzxrz2agtOPgk3bZM7W3ej9QI/j6WvYiHdg'
05/02/11 20:41:19 [3684] GAHP[3685] -> 'S'
05/02/11 20:41:19 [3684] GAHP[3685] <- 'RESULTS'
05/02/11 20:41:19 [3684] GAHP[3685] -> 'R'
05/02/11 20:41:19 [3684] GAHP[3685] -> 'S' '1'
05/02/11 20:41:19 [3684] GAHP[3685] -> '1897' 'NULL' 'i-006e5e6f' 'RUNNING' 'i-1431017b' 'RUNNING' 'i-26d2e249' 'RUNNING' 'i-fcbd8d93' 'RUNNING'
05/02/11 20:41:19 [3684] BaseResource::DoBatchStatus for http://localhost:3003/api.
05/02/11 20:41:19 [3684] Job http://localhost:3003/api#i-006e5e6f on remote host is unknown. Skipping.
05/02/11 20:41:19 [3684] Job http://localhost:3003/api#i-26d2e249 on remote host is unknown. Skipping.
05/02/11 20:41:19 [3684] Job http://localhost:3003/api#i-fcbd8d93 on remote host is unknown. Skipping.
05/02/11 20:41:19 [3684] BaseResource::DoBatchStatus: Finished bulk job poll of http://localhost:3003/api
05/02/11 20:41:19 [3684] in doContactSchedd()
05/02/11 20:41:19 [3684] querying for removed/held jobs
05/02/11 20:41:19 [3684] Using constraint ((Owner=?="aeolus"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External"))
05/02/11 20:41:19 [3684] Fetched 0 job ads from schedd
05/02/11 20:41:19 [3684] Updating classad values for 1.0:
05/02/11 20:41:19 [3684]    LastRemoteStatusUpdate = 1304383279
05/02/11 20:41:19 [3684] leaving doContactSchedd()
05/02/11 20:41:20 [3684] Evaluating staleness of remote job statuses.
05/02/11 20:41:23 [3684] GAHP[3685] <- 'RESULTS'
05/02/11 20:41:23 [3684] GAHP[3685] -> 'S' '0'
05/02/11 20:41:38 [3684] BaseResource::DoBatchStatus for http://localhost:3004/api.
05/02/11 20:41:38 [3684] BaseResource::DoBatchStatus: Starting bulk job poll of http://localhost:3004/api
05/02/11 20:41:38 [3684] GAHP[3685] <- 'DELTACLOUD_VM_STATUS_ALL 1898 http://localhost:3004/api AKIAJ557U7P7OIHRV2EQ 7COhvBzxrz2agtOPgk3bZM7W3ej9QI/j6WvYiHdg'
05/02/11 20:41:38 [3684] GAHP[3685] -> 'S'
05/02/11 20:41:40 [3684] GAHP[3685] <- 'RESULTS'
05/02/11 20:41:40 [3684] GAHP[3685] -> 'R'
05/02/11 20:41:40 [3684] GAHP[3685] -> 'S' '1'
05/02/11 20:41:40 [3684] GAHP[3685] -> '1898' 'NULL' 'i-d69fb792' 'RUNNING' 'i-805279c4' 'PENDING'
05/02/11 20:41:40 [3684] BaseResource::DoBatchStatus for http://localhost:3004/api.
05/02/11 20:41:40 [3684] BaseResource::DoBatchStatus: Finished bulk job poll of http://localhost:3004/api
05/02/11 20:41:40 [3684] in doContactSchedd()
05/02/11 20:41:40 [3684] querying for removed/held jobs
05/02/11 20:41:40 [3684] Using constraint ((Owner=?="aeolus"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External"))
05/02/11 20:41:40 [3684] Fetched 0 job ads from schedd
05/02/11 20:41:40 [3684] Updating classad values for 2.0:
05/02/11 20:41:40 [3684]    LastRemoteStatusUpdate = 1304383300
05/02/11 20:41:40 [3684] Updating classad values for 3.0:
05/02/11 20:41:40 [3684]    LastRemoteStatusUpdate = 1304383300
05/02/11 20:41:40 [3684] leaving doContactSchedd()
05/02/11 20:41:49 [3684] BaseResource::DoBatchStatus for http://localhost:3003/api.
05/02/11 20:41:49 [3684] BaseResource::DoBatchStatus: Starting bulk job poll of http://localhost:3003/api
05/02/11 20:41:49 [3684] GAHP[3685] <- 'DELTACLOUD_VM_STATUS_ALL 1899 http://localhost:3003/api AKIAJ557U7P7OIHRV2EQ 7COhvBzxrz2agtOPgk3bZM7W3ej9QI/j6WvYiHdg'
05/02/11 20:41:49 [3684] GAHP[3685] -> 'S'
05/02/11 20:41:49 [3684] GAHP[3685] <- 'RESULTS'
05/02/11 20:41:49 [3684] GAHP[3685] -> 'R'
05/02/11 20:41:49 [3684] GAHP[3685] -> 'S' '1'
05/02/11 20:41:49 [3684] GAHP[3685] -> '1899' 'NULL' 'i-006e5e6f' 'RUNNING' 'i-1431017b' 'RUNNING' 'i-26d2e249' 'RUNNING' 'i-fcbd8d93' 'RUNNING'
05/02/11 20:41:49 [3684] BaseResource::DoBatchStatus for http://localhost:3003/api.
05/02/11 20:41:49 [3684] Job http://localhost:3003/api#i-006e5e6f on remote host is unknown. Skipping.
05/02/11 20:41:49 [3684] Job http://localhost:3003/api#i-26d2e249 on remote host is unknown. Skipping.
05/02/11 20:41:49 [3684] Job http://localhost:3003/api#i-fcbd8d93 on remote host is unknown. Skipping.
05/02/11 20:41:49 [3684] BaseResource::DoBatchStatus: Finished bulk job poll of http://localhost:3003/api
05/02/11 20:41:49 [3684] in doContactSchedd()
05/02/11 20:41:49 [3684] querying for removed/held jobs
05/02/11 20:41:49 [3684] Using constraint ((Owner=?="aeolus"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External"))
05/02/11 20:41:49 [3684] Fetched 0 job ads from schedd
05/02/11 20:41:49 [3684] Updating classad values for 1.0:
05/02/11 20:41:49 [3684]    LastRemoteStatusUpdate = 1304383309
05/02/11 20:41:49 [3684] leaving doContactSchedd()

Comment 5 wes hayutin 2011-05-03 00:46:45 UTC
mages/c9ef49a6-4882-48e0-bec1-7c0dd74760c6
2011-05-02 20:46:01,492 ERROR imagefactory.qmfagent.ImageFactoryAgent.ImageFactoryAgent pid(2018) Message: "Problem encountered trying to reach image warehouse. Please check that iwhd is running and reachable.\nException text: 'NoneType' object has no attribute 'makefile'"
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/imagefactory/qmfagent/ImageFactoryAgent.py", line 92, in method
    result = getattr(target_obj, methodName)(**args)
  File "/usr/lib/python2.6/site-packages/imagefactory/qmfagent/ImageFactory.py", line 92, in provider_image
    image_metadata = self.warehouse.metadata_for_id_of_type(("template", "target"), image_id, "image")
  File "/usr/lib/python2.6/site-packages/imagefactory/ImageWarehouse.py", line 176, in metadata_for_id_of_type
    return self.metadata_for_object_at_url(metadata_keys, object_url)
  File "/usr/lib/python2.6/site-packages/imagefactory/ImageWarehouse.py", line 187, in metadata_for_object_at_url
    raise WarehouseError("Problem encountered trying to reach image warehouse. Please check that iwhd is running and reachable.\nException text: %s" % (e, ))
WarehouseError: "Problem encountered trying to reach image warehouse. Please check that iwhd is running and reachable.\nException text: 'NoneType' object has no attribute 'makefile'"
^C
[root@ip-10-82-251-157 tmp]# /etc/init.d/iwhd status
iwhd (pid  19588) is running...
[root@ip-10-82-251-157 tmp]# ps -ef | grep 19588
root     19588     1  3 20:17 ?        00:01:07 iwhd -c /etc/iwhd/conf.js -d localhost 27017
root     21571 20793  0 20:46 pts/0    00:00:00 grep 19588

Comment 6 wes hayutin 2011-05-04 16:35:44 UTC
Created attachment 496845 [details]
coredump from dell-pe860-01.rhts.eng.bos.redhat.com

coredump from dell-pe860-01.rhts.eng.bos.redhat.com

Comment 7 wes hayutin 2011-05-04 16:36:33 UTC
Created attachment 496846 [details]
sos report

Comment 8 wes hayutin 2011-05-04 19:42:06 UTC
Created attachment 496903 [details]
secondCore

recreate:
1. built two more templates
   a. template_simpledr14j
   b. template_complexrjl5v

looks like iwhd went down while template_complexrjl5v was getting pushed

Comment 9 Jim Meyering 2011-05-06 12:20:55 UTC
I've fixed this.  Patches coming shortly.

Comment 10 Jim Meyering 2011-05-06 13:48:07 UTC
Explanation and patches here:

http://thread.gmane.org/gmane.comp.lib.iwhd.devel/552

Comment 11 wes hayutin 2011-05-06 14:06:29 UTC
fantastic, Jim thanks for the in depth explanation!

Comment 12 Jim Meyering 2011-05-06 20:06:24 UTC
New release, including that fix:
  http://thread.gmane.org/gmane.comp.lib.iwhd.devel/559

Comment 13 Shveta 2011-07-13 13:04:16 UTC
Build and Pushed templates again and again and didn't notice iwhd going down .

Comment 14 wes hayutin 2011-08-01 20:02:16 UTC
release pending...

Comment 15 wes hayutin 2011-08-01 20:02:30 UTC
release pending 3

Comment 16 wes hayutin 2011-08-01 20:02:35 UTC
release pending...

Comment 17 wes hayutin 2011-08-01 20:02:48 UTC
release pending.. 2

Comment 19 wes hayutin 2011-12-08 14:04:42 UTC
closing out old bugs

Comment 20 wes hayutin 2011-12-08 14:16:30 UTC
perm close