Bug 888833
| Summary: | After upgrade, AsyncTaskZombieTaskLifeInMinutes remains 0 and not reverted to default value | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | james labocki <jlabocki> | ||||||||||
| Component: | ovirt-engine | Assignee: | Moran Goldboim <mgoldboi> | ||||||||||
| Status: | CLOSED WONTFIX | QA Contact: | Haim <hateya> | ||||||||||
| Severity: | high | Docs Contact: | |||||||||||
| Priority: | urgent | ||||||||||||
| Version: | 3.1.0 | CC: | abaron, alourie, bazulay, dajohnso, dyasny, gfidente, iheim, lpeer, mhuth, Rhev-m-bugs, sgrinber, yeylon, ykaul | ||||||||||
| Target Milestone: | --- | ||||||||||||
| Target Release: | --- | ||||||||||||
| Hardware: | Unspecified | ||||||||||||
| OS: | Unspecified | ||||||||||||
| Whiteboard: | integration | ||||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
| Doc Text: | Story Points: | --- | |||||||||||
| Clone Of: | Environment: | ||||||||||||
| Last Closed: | 2013-01-30 10:04:41 UTC | Type: | Bug | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Embargoed: | |||||||||||||
| Attachments: |
|
||||||||||||
thanks for reporting this; from a first look at the logs the error seems to be caused by a problem on the hypervisor during image copy from the import domain (as per vdsm.log):
ea197f92-b3e9-42bc-9a7f-6d1666e01cb3::ERROR::2012-12-19 09:07:51,697::task::868::TaskManager.Task::(_setError) Unexpected error
Traceback (most recent call last):
File "/usr/share/vdsm/storage/task.py", line 876, in _run
return fn(*args, **kargs)
File "/usr/share/vdsm/storage/task.py", line 303, in run
return self.cmd(*self.argslist, **self.argsdict)
File "/usr/share/vdsm/storage/spm.py", line 115, in run
return self.func(*args, **kwargs)
File "/usr/share/vdsm/storage/spm.py", line 743, in copyImage
volType, volFormat, preallocate, postZero, force)
File "/usr/share/vdsm/storage/image.py", line 817, in copy
raise se.CopyImageError("src image=%s, dst image=%s: msg=%s" % (srcImgUUID, dstImgUUID, str(e)))
CopyImageError: low level Image copy failed: ('src image=4c62079b-2112-4b08-ac63-dc9b33254c4d, dst image=4c62079b-2112-4b08-ac63-dc9b33254c4d: msg=Action was stopped: ()',)
I think this should be reassigned against the vdsm component in RHEL or to RHEV, cause imagefactory seems to be 'correctly' reporting a problem on the backend and push failure
Dup of https://bugzilla.redhat.com/show_bug.cgi?id=803724 ? Can you describe your RHEV env. (data domain storage type, etc.) Please attach complete vdsm.log (compressed) to the bug. It's impossible to look at them as a huge comment. (In reply to comment #3) > Dup of https://bugzilla.redhat.com/show_bug.cgi?id=803724 ? Can you describe > your RHEV env. (data domain storage type, etc.) Not at all a dup. The task was deliberately stopped: Thread-5448640::INFO::2012-12-19 09:07:51,035::dispatcher::100::Storage.Dispatcher.Protect::(run) Run and protect: getAllTasksStatuses, Return response: {'status': {'message': 'OK', 'code': 0}, 'allTasksStatus': {'ea197f92-b3e9-42bc-9a7f-6d1666e01cb3': {'code': 0, 'message': 'Task is initializing', 'taskState': 'running', 'taskResult': '', 'taskID': 'ea197f92-b3e9-42bc-9a7f-6d1666e01cb3'}}} Thread-5448641::DEBUG::2012-12-19 09:07:51,057::clientIF::239::Storage.Dispatcher.Protect::(wrapper) [10.16.46.112] Thread-5448641::INFO::2012-12-19 09:07:51,058::dispatcher::94::Storage.Dispatcher.Protect::(run) Run and protect: stopTask, args: ( taskID=ea197f92-b3e9-42bc-9a7f-6d1666e01cb3) James, can you explain the flow? what did you do? (steps) Is 10.16.46.112 your rhevm? Please attach the engine log Workflow 1. Install CloudForms 1.1 from RHN. 2. Upgrade RHEV 3.0 to RHEV 3.1 using the following provided to overcome zombie process problem [test] name=Test baseurl=http://bob.eng.lab.tlv.redhat.com/builds/si25.1/ enabled=1 gpgcheck=0 3. Install patch to CloudForms 1.1 for support of RHEV 3.1 24 wget http://download.devel.redhat.com/brewroot/packages/deltacloud-core/0.5.0/11.el6cf/noarch/deltacloud-core-0.5.0-11.el6cf.noarch.rpm 25 wget http://download.devel.redhat.com/brewroot/packages/deltacloud-core/0.5.0/11.el6cf/noarch/deltacloud-core-all-0.5.0-11.el6cf.noarch.rpm 26 wget http://download.devel.redhat.com/brewroot/packages/deltacloud-core/0.5.0/11.el6cf/noarch/deltacloud-core-doc-0.5.0-11.el6cf.noarch.rpm 27 wget http://download.devel.redhat.com/brewroot/packages/deltacloud-core/0.5.0/11.el6cf/noarch/deltacloud-core-ec2-0.5.0-11.el6cf.noarch.rpm 28 wget http://download.devel.redhat.com/brewroot/packages/deltacloud-core/0.5.0/11.el6cf/noarch/deltacloud-core-rhevm-0.5.0-11.el6cf.noarch.rpm 29 wget http://download.devel.redhat.com/brewroot/packages/deltacloud-core/0.5.0/11.el6cf/noarch/deltacloud-core-vsphere-0.5.0-11.el6cf.noarch.rpm 4. Build, Push - receive failures above. engine.log coming Created attachment 666396 [details]
engine.log vdsm.log imagefactory.log
The logs are not the same logs as above but problem looks the same. The zombie killer is killing the wrong task: Engine log: 2012-12-19 16:53:18,209 INFO [org.ovirt.engine.core.bll.AsyncTaskManager] (QuartzScheduler_Worker-56) Cleaning zombie tasks: Stopping async task ImportVmTemplate that started at 12/19/12 4:53 PM 2012-12-19 16:53:18,209 INFO [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-56) SPMAsyncTask::StopTask: Attempting to stop task ee5a89b1-00cd-4aec-b4b4-d812b87c8617 (Parent Command ImportVmTemplate, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters). vdsm log: Thread-5470473::INFO::2012-12-19 16:50:21,667::dispatcher::94::Storage.Dispatcher.Protect::(run) Run and protect: stopTask, args: ( taskID=ee5a89b1-00cd-4aec-b4b4-d812b87c8617) Ayal, I'm not familiar with the zombie killer architecture. What is the next action to resolution? Do I need to provide anything else or should I await a patch? Is there a workaround to keep the task from being killed in the meantime? zombie cleanup at upgrade isn't relevant, since the task to push is created post the upgrade here. but maybe the config isn't reverted? what does this show: rhevm-config --get AsyncTaskZombieTaskLifeInMinutes [root@rhc-rhevm ~]# rhevm-config --get AsyncTaskZombieTaskLifeInMinutes AsyncTaskZombieTaskLifeInMinutes: 0 version: general [root@rhc-rhevm ~]# James - please set this value to 3000. could it be you killed upgrade in the middle? This request was not resolved in time for the current release. Red Hat invites you to ask your support representative to propose this request, if still desired, for consideration in the next release of Red Hat Enterprise Linux. Created attachment 671513 [details]
/var/log/rhevm/
Contents of /var/log/rhevm/ directory
Created attachment 671526 [details]
/var/log/ovirt-engine/
Contents of /var/log/rhevm/
Attached archives of /var/log/rhevm and /var/log/ovirt-engine directories above. I attempted the following to see if it resolves the issue. 1. I set AsyncTaskZombieTaskLifeInMinutes to 3000 on the rhevm [root@rhc-rhevm ~]# rhevm-config --set AsyncTaskZombieTaskLifeInMinutes=3000 [root@rhc-rhevm ~]# rhevm-config --get AsyncTaskZombieTaskLifeInMinutes AsyncTaskZombieTaskLifeInMinutes: 3000 version: general 2. On the Cloud Engine server I attempted to repush the image. The result is the same (failure). Here is the imagefactory.log. 2013-01-02 11:06:41,105 DEBUG paste.httpserver.ThreadPool thread(MainThread) Message: Added task (0 tasks queued) 2013-01-02 11:06:41,150 DEBUG imgfac.BuildDispatcher.BuildDispatcher thread(worker 5) Message: Testing provider for XML: syntax error: line 1, column 0 2013-01-02 11:06:41,151 DEBUG imgfac.BuildDispatcher.BuildDispatcher thread(worker 5) Message: Testing provider for JSON: No JSON object could be decoded 2013-01-02 11:06:41,151 DEBUG imgfac.ImageWarehouse.ImageWarehouse thread(worker 5) Message: Querying (http://localhost:9090/target_images/_query) with expression ($build == "b0886e8a-8afa-4cf1-8390-06182e58cd9f" && $target == "rhevm") 2013-01-02 11:06:41,155 DEBUG imgfac.ImageWarehouse.ImageWarehouse thread(worker 5) Message: Getting metadata (['template']) from http://localhost:9090/target_images/07050038-8bf7-4080-90a9-3743f93970a7 2013-01-02 11:06:41,157 DEBUG imgfac.ImageWarehouse.ImageWarehouse thread(worker 5) Message: Created Image Warehouse instance http://localhost:9090 - buckets(target_images, templates, icicles, provider_images) 2013-01-02 11:06:41,162 DEBUG imgfac.ImageWarehouse.ImageWarehouse thread(worker 5) Message: Created Image Warehouse instance http://localhost:9090 - buckets(target_images, templates, icicles, provider_images) 2013-01-02 11:06:41,163 DEBUG imgfac.BuildJob.BuildJob thread(caa39b64) Message: caa39b64-5ce1-4a85-a335-a6b5b428fb35 for rhevm about to enter None queue... 2013-01-02 11:06:41,163 DEBUG imgfac.BuildJob.BuildJob thread(caa39b64) Message: Builder (caa39b64-5ce1-4a85-a335-a6b5b428fb35) changed status from NEW to PUSHING 2013-01-02 11:06:41,163 DEBUG imgfac.BuildJob.BuildJob thread(caa39b64) Message: Builder (caa39b64-5ce1-4a85-a335-a6b5b428fb35) changed percent complete from 0 to 0 2013-01-02 11:06:41,164 DEBUG imgfac.BuildDispatcher.BuildDispatcher thread(caa39b64) Message: Testing provider for XML: syntax error: line 1, column 0 2013-01-02 11:06:41,164 DEBUG imgfac.BuildDispatcher.BuildDispatcher thread(caa39b64) Message: Testing provider for JSON: No JSON object could be decoded 2013-01-02 11:06:41,165 DEBUG imgfac.builders.BaseBuilder.RHEL6_rhevm_Builder thread(caa39b64) Message: Image file /var/lib/imagefactory/images/rhevm-image-07050038-8bf7-4080-90a9-3743f93970a7.dsk already present - skipping warehouse download 2013-01-02 11:06:41,166 DEBUG imgfac.builders.BaseBuilder.RHEL6_rhevm_Builder thread(caa39b64) Message: Produced provider json: { "apipass": "REDACTED", "apiurl": "https://rhc-rhevm.lab.eng.bos.redhat.com:8443/api", "apiuser": "admin@internal", "cluster": "_any_", "image": "/tmp/caa39b64-5ce1-4a85-a335-a6b5b428fb35", "name": "rhevm-default", "nfsdir": "/mnt/rhevm-nfs", "nfshost": "192.168.10.11", "nfspath": "/storage/rhev_export", "target": "rhevm", "timeout": 1800 } 2013-01-02 11:06:41,167 DEBUG imgfac.builders.BaseBuilder.RHEL6_rhevm_Builder thread(caa39b64) Message: Executing external RHEV-M push command (['/usr/bin/dc-rhev-image', '/tmp/tmpEJzny2']) 2013-01-02 11:06:41,619 DEBUG paste.httpserver.ThreadPool thread(MainThread) Message: Added task (0 tasks queued) 2013-01-02 11:07:13,131 DEBUG imgfac.builders.BaseBuilder.RHEL6_rhevm_Builder thread(caa39b64) Message: Exception caught in ImageFactory 2013-01-02 11:07:13,131 DEBUG imgfac.builders.BaseBuilder.RHEL6_rhevm_Builder thread(caa39b64) Message: Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/imgfac/builders/Fedora_rhevm_Builder.py", line 198, in push_image self.rhevm_push_image_upload(target_image_id, provider, credentials) File "/usr/lib/python2.6/site-packages/imgfac/builders/Fedora_rhevm_Builder.py", line 284, in rhevm_push_image_upload (stdout, stderr, retcode) = subprocess_check_output(rhevm_push_command) File "/usr/lib/python2.6/site-packages/imgfac/builders/Fedora_rhevm_Builder.py", line 47, in subprocess_check_output raise ImageFactoryException("'%s' failed(%d): %s\nstdout: %s" % (cmd, retcode, stderr, stdout)) ImageFactoryException: '/usr/bin/dc-rhev-image /tmp/tmpEJzny2' failed(1): None stdout: ERROR import `failed org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed in vdscommand to HSMGetAllTasksStatusesVDS, error = b9db8bcd-8e6b-4fc7-a296-b6f14b487694' import cluster 17559862-0325-11e2-8541-5254001ef55e import target 2c5c14f2-6e6b-4a3f-8c8a-6fda96d96e44 import href /api/storagedomains/580e4724-0002-451a-9ff2-e24e07518993/templates/b8c2daac-a1d6-4593-b3eb-0753d2c36659/import/b9db8bcd-8e6b-4fc7-a296-b6f14b487694 2013-01-02 11:07:13,131 DEBUG imgfac.BuildJob.BuildJob thread(caa39b64) Message: Builder (caa39b64-5ce1-4a85-a335-a6b5b428fb35) changed status from PUSHING to FAILED 2013-01-02 11:07:13,132 DEBUG imgfac.BuildJob.BuildJob thread(caa39b64) Message: caa39b64-5ce1-4a85-a335-a6b5b428fb35 for rhevm about to exit None queue... 2013-01-02 11:07:13,230 DEBUG paste.httpserver.ThreadPool thread(MainThread) Message: Added task (0 tasks queued) 2013-01-02 11:07:43,248 DEBUG paste.httpserver.ThreadPool thread(MainThread) Message: Added task (0 tasks queued) 2013-01-02 11:08:13,042 DEBUG paste.httpserver.ThreadPool thread(MainThread) Message: Added task (0 tasks queued) 2013-01-02 11:08:43,140 DEBUG paste.httpserver.ThreadPool thread(MainThread) Message: Added task (0 tasks queued) For what it's worth I can see the image is pushed to the export domain and begins importing as a template, then it appears to be deleted from the template section and remains in the export domain. If I perform an import on the template in the export domain manually via the web UI it also fails in the same manner. In the events tab of the Web UI I see: Failed to complete copy of Template 905c2aa0-f11d-43a9-8c12-a236e68682fd to Domain <UNKNOWN>. It seems that <UNKNOWN> might be a problem. When I select to import the template I select "Data", the name of my data storage domain. I'm not sure why it says <UNKNOWN>. after restarting ovirt-engine service (I mistakenly restart jbossas instead) the image push is successful. The problem lies in the work flow. During one of the upgrade attempts, the process was stopped in the middle (either by Ctrl+C, a kill or some other system event), leaving timeout value at 0, wrongly. All the consequent runs are working as designed, this switching to 0 and returning to the original value, which is again, 0. To resolve the problem on the system, it is enough to set the timeout to its original value 3000 and rerun the upgrade. closing this one, since it looks like a corner case. please reopen if it comes back from the field and felt like we should cover this corner case. |
Created attachment 915649 [details] Comment (This comment was longer than 65,535 characters and has been moved to an attachment by Red Hat Bugzilla).