Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 888833

Summary: After upgrade, AsyncTaskZombieTaskLifeInMinutes remains 0 and not reverted to default value
Product: Red Hat Enterprise Virtualization Manager Reporter: james labocki <jlabocki>
Component: ovirt-engineAssignee: Moran Goldboim <mgoldboi>
Status: CLOSED WONTFIX QA Contact: Haim <hateya>
Severity: high Docs Contact:
Priority: urgent    
Version: 3.1.0CC: abaron, alourie, bazulay, dajohnso, dyasny, gfidente, iheim, lpeer, mhuth, Rhev-m-bugs, sgrinber, yeylon, ykaul
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: integration
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-01-30 10:04:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
engine.log vdsm.log imagefactory.log
none
/var/log/rhevm/
none
/var/log/ovirt-engine/
none
Comment none

Description james labocki 2012-12-19 15:34:46 UTC
Created attachment 915649 [details]
Comment

(This comment was longer than 65,535 characters and has been moved to an attachment by Red Hat Bugzilla).

Comment 2 Giulio Fidente 2012-12-19 19:48:07 UTC
thanks for reporting this; from a first look at the logs the error seems to be caused by a problem on the hypervisor during image copy from the import domain (as per vdsm.log):

ea197f92-b3e9-42bc-9a7f-6d1666e01cb3::ERROR::2012-12-19 09:07:51,697::task::868::TaskManager.Task::(_setError) Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 876, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/storage/task.py", line 303, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/share/vdsm/storage/spm.py", line 115, in run
    return self.func(*args, **kwargs)
  File "/usr/share/vdsm/storage/spm.py", line 743, in copyImage
    volType, volFormat, preallocate, postZero, force)
  File "/usr/share/vdsm/storage/image.py", line 817, in copy
    raise se.CopyImageError("src image=%s, dst image=%s: msg=%s" % (srcImgUUID, dstImgUUID, str(e)))
CopyImageError: low level Image copy failed: ('src image=4c62079b-2112-4b08-ac63-dc9b33254c4d, dst image=4c62079b-2112-4b08-ac63-dc9b33254c4d: msg=Action was stopped: ()',)

I think this should be reassigned against the vdsm component in RHEL or to RHEV, cause imagefactory seems to be 'correctly' reporting a problem on the backend and push failure

Comment 3 Yaniv Kaul 2012-12-19 20:02:34 UTC
Dup of https://bugzilla.redhat.com/show_bug.cgi?id=803724 ? Can you describe your RHEV env. (data domain storage type, etc.)

Please attach complete vdsm.log (compressed) to the bug. It's impossible to look at them as a huge comment.

Comment 5 Ayal Baron 2012-12-19 21:14:40 UTC
(In reply to comment #3)
> Dup of https://bugzilla.redhat.com/show_bug.cgi?id=803724 ? Can you describe
> your RHEV env. (data domain storage type, etc.)

Not at all a dup.
The task was deliberately stopped:
Thread-5448640::INFO::2012-12-19 09:07:51,035::dispatcher::100::Storage.Dispatcher.Protect::(run) Run and protect: getAllTasksStatuses, Return response: {'status': {'message': 'OK', 'code': 0}, 'allTasksStatus': {'ea197f92-b3e9-42bc-9a7f-6d1666e01cb3': {'code': 0, 'message': 'Task is initializing', 'taskState': 'running', 'taskResult': '', 'taskID': 'ea197f92-b3e9-42bc-9a7f-6d1666e01cb3'}}}
Thread-5448641::DEBUG::2012-12-19 09:07:51,057::clientIF::239::Storage.Dispatcher.Protect::(wrapper) [10.16.46.112]
Thread-5448641::INFO::2012-12-19 09:07:51,058::dispatcher::94::Storage.Dispatcher.Protect::(run) Run and protect: stopTask, args: ( taskID=ea197f92-b3e9-42bc-9a7f-6d1666e01cb3)

James, can you explain the flow? what did you do? (steps)
Is 10.16.46.112 your rhevm?

Please attach the engine log

Comment 7 james labocki 2012-12-19 21:58:44 UTC
Created attachment 666396 [details]
engine.log vdsm.log imagefactory.log

Comment 8 Ayal Baron 2012-12-19 22:59:07 UTC
The logs are not the same logs as above but problem looks the same.
The zombie killer is killing the wrong task:

Engine log:
2012-12-19 16:53:18,209 INFO  [org.ovirt.engine.core.bll.AsyncTaskManager] (QuartzScheduler_Worker-56) Cleaning zombie tasks: Stopping async task ImportVmTemplate that started at 12/19/12 4:53 PM
2012-12-19 16:53:18,209 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-56) SPMAsyncTask::StopTask: Attempting to stop task ee5a89b1-00cd-4aec-b4b4-d812b87c8617 (Parent Command ImportVmTemplate, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters).

vdsm log:
Thread-5470473::INFO::2012-12-19 16:50:21,667::dispatcher::94::Storage.Dispatcher.Protect::(run) Run and protect: stopTask, args: ( taskID=ee5a89b1-00cd-4aec-b4b4-d812b87c8617)

Comment 9 james labocki 2012-12-20 00:01:27 UTC
Ayal, I'm not familiar with the zombie killer architecture. What is the next action to resolution? Do I need to provide anything else or should I await a patch? Is there a workaround to keep the task from being killed in the meantime?

Comment 11 Itamar Heim 2012-12-20 07:23:40 UTC
zombie cleanup at upgrade isn't relevant, since the task to push is created post the upgrade here.
but maybe the config isn't reverted?
what does this show:
rhevm-config --get AsyncTaskZombieTaskLifeInMinutes

Comment 12 james labocki 2012-12-20 18:42:34 UTC
[root@rhc-rhevm ~]# rhevm-config --get AsyncTaskZombieTaskLifeInMinutes
AsyncTaskZombieTaskLifeInMinutes: 0 version: general
[root@rhc-rhevm ~]#

Comment 14 Itamar Heim 2012-12-23 17:08:22 UTC
James - please set this value to 3000.
could it be you killed upgrade in the middle?

Comment 15 RHEL Program Management 2012-12-27 06:47:21 UTC
This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 18 james labocki 2013-01-02 15:59:59 UTC
Created attachment 671513 [details]
/var/log/rhevm/

Contents of /var/log/rhevm/ directory

Comment 19 james labocki 2013-01-02 16:02:54 UTC
Created attachment 671526 [details]
/var/log/ovirt-engine/

Contents of /var/log/rhevm/

Comment 20 james labocki 2013-01-02 16:03:48 UTC
Attached archives of /var/log/rhevm and /var/log/ovirt-engine directories above.

Comment 21 james labocki 2013-01-02 16:12:42 UTC
I attempted the following to see if it resolves the issue.

1. I set AsyncTaskZombieTaskLifeInMinutes to 3000 on the rhevm

[root@rhc-rhevm ~]# rhevm-config --set AsyncTaskZombieTaskLifeInMinutes=3000

[root@rhc-rhevm ~]# rhevm-config --get AsyncTaskZombieTaskLifeInMinutes
AsyncTaskZombieTaskLifeInMinutes: 3000 version: general


2. On the Cloud Engine server I attempted to repush the image. The result is the same (failure). Here is the imagefactory.log.

2013-01-02 11:06:41,105 DEBUG paste.httpserver.ThreadPool thread(MainThread) Message: Added task (0 tasks queued)
2013-01-02 11:06:41,150 DEBUG imgfac.BuildDispatcher.BuildDispatcher thread(worker 5) Message: Testing provider for XML: syntax error: line 1, column 0
2013-01-02 11:06:41,151 DEBUG imgfac.BuildDispatcher.BuildDispatcher thread(worker 5) Message: Testing provider for JSON: No JSON object could be decoded
2013-01-02 11:06:41,151 DEBUG imgfac.ImageWarehouse.ImageWarehouse thread(worker 5) Message: Querying (http://localhost:9090/target_images/_query) with expression ($build == "b0886e8a-8afa-4cf1-8390-06182e58cd9f" && $target == "rhevm")
2013-01-02 11:06:41,155 DEBUG imgfac.ImageWarehouse.ImageWarehouse thread(worker 5) Message: Getting metadata (['template']) from http://localhost:9090/target_images/07050038-8bf7-4080-90a9-3743f93970a7
2013-01-02 11:06:41,157 DEBUG imgfac.ImageWarehouse.ImageWarehouse thread(worker 5) Message: Created Image Warehouse instance http://localhost:9090 - buckets(target_images, templates, icicles, provider_images)
2013-01-02 11:06:41,162 DEBUG imgfac.ImageWarehouse.ImageWarehouse thread(worker 5) Message: Created Image Warehouse instance http://localhost:9090 - buckets(target_images, templates, icicles, provider_images)
2013-01-02 11:06:41,163 DEBUG imgfac.BuildJob.BuildJob thread(caa39b64) Message: caa39b64-5ce1-4a85-a335-a6b5b428fb35 for rhevm about to enter None queue...
2013-01-02 11:06:41,163 DEBUG imgfac.BuildJob.BuildJob thread(caa39b64) Message: Builder (caa39b64-5ce1-4a85-a335-a6b5b428fb35) changed status from NEW to PUSHING
2013-01-02 11:06:41,163 DEBUG imgfac.BuildJob.BuildJob thread(caa39b64) Message: Builder (caa39b64-5ce1-4a85-a335-a6b5b428fb35) changed percent complete from 0 to 0
2013-01-02 11:06:41,164 DEBUG imgfac.BuildDispatcher.BuildDispatcher thread(caa39b64) Message: Testing provider for XML: syntax error: line 1, column 0
2013-01-02 11:06:41,164 DEBUG imgfac.BuildDispatcher.BuildDispatcher thread(caa39b64) Message: Testing provider for JSON: No JSON object could be decoded
2013-01-02 11:06:41,165 DEBUG imgfac.builders.BaseBuilder.RHEL6_rhevm_Builder thread(caa39b64) Message: Image file /var/lib/imagefactory/images/rhevm-image-07050038-8bf7-4080-90a9-3743f93970a7.dsk already present - skipping warehouse download
2013-01-02 11:06:41,166 DEBUG imgfac.builders.BaseBuilder.RHEL6_rhevm_Builder thread(caa39b64) Message: Produced provider json: 
{
    "apipass": "REDACTED", 
    "apiurl": "https://rhc-rhevm.lab.eng.bos.redhat.com:8443/api", 
    "apiuser": "admin@internal", 
    "cluster": "_any_", 
    "image": "/tmp/caa39b64-5ce1-4a85-a335-a6b5b428fb35", 
    "name": "rhevm-default", 
    "nfsdir": "/mnt/rhevm-nfs", 
    "nfshost": "192.168.10.11", 
    "nfspath": "/storage/rhev_export", 
    "target": "rhevm", 
    "timeout": 1800
}
2013-01-02 11:06:41,167 DEBUG imgfac.builders.BaseBuilder.RHEL6_rhevm_Builder thread(caa39b64) Message: Executing external RHEV-M push command (['/usr/bin/dc-rhev-image', '/tmp/tmpEJzny2'])
2013-01-02 11:06:41,619 DEBUG paste.httpserver.ThreadPool thread(MainThread) Message: Added task (0 tasks queued)
2013-01-02 11:07:13,131 DEBUG imgfac.builders.BaseBuilder.RHEL6_rhevm_Builder thread(caa39b64) Message: Exception caught in ImageFactory
2013-01-02 11:07:13,131 DEBUG imgfac.builders.BaseBuilder.RHEL6_rhevm_Builder thread(caa39b64) Message: Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/imgfac/builders/Fedora_rhevm_Builder.py", line 198, in push_image
    self.rhevm_push_image_upload(target_image_id, provider, credentials)
  File "/usr/lib/python2.6/site-packages/imgfac/builders/Fedora_rhevm_Builder.py", line 284, in rhevm_push_image_upload
    (stdout, stderr, retcode) = subprocess_check_output(rhevm_push_command)
  File "/usr/lib/python2.6/site-packages/imgfac/builders/Fedora_rhevm_Builder.py", line 47, in subprocess_check_output
    raise ImageFactoryException("'%s' failed(%d): %s\nstdout: %s" % (cmd, retcode, stderr, stdout))
ImageFactoryException: '/usr/bin/dc-rhev-image /tmp/tmpEJzny2' failed(1): None
stdout: ERROR import `failed org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed in vdscommand to HSMGetAllTasksStatusesVDS, error = b9db8bcd-8e6b-4fc7-a296-b6f14b487694'
import cluster 17559862-0325-11e2-8541-5254001ef55e
import target 2c5c14f2-6e6b-4a3f-8c8a-6fda96d96e44
import href /api/storagedomains/580e4724-0002-451a-9ff2-e24e07518993/templates/b8c2daac-a1d6-4593-b3eb-0753d2c36659/import/b9db8bcd-8e6b-4fc7-a296-b6f14b487694


2013-01-02 11:07:13,131 DEBUG imgfac.BuildJob.BuildJob thread(caa39b64) Message: Builder (caa39b64-5ce1-4a85-a335-a6b5b428fb35) changed status from PUSHING to FAILED
2013-01-02 11:07:13,132 DEBUG imgfac.BuildJob.BuildJob thread(caa39b64) Message: caa39b64-5ce1-4a85-a335-a6b5b428fb35 for rhevm about to exit None queue...
2013-01-02 11:07:13,230 DEBUG paste.httpserver.ThreadPool thread(MainThread) Message: Added task (0 tasks queued)
2013-01-02 11:07:43,248 DEBUG paste.httpserver.ThreadPool thread(MainThread) Message: Added task (0 tasks queued)
2013-01-02 11:08:13,042 DEBUG paste.httpserver.ThreadPool thread(MainThread) Message: Added task (0 tasks queued)
2013-01-02 11:08:43,140 DEBUG paste.httpserver.ThreadPool thread(MainThread) Message: Added task (0 tasks queued)

Comment 22 james labocki 2013-01-02 16:18:03 UTC
For what it's worth I can see the image is pushed to the export domain and begins importing as a template, then it appears to be deleted from the template section and remains in the export domain. 

If I perform an import on the template in the export domain manually via the web UI it also fails in the same manner. In the events tab of the Web UI I see:

Failed to complete copy of Template 905c2aa0-f11d-43a9-8c12-a236e68682fd to Domain <UNKNOWN>.

It seems that <UNKNOWN> might be a problem. When I select to import the template I select "Data", the name of my data storage domain. I'm not sure why it says <UNKNOWN>.

Comment 23 james labocki 2013-01-02 18:32:29 UTC
after restarting ovirt-engine service (I mistakenly restart jbossas instead) the image push is successful.

Comment 24 Alex Lourie 2013-01-08 10:04:15 UTC
The problem lies in the work flow.

During one of the upgrade attempts, the process was stopped in the middle (either by Ctrl+C, a kill or some other system event), leaving timeout value at 0, wrongly. All the consequent runs are working as designed, this switching to 0 and returning to the original value, which is again, 0.

To resolve the problem on the system, it is enough to set the timeout to its original value 3000 and rerun the upgrade.

Comment 27 Moran Goldboim 2013-01-30 10:04:41 UTC
closing this one, since it looks like a corner case.
please reopen if it comes back from the field and felt like we should cover this corner case.