Bug 1130222

Summary: [engine-backend] StopSpmOnIrs - When there are "unknown" tasks the spm isn't being stopped but it's not reflected on the execution results
Product: [Retired] oVirt Reporter: Elad <ebenahar>
Component: ovirt-engine-coreAssignee: Liron Aravot <laravot>
Status: CLOSED CURRENTRELEASE QA Contact: Elad <ebenahar>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.5CC: amureini, bugs, ebenahar, ecohen, gklein, iheim, laravot, rbalakri, tnisan, yeylon
Target Milestone: ---   
Target Release: 3.5.0   
Hardware: x86_64   
OS: Unspecified   
Whiteboard: storage
Fixed In Version: ovirt-3.5.0_rc2 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1136840 (view as bug list) Environment:
Last Closed: 2014-10-17 12:42:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
logs from engine and host
none
engine and vdsm logs (3.9) none

Description Elad 2014-08-14 14:57:05 UTC
Created attachment 926848 [details]
logs from engine and host

Description of problem:
I've created a storage domain and after a few minutes tried to deactivate it. The operation failed with "Operation not allowed while SPM is active" on vdsm. Then I checked and saw that there are finished downloadImageFromStream tasks that are waiting to be cleared in the tasks queue in SPM. Engine doesn't seem to poll those tasks after sending them to vdsm. 

Version-Release number of selected component (if applicable):
ovirt-3.5 RC1
ovirt-engine-3.5.0-0.0.master.20140804172041.git23b558e.el6.noarch
vdsm-4.14.13-1.el7ev.x86_64

How reproducible:
2/2

Steps to Reproduce:
On a new shared 3.5 DC with 2 hosts cluster without storage domains:
1. Create a storage domain (FC in my case). Do not create any VMs or disks on it
2. After 30 minutes, try to put the storage domain to maintenance


Actual results:
The operation of putting the master domain to maintenance fails with "Operation not allowed while SPM is active" on vdsm. 

In engine.log can see that UploadStreamVDSCommand was sent to vdsm:

2014-08-14 17:32:29,322 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.UploadStreamVDSCommand] (DefaultQuartzScheduler_Worker-38) [513ee86] START, UploadStreamVDSCommand(HostName = green-vdsb, HostId = 24b5f42c-8567-466e-bb29-be6488324f70), log id: e6a1b97
2014-08-14 17:32:29,324 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.UploadStreamVDSCommand] (DefaultQuartzScheduler_Worker-38) [513ee86] -- executeVdsBrokerCommand:
2014-08-14 17:32:29,324 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.UploadStreamVDSCommand] (DefaultQuartzScheduler_Worker-38) [513ee86] -- parameters:
                spUUID=6914dd84-1f1d-44ad-8323-7f68066e4a13
                sdUUID=4db95ed4-3750-450c-9784-895563dd9db1
                imageGUID=71bb0bf4-89c5-43e5-a0ea-95ce22a5acc5
                volUUID=ace1ba03-16e6-40e6-a1ef-f49c31ca610b
                size=10,240


Checked the running tasks (vdsm):

[root@green-vdsb ~]# vdsClient -s 0 getAllTasks
bd335386-7857-4190-b4be-cf8553f6e577 :
        verb = downloadImageFromStream
        code = 0
        state = finished
        tag = spm
        result =
        message = 1 jobs completed successfully
        id = bd335386-7857-4190-b4be-cf8553f6e577
e5f8c458-fac4-4349-87a1-0022af291f52 :
        verb = downloadImageFromStream
        code = 0
        state = finished
        tag = spm
        result =
        message = 1 jobs completed successfully
        id = e5f8c458-fac4-4349-87a1-0022af291f52

Async tasks table in DB is empty:

[root@vm-161-52 ~]# su - postgres -c "psql -U postgres engine -c  'select * from async_tasks;'"  | less -S

 task_id | action_type | status | result | step_id | command_id | started_at | storage_pool_id | task_type | task_parameters | task_params_class | vdsm_task_id
---------+-------------+--------+--------+---------+------------+------------+-----------------+-----------+-----------------+-------------------+--------------
(0 rows)



The tasks are not being polled by engine, they are stuck on SPM

Expected results:
UploadStreamVDSCommand tasks should be polled by engine and should be cleared once they are finished

Additional info:
logs from engine and host

Comment 1 Liron Aravot 2014-08-17 12:47:26 UTC
The tasks aren't polled by the engine because of the vdsm version isn't updated and doesn't return the created task id as required by the engine code.

2014-08-14 17:32:30,733 ERROR [org.ovirt.engine.core.bll.UploadStreamCommand] (DefaultQuartzScheduler_Worker-38) [513ee86] Command org.ovirt.engine.core.bll.U
ploadStreamCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericExcepti
on: VDSErrorException: response was missing the following header: Task-Id (Failed with error GeneralException and code 100)

The return of the task-id in header was merged in this change:
http://gerrit.ovirt.org/#/c/27997/

Elad, please update your vdsm version and try again - it should work, therefore i'm closing this bug.


there are two other issues here:

1. when there are tasks on vdsm side that are completed but aren't known by the engine, they should be cleared automatically by the async task manager - elad, if you can please verify if this happens after X time or not, if not..than it's an async task manager bug (infra).

2. SpmStopOnIrsCommand always marked as succeeded even if the spm stop failed. That's what will be handled in this bug.

Comment 2 Elad 2014-08-17 13:12:16 UTC
Liron, the vdsm version I'm using the the latest for RHEL7, from ovirt-3.5-RC1 repos provided by CI:

vdsm-4.14.13-1.el7ev.x86_64

Comment 3 Elad 2014-08-17 13:17:50 UTC
(In reply to Liron Aravot from comment #1)
> The tasks aren't polled by the engine because of the vdsm version isn't
> updated and doesn't return the created task id as required by the engine
> code.
> 
> 2014-08-14 17:32:30,733 ERROR
> [org.ovirt.engine.core.bll.UploadStreamCommand]
> (DefaultQuartzScheduler_Worker-38) [513ee86] Command
> org.ovirt.engine.core.bll.U
> ploadStreamCommand throw Vdc Bll exception. With error message
> VdcBLLException:
> org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException:
> VDSGenericExcepti
> on: VDSErrorException: response was missing the following header: Task-Id
> (Failed with error GeneralException and code 100)
> 
> The return of the task-id in header was merged in this change:
> http://gerrit.ovirt.org/#/c/27997/
> 
> Elad, please update your vdsm version and try again - it should work,
> therefore i'm closing this bug.
> 
> 
> there are two other issues here:
> 
> 1. when there are tasks on vdsm side that are completed but aren't known by
> the engine, they should be cleared automatically by the async task manager -
> elad, if you can please verify if this happens after X time or not, if
> not..than it's an async task manager bug (infra).
After a few hours, the tasks list is full of downloadImageFromStream tasks:

[root@RHEL7Server yum.repos.d]# vdsClient -s 0 getAllTasks 
b422f1c2-1ef3-4990-b6a7-909f40d74cdb :                                                                                                                                                                               
        verb = downloadImageFromStream                                                                                                                                                                               
        code = 0                                                                                                                                                                                                     
        state = finished                                                                                                                                                                                             
        tag = spm                                                                                                                                                                                                    
        result =
        message = 1 jobs completed successfully
        id = b422f1c2-1ef3-4990-b6a7-909f40d74cdb
5b5cb243-4956-47d3-be8e-be1370af5f33 :
        verb = downloadImageFromStream
        code = 0
        state = finished
        tag = spm
        result =
        message = 1 jobs completed successfully
        id = 5b5cb243-4956-47d3-be8e-be1370af5f33
a5c712bb-93b2-4ae5-a08c-dbedd5eae759 :
        verb = downloadImageFromStream
        code = 0
        state = finished
        tag = spm
        result =
        message = 1 jobs completed successfully
        id = a5c712bb-93b2-4ae5-a08c-dbedd5eae759
986e4119-31bb-42e4-9197-400e624bf33c :
        verb = downloadImageFromStream
        code = 0
        state = finished
        tag = spm
        result =
        message = 1 jobs completed successfully
        id = 986e4119-31bb-42e4-9197-400e624bf33c
6b2bac66-b147-4e04-a297-5b2c720180eb :
        verb = downloadImageFromStream
        code = 0
        state = finished
        tag = spm
        result =
        message = 1 jobs completed successfully
        id = 6b2bac66-b147-4e04-a297-5b2c720180eb
8e27fd49-128b-46a9-9a16-ad03bb4713ff :
        verb = downloadImageFromStream
        code = 0
        state = finished
        tag = spm
        result =
        message = 1 jobs completed successfully
        id = 8e27fd49-128b-46a9-9a16-ad03bb4713ff
b88247c4-abc6-4d30-81fd-7d83ff8cec17 :
        verb = downloadImageFromStream
        code = 0
        state = finished
        tag = spm
        result =
        message = 1 jobs completed successfully
        id = b88247c4-abc6-4d30-81fd-7d83ff8cec17
6604ea5e-4c34-4288-90e7-a9d00721ac85 :
        verb = downloadImageFromStream
        code = 0
        state = finished
        tag = spm
        result =
        message = 1 jobs completed successfully
        id = 6604ea5e-4c34-4288-90e7-a9d00721ac85

Comment 4 Liron Aravot 2014-08-17 13:28:30 UTC
If your vdsm version is up to date, i assume that we might have an issue with the version on the repo.

Comment 5 Liron Aravot 2014-08-17 13:50:59 UTC
moving to 3.6, the fixed issue here isn't critical for 3.5 as currently it'll just save another call that we can now that will fail, if we'll need this fix to solve other bugs we can backport this one as well.

Comment 6 Elad 2014-09-03 11:19:49 UTC
Created attachment 934046 [details]
engine and vdsm logs (3.9)

I'm encountering the issue with vdsm-4.16.2-1.gite8cba75.el7.x86_64
I think this is a major issue that need to be fixed ASAP since when the uploadImageToStream is stuck in the tasks queue in vdsm, there is no option to put the host in maintenance.

Comment 7 Liron Aravot 2014-09-03 12:11:45 UTC
We looked into the logs and saw that the issue in this log is bit different. We saw that the handled issue in this bug (https://bugzilla.redhat.com/show_bug.cgi?id=1130222#c1) was solved, the other issue was opened as:

https://bugzilla.redhat.com/show_bug.cgi?id=1136840

Comment 8 Elad 2014-09-14 08:32:34 UTC
Tested the following:
- Got a situation in which I have unknown tasks running on SPM (uploadImageToStream tasks that are not being polled by engine due to https://bugzilla.redhat.com/show_bug.cgi?id=1136840)
- Tried to put the host to maintenance

spmStop is not sent to vdsm. 

2014-09-14 11:19:48,121 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (org.ovirt.thread.pool-6-thread-12) [5c0b4281] SpmStopVDSCommand::Not stopping SPM on vds green-vdsb, pool id b7cf7a2a-4fde-4732-91f8-a9808ff85b93 as there are uncleared tasks


Verified using rhev3.5 vt3.1

Comment 9 Elad 2014-09-14 13:03:21 UTC
Got confused with https://bugzilla.redhat.com/show_bug.cgi?id=1130766.
Steps to reproduce of this bug should be:
- Have unknown task to engine running on SPM
- Put master storage domain to maintenance


Putting master domain to maintenance still fails with an uninformative message:

2014-09-14 15:50:55,077 ERROR [org.ovirt.engine.core.bll.storage.DeactivateStorageDomainCommand] (org.ovirt.thread.pool-6-thread-22) [617167d3] Aborting execution due to failure stopping SPM. Stop SPM failed due to null


Liron, this bug was originally opened for the failure of putting the master domain to maintenance when having unknown tasks. I think this should be handled like it's being done for putting the SPM to maintenance while having unknown tasks. 



I'm verifying  https://bugzilla.redhat.com/show_bug.cgi?id=1130766 according to comment #8

Comment 10 Liron Aravot 2014-09-14 13:17:21 UTC
You should have another print before indicating that there are tasks (log of SpmStopCommand).

The "due to null" issue in your last comment was already fixed by me a while ago in this patch and will be in 3.6 as it has no effect.


http://gerrit.ovirt.org/#/c/31564/

please verify that and let me know if there's any further issue.

Comment 11 Elad 2014-09-14 13:29:53 UTC
spmStop isn't sent to vdsm as part of trying to put the master domain to maintenance while there are unknown running tasks in SPM.

Moving back to VERIFIED 

Used vt3.1

Comment 12 Sandro Bonazzola 2014-10-17 12:42:04 UTC
oVirt 3.5 has been released and should include the fix for this issue.