Created attachment 926848 [details] logs from engine and host Description of problem: I've created a storage domain and after a few minutes tried to deactivate it. The operation failed with "Operation not allowed while SPM is active" on vdsm. Then I checked and saw that there are finished downloadImageFromStream tasks that are waiting to be cleared in the tasks queue in SPM. Engine doesn't seem to poll those tasks after sending them to vdsm. Version-Release number of selected component (if applicable): ovirt-3.5 RC1 ovirt-engine-3.5.0-0.0.master.20140804172041.git23b558e.el6.noarch vdsm-4.14.13-1.el7ev.x86_64 How reproducible: 2/2 Steps to Reproduce: On a new shared 3.5 DC with 2 hosts cluster without storage domains: 1. Create a storage domain (FC in my case). Do not create any VMs or disks on it 2. After 30 minutes, try to put the storage domain to maintenance Actual results: The operation of putting the master domain to maintenance fails with "Operation not allowed while SPM is active" on vdsm. In engine.log can see that UploadStreamVDSCommand was sent to vdsm: 2014-08-14 17:32:29,322 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.UploadStreamVDSCommand] (DefaultQuartzScheduler_Worker-38) [513ee86] START, UploadStreamVDSCommand(HostName = green-vdsb, HostId = 24b5f42c-8567-466e-bb29-be6488324f70), log id: e6a1b97 2014-08-14 17:32:29,324 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.UploadStreamVDSCommand] (DefaultQuartzScheduler_Worker-38) [513ee86] -- executeVdsBrokerCommand: 2014-08-14 17:32:29,324 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.UploadStreamVDSCommand] (DefaultQuartzScheduler_Worker-38) [513ee86] -- parameters: spUUID=6914dd84-1f1d-44ad-8323-7f68066e4a13 sdUUID=4db95ed4-3750-450c-9784-895563dd9db1 imageGUID=71bb0bf4-89c5-43e5-a0ea-95ce22a5acc5 volUUID=ace1ba03-16e6-40e6-a1ef-f49c31ca610b size=10,240 Checked the running tasks (vdsm): [root@green-vdsb ~]# vdsClient -s 0 getAllTasks bd335386-7857-4190-b4be-cf8553f6e577 : verb = downloadImageFromStream code = 0 state = finished tag = spm result = message = 1 jobs completed successfully id = bd335386-7857-4190-b4be-cf8553f6e577 e5f8c458-fac4-4349-87a1-0022af291f52 : verb = downloadImageFromStream code = 0 state = finished tag = spm result = message = 1 jobs completed successfully id = e5f8c458-fac4-4349-87a1-0022af291f52 Async tasks table in DB is empty: [root@vm-161-52 ~]# su - postgres -c "psql -U postgres engine -c 'select * from async_tasks;'" | less -S task_id | action_type | status | result | step_id | command_id | started_at | storage_pool_id | task_type | task_parameters | task_params_class | vdsm_task_id ---------+-------------+--------+--------+---------+------------+------------+-----------------+-----------+-----------------+-------------------+-------------- (0 rows) The tasks are not being polled by engine, they are stuck on SPM Expected results: UploadStreamVDSCommand tasks should be polled by engine and should be cleared once they are finished Additional info: logs from engine and host
The tasks aren't polled by the engine because of the vdsm version isn't updated and doesn't return the created task id as required by the engine code. 2014-08-14 17:32:30,733 ERROR [org.ovirt.engine.core.bll.UploadStreamCommand] (DefaultQuartzScheduler_Worker-38) [513ee86] Command org.ovirt.engine.core.bll.U ploadStreamCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericExcepti on: VDSErrorException: response was missing the following header: Task-Id (Failed with error GeneralException and code 100) The return of the task-id in header was merged in this change: http://gerrit.ovirt.org/#/c/27997/ Elad, please update your vdsm version and try again - it should work, therefore i'm closing this bug. there are two other issues here: 1. when there are tasks on vdsm side that are completed but aren't known by the engine, they should be cleared automatically by the async task manager - elad, if you can please verify if this happens after X time or not, if not..than it's an async task manager bug (infra). 2. SpmStopOnIrsCommand always marked as succeeded even if the spm stop failed. That's what will be handled in this bug.
Liron, the vdsm version I'm using the the latest for RHEL7, from ovirt-3.5-RC1 repos provided by CI: vdsm-4.14.13-1.el7ev.x86_64
(In reply to Liron Aravot from comment #1) > The tasks aren't polled by the engine because of the vdsm version isn't > updated and doesn't return the created task id as required by the engine > code. > > 2014-08-14 17:32:30,733 ERROR > [org.ovirt.engine.core.bll.UploadStreamCommand] > (DefaultQuartzScheduler_Worker-38) [513ee86] Command > org.ovirt.engine.core.bll.U > ploadStreamCommand throw Vdc Bll exception. With error message > VdcBLLException: > org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: > VDSGenericExcepti > on: VDSErrorException: response was missing the following header: Task-Id > (Failed with error GeneralException and code 100) > > The return of the task-id in header was merged in this change: > http://gerrit.ovirt.org/#/c/27997/ > > Elad, please update your vdsm version and try again - it should work, > therefore i'm closing this bug. > > > there are two other issues here: > > 1. when there are tasks on vdsm side that are completed but aren't known by > the engine, they should be cleared automatically by the async task manager - > elad, if you can please verify if this happens after X time or not, if > not..than it's an async task manager bug (infra). After a few hours, the tasks list is full of downloadImageFromStream tasks: [root@RHEL7Server yum.repos.d]# vdsClient -s 0 getAllTasks b422f1c2-1ef3-4990-b6a7-909f40d74cdb : verb = downloadImageFromStream code = 0 state = finished tag = spm result = message = 1 jobs completed successfully id = b422f1c2-1ef3-4990-b6a7-909f40d74cdb 5b5cb243-4956-47d3-be8e-be1370af5f33 : verb = downloadImageFromStream code = 0 state = finished tag = spm result = message = 1 jobs completed successfully id = 5b5cb243-4956-47d3-be8e-be1370af5f33 a5c712bb-93b2-4ae5-a08c-dbedd5eae759 : verb = downloadImageFromStream code = 0 state = finished tag = spm result = message = 1 jobs completed successfully id = a5c712bb-93b2-4ae5-a08c-dbedd5eae759 986e4119-31bb-42e4-9197-400e624bf33c : verb = downloadImageFromStream code = 0 state = finished tag = spm result = message = 1 jobs completed successfully id = 986e4119-31bb-42e4-9197-400e624bf33c 6b2bac66-b147-4e04-a297-5b2c720180eb : verb = downloadImageFromStream code = 0 state = finished tag = spm result = message = 1 jobs completed successfully id = 6b2bac66-b147-4e04-a297-5b2c720180eb 8e27fd49-128b-46a9-9a16-ad03bb4713ff : verb = downloadImageFromStream code = 0 state = finished tag = spm result = message = 1 jobs completed successfully id = 8e27fd49-128b-46a9-9a16-ad03bb4713ff b88247c4-abc6-4d30-81fd-7d83ff8cec17 : verb = downloadImageFromStream code = 0 state = finished tag = spm result = message = 1 jobs completed successfully id = b88247c4-abc6-4d30-81fd-7d83ff8cec17 6604ea5e-4c34-4288-90e7-a9d00721ac85 : verb = downloadImageFromStream code = 0 state = finished tag = spm result = message = 1 jobs completed successfully id = 6604ea5e-4c34-4288-90e7-a9d00721ac85
If your vdsm version is up to date, i assume that we might have an issue with the version on the repo.
moving to 3.6, the fixed issue here isn't critical for 3.5 as currently it'll just save another call that we can now that will fail, if we'll need this fix to solve other bugs we can backport this one as well.
Created attachment 934046 [details] engine and vdsm logs (3.9) I'm encountering the issue with vdsm-4.16.2-1.gite8cba75.el7.x86_64 I think this is a major issue that need to be fixed ASAP since when the uploadImageToStream is stuck in the tasks queue in vdsm, there is no option to put the host in maintenance.
We looked into the logs and saw that the issue in this log is bit different. We saw that the handled issue in this bug (https://bugzilla.redhat.com/show_bug.cgi?id=1130222#c1) was solved, the other issue was opened as: https://bugzilla.redhat.com/show_bug.cgi?id=1136840
Tested the following: - Got a situation in which I have unknown tasks running on SPM (uploadImageToStream tasks that are not being polled by engine due to https://bugzilla.redhat.com/show_bug.cgi?id=1136840) - Tried to put the host to maintenance spmStop is not sent to vdsm. 2014-09-14 11:19:48,121 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (org.ovirt.thread.pool-6-thread-12) [5c0b4281] SpmStopVDSCommand::Not stopping SPM on vds green-vdsb, pool id b7cf7a2a-4fde-4732-91f8-a9808ff85b93 as there are uncleared tasks Verified using rhev3.5 vt3.1
Got confused with https://bugzilla.redhat.com/show_bug.cgi?id=1130766. Steps to reproduce of this bug should be: - Have unknown task to engine running on SPM - Put master storage domain to maintenance Putting master domain to maintenance still fails with an uninformative message: 2014-09-14 15:50:55,077 ERROR [org.ovirt.engine.core.bll.storage.DeactivateStorageDomainCommand] (org.ovirt.thread.pool-6-thread-22) [617167d3] Aborting execution due to failure stopping SPM. Stop SPM failed due to null Liron, this bug was originally opened for the failure of putting the master domain to maintenance when having unknown tasks. I think this should be handled like it's being done for putting the SPM to maintenance while having unknown tasks. I'm verifying https://bugzilla.redhat.com/show_bug.cgi?id=1130766 according to comment #8
You should have another print before indicating that there are tasks (log of SpmStopCommand). The "due to null" issue in your last comment was already fixed by me a while ago in this patch and will be in 3.6 as it has no effect. http://gerrit.ovirt.org/#/c/31564/ please verify that and let me know if there's any further issue.
spmStop isn't sent to vdsm as part of trying to put the master domain to maintenance while there are unknown running tasks in SPM. Moving back to VERIFIED Used vt3.1
oVirt 3.5 has been released and should include the fix for this issue.