Bug 1656399

Summary: Failed to deactivate host by API with error SpmStopVDSCommand::Not stopping SPM on vds.
Product: [oVirt] ovirt-engine Reporter: Polina <pagranat>
Component: BLL.VirtAssignee: Ahmad Khiet <akhiet>
Status: CLOSED WORKSFORME QA Contact: Polina <pagranat>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.2.7CC: bugs, pagranat, rbarry, tnisan
Target Milestone: ovirt-4.3.5Keywords: Automation
Target Release: ---Flags: rule-engine: ovirt-4.3+
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-02 12:42:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
engine & vdsm logs
none
engine.log for the automation run none

Description Polina 2018-12-05 12:31:07 UTC
Created attachment 1511662 [details]
engine & vdsm logs

Description of problem: API deactivate host Post request failed with 
ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (default task-17) [hosts_syncAction_c88085a1-1fa6-4f1b] SpmStopVDSCommand::Not stopping SPM on vds 'host_mixed_2', pool id '55b614dc-f75d-11e8-80fd-001a4a168bfc' as there are uncleared tasks 'Task '791e2033-3b6a-469b-90c6-966ed6691d5e', status 'finished''

Version-Release number of selected component (if applicable):
ovirt-engine-4.2.7.5-0.1.el7ev.noarch
vdsm	vdsm-4.20.43-1.el7ev.x86_64
libvirt	libvirt-4.5.0-10.el7_6.3.x86_64
qemu-img-rhev-2.12.0-18.el7_6.1.x86_64

How reproducible: once in automation run https://rhv-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/rhv-4.2-ge-runner-tier1/272/

Steps to Reproduce:
1. Send deactivate host action

Action request content is --  url:/ovirt-engine/api/hosts/5c19ab08-36c7-4efe-bec2-47b75ccec856/deactivate body:<action>
    <async>false</async>
    <grace_period>
        <expiry>10</expiry>
    </grace_period>
</action>

Actual results:
deactivation fails with ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (default task-17) [hosts_syncAction_c88085a1-1fa6-4f1b] SpmStopVDSCommand::Not stopping SPM on vds 'host_mixed_2', pool id '55b614dc-f75d-11e8-80fd-001a4a168bfc' as there are uncleared tasks 'Task '791e2033-3b6a-469b-90c6-966ed6691d5e', status 'finished''

Expected results: host is deactivated

Additional info:for qe -  rhevmtests/compute/sla/multiple_pinning/multiple_pinning_test.py (TestMultiplePinning01)

Comment 1 Raz Tamir 2018-12-05 14:33:55 UTC
Raising severity as this bug fails 10 cases in our last tier1 cycle.

Also adding this piece from the engine.log that looks very suspicious:

2018-12-04 08:05:31,284+02 INFO  [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedThreadFactory-engineScheduled-Thread-34) [63214c63] SPMAsyncTask::PollTask: Polling task '791e2033-3b6a-469b-90c6-966ed6691d5e' (Parent Command 'Unknown', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') returned status 'finished', result 'success'.
2018-12-04 08:05:31,284+02 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedThreadFactory-engineScheduled-Thread-34) [63214c63] BaseAsyncTask::logEndTaskFailure: Task '791e2033-3b6a-469b-90c6-966ed6691d5e' (Parent Command 'Unknown', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') ended with failure:
-- Result: 'success'
-- Message: '',
-- Exception: '[null]'

Comment 2 Michal Skrivanek 2018-12-06 12:50:19 UTC
the engine.log is not complete and I suppose we'll need to see the whole thing, but it anyway looks like some pending/stuck/stale storage operation, reassigning to Storage

Comment 3 Tal Nisan 2018-12-10 09:17:13 UTC
It seems like you have a pending task in which case the SPM should indeed not be deactivated, any actions you ran before trying to deactivate the host?

Comment 4 Polina 2018-12-10 13:18:21 UTC
Created attachment 1513078 [details]
engine.log for the automation run

The full engine.log from the run is attached.

Nothing specific was done to the host before. The test starts on the environment with three hosts in Up state.
scenario:
Start VM golden_env_mixed_virtio_1_0
Get the VM's host  - host_mixed_1 , Stop VM.
Deactivate the host host_mixed_1. with {'expected_status': 'maintenance'}
Start VM golden_env_mixed_virtio_1_0
Get the VM golden_env_mixed_virtio_1_0 host - host_mixed_2.
Stop vm golden_env_mixed_virtio_1_0.
Deactivate the host host_mixed_2. with {'expected_status': 'maintenance'} 

05:34:27 2018-12-04 05:34:27,047 ERROR Result: FAILED

Comment 5 Tal Nisan 2018-12-10 13:23:24 UTC
Can you check in the database or on the host which tasks was running on the SPM?

Comment 6 Polina 2018-12-10 14:48:16 UTC
This environment was changed already. So, I can't run 'vdsm-client Host getAllTasks' on the host. Could be this info extracted from the vdsm.logs?

Comment 7 Sandro Bonazzola 2019-01-28 09:37:08 UTC
This bug has not been marked as blocker for oVirt 4.3.0.
Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.

Comment 8 Ahmad Khiet 2019-02-13 12:24:34 UTC
need more logs for vdsm and engine, the command for the task is not in the logs.

Comment 9 Polina 2019-02-14 08:12:10 UTC
The deactivation action happens at 2018-12-04 05:34:26
and the attached logs include this period of time.
   engine.log starts at 2018-12-04 04:55:25 - ends at 2018-12-04 06:01:01
   vdsm.log 2018-12-04 04:55:25 - 2018-12-04 06:01:01

Since this run was in December I have no more logs to add since the infra team removes the old jobs. 
If I see such incident again in further runs I'll add additional info to this BZ.

Comment 10 Ahmad Khiet 2019-05-29 06:41:23 UTC
(In reply to Polina from comment #9)
> The deactivation action happens at 2018-12-04 05:34:26
> and the attached logs include this period of time.
>    engine.log starts at 2018-12-04 04:55:25 - ends at 2018-12-04 06:01:01
>    vdsm.log 2018-12-04 04:55:25 - 2018-12-04 06:01:01
> 
> Since this run was in December I have no more logs to add since the infra
> team removes the old jobs. 
> If I see such incident again in further runs I'll add additional info to
> this BZ.

@Polin do you think it's the same as?
https://bugzilla.redhat.com/show_bug.cgi?id=1664045

Comment 11 Polina 2019-05-29 07:28:06 UTC
not sure . I think that Bz 1664045 talks about "no error to the user".
Here the deactivation fails with error reported to logs

Comment 12 Ahmad Khiet 2019-05-29 14:54:49 UTC
followed reproducing steps, and I got no error.

when the virtual machines where running, the result was that there virtual machines running.
when I shut down virtual machines, the action has been completed successfully.

from the attached logs, we can get enough info about the running task on SPM host. if its long task or stuck task!
 


<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<action>
    <async>false</async>
    <grace_period>
        <expiry>10</expiry>
    </grace_period>
    <host href="/ovirt-engine/api/hosts/f8c911e3-e205-4afd-9f06-2fb51b3ef412" id="f8c911e3-e205-4afd-9f06-2fb51b3ef412">
        <actions>
            <link href="/ovirt-engine/api/hosts/f8c911e3-e205-4afd-9f06-2fb51b3ef412/refresh" rel="refresh"/>
            <link href="/ovirt-engine/api/hosts/f8c911e3-e205-4afd-9f06-2fb51b3ef412/install" rel="install"/>
            <link href="/ovirt-engine/api/hosts/f8c911e3-e205-4afd-9f06-2fb51b3ef412/activate" rel="activate"/>
            <link href="/ovirt-engine/api/hosts/f8c911e3-e205-4afd-9f06-2fb51b3ef412/deactivate" rel="deactivate"/>
            <link href="/ovirt-engine/api/hosts/f8c911e3-e205-4afd-9f06-2fb51b3ef412/upgrade" rel="upgrade"/>
            <link href="/ovirt-engine/api/hosts/f8c911e3-e205-4afd-9f06-2fb51b3ef412/setupnetworks" rel="setupnetworks"/>
            <link href="/ovirt-engine/api/hosts/f8c911e3-e205-4afd-9f06-2fb51b3ef412/syncallnetworks" rel="syncallnetworks"/>
            <link href="/ovirt-engine/api/hosts/f8c911e3-e205-4afd-9f06-2fb51b3ef412/approve" rel="approve"/>
            <link href="/ovirt-engine/api/hosts/f8c911e3-e205-4afd-9f06-2fb51b3ef412/commitnetconfig" rel="commitnetconfig"/>
            <link href="/ovirt-engine/api/hosts/f8c911e3-e205-4afd-9f06-2fb51b3ef412/enrollcertificate" rel="enrollcertificate"/>
            <link href="/ovirt-engine/api/hosts/f8c911e3-e205-4afd-9f06-2fb51b3ef412/fence" rel="fence"/>
            <link href="/ovirt-engine/api/hosts/f8c911e3-e205-4afd-9f06-2fb51b3ef412/forceselectspm" rel="forceselectspm"/>
            <link href="/ovirt-engine/api/hosts/f8c911e3-e205-4afd-9f06-2fb51b3ef412/iscsidiscover" rel="iscsidiscover"/>
            <link href="/ovirt-engine/api/hosts/f8c911e3-e205-4afd-9f06-2fb51b3ef412/iscsilogin" rel="iscsilogin"/>
            <link href="/ovirt-engine/api/hosts/f8c911e3-e205-4afd-9f06-2fb51b3ef412/unregisteredstoragedomainsdiscover" rel="unregisteredstoragedomainsdiscover"/>
            <link href="/ovirt-engine/api/hosts/f8c911e3-e205-4afd-9f06-2fb51b3ef412/upgradecheck" rel="upgradecheck"/>
        </actions>
        <link href="/ovirt-engine/api/hosts/f8c911e3-e205-4afd-9f06-2fb51b3ef412/nics" rel="nics"/>
        <link href="/ovirt-engine/api/hosts/f8c911e3-e205-4afd-9f06-2fb51b3ef412/affinitylabels" rel="affinitylabels"/>
        <link href="/ovirt-engine/api/hosts/f8c911e3-e205-4afd-9f06-2fb51b3ef412/fenceagents" rel="fenceagents"/>
        <link href="/ovirt-engine/api/hosts/f8c911e3-e205-4afd-9f06-2fb51b3ef412/devices" rel="devices"/>
        <link href="/ovirt-engine/api/hosts/f8c911e3-e205-4afd-9f06-2fb51b3ef412/hooks" rel="hooks"/>
        <link href="/ovirt-engine/api/hosts/f8c911e3-e205-4afd-9f06-2fb51b3ef412/numanodes" rel="numanodes"/>
        <link href="/ovirt-engine/api/hosts/f8c911e3-e205-4afd-9f06-2fb51b3ef412/storage" rel="storage"/>
        <link href="/ovirt-engine/api/hosts/f8c911e3-e205-4afd-9f06-2fb51b3ef412/networkattachments" rel="networkattachments"/>
        <link href="/ovirt-engine/api/hosts/f8c911e3-e205-4afd-9f06-2fb51b3ef412/storageconnectionextensions" rel="storageconnectionextensions"/>
        <link href="/ovirt-engine/api/hosts/f8c911e3-e205-4afd-9f06-2fb51b3ef412/unmanagednetworks" rel="unmanagednetworks"/>
        <link href="/ovirt-engine/api/hosts/f8c911e3-e205-4afd-9f06-2fb51b3ef412/externalnetworkproviderconfigurations" rel="externalnetworkproviderconfigurations"/>
        <link href="/ovirt-engine/api/hosts/f8c911e3-e205-4afd-9f06-2fb51b3ef412/katelloerrata" rel="katelloerrata"/>
        <link href="/ovirt-engine/api/hosts/f8c911e3-e205-4afd-9f06-2fb51b3ef412/permissions" rel="permissions"/>
        <link href="/ovirt-engine/api/hosts/f8c911e3-e205-4afd-9f06-2fb51b3ef412/tags" rel="tags"/>
        <link href="/ovirt-engine/api/hosts/f8c911e3-e205-4afd-9f06-2fb51b3ef412/statistics" rel="statistics"/>
    </host>
    <status>complete</status>
</action>

Comment 13 Polina 2019-05-30 07:38:48 UTC
Hi Ahmad, the problem was not reproduced easily with the described steps. I just described the scenario in which it was faced once with the hope that some hint could be found in logs. If it doesn't help and we don't find what could be fixed in this area, I think we could close the issue. If it happens again in our automation runs, I'll reopen the bz and will try to get the host's tasks the 'vdsm-client Host getAllTasks'.