Bug 916652 - RHEVM Backend : "asynchronous running task" remains after remove Vms
Summary: RHEVM Backend : "asynchronous running task" remains after remove Vms
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.1.2
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.4.0
Assignee: Shahar Havivi
QA Contact: Barak Dagan
URL:
Whiteboard: virt
: 918588 (view as bug list)
Depends On: 915809
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-02-28 15:40 UTC by Barak Dagan
Modified: 2020-06-11 12:37 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-02-28 07:56:04 UTC
oVirt Team: ---
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Test logs (1.62 MB, application/x-bzip2)
2013-02-28 15:40 UTC, Barak Dagan
no flags Details

Description Barak Dagan 2013-02-28 15:40:22 UTC
Created attachment 703917 [details]
Test logs

Description of problem:
http://jenkins.qa.lab.tlv.redhat.com:8080/view/Core+Tools/view/3.1/job/3.1-automation_restapi_reg_vms_rhevh/73/testReport/junit/Hosts/134-Set%20the%20second%20host%20to%20maintenance,%20for%20the%20SPM%20selection/Set_the_second_host_to_maintenance__for_the_SPM_selection/

VMs are not fully removed, when the REST API, returns delete ack:

DEBUG - DELETE request content is --  url:/api/vms/478e7dbc-e4b0-4b79-8a0d-0942c176643a
DEBUG - Request response time: 0.100
DEBUG - Response body for DELETE request is:  
DEBUG - Response code is valid: [200, 202, 204] 
DEBUG - SEARCH request content is --  url:https://localhost/api/vms?search=name%3D%22restvm_nic%22
DEBUG - Request response time: 0.030
DEBUG - Response body for QUERY request is: 
<vms/>

Then deactivating the host is failed, which cause the automatic test to fail:

DEBUG - Action request content is --  url:/api/hosts/131f8dae-805f-11e2-8b7c-001a4a169774/deactivate body:<action>
    <async>false</async>
    <grace_period>
        <expiry>10</expiry>
    </grace_period>
</action>
 
DEBUG - Response body for action request is: 
<action>
    <async>false</async>
    <grace_period>
        <expiry>10</expiry>
    </grace_period>
    <status>
        <state>failed</state>
    </status>
    <fault>
        <reason>Operation Failed</reason>
        <detail>[Cannot switch Host to Maintenance mode. Host has asynchronous running tasks,
wait for operation to complete and retry.]</detail>
    </fault>
</action>
 
2013-02-27 00:10:29,879 - MainThread - plmanagement.matrix-test-composer - ERROR - Status: Fail


Version-Release number of selected component (if applicable):
SI27.1

How reproducible:
quite frequent

Steps to Reproduce:
1. Add a few vms over one host
2. Remove all vms simultaneously, using REST api (SDK)
3. set host to maintenance
  

Additional info:
Log Attached

Comment 1 Oded Ramraz 2013-02-28 19:30:43 UTC
might be related to https://bugzilla.redhat.com/show_bug.cgi?id=915809

Comment 3 Michael Pasternak 2013-03-03 09:50:28 UTC
(In reply to comment #0)
> Created attachment 703917 [details]
> Test logs
> 
> Description of problem:
> http://jenkins.qa.lab.tlv.redhat.com:8080/view/Core+Tools/view/3.1/job/3.1-
> automation_restapi_reg_vms_rhevh/73/testReport/junit/Hosts/134-
> Set%20the%20second%20host%20to%20maintenance,%20for%20the%20SPM%20selection/
> Set_the_second_host_to_maintenance__for_the_SPM_selection/
> 
> VMs are not fully removed, when the REST API, returns delete ack:
> 

blocking request on DELETE in api is not supported, we support only:

1. semi-async by 'Expect:202-accepted' header (control is returned after backend can-do-action has passed)

2. async by ';async' matrix URL parameter (control is returned immediately)

in any case RESTful api does *not* promise that DELETE will success, this
verification should be done by the client's code.

Comment 4 Michael Pasternak 2013-03-03 09:52:47 UTC
(In reply to comment #3)
> (In reply to comment #0)
> > Created attachment 703917 [details]
> > Test logs
> > 
> > Description of problem:
> > http://jenkins.qa.lab.tlv.redhat.com:8080/view/Core+Tools/view/3.1/job/3.1-
> > automation_restapi_reg_vms_rhevh/73/testReport/junit/Hosts/134-
> > Set%20the%20second%20host%20to%20maintenance,%20for%20the%20SPM%20selection/
> > Set_the_second_host_to_maintenance__for_the_SPM_selection/
> > 
> > VMs are not fully removed, when the REST API, returns delete ack:
> > 
> 
> blocking request on DELETE in api is not supported, we support only:
> 
> 1. semi-async by 'Expect:202-accepted' header (control is returned after
> backend can-do-action has passed)
> 
> 2. async by ';async' matrix URL parameter (control is returned immediately)
> 
> in any case RESTful api does *not* promise that DELETE will success, this
> verification should be done by the client's code.

minor clarification:
===================

1. both expect header and async param will generate full-async call
2. not passing any of mentioned in #1 will return control after can-do-action

Comment 5 Oded Ramraz 2013-03-03 10:02:15 UTC
(In reply to comment #4)
> (In reply to comment #3)
> > (In reply to comment #0)
> > > Created attachment 703917 [details]
> > > Test logs
> > > 
> > > Description of problem:
> > > http://jenkins.qa.lab.tlv.redhat.com:8080/view/Core+Tools/view/3.1/job/3.1-
> > > automation_restapi_reg_vms_rhevh/73/testReport/junit/Hosts/134-
> > > Set%20the%20second%20host%20to%20maintenance,%20for%20the%20SPM%20selection/
> > > Set_the_second_host_to_maintenance__for_the_SPM_selection/
> > > 
> > > VMs are not fully removed, when the REST API, returns delete ack:
> > > 
> > 
> > blocking request on DELETE in api is not supported, we support only:
> > 
> > 1. semi-async by 'Expect:202-accepted' header (control is returned after
> > backend can-do-action has passed)
> > 
> > 2. async by ';async' matrix URL parameter (control is returned immediately)
> > 
> > in any case RESTful api does *not* promise that DELETE will success, this
> > verification should be done by the client's code.
> 
> minor clarification:
> ===================
> 
> 1. both expect header and async param will generate full-async call
> 2. not passing any of mentioned in #1 will return control after can-do-action


Can we monitor the tasks statuses using API  If not please explain how should we monitor the tasks which related to delete VM's operations.

Comment 6 Michael Pasternak 2013-03-03 10:09:34 UTC
(In reply to comment #5)
> (In reply to comment #4)
> > (In reply to comment #3)
> > > (In reply to comment #0)
> > > > Created attachment 703917 [details]
> > > > Test logs
> > > > 
> > > > Description of problem:
> > > > http://jenkins.qa.lab.tlv.redhat.com:8080/view/Core+Tools/view/3.1/job/3.1-
> > > > automation_restapi_reg_vms_rhevh/73/testReport/junit/Hosts/134-
> > > > Set%20the%20second%20host%20to%20maintenance,%20for%20the%20SPM%20selection/
> > > > Set_the_second_host_to_maintenance__for_the_SPM_selection/
> > > > 
> > > > VMs are not fully removed, when the REST API, returns delete ack:
> > > > 
> > > 
> > > blocking request on DELETE in api is not supported, we support only:
> > > 
> > > 1. semi-async by 'Expect:202-accepted' header (control is returned after
> > > backend can-do-action has passed)
> > > 
> > > 2. async by ';async' matrix URL parameter (control is returned immediately)
> > > 
> > > in any case RESTful api does *not* promise that DELETE will success, this
> > > verification should be done by the client's code.
> > 
> > minor clarification:
> > ===================
> > 
> > 1. both expect header and async param will generate full-async call
> > 2. not passing any of mentioned in #1 will return control after can-do-action
> 
> 
> Can we monitor the tasks statuses using API  

yes you do, but only in CREATE calls when call is async.

> If not please explain how
> should we monitor the tasks which related to delete VM's operations.

you should be polling the resources till you get 404.

Comment 7 Oded Ramraz 2013-03-03 18:28:04 UTC
Michal please advise how to proceed:

1.Fix this use case specifically same as in https://bugzilla.redhat.com/show_bug.cgi?id=860194
2. Open RFE for Rest API about being able to poll async delete calls properly. 

We can choose between 1,2 or both .

Comment 8 Michal Skrivanek 2013-03-04 09:21:37 UTC
as I understood it 1 is not possible. There is no vdsm async task
only 2 would be possible imho, but it would require a bg task framework on engine side only...and I'm not sure that justifies the effort - the currently correct way seems to me as sufficient as per comment #6.

Comment 9 Oded Ramraz 2013-03-05 20:30:43 UTC
I'm not sure that follow comment #6 would be sufficient since in some cases tasks remaining on the host even after I get the 404 error.
We need to find a proper solution for this problem , then we can decide when to fix it. 


(In reply to comment #8)
> as I understood it 1 is not possible. There is no vdsm async task
> only 2 would be possible imho, but it would require a bg task framework on
> engine side only...and I'm not sure that justifies the effort - the
> currently correct way seems to me as sufficient as per comment #6.

Comment 10 Oded Ramraz 2013-03-06 12:58:03 UTC
Simon , please advice how we should continue with solving this issue .

Comment 11 Oded Ramraz 2013-03-07 09:10:41 UTC
After discussion with hateya , bdagan , sgrinberg , mpastern and michal.skrivanek agreed to reopen this bug .

Comment 12 Haim 2013-03-18 12:27:56 UTC
*** Bug 918588 has been marked as a duplicate of this bug. ***

Comment 13 Michal Skrivanek 2013-05-09 11:41:45 UTC
don't agree with blocking 3.2. This is not a regression, the REST behaved like this forever, see comment #6. The issue was hidden all the time so the integration tests are expecting different behavior. Now the problem is revealed many tests are "failing".
Anyway, discussion scheduled for next build mtg May 13th, adding CondNAK for 3.2 till then

Comment 14 Andrew Cathrow 2013-05-09 12:20:27 UTC
(In reply to comment #13)
> don't agree with blocking 3.2. This is not a regression, the REST behaved
> like this forever, see comment #6. The issue was hidden all the time so the
> integration tests are expecting different behavior. Now the problem is
> revealed many tests are "failing".
> Anyway, discussion scheduled for next build mtg May 13th, adding CondNAK for
> 3.2 till then


ACK, setting to 3.3/3.2.z
Let's review the need for the .Z flag

Comment 19 Shahar Havivi 2013-07-09 11:54:53 UTC
There is a limitation on REST DELETE method which always returns 'Accepted'.
Its the user responsibility to poll for the object state. (unlike the Create VM that work with POST and have a return value, in our case the job id).  

Currently we have two solutions: 
1. When the tests delete VM there are still jobs on the host, while the tests are calling for maintenance host it get exception that is describing the problem - in this case "host still have jobs", So the tests can retry later until the host finish its jobs. 
2. Api can provide "GetAllHotsJobs" which the tests can poll instead of trying to maintenance the host while it have running jobs.

Comment 20 Michal Skrivanek 2013-07-10 07:42:09 UTC
how about changing it from delete to post and return the id of that job then?

Comment 21 Shahar Havivi 2013-07-10 08:23:44 UTC
Michael,
What do you think about changing the method from delete to post (regarding deleting VM)?

Comment 22 Michael Pasternak 2013-07-10 12:54:41 UTC
(In reply to Shahar Havivi from comment #21)
> Michael,
> What do you think about changing the method from delete to post (regarding
> deleting VM)?

it won't work, your problem is not in inability of returning tasks at DELETE,
(actually host can run the tasks that completely unrelated to vm.delete()),
but in inconsistency of host.maintenance() behaviour, e.g:

in some cases it moves to PREPARING_FOR_MAINTENANCE and in other it returns
CAN_DO_ACTION that host can't move to MAINTENANCE cause it running tasks,
and this is wrong as it leaves host as potential vm-runner candidate state
while user intention was moving it to the MAINTENANCE mode,

i'd consider moving host to PREPARING_FOR_MAINTENANCE till all tasks ends
or introducing a new state like WAITING_FOR_TASKS_TO_END, but not throwing
error on host.maintenance()

also i don't think that exposing list of jobs running on host to user and
make him poll them is a way to go, it may be an alternative, but not a solution.

Comment 23 Michal Skrivanek 2013-07-11 08:54:33 UTC
(In reply to Michael Pasternak from comment #22)
> i'd consider moving host to PREPARING_FOR_MAINTENANCE till all tasks ends
IMHO this sounds most reasonable
we already have "issue" with host staying in preparing state, this will be yet another one, it's not going against anything

Comment 24 Omer Frenkel 2013-07-11 13:45:23 UTC
(In reply to Michal Skrivanek from comment #23)
> (In reply to Michael Pasternak from comment #22)
> > i'd consider moving host to PREPARING_FOR_MAINTENANCE till all tasks ends
> IMHO this sounds most reasonable
> we already have "issue" with host staying in preparing state, this will be
> yet another one, it's not going against anything

i agree this sounds the 'cleanest' solution to the user, but im afraid it will be comlicated - as moving host to 'preparing' means stopping the spm, but this will fail if tasks are running.
not sure how easy is to change this, and if it worth the effort.

Comment 25 Eyal Edri 2013-07-16 08:34:01 UTC
any reason this bug is on 3.1.2 version and not on 3.3?

Comment 26 Shahar Havivi 2013-08-13 12:36:06 UTC
As Omer wrote in comment #24 we cannot just change the status of the host without stopping the SPM.
One of the suggested solution was to change the Delete VM API method to return the job id as Start VM method do in order to monitor the http://host:port/api/jobs ans wait for Finish status,
its look like you can we can do it without changing the API, by calling the API for all jobs: http://host:port/api/jobs you get one of the job with the VMs name in the description as you can see here:

<job href="/api/jobs/9976a2ad-8320-43d1-8d40-52808c75e16e" id="9976a2ad-8320-43d1-8d40-52808c75e16e">
    <actions>
        <link href="/api/jobs/9976a2ad-8320-43d1-8d40-52808c75e16e/clear" rel="clear"/>
        <link href="/api/jobs/9976a2ad-8320-43d1-8d40-52808c75e16e/end" rel="end"/>
    </actions>
    <description>Stopping VM usb-958526</description>
    <link href="/api/jobs/9976a2ad-8320-43d1-8d40-52808c75e16e/steps" rel="steps"/>
    <status>
        <state>FINISHED</state>
    </status>
    <owner href="/api/users/fdfc627c-d875-11e0-90f0-83df133b58cc" id="fdfc627c-d875-11e0-90f0-83df133b58cc"/>
    <start_time>2013-08-13T14:00:00.469+03:00</start_time>
    <end_time>2013-08-13T14:00:00.937+03:00</end_time>
    <last_updated>2013-08-13T14:00:00.937+03:00</last_updated>
    <external>false</external>
    <auto_cleared>true</auto_cleared>
</job>

What do you think about that solution?

Comment 27 Michal Skrivanek 2013-10-08 08:00:25 UTC
shahar, so it still make sense to return the job id in the delete call, right? I mean it would be an easy change and there won't be any problems matching the right job

Comment 28 Michal Skrivanek 2014-01-31 08:28:24 UTC
another option is to wait until SPM is eliminated:-)

In the meantime Shahar please see if comment #27 can get in sooner than that

Comment 29 Shahar Havivi 2014-02-02 11:58:45 UTC
(In reply to Michal Skrivanek from comment #28)
> another option is to wait until SPM is eliminated:-)
> 
> In the meantime Shahar please see if comment #27 can get in sooner than that
No...
All the DELETE methods use the DELETE http method which is void.

Comment 30 Michal Skrivanek 2014-02-28 07:56:04 UTC
ok, then we're out of options other than waiting for SPM to be eliminated perhaps. Also proposal from comment 26 should be good enough


Note You need to log in before you can comment on or make changes to this bug.