Bug 1038975

Summary: SPM is stopped and pool is disconnected while asynchronous task is scheduled
Product: Red Hat Enterprise Virtualization Manager Reporter: Nir Soffer <nsoffer>
Component: ovirt-engineAssignee: Liron Aravot <laravot>
Status: CLOSED UPSTREAM QA Contact: Aharon Canan <acanan>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.3.0CC: acathrow, amureini, iheim, laravot, lpeer, nsoffer, Rhev-m-bugs, scohen, yeylon
Target Milestone: ---Keywords: Triaged
Target Release: 3.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: storage
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-01-02 13:01:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Log of CI job failing none

Description Nir Soffer 2013-12-06 09:41:22 UTC
Description of problem:

After stopping spm and disconnecting storage pool, and then connecting again to storage pool, a reference to the old pool object is released when a copyImage task is scheculed on a thread. Looks like spm was stopped when an asynchronous task was scheduled.

Version-Release number of selected component (if applicable):
is26 + debugging patch

How reproducible:
Radnom

Steps to Reproduce:
1. Use this debugging patch: http://gerrit.ovirt.org/#/c/21932/
1. Run jenkins  rhevm 3.3 automation coretools two hosts restapi vms nfs rest factory vdsm until it fails with "Low space error" (error is a symbpthom of bug 1032925).

Actual results:

- Test fail with "Low space error" (but there is lot of space)
- In the log, we can see that old pool was deleted (__del__) when a copyImage task was commited. Looks like the task thread was holding a reference to the old pool that was recently disconnected.
- Looks like engine stop spm and disconnect storage pool when copyImage task was schecudled

Expected results:

- copyImage task should be canceled or spm stop should fail. We cannot have spm operations scheuled or running when spm is stopped.

Additional info:

This may be engine issue (stopping spm when it should not), and vdsm issue (allowing stop spm when it should fail), or both.

Comment 1 Nir Soffer 2013-12-06 09:46:34 UTC
Created attachment 833513 [details]
Log of CI job failing

Comment 2 Nir Soffer 2013-12-07 23:29:56 UTC
I did not found any issue regarding incorect stopping of spm in the logs.

Liron, can you verify this and confirm that engine is operating correctly?

Comment 3 Vered Volansky 2013-12-08 05:57:43 UTC
Nir, please provide logs (engine + vdsm).

Comment 4 Nir Soffer 2013-12-08 11:55:46 UTC
(In reply to Vered Volansky from comment #3)
> Nir, please provide logs (engine + vdsm).

I already did:
https://bugzilla.redhat.com/attachment.cgi?id=833513

Comment 5 Nir Soffer 2014-01-02 13:01:18 UTC
We did not find any issue regarding stopping spm or disconnecting from storage. The real problem was that old pool was kept by the thread pool and deleted many minutes after the pool was diconnnected. This issue was resolved by http://gerrit.ovirt.org/22136.

Comment 6 Nir Soffer 2014-01-02 14:32:58 UTC
To make it clear after Allon change the close reason to UPSTREAM:

1. This bug is invalid - there was no such bug - there was no active task when spm was stopped and pool was disconnected.
2. The bug was not fixed in upstream since there was nothing to fix :-)

There seems to be no reasonable close reason in this bugzilla. Hopefully someone can add INVALID status.