Bug 1768168 - [downstream clone - 4.3.7] VM fails to be re-started with error: Failed to acquire lock: No space left on device
Summary: [downstream clone - 4.3.7] VM fails to be re-started with error: Failed to ac...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.3.5
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ovirt-4.3.7
: 4.3.7
Assignee: Benny Zlotnik
QA Contact: Shir Fishbain
URL:
Whiteboard:
Depends On: 1741625
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-11-03 07:27 UTC by RHV bug bot
Modified: 2020-08-03 15:31 UTC (History)
15 users (show)

Fixed In Version: ovirt-engine-4.3.7.2
Doc Type: No Doc Update
Doc Text:
Clone Of: 1741625
Environment:
Last Closed: 2019-12-12 10:36:35 UTC
oVirt Team: Storage
Target Upstream Version:
lsvaty: testing_plan_complete-


Attachments (Terms of Use)
Logs (1.30 MB, application/zip)
2019-11-14 16:18 UTC, Shir Fishbain
no flags Details
New_Logs (1.80 MB, application/zip)
2019-11-17 13:19 UTC, Shir Fishbain
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:4229 0 None None None 2019-12-12 10:36:54 UTC
oVirt gerrit 103822 0 'None' MERGED core: clear domains cache when changing state 2020-12-07 15:19:47 UTC
oVirt gerrit 104355 0 'None' MERGED core: clear domains cache when changing state 2020-12-07 15:19:47 UTC

Comment 28 RHV bug bot 2019-11-03 07:28:05 UTC
The customer just updated this:

"The command executed pre-host boot:

killall -TERM glusterd glusterfs glusterfsd

Also now attached the sos report from the run with this command execute prior to the host reboot."

I have copied the sosreports to supportshell

Regards,
Jay

(Originally by Jaysamson Pankajakshan)

Comment 40 Avihai 2019-11-03 08:19:43 UTC
Hi Benny,

Please provide a clear scenario so I can QA_ACK.

Comment 41 Benny Zlotnik 2019-11-03 08:56:33 UTC
1. You need a setup with two hosts
2. Create a VM with a lease and start it
3. The host that does not run the VM should have
" 'acquired': domStatus.hasHostId is True, " changed to  'acquired': False,
at /usr/lib/python2.7/site-packages/vdsm/storage/hsm.py 
4. Add a delay at getStats on the host that does not run the VM, add a 60 sleep to /usr/lib/python2.7/site-packages/vdsm/API.py
at Global#getStats
5. Hard reset the host the that is running the VM
6. Make sure there was no attempt to run the VM on the restarted host and that it was filtered out

Comment 42 Avihai 2019-11-06 11:43:36 UTC
(In reply to Benny Zlotnik from comment #41)
> 1. You need a setup with two hosts
> 2. Create a VM with a lease and start it
> 3. The host that does not run the VM should have
> " 'acquired': domStatus.hasHostId is True, " changed to  'acquired': False,
> at /usr/lib/python2.7/site-packages/vdsm/storage/hsm.py 
As it's RHEL8 with python3,please replace the python path with 3.6

> 4. Add a delay at getStats on the host that does not run the VM, add a 60
> sleep to /usr/lib/python2.7/site-packages/vdsm/API.py
> at Global#getStats
> 5. Hard reset the host the that is running the VM
> 6. Make sure there was no attempt to run the VM on the restarted host and
> that it was filtered out

Comment 43 Avihai 2019-11-11 15:14:44 UTC
(In reply to Avihai from comment #42)
> (In reply to Benny Zlotnik from comment #41)
> > 1. You need a setup with two hosts
> > 2. Create a VM with a lease and start it
> > 3. The host that does not run the VM should have
> > " 'acquired': domStatus.hasHostId is True, " changed to  'acquired': False,
> > at /usr/lib/python2.7/site-packages/vdsm/storage/hsm.py 
> As it's RHEL8 with python3,please replace the python path with 3.6
> 
> > 4. Add a delay at getStats on the host that does not run the VM, add a 60
> > sleep to /usr/lib/python2.7/site-packages/vdsm/API.py
> > at Global#getStats
> > 5. Hard reset the host the that is running the VM
> > 6. Make sure there was no attempt to run the VM on the restarted host and
> > that it was filtered out

Disregard this comment as this is the verification flow for RHEL8/RHV4.4.
Please use verification flow from comment 41.
https://bugzilla.redhat.com/show_bug.cgi?id=1768168#c41

Comment 46 Shir Fishbain 2019-11-14 16:18:42 UTC
Created attachment 1636178 [details]
Logs

Comment 48 Shir Fishbain 2019-11-17 13:05:37 UTC
After I have spoken with Benny, there are the steps to reproduce the bug :
1. You need a setup with two hosts
2. Create a VM with lease and start the VM 
3. The host that doesn't run the VM should have      
" 'acquired': domStatus.hasHostId is True, " change this row to 'acquired': False,
at /usr/lib/python2.7/site-packages/vdsm/storage/hsm.py 
4. Add a delay at getStats on the host that run the VM, add a 60 sleep to /usr/lib/python2.7/site-packages/vdsm/API.py
at Global#getStats
5. Make a reboot to the host that is running the VM
6. Make sure there isn't attempt to run the VM on the restarted host and on the other host
The status of the VM should be Unknown, the fixing of the bug was to see if the WARN: "VM lease is not ready yet" appears in the logs.

There are the rows from the engine log:
2019-11-17 14:29:53,680+02 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-68593) [69254924] EVENT_ID: VDS_ALERT_NO_PM_CONFIG_FENCE_OPERATION_SKIPPED(9,028), Host host_mixed_3 became non responsive. It has no power management configured. Please check the host status, manually reboot it, and click "Confirm Host Has Been Rebooted"

2019-11-17 14:29:53,688+02 WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedThreadFactory-engine-Thread-68593) [69254924] Trying to release exclusive lock which does not exist, lock key: 'b2e12f99-392d-42b8-b3d8-8e371ccf8ce7VDS_FENCE'

2019-11-17 14:30:20,872+02 INFO  [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (EE-ManagedThreadFactory-engine-Thread-68607) [7851617c] Candidate host 'host_mixed_2' ('de79a846-13e5-4657-9449-a80efd46dc10') was filtered out by 'VAR__FILTERTYPE__INTERNAL' filter 'VM leases ready' (correlation id: null)

2019-11-17 14:30:20,874+02 WARN  [org.ovirt.engine.core.bll.RunVmCommand] (EE-ManagedThreadFactory-engine-Thread-68607) [7851617c] Validation of action 'RunVm' failed for user SYSTEM. Reasons: VAR__ACTION__RUN,VAR__TYPE__VM,SCHEDULING_ALL_HOSTS_FILTERED_OUT,VAR__FILTERTYPE__INTERNAL,$hostName host_mixed_2,$filterName VM leases ready,ACTION_TYPE_FAILED_VM_LEASE_IS_NOT_READY_FOR_HOST,SCHEDULING_HOST_FILTERED_REASON_WITH_DETAIL

Benny, can you please ack that I hit the customer scenario?

Comment 49 Benny Zlotnik 2019-11-17 13:11:35 UTC
Looks good!

Comment 50 Shir Fishbain 2019-11-17 13:19:55 UTC
Created attachment 1637026 [details]
New_Logs

Comment 51 Shir Fishbain 2019-11-17 13:20:20 UTC
Verified 

ovirt-engine-4.3.7.2-0.1.el7.noarch
vdsm-4.30.37-1.el7ev.x86_64

Comment 53 errata-xmlrpc 2019-12-12 10:36:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:4229


Note You need to log in before you can comment on or make changes to this bug.