Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1620044

Summary: VM has been paused due to lack of storage space - cannot resume
Product: [oVirt] vdsm Reporter: oliver.albl
Component: GeneralAssignee: Benny Zlotnik <bzlotnik>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Avihai <aefrat>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.20.31CC: aefrat, bugs, bzlotnik, christian.grundmann, fromani, michal.skrivanek, oliver.albl, tcarlin, tnisan
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1693786 (view as bug list) Environment:
Last Closed: 2020-03-24 13:22:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1693786    
Attachments:
Description Flags
Logfiles
none
Engine Logs
none
Logfiles
none
Logfiles none

Description oliver.albl 2018-08-22 09:36:55 UTC
Created attachment 1477835 [details]
Logfiles

Description of problem:
I run a cluster with 10 hosts for automated testing purposes. I have a VM in state "VM has been paused due to lack of storage space". I cannot resume the VM.

Attached VDSM logs for first error and resume attempt. Affected VM is "baselcloudsprint", "151ea7eb-5358-4b4e-a644-14552e278a8e"


Version-Release number of selected component (if applicable):
oVirt 4.2.4.5-1.el7
vdsm-4.20.32-1.el7.x86_64


How reproducible:
Generate load on hosts by automatically creating vms, running vm workloads, deleting vms


Steps to Reproduce:
1.
2.
3.

Actual results:
VM is in state "VM has been paused due to lack of storage space", resume is not possible.

Expected results:
VM available again.


Additional info:
Removed /usr/libexec/vdsm/hooks/after_vm_destroy/50_vhostmd

Running with the following vdsm configuration:
[irs]
volume_utilization_percent = 25
volume_utilization_chunk_mb = 2048

[rpc]
worker_threads = 16

Might be releated with
https://bugzilla.redhat.com/show_bug.cgi?id=1618755
https://bugzilla.redhat.com/show_bug.cgi?id=1600507

Comment 1 Tal Nisan 2018-08-22 13:37:46 UTC
Hi Oliver,
Just to be clear, this occurs where the VM disks are getting full, right?
And do you extend the disks prior to trying to resume the VM?

Comment 2 oliver.albl 2018-08-22 13:45:24 UTC
Hi Tal, 
this occurs quite after VM creation. Disks are thin provisioned and not getting full.

Comment 3 Tal Nisan 2018-08-22 15:24:09 UTC
Benny, you're this week's QE contact, can you please have a look?

Comment 4 Benny Zlotnik 2018-08-23 11:32:21 UTC
Can you please upload the relevant engine log as well?

Comment 5 oliver.albl 2018-08-23 11:58:43 UTC
Created attachment 1478143 [details]
Engine Logs

Comment 6 Benny Zlotnik 2018-08-28 08:56:47 UTC
Looks like resume is stuck on acquiring the lock

2018-08-22 11:14:04,804+0200 ERROR (jsonrpc/7) [api] FINISH cont error=waiting more than 168s for _guestCpuLock (api:132)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 122, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/API.py", line 166, in cont
    return self.vm.cont()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1517, in cont
    self._acquireCpuLockWithTimeout()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1490, in _acquireCpuLockWithTimeout
    timeout)
RuntimeError: waiting more than 168s for _guestCpuLock

Oliver, can you provide vdsm logs in debug level[1]?


[1] - https://www.ovirt.org/develop/developer-guide/vdsm/log-files/

Comment 7 oliver.albl 2018-08-28 09:31:07 UTC
Temporarily changed log level to debug using "vdsm-client Host setLogLevel level=DEBUG" on the affected hosts. I will provide logs as soon as we encounter the problem again.

Comment 8 oliver.albl 2018-09-10 11:09:07 UTC
The issue did not occur since I changed the log level. Before that, we saw it 1-3 times a week.

Comment 9 oliver.albl 2018-09-13 14:51:06 UTC
Created attachment 1483075 [details]
Logfiles

Today we observed the issue again. 
VM 90591853-249d-4442-b4a4-ed4cf46c3665
Time: 16:36

Comment 10 oliver.albl 2018-09-14 09:43:44 UTC
Created attachment 1483258 [details]
Logfiles

Another occurrence today - this time, the VM stopped/crashed after pause message:

Sep 14, 2018, 9:18:33 AM "VM xxx has been paused due to no Storage space error."
Sep 14, 2018, 9:18:48 AM "VM ... is down with error. Exit message: Lost connection with qemu process".

Comment 11 oliver.albl 2018-09-20 08:32:24 UTC
Is there a way to recover VMs in that state? After a reboot VM reports "VM has been paused due to a storage I/O error" which cannot be solved any more!

Comment 12 Benny Zlotnik 2018-10-03 12:28:53 UTC
Francesco,

As I am not very familiar with this flow, can you provide some insight as to why guestCpuLock is stuck?

2018-08-22 11:14:04,804+0200 ERROR (jsonrpc/7) [api] FINISH cont error=waiting more than 168s for _guestCpuLock (api:132)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 122, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/API.py", line 166, in cont
    return self.vm.cont()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1517, in cont
    self._acquireCpuLockWithTimeout()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1490, in _acquireCpuLockWithTimeout
    timeout)
RuntimeError: waiting more than 168s for _guestCpuLock

Comment 13 Francesco Romani 2018-10-03 15:34:55 UTC
(In reply to Benny Zlotnik from comment #12)
> Francesco,
> 
> As I am not very familiar with this flow, can you provide some insight as to
> why guestCpuLock is stuck?
> 
> 2018-08-22 11:14:04,804+0200 ERROR (jsonrpc/7) [api] FINISH cont
> error=waiting more than 168s for _guestCpuLock (api:132)
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 122, in
> method
>     ret = func(*args, **kwargs)
>   File "/usr/lib/python2.7/site-packages/vdsm/API.py", line 166, in cont
>     return self.vm.cont()
>   File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1517, in cont
>     self._acquireCpuLockWithTimeout()
>   File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1490, in
> _acquireCpuLockWithTimeout
>     timeout)
> RuntimeError: waiting more than 168s for _guestCpuLock

We try to acquire guestCpuLock in these flows:

VM.cont
VM.pause
VM.migrate
VM.migrateCancel
VM.migrateChangeParams
When receiving _some_ libvirt events, we need to set the VCPUs as (non)running

Now, only the first two should be relevant in this case.
What's going on is that _some other flow_ was holding the lock.

It's hard to say from the logs who's holding that lock, there could be a couple of explanations. As first step I'll improve the logging

Comment 14 Francesco Romani 2018-10-04 07:22:30 UTC
my patches will improve the logging, not tackling this issue yet - not enough data. POST -> NEW

Comment 19 Sandro Bonazzola 2019-01-28 09:44:02 UTC
This bug has not been marked as blocker for oVirt 4.3.0.
Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.

Comment 20 Benny Zlotnik 2019-02-11 08:32:40 UTC
Oliver, have you encountered this after upgrading (this should be available in 4.2.8)?
Francesco improved the logging in this area so we can properly investigate

Comment 21 oliver.albl 2019-02-12 12:41:40 UTC
Benny, we are still running 4.2.7 on our hosts. We cloud work around the problem by changing volume_utilization_percent = 25 and volume_utilization_chunk_mb = 4096 as well as by reducing vdsm worker threads from 16 to default.

Comment 22 oliver.albl 2019-03-11 14:57:44 UTC
We are going to upgrade to 4.2.8 by end of March 2019. I will provide new logs as soon as we have the hosts on 4.2.8.

Comment 23 Tal Nisan 2019-06-23 10:51:08 UTC
Any news Oliver?

Comment 24 oliver.albl 2019-06-24 13:52:16 UTC
We did not see this error again since we reduced vdsm worker threads from 16 to default.

Comment 26 Tal Nisan 2019-11-19 11:46:55 UTC
(In reply to oliver.albl from comment #24)
> We did not see this error again since we reduced vdsm worker threads from 16
> to default.

Benny, can you please figure out what was the cause here and create a KB for documentation purposes?

Comment 27 Michal Skrivanek 2020-03-24 13:22:04 UTC
we didn't get to properly investigate all the data and the problem is too old. If it reappears please fee free to reopen