Bug 1620044
| Summary: | VM has been paused due to lack of storage space - cannot resume | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [oVirt] vdsm | Reporter: | oliver.albl | ||||||||||
| Component: | General | Assignee: | Benny Zlotnik <bzlotnik> | ||||||||||
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Avihai <aefrat> | ||||||||||
| Severity: | high | Docs Contact: | |||||||||||
| Priority: | unspecified | ||||||||||||
| Version: | 4.20.31 | CC: | aefrat, bugs, bzlotnik, christian.grundmann, fromani, michal.skrivanek, oliver.albl, tcarlin, tnisan | ||||||||||
| Target Milestone: | --- | ||||||||||||
| Target Release: | --- | ||||||||||||
| Hardware: | x86_64 | ||||||||||||
| OS: | Linux | ||||||||||||
| Whiteboard: | |||||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||
| Doc Text: | Story Points: | --- | |||||||||||
| Clone Of: | |||||||||||||
| : | 1693786 (view as bug list) | Environment: | |||||||||||
| Last Closed: | 2020-03-24 13:22:04 UTC | Type: | Bug | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Embargoed: | |||||||||||||
| Bug Depends On: | |||||||||||||
| Bug Blocks: | 1693786 | ||||||||||||
| Attachments: |
|
||||||||||||
|
Description
oliver.albl
2018-08-22 09:36:55 UTC
Hi Oliver, Just to be clear, this occurs where the VM disks are getting full, right? And do you extend the disks prior to trying to resume the VM? Hi Tal, this occurs quite after VM creation. Disks are thin provisioned and not getting full. Benny, you're this week's QE contact, can you please have a look? Can you please upload the relevant engine log as well? Created attachment 1478143 [details]
Engine Logs
Looks like resume is stuck on acquiring the lock
2018-08-22 11:14:04,804+0200 ERROR (jsonrpc/7) [api] FINISH cont error=waiting more than 168s for _guestCpuLock (api:132)
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 122, in method
ret = func(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/vdsm/API.py", line 166, in cont
return self.vm.cont()
File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1517, in cont
self._acquireCpuLockWithTimeout()
File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1490, in _acquireCpuLockWithTimeout
timeout)
RuntimeError: waiting more than 168s for _guestCpuLock
Oliver, can you provide vdsm logs in debug level[1]?
[1] - https://www.ovirt.org/develop/developer-guide/vdsm/log-files/
Temporarily changed log level to debug using "vdsm-client Host setLogLevel level=DEBUG" on the affected hosts. I will provide logs as soon as we encounter the problem again. The issue did not occur since I changed the log level. Before that, we saw it 1-3 times a week. Created attachment 1483075 [details]
Logfiles
Today we observed the issue again.
VM 90591853-249d-4442-b4a4-ed4cf46c3665
Time: 16:36
Created attachment 1483258 [details]
Logfiles
Another occurrence today - this time, the VM stopped/crashed after pause message:
Sep 14, 2018, 9:18:33 AM "VM xxx has been paused due to no Storage space error."
Sep 14, 2018, 9:18:48 AM "VM ... is down with error. Exit message: Lost connection with qemu process".
Is there a way to recover VMs in that state? After a reboot VM reports "VM has been paused due to a storage I/O error" which cannot be solved any more! Francesco,
As I am not very familiar with this flow, can you provide some insight as to why guestCpuLock is stuck?
2018-08-22 11:14:04,804+0200 ERROR (jsonrpc/7) [api] FINISH cont error=waiting more than 168s for _guestCpuLock (api:132)
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 122, in method
ret = func(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/vdsm/API.py", line 166, in cont
return self.vm.cont()
File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1517, in cont
self._acquireCpuLockWithTimeout()
File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1490, in _acquireCpuLockWithTimeout
timeout)
RuntimeError: waiting more than 168s for _guestCpuLock
(In reply to Benny Zlotnik from comment #12) > Francesco, > > As I am not very familiar with this flow, can you provide some insight as to > why guestCpuLock is stuck? > > 2018-08-22 11:14:04,804+0200 ERROR (jsonrpc/7) [api] FINISH cont > error=waiting more than 168s for _guestCpuLock (api:132) > Traceback (most recent call last): > File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 122, in > method > ret = func(*args, **kwargs) > File "/usr/lib/python2.7/site-packages/vdsm/API.py", line 166, in cont > return self.vm.cont() > File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1517, in cont > self._acquireCpuLockWithTimeout() > File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1490, in > _acquireCpuLockWithTimeout > timeout) > RuntimeError: waiting more than 168s for _guestCpuLock We try to acquire guestCpuLock in these flows: VM.cont VM.pause VM.migrate VM.migrateCancel VM.migrateChangeParams When receiving _some_ libvirt events, we need to set the VCPUs as (non)running Now, only the first two should be relevant in this case. What's going on is that _some other flow_ was holding the lock. It's hard to say from the logs who's holding that lock, there could be a couple of explanations. As first step I'll improve the logging my patches will improve the logging, not tackling this issue yet - not enough data. POST -> NEW This bug has not been marked as blocker for oVirt 4.3.0. Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1. Oliver, have you encountered this after upgrading (this should be available in 4.2.8)? Francesco improved the logging in this area so we can properly investigate Benny, we are still running 4.2.7 on our hosts. We cloud work around the problem by changing volume_utilization_percent = 25 and volume_utilization_chunk_mb = 4096 as well as by reducing vdsm worker threads from 16 to default. We are going to upgrade to 4.2.8 by end of March 2019. I will provide new logs as soon as we have the hosts on 4.2.8. Any news Oliver? We did not see this error again since we reduced vdsm worker threads from 16 to default. (In reply to oliver.albl from comment #24) > We did not see this error again since we reduced vdsm worker threads from 16 > to default. Benny, can you please figure out what was the cause here and create a KB for documentation purposes? we didn't get to properly investigate all the data and the problem is too old. If it reappears please fee free to reopen |