Bug 1525955 - VDSM fails to start HE VM due to VM.getStats error right after VM.create (seen on OST)
Summary: VDSM fails to start HE VM due to VM.getStats error right after VM.create (see...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-hosted-engine-setup
Classification: oVirt
Component: General
Version: 2.2.1
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.2.1
: ---
Assignee: Simone Tiraboschi
QA Contact: Nikolai Sednev
URL:
Whiteboard:
Depends On:
Blocks: 1458709
TreeView+ depends on / blocked
 
Reported: 2017-12-14 13:43 UTC by Simone Tiraboschi
Modified: 2018-02-12 11:57 UTC (History)
8 users (show)

Fixed In Version: ovirt-hosted-engine-setup-2.2.5
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-02-12 11:57:23 UTC
oVirt Team: Integration
Embargoed:
rule-engine: ovirt-4.2+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1459829 0 high CLOSED [DR] - Hosted-Engine VM migration fails on: 'libvirtError: Failed to acquire lock: No space left on device' (while there... 2022-03-13 14:18:50 UTC
oVirt gerrit 85549 0 ovirt-4.2.0 ABANDONED HE: OST: try reverting a vdsm fix 2021-02-10 16:30:39 UTC
oVirt gerrit 85550 0 master ABANDONED HE: OST: try reverting a vdsm fix 2021-02-10 16:30:39 UTC
oVirt gerrit 85569 0 master ABANDONED HE: OST: try reverting a vdsm fix 2021-02-10 16:30:39 UTC
oVirt gerrit 85570 0 master MERGED virt: avoid failing on inital ServerError on VM.getStats 2021-02-10 16:30:39 UTC
oVirt gerrit 85581 0 master MERGED virt: Return None disk mapping hash when disk mapping is empty 2021-02-10 16:30:39 UTC
oVirt gerrit 85582 0 master MERGED virt: Catch NotConnectedError in guest mapping metadata sync 2021-02-10 16:30:39 UTC
oVirt gerrit 85583 0 master MERGED virt: Don't update disk mapping when VM is not Up 2021-02-10 16:30:39 UTC

Internal Links: 1459829

Description Simone Tiraboschi 2017-12-14 13:43:21 UTC
Description of problem:

Seen more than once on OST:

http://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/135
http://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/136

It seams not really reproducible on real HW.

Hosted-engine-setup fails with:
08:27:07 [ INFO  ] Creating VM
08:27:07 [ ERROR ] Failed to execute stage 'Closing up': Command VM.getStats with args {'vmID': '9feb6e29-2785-43ca-91db-0abc96217847'} failed:
08:27:07          (code=100, message=General Exception: ("VM '9feb6e29-2785-43ca-91db-0abc96217847' was not defined yet or was undefined",))

While on vdsm logs we see:
2017-12-14 03:27:08,133-0500 ERROR (vm/9feb6e29) [virt.vm] (vmId='9feb6e29-2785-43ca-91db-0abc96217847') The vm start process failed (vm:915)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 844, in _startUnderlyingVm
    self._run()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2740, in _run
    dom.createWithFlags(flags)
  File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1069, in createWithFlags
    if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self)
libvirtError: Failed to acquire lock: No space left on device
2017-12-14 03:27:08,137-0500 INFO  (vm/9feb6e29) [virt.vm] (vmId='9feb6e29-2785-43ca-91db-0abc96217847') Changed state to Down: Failed to acquire lock: No space left on device (code=1) (vm:1634)

and on sanlock logs:
2017-12-14 03:27:07 526 [4803]: r4 cmd_acquire 4,13,6672 invalid lockspace found -1 failed 0 name d9e93730-1801-47fe-a2a1-aea67ce99acc


Version-Release number of selected component (if applicable):
vdsm-common-4.20.9-45.gitd3f712a.el7.centos.noarch
libvirt-daemon-3.2.0-14.el7_4.5.x86_64
libvirt-lock-sanlock-3.2.0-14.el7_4.5.x86_64
sanlock-3.5.0-1.el7.x86_64
ovirt-hosted-engine-setup-2.2.2-0.0.master.20171213093337.gitca0028e.el7.centos.noarch


How reproducible:
seen on OST, not really reproducible on real HW

Steps to Reproduce:
1. deploy hosted-engine
2.
3.

Actual results:
libvirtError: Failed to acquire lock: No space left on device

Expected results:
The engine VM starts

Additional info:
all the relevant logs are available here:
http://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/136/artifact/exported-artifacts/test_logs/he-basic-suite-master/post-002_bootstrap.py/lago-he-basic-suite-master-host0/_var_log/

Comment 1 Martin Sivák 2017-12-14 14:12:28 UTC
Since this works on bare metal, can it be an infrastructure issue? Can the host with nested VMs providing the storage be too loaded or slow for example? Sanlock (and the whole oVirt) expects reasonable response time.

Comment 2 Yaniv Kaul 2017-12-17 08:42:45 UTC
(In reply to Martin Sivák from comment #1)
> Since this works on bare metal, can it be an infrastructure issue? Can the
> host with nested VMs providing the storage be too loaded or slow for
> example? Sanlock (and the whole oVirt) expects reasonable response time.

It can be, but:
- everything else seems to be working well in OST
- OST with Hosted Engine used to work in 4.1.
- It doesn't happen with the Ansible based deployment.

Nir, what are the next steps to debug this?

Comment 3 Simone Tiraboschi 2017-12-18 08:24:19 UTC
(In reply to Yaniv Kaul from comment #2)
> It can be, but:
> - everything else seems to be working well in OST
> - OST with Hosted Engine used to work in 4.1.
> - It doesn't happen with the Ansible based deployment.

On the old flow we are bootstrapping the engine VM via VDSM with shared:exclusive on disk attributes.
On the new ansible based flow we are bootstrapping a local VM via virt-install with libvirt so no sanlock here. 

 
> Nir, what are the next steps to debug this?

Comment 4 Nir Soffer 2017-12-18 22:15:22 UTC
(In reply to Yaniv Kaul from comment #2)
> Nir, what are the next steps to debug this?

This:

libvirtError: Failed to acquire lock: No space left on device
2017-12-14 03:27:08,137-0500 INFO  (vm/9feb6e29) [virt.vm] (vmId='9feb6e29-2785-43ca-91db-0abc96217847') Changed state to Down: Failed to acquire lock: No space left on device (code=1) (vm:1634)

Means that the host id was not acquired yet.

I can happen if we try to start the VM on a host before the host id is acquired.

To check if the host id is acquire, you need to check the hosted engine storage 
domain status in repoStats - it must have {"acquired": True, ...}.

I suggest to check vdsm logs and see if this is indeed the issue.

Comment 5 Nir Soffer 2017-12-18 22:33:22 UTC
Looking in:
http://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/136/artifact/exported-artifacts/test_logs/he-basic-suite-master/post-002_bootstrap.py/lago-he-basic-suite-master-host0/_var_log/vdsm/vdsm.log

1. host id was acquired on the domain

2017-12-14 03:27:00,712-0500 INFO  (monitor/d9e9373) [storage.Monitor] Host id for domain d9e93730-1801-47fe-a2a1-aea67ce99acc successfully acquired (id: 1) (monitor:447)

2. host id still acquired...

2017-12-14 03:27:00,829-0500 INFO  (jsonrpc/3) [vdsm.api] FINISH repoStats return={u'd9e93730-1801-47fe-a2a1-aea67ce99acc': {'code': 0, 'actual': True, 'version': 3, 'acquired': True, 'delay
': '0.000202206', 'lastCheck': '0.1', 'valid': True}} from=::1,44714, task_id=d60d4e22-a16c-4515-ae46-43557bcf29d9 (api:52)

3. Preparing to run a vm on this domain:

2017-12-14 03:27:07,155-0500 INFO  (vm/9feb6e29) [vdsm.api] START prepareImage(sdUUID=u'd9e93730-1801-47fe-a2a1-aea67ce99acc', spUUID=u'00000000-0000-0000-0000-000000000000', imgUUID=u'cc06c1ea-4aea-4de1-a6b3-af3033c23d68', leafUUID=u'bdd50fc4-d29f-460a-ba0d-2d9dbeaff651', allowIllegal=False) from=internal, task_id=4a2a4a87-acb2-4ef3-a62c-ea8dc435a7d2 (api:46)

2017-12-14 03:27:07,174-0500 INFO  (vm/9feb6e29) [vdsm.api] FINISH prepareImage return={'info': {'path': u'/rhev/data-center/mnt/lago-he-basic-suite-master-storage:_exports_nfs__he/d9e93730-1801-47fe-a2a1-aea67ce99acc/images/cc06c1ea-4aea-4de1-a6b3-af3033c23d68/bdd50fc4-d29f-460a-ba0d-2d9dbeaff651', 'volType': 'path'}, 'path': u'/var/run/vdsm/storage/d9e93730-1801-47fe-a2a1-aea67ce99acc/cc06c1ea-4aea-4de1-a6b3-af3033c23d68/bdd50fc4-d29f-460a-ba0d-2d9dbeaff651', 'imgVolumesInfo': [{'domainID': u'd9e93730-1801-47fe-a2a1-aea67ce99acc', 'leaseOffset': 0, 'path': u'/rhev/data-center/mnt/lago-he-basic-suite-master-storage:_exports_nfs__he/d9e93730-1801-47fe-a2a1-aea67ce99acc/images/cc06c1ea-4aea-4de1-a6b3-af3033c23d68/bdd50fc4-d29f-460a-ba0d-2d9dbeaff651', 'volumeID': u'bdd50fc4-d29f-460a-ba0d-2d9dbeaff651', 'leasePath': u'/rhev/data-center/mnt/lago-he-basic-suite-master-storage:_exports_nfs__he/d9e93730-1801-47fe-a2a1-aea67ce99acc/images/cc06c1ea-4aea-4de1-a6b3-af3033c23d68/bdd50fc4-d29f-460a-ba0d-2d9dbeaff651.lease', 'imageID': u'cc06c1ea-4aea-4de1-a6b3-af3033c23d68'}]} from=internal, task_id=4a2a4a87-acb2-4ef3-a62c-ea8dc435a7d2 (api:52)
2017-12-14 03:27:07,175-0500 INFO  (vm/9feb6e29) [vds] prepared volume path: /var/run/vdsm/storage/d9e93730-1801-47fe-a2a1-aea67ce99acc/cc06c1ea-4aea-4de1-a6b3-af3033c23d68/bdd50fc4-d29f-460

4. Stopping monitoring the domain - this is called only from hosted engine

2017-12-14 03:27:07,385-0500 INFO  (jsonrpc/0) [vdsm.api] START stopMonitoringDomain(sdUUID=u'd9e93730-1801-47fe-a2a1-aea67ce99acc', options=None) from=::1,44714, task_id=632cb6fe-a492-4ace-8128-6ddf30025434 (api:46)

5. When we stop monitoring a domain, we also release the host id...

2017-12-14 03:27:07,386-0500 INFO  (monitor/d9e9373) [storage.SANLock] Releasing host id for domain d9e93730-1801-47fe-a2a1-aea67ce99acc (id: 1) (clusterlock:325)
 
6. Starting the hosted engine vm - this vm has a lease on the domain we just stopped monitoring...

2017-12-14 03:27:07,501-0500 INFO  (vm/9feb6e29) [virt.vm] (vmId='9feb6e29-2785-43ca-91db-0abc96217847') <?xml version="1.0" encoding="utf-8"?><domain type="kvm" xmlns:ovirt-tune="http://ovirt.org/vm/tune/1.0" xmlns:ovirt-vm="http://ovirt.org/vm/1.0">
    <name>HostedEngine</name>
    <uuid>9feb6e29-2785-43ca-91db-0abc96217847</uuid>
...
        <lease>
            <key>bdd50fc4-d29f-460a-ba0d-2d9dbeaff651</key>
            <lockspace>d9e93730-1801-47fe-a2a1-aea67ce99acc</lockspace>
            <target offset="0" path="/rhev/data-center/mnt/lago-he-basic-suite-master-storage:_exports_nfs__he/d9e93730-1801-47fe-a2a1-aea67ce99acc/images/cc06c1ea-4aea-4de1-a6b3-af3033c23d68/bdd50fc4-d29f-460a-ba0d-2d9dbeaff651.lease"/>
        </lease>

7. Libvirt fail to start the VM since there is no lockspace

2017-12-14 03:27:08,133-0500 ERROR (vm/9feb6e29) [virt.vm] (vmId='9feb6e29-2785-43ca-91db-0abc96217847') The vm start process failed (vm:915)
...
libvirtError: Failed to acquire lock: No space left on device


Looks like hosted engine is shooting itself in the leg, moving back to hosted engine.

Comment 6 Simone Tiraboschi 2017-12-18 22:38:48 UTC
(In reply to Nir Soffer from comment #4)
> To check if the host id is acquire, you need to check the hosted engine
> storage 
> domain status in repoStats - it must have {"acquired": True, ...}.
> 
> I suggest to check vdsm logs and see if this is indeed the issue.

Everything seams fine on that front.
Looking at http://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/136/artifact/exported-artifacts/test_logs/he-basic-suite-master/post-002_bootstrap.py/lago-he-basic-suite-master-host0/_var_log/vdsm/vdsm.log

I see:

2017-12-14 03:27:00,712-0500 INFO  (monitor/d9e9373) [storage.Monitor] Host id for domain d9e93730-1801-47fe-a2a1-aea67ce99acc successfully acquired (id: 1) (monitor:447)

2017-12-14 03:27:00,829-0500 INFO  (jsonrpc/3) [vdsm.api] FINISH repoStats return={u'd9e93730-1801-47fe-a2a1-aea67ce99acc': {'code': 0, 'actual': True, 'version': 3, 'acquired': True, 'delay': '0.000202206', 'lastCheck': '0.1', 'valid': True}} from=::1,44714, task_id=d60d4e22-a16c-4515-ae46-43557bcf29d9 (api:52)


and VM.create comes only after:

2017-12-14 03:27:07,119-0500 INFO  (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call VM.create succeeded in 0.02 seconds (__init__:573)

and it fails due to:
2017-12-14 03:27:08,137-0500 INFO  (vm/9feb6e29) [virt.vm] (vmId='9feb6e29-2785-43ca-91db-0abc96217847') Changed state to Down: Failed to acquire lock: No space left on device (code=1) (vm:1634)

Comment 7 Simone Tiraboschi 2017-12-18 22:47:56 UTC
(In reply to Nir Soffer from comment #5)
> 4. Stopping monitoring the domain - this is called only from hosted engine
> 
> 2017-12-14 03:27:07,385-0500 INFO  (jsonrpc/0) [vdsm.api] START
> stopMonitoringDomain(sdUUID=u'd9e93730-1801-47fe-a2a1-aea67ce99acc',
> options=None) from=::1,44714, task_id=632cb6fe-a492-4ace-8128-6ddf30025434
> (api:46)

According to http://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/136/artifact/exported-artifacts/test_logs/he-basic-suite-master/post-002_bootstrap.py/lago-he-basic-suite-master-host0/_var_log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20171214032313-aunduz.log

it failed monitoring the engine VM here:
2017-12-14 03:27:07,138-0500 DEBUG otopi.context context._executeMethod:143 method exception
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/otopi/context.py", line 133, in _executeMethod
    method['method']()
  File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/gr-he-setup/vm/runvm.py", line 174, in _boot_from_hd
    self._create_vm()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/mixins.py", line 303, in _create_vm
    raise RuntimeError(str(e))
RuntimeError: Command VM.getStats with args {'vmID': '9feb6e29-2785-43ca-91db-0abc96217847'} failed:
(code=100, message=General Exception: ("VM '9feb6e29-2785-43ca-91db-0abc96217847' was not defined yet or was undefined",))

and so we called _stopMonitoringDomain here 
2017-12-14 03:27:07,382-0500 DEBUG otopi.plugins.gr_he_setup.storage.storage storage._stopMonitoringDomain:742 _stopMonitoringDomain

it seams that vdsm wasn't aware that it was starting the hosted-engine VM

Comment 8 Simone Tiraboschi 2017-12-18 23:07:59 UTC
(In reply to Simone Tiraboschi from comment #7)
> it seams that vdsm wasn't aware that it was starting the hosted-engine VM

Ok, the issue it's here:

We get

2017-12-14 03:27:07,119-0500 INFO  (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call VM.create succeeded in 0.02 seconds (__init__:573)

and so we expected that the VM has successfully been created and we start polling about its status.

But, on overloaded systems, the first request now fails with:

2017-12-14 03:27:07,135-0500 INFO  (jsonrpc/2) [api.virt] FINISH getStats return={'status': {'message': 'General Exception: ("VM \'9feb6e29-2785-43ca-91db-0abc96217847\' was not defined yet or was undefined",)', 'code': 100}} from=::1,44714 (api:52)
2017-12-14 03:27:07,135-0500 INFO  (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call VM.getStats failed (error 100) in 0.00 seconds (__init__:573)

But it was't undefined due to the sanlock issue as I though, but still not defined although VM.create was success.

So we start the cleanup procedure which calls stopMonitoringDomain and so the sanlock issue since the VM was instead still going to be started.

Comment 9 Nir Soffer 2017-12-18 23:55:14 UTC
(In reply to Simone Tiraboschi from comment #8)
> 2017-12-14 03:27:07,119-0500 INFO  (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC
> call VM.create succeeded in 0.02 seconds (__init__:573)
> 
> and so we expected that the VM has successfully been created and we start
> polling about its status.
> But, on overloaded systems, the first request now fails with:
> 
> 2017-12-14 03:27:07,135-0500 INFO  (jsonrpc/2) [api.virt] FINISH getStats
> return={'status': {'message': 'General Exception: ("VM
> \'9feb6e29-2785-43ca-91db-0abc96217847\' was not defined yet or was
> undefined",)', 'code': 100}} from=::1,44714 (api:52)
> 2017-12-14 03:27:07,135-0500 INFO  (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC
> call VM.getStats failed (error 100) in 0.00 seconds (__init__:573)

Milan, is this correct expectation? Did we change the behavior?

Comment 10 Milan Zamazal 2017-12-19 14:17:50 UTC
getStats fails when attempting to update disk mapping metadata. That's not intended, we work on fixes.

Comment 11 Nikolai Sednev 2018-01-11 16:56:12 UTC
Not reproduced on real HW, moving to verified as it worked fine for me.
Please reopen if still happens on real hardware environment.

Comment 12 Sandro Bonazzola 2018-02-12 11:57:23 UTC
This bugzilla is included in oVirt 4.2.1 release, published on Feb 12th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.1 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.