Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1525955

Summary:	VDSM fails to start HE VM due to VM.getStats error right after VM.create (seen on OST)
Product:	[oVirt] ovirt-hosted-engine-setup	Reporter:	Simone Tiraboschi <stirabos>
Component:	General	Assignee:	Simone Tiraboschi <stirabos>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Nikolai Sednev <nsednev>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	2.2.1	CC:	amureini, bugs, lveyde, mzamazal, nsednev, nsoffer, rhv-bugzilla-bot, stirabos
Target Milestone:	ovirt-4.2.1	Keywords:	Triaged
Target Release:	---	Flags:	rule-engine: ovirt-4.2+
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	ovirt-hosted-engine-setup-2.2.5	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-02-12 11:57:23 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Integration	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1458709

Description Simone Tiraboschi 2017-12-14 13:43:21 UTC

Description of problem:

Seen more than once on OST:

http://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/135
http://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/136

It seams not really reproducible on real HW.

Hosted-engine-setup fails with:
08:27:07 [ INFO  ] Creating VM
08:27:07 [ ERROR ] Failed to execute stage 'Closing up': Command VM.getStats with args {'vmID': '9feb6e29-2785-43ca-91db-0abc96217847'} failed:
08:27:07          (code=100, message=General Exception: ("VM '9feb6e29-2785-43ca-91db-0abc96217847' was not defined yet or was undefined",))

While on vdsm logs we see:
2017-12-14 03:27:08,133-0500 ERROR (vm/9feb6e29) [virt.vm] (vmId='9feb6e29-2785-43ca-91db-0abc96217847') The vm start process failed (vm:915)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 844, in _startUnderlyingVm
    self._run()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2740, in _run
    dom.createWithFlags(flags)
  File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1069, in createWithFlags
    if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self)
libvirtError: Failed to acquire lock: No space left on device
2017-12-14 03:27:08,137-0500 INFO  (vm/9feb6e29) [virt.vm] (vmId='9feb6e29-2785-43ca-91db-0abc96217847') Changed state to Down: Failed to acquire lock: No space left on device (code=1) (vm:1634)

and on sanlock logs:
2017-12-14 03:27:07 526 [4803]: r4 cmd_acquire 4,13,6672 invalid lockspace found -1 failed 0 name d9e93730-1801-47fe-a2a1-aea67ce99acc


Version-Release number of selected component (if applicable):
vdsm-common-4.20.9-45.gitd3f712a.el7.centos.noarch
libvirt-daemon-3.2.0-14.el7_4.5.x86_64
libvirt-lock-sanlock-3.2.0-14.el7_4.5.x86_64
sanlock-3.5.0-1.el7.x86_64
ovirt-hosted-engine-setup-2.2.2-0.0.master.20171213093337.gitca0028e.el7.centos.noarch


How reproducible:
seen on OST, not really reproducible on real HW

Steps to Reproduce:
1. deploy hosted-engine
2.
3.

Actual results:
libvirtError: Failed to acquire lock: No space left on device

Expected results:
The engine VM starts

Additional info:
all the relevant logs are available here:
http://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/136/artifact/exported-artifacts/test_logs/he-basic-suite-master/post-002_bootstrap.py/lago-he-basic-suite-master-host0/_var_log/

Comment 1 Martin Sivák 2017-12-14 14:12:28 UTC

Since this works on bare metal, can it be an infrastructure issue? Can the host with nested VMs providing the storage be too loaded or slow for example? Sanlock (and the whole oVirt) expects reasonable response time.

Comment 2 Yaniv Kaul 2017-12-17 08:42:45 UTC

(In reply to Martin Sivák from comment #1)
> Since this works on bare metal, can it be an infrastructure issue? Can the
> host with nested VMs providing the storage be too loaded or slow for
> example? Sanlock (and the whole oVirt) expects reasonable response time.

It can be, but:
- everything else seems to be working well in OST
- OST with Hosted Engine used to work in 4.1.
- It doesn't happen with the Ansible based deployment.

Nir, what are the next steps to debug this?

Comment 3 Simone Tiraboschi 2017-12-18 08:24:19 UTC

(In reply to Yaniv Kaul from comment #2)
> It can be, but:
> - everything else seems to be working well in OST
> - OST with Hosted Engine used to work in 4.1.
> - It doesn't happen with the Ansible based deployment.

On the old flow we are bootstrapping the engine VM via VDSM with shared:exclusive on disk attributes.
On the new ansible based flow we are bootstrapping a local VM via virt-install with libvirt so no sanlock here. 

 
> Nir, what are the next steps to debug this?

Comment 4 Nir Soffer 2017-12-18 22:15:22 UTC

(In reply to Yaniv Kaul from comment #2)
> Nir, what are the next steps to debug this?

This:

libvirtError: Failed to acquire lock: No space left on device
2017-12-14 03:27:08,137-0500 INFO  (vm/9feb6e29) [virt.vm] (vmId='9feb6e29-2785-43ca-91db-0abc96217847') Changed state to Down: Failed to acquire lock: No space left on device (code=1) (vm:1634)

Means that the host id was not acquired yet.

I can happen if we try to start the VM on a host before the host id is acquired.

To check if the host id is acquire, you need to check the hosted engine storage 
domain status in repoStats - it must have {"acquired": True, ...}.

I suggest to check vdsm logs and see if this is indeed the issue.

Comment 5 Nir Soffer 2017-12-18 22:33:22 UTC

Looking in:
http://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/136/artifact/exported-artifacts/test_logs/he-basic-suite-master/post-002_bootstrap.py/lago-he-basic-suite-master-host0/_var_log/vdsm/vdsm.log

1. host id was acquired on the domain

2017-12-14 03:27:00,712-0500 INFO  (monitor/d9e9373) [storage.Monitor] Host id for domain d9e93730-1801-47fe-a2a1-aea67ce99acc successfully acquired (id: 1) (monitor:447)

2. host id still acquired...

2017-12-14 03:27:00,829-0500 INFO  (jsonrpc/3) [vdsm.api] FINISH repoStats return={u'd9e93730-1801-47fe-a2a1-aea67ce99acc': {'code': 0, 'actual': True, 'version': 3, 'acquired': True, 'delay
': '0.000202206', 'lastCheck': '0.1', 'valid': True}} from=::1,44714, task_id=d60d4e22-a16c-4515-ae46-43557bcf29d9 (api:52)

3. Preparing to run a vm on this domain:

2017-12-14 03:27:07,155-0500 INFO  (vm/9feb6e29) [vdsm.api] START prepareImage(sdUUID=u'd9e93730-1801-47fe-a2a1-aea67ce99acc', spUUID=u'00000000-0000-0000-0000-000000000000', imgUUID=u'cc06c1ea-4aea-4de1-a6b3-af3033c23d68', leafUUID=u'bdd50fc4-d29f-460a-ba0d-2d9dbeaff651', allowIllegal=False) from=internal, task_id=4a2a4a87-acb2-4ef3-a62c-ea8dc435a7d2 (api:46)

2017-12-14 03:27:07,174-0500 INFO  (vm/9feb6e29) [vdsm.api] FINISH prepareImage return={'info': {'path': u'/rhev/data-center/mnt/lago-he-basic-suite-master-storage:_exports_nfs__he/d9e93730-1801-47fe-a2a1-aea67ce99acc/images/cc06c1ea-4aea-4de1-a6b3-af3033c23d68/bdd50fc4-d29f-460a-ba0d-2d9dbeaff651', 'volType': 'path'}, 'path': u'/var/run/vdsm/storage/d9e93730-1801-47fe-a2a1-aea67ce99acc/cc06c1ea-4aea-4de1-a6b3-af3033c23d68/bdd50fc4-d29f-460a-ba0d-2d9dbeaff651', 'imgVolumesInfo': [{'domainID': u'd9e93730-1801-47fe-a2a1-aea67ce99acc', 'leaseOffset': 0, 'path': u'/rhev/data-center/mnt/lago-he-basic-suite-master-storage:_exports_nfs__he/d9e93730-1801-47fe-a2a1-aea67ce99acc/images/cc06c1ea-4aea-4de1-a6b3-af3033c23d68/bdd50fc4-d29f-460a-ba0d-2d9dbeaff651', 'volumeID': u'bdd50fc4-d29f-460a-ba0d-2d9dbeaff651', 'leasePath': u'/rhev/data-center/mnt/lago-he-basic-suite-master-storage:_exports_nfs__he/d9e93730-1801-47fe-a2a1-aea67ce99acc/images/cc06c1ea-4aea-4de1-a6b3-af3033c23d68/bdd50fc4-d29f-460a-ba0d-2d9dbeaff651.lease', 'imageID': u'cc06c1ea-4aea-4de1-a6b3-af3033c23d68'}]} from=internal, task_id=4a2a4a87-acb2-4ef3-a62c-ea8dc435a7d2 (api:52)
2017-12-14 03:27:07,175-0500 INFO  (vm/9feb6e29) [vds] prepared volume path: /var/run/vdsm/storage/d9e93730-1801-47fe-a2a1-aea67ce99acc/cc06c1ea-4aea-4de1-a6b3-af3033c23d68/bdd50fc4-d29f-460

4. Stopping monitoring the domain - this is called only from hosted engine

2017-12-14 03:27:07,385-0500 INFO  (jsonrpc/0) [vdsm.api] START stopMonitoringDomain(sdUUID=u'd9e93730-1801-47fe-a2a1-aea67ce99acc', options=None) from=::1,44714, task_id=632cb6fe-a492-4ace-8128-6ddf30025434 (api:46)

5. When we stop monitoring a domain, we also release the host id...

2017-12-14 03:27:07,386-0500 INFO  (monitor/d9e9373) [storage.SANLock] Releasing host id for domain d9e93730-1801-47fe-a2a1-aea67ce99acc (id: 1) (clusterlock:325)
 
6. Starting the hosted engine vm - this vm has a lease on the domain we just stopped monitoring...

2017-12-14 03:27:07,501-0500 INFO  (vm/9feb6e29) [virt.vm] (vmId='9feb6e29-2785-43ca-91db-0abc96217847') <?xml version="1.0" encoding="utf-8"?><domain type="kvm" xmlns:ovirt-tune="http://ovirt.org/vm/tune/1.0" xmlns:ovirt-vm="http://ovirt.org/vm/1.0">
    <name>HostedEngine</name>
    <uuid>9feb6e29-2785-43ca-91db-0abc96217847</uuid>
...
        <lease>
            <key>bdd50fc4-d29f-460a-ba0d-2d9dbeaff651</key>
            <lockspace>d9e93730-1801-47fe-a2a1-aea67ce99acc</lockspace>
            <target offset="0" path="/rhev/data-center/mnt/lago-he-basic-suite-master-storage:_exports_nfs__he/d9e93730-1801-47fe-a2a1-aea67ce99acc/images/cc06c1ea-4aea-4de1-a6b3-af3033c23d68/bdd50fc4-d29f-460a-ba0d-2d9dbeaff651.lease"/>
        </lease>

7. Libvirt fail to start the VM since there is no lockspace

2017-12-14 03:27:08,133-0500 ERROR (vm/9feb6e29) [virt.vm] (vmId='9feb6e29-2785-43ca-91db-0abc96217847') The vm start process failed (vm:915)
...
libvirtError: Failed to acquire lock: No space left on device


Looks like hosted engine is shooting itself in the leg, moving back to hosted engine.

Comment 6 Simone Tiraboschi 2017-12-18 22:38:48 UTC

(In reply to Nir Soffer from comment #4)
> To check if the host id is acquire, you need to check the hosted engine
> storage 
> domain status in repoStats - it must have {"acquired": True, ...}.
> 
> I suggest to check vdsm logs and see if this is indeed the issue.

Everything seams fine on that front.
Looking at http://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/136/artifact/exported-artifacts/test_logs/he-basic-suite-master/post-002_bootstrap.py/lago-he-basic-suite-master-host0/_var_log/vdsm/vdsm.log

I see:

2017-12-14 03:27:00,712-0500 INFO  (monitor/d9e9373) [storage.Monitor] Host id for domain d9e93730-1801-47fe-a2a1-aea67ce99acc successfully acquired (id: 1) (monitor:447)

2017-12-14 03:27:00,829-0500 INFO  (jsonrpc/3) [vdsm.api] FINISH repoStats return={u'd9e93730-1801-47fe-a2a1-aea67ce99acc': {'code': 0, 'actual': True, 'version': 3, 'acquired': True, 'delay': '0.000202206', 'lastCheck': '0.1', 'valid': True}} from=::1,44714, task_id=d60d4e22-a16c-4515-ae46-43557bcf29d9 (api:52)


and VM.create comes only after:

2017-12-14 03:27:07,119-0500 INFO  (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call VM.create succeeded in 0.02 seconds (__init__:573)

and it fails due to:
2017-12-14 03:27:08,137-0500 INFO  (vm/9feb6e29) [virt.vm] (vmId='9feb6e29-2785-43ca-91db-0abc96217847') Changed state to Down: Failed to acquire lock: No space left on device (code=1) (vm:1634)

Comment 7 Simone Tiraboschi 2017-12-18 22:47:56 UTC

(In reply to Nir Soffer from comment #5)
> 4. Stopping monitoring the domain - this is called only from hosted engine
> 
> 2017-12-14 03:27:07,385-0500 INFO  (jsonrpc/0) [vdsm.api] START
> stopMonitoringDomain(sdUUID=u'd9e93730-1801-47fe-a2a1-aea67ce99acc',
> options=None) from=::1,44714, task_id=632cb6fe-a492-4ace-8128-6ddf30025434
> (api:46)

According to http://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/136/artifact/exported-artifacts/test_logs/he-basic-suite-master/post-002_bootstrap.py/lago-he-basic-suite-master-host0/_var_log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20171214032313-aunduz.log

it failed monitoring the engine VM here:
2017-12-14 03:27:07,138-0500 DEBUG otopi.context context._executeMethod:143 method exception
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/otopi/context.py", line 133, in _executeMethod
    method['method']()
  File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/gr-he-setup/vm/runvm.py", line 174, in _boot_from_hd
    self._create_vm()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/mixins.py", line 303, in _create_vm
    raise RuntimeError(str(e))
RuntimeError: Command VM.getStats with args {'vmID': '9feb6e29-2785-43ca-91db-0abc96217847'} failed:
(code=100, message=General Exception: ("VM '9feb6e29-2785-43ca-91db-0abc96217847' was not defined yet or was undefined",))

and so we called _stopMonitoringDomain here 
2017-12-14 03:27:07,382-0500 DEBUG otopi.plugins.gr_he_setup.storage.storage storage._stopMonitoringDomain:742 _stopMonitoringDomain

it seams that vdsm wasn't aware that it was starting the hosted-engine VM

Comment 8 Simone Tiraboschi 2017-12-18 23:07:59 UTC

(In reply to Simone Tiraboschi from comment #7)
> it seams that vdsm wasn't aware that it was starting the hosted-engine VM

Ok, the issue it's here:

We get

2017-12-14 03:27:07,119-0500 INFO  (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call VM.create succeeded in 0.02 seconds (__init__:573)

and so we expected that the VM has successfully been created and we start polling about its status.

But, on overloaded systems, the first request now fails with:

2017-12-14 03:27:07,135-0500 INFO  (jsonrpc/2) [api.virt] FINISH getStats return={'status': {'message': 'General Exception: ("VM \'9feb6e29-2785-43ca-91db-0abc96217847\' was not defined yet or was undefined",)', 'code': 100}} from=::1,44714 (api:52)
2017-12-14 03:27:07,135-0500 INFO  (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call VM.getStats failed (error 100) in 0.00 seconds (__init__:573)

But it was't undefined due to the sanlock issue as I though, but still not defined although VM.create was success.

So we start the cleanup procedure which calls stopMonitoringDomain and so the sanlock issue since the VM was instead still going to be started.

Comment 9 Nir Soffer 2017-12-18 23:55:14 UTC

(In reply to Simone Tiraboschi from comment #8)
> 2017-12-14 03:27:07,119-0500 INFO  (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC
> call VM.create succeeded in 0.02 seconds (__init__:573)
> 
> and so we expected that the VM has successfully been created and we start
> polling about its status.
> But, on overloaded systems, the first request now fails with:
> 
> 2017-12-14 03:27:07,135-0500 INFO  (jsonrpc/2) [api.virt] FINISH getStats
> return={'status': {'message': 'General Exception: ("VM
> \'9feb6e29-2785-43ca-91db-0abc96217847\' was not defined yet or was
> undefined",)', 'code': 100}} from=::1,44714 (api:52)
> 2017-12-14 03:27:07,135-0500 INFO  (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC
> call VM.getStats failed (error 100) in 0.00 seconds (__init__:573)

Milan, is this correct expectation? Did we change the behavior?

Comment 10 Milan Zamazal 2017-12-19 14:17:50 UTC

getStats fails when attempting to update disk mapping metadata. That's not intended, we work on fixes.

Comment 11 Nikolai Sednev 2018-01-11 16:56:12 UTC

Not reproduced on real HW, moving to verified as it worked fine for me.
Please reopen if still happens on real hardware environment.

Comment 12 Sandro Bonazzola 2018-02-12 11:57:23 UTC

This bugzilla is included in oVirt 4.2.1 release, published on Feb 12th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.1 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.