Description of problem: Seen more than once on OST: http://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/135 http://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/136 It seams not really reproducible on real HW. Hosted-engine-setup fails with: 08:27:07 [ INFO ] Creating VM 08:27:07 [ ERROR ] Failed to execute stage 'Closing up': Command VM.getStats with args {'vmID': '9feb6e29-2785-43ca-91db-0abc96217847'} failed: 08:27:07 (code=100, message=General Exception: ("VM '9feb6e29-2785-43ca-91db-0abc96217847' was not defined yet or was undefined",)) While on vdsm logs we see: 2017-12-14 03:27:08,133-0500 ERROR (vm/9feb6e29) [virt.vm] (vmId='9feb6e29-2785-43ca-91db-0abc96217847') The vm start process failed (vm:915) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 844, in _startUnderlyingVm self._run() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2740, in _run dom.createWithFlags(flags) File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1069, in createWithFlags if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self) libvirtError: Failed to acquire lock: No space left on device 2017-12-14 03:27:08,137-0500 INFO (vm/9feb6e29) [virt.vm] (vmId='9feb6e29-2785-43ca-91db-0abc96217847') Changed state to Down: Failed to acquire lock: No space left on device (code=1) (vm:1634) and on sanlock logs: 2017-12-14 03:27:07 526 [4803]: r4 cmd_acquire 4,13,6672 invalid lockspace found -1 failed 0 name d9e93730-1801-47fe-a2a1-aea67ce99acc Version-Release number of selected component (if applicable): vdsm-common-4.20.9-45.gitd3f712a.el7.centos.noarch libvirt-daemon-3.2.0-14.el7_4.5.x86_64 libvirt-lock-sanlock-3.2.0-14.el7_4.5.x86_64 sanlock-3.5.0-1.el7.x86_64 ovirt-hosted-engine-setup-2.2.2-0.0.master.20171213093337.gitca0028e.el7.centos.noarch How reproducible: seen on OST, not really reproducible on real HW Steps to Reproduce: 1. deploy hosted-engine 2. 3. Actual results: libvirtError: Failed to acquire lock: No space left on device Expected results: The engine VM starts Additional info: all the relevant logs are available here: http://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/136/artifact/exported-artifacts/test_logs/he-basic-suite-master/post-002_bootstrap.py/lago-he-basic-suite-master-host0/_var_log/
Since this works on bare metal, can it be an infrastructure issue? Can the host with nested VMs providing the storage be too loaded or slow for example? Sanlock (and the whole oVirt) expects reasonable response time.
(In reply to Martin Sivák from comment #1) > Since this works on bare metal, can it be an infrastructure issue? Can the > host with nested VMs providing the storage be too loaded or slow for > example? Sanlock (and the whole oVirt) expects reasonable response time. It can be, but: - everything else seems to be working well in OST - OST with Hosted Engine used to work in 4.1. - It doesn't happen with the Ansible based deployment. Nir, what are the next steps to debug this?
(In reply to Yaniv Kaul from comment #2) > It can be, but: > - everything else seems to be working well in OST > - OST with Hosted Engine used to work in 4.1. > - It doesn't happen with the Ansible based deployment. On the old flow we are bootstrapping the engine VM via VDSM with shared:exclusive on disk attributes. On the new ansible based flow we are bootstrapping a local VM via virt-install with libvirt so no sanlock here. > Nir, what are the next steps to debug this?
(In reply to Yaniv Kaul from comment #2) > Nir, what are the next steps to debug this? This: libvirtError: Failed to acquire lock: No space left on device 2017-12-14 03:27:08,137-0500 INFO (vm/9feb6e29) [virt.vm] (vmId='9feb6e29-2785-43ca-91db-0abc96217847') Changed state to Down: Failed to acquire lock: No space left on device (code=1) (vm:1634) Means that the host id was not acquired yet. I can happen if we try to start the VM on a host before the host id is acquired. To check if the host id is acquire, you need to check the hosted engine storage domain status in repoStats - it must have {"acquired": True, ...}. I suggest to check vdsm logs and see if this is indeed the issue.
Looking in: http://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/136/artifact/exported-artifacts/test_logs/he-basic-suite-master/post-002_bootstrap.py/lago-he-basic-suite-master-host0/_var_log/vdsm/vdsm.log 1. host id was acquired on the domain 2017-12-14 03:27:00,712-0500 INFO (monitor/d9e9373) [storage.Monitor] Host id for domain d9e93730-1801-47fe-a2a1-aea67ce99acc successfully acquired (id: 1) (monitor:447) 2. host id still acquired... 2017-12-14 03:27:00,829-0500 INFO (jsonrpc/3) [vdsm.api] FINISH repoStats return={u'd9e93730-1801-47fe-a2a1-aea67ce99acc': {'code': 0, 'actual': True, 'version': 3, 'acquired': True, 'delay ': '0.000202206', 'lastCheck': '0.1', 'valid': True}} from=::1,44714, task_id=d60d4e22-a16c-4515-ae46-43557bcf29d9 (api:52) 3. Preparing to run a vm on this domain: 2017-12-14 03:27:07,155-0500 INFO (vm/9feb6e29) [vdsm.api] START prepareImage(sdUUID=u'd9e93730-1801-47fe-a2a1-aea67ce99acc', spUUID=u'00000000-0000-0000-0000-000000000000', imgUUID=u'cc06c1ea-4aea-4de1-a6b3-af3033c23d68', leafUUID=u'bdd50fc4-d29f-460a-ba0d-2d9dbeaff651', allowIllegal=False) from=internal, task_id=4a2a4a87-acb2-4ef3-a62c-ea8dc435a7d2 (api:46) 2017-12-14 03:27:07,174-0500 INFO (vm/9feb6e29) [vdsm.api] FINISH prepareImage return={'info': {'path': u'/rhev/data-center/mnt/lago-he-basic-suite-master-storage:_exports_nfs__he/d9e93730-1801-47fe-a2a1-aea67ce99acc/images/cc06c1ea-4aea-4de1-a6b3-af3033c23d68/bdd50fc4-d29f-460a-ba0d-2d9dbeaff651', 'volType': 'path'}, 'path': u'/var/run/vdsm/storage/d9e93730-1801-47fe-a2a1-aea67ce99acc/cc06c1ea-4aea-4de1-a6b3-af3033c23d68/bdd50fc4-d29f-460a-ba0d-2d9dbeaff651', 'imgVolumesInfo': [{'domainID': u'd9e93730-1801-47fe-a2a1-aea67ce99acc', 'leaseOffset': 0, 'path': u'/rhev/data-center/mnt/lago-he-basic-suite-master-storage:_exports_nfs__he/d9e93730-1801-47fe-a2a1-aea67ce99acc/images/cc06c1ea-4aea-4de1-a6b3-af3033c23d68/bdd50fc4-d29f-460a-ba0d-2d9dbeaff651', 'volumeID': u'bdd50fc4-d29f-460a-ba0d-2d9dbeaff651', 'leasePath': u'/rhev/data-center/mnt/lago-he-basic-suite-master-storage:_exports_nfs__he/d9e93730-1801-47fe-a2a1-aea67ce99acc/images/cc06c1ea-4aea-4de1-a6b3-af3033c23d68/bdd50fc4-d29f-460a-ba0d-2d9dbeaff651.lease', 'imageID': u'cc06c1ea-4aea-4de1-a6b3-af3033c23d68'}]} from=internal, task_id=4a2a4a87-acb2-4ef3-a62c-ea8dc435a7d2 (api:52) 2017-12-14 03:27:07,175-0500 INFO (vm/9feb6e29) [vds] prepared volume path: /var/run/vdsm/storage/d9e93730-1801-47fe-a2a1-aea67ce99acc/cc06c1ea-4aea-4de1-a6b3-af3033c23d68/bdd50fc4-d29f-460 4. Stopping monitoring the domain - this is called only from hosted engine 2017-12-14 03:27:07,385-0500 INFO (jsonrpc/0) [vdsm.api] START stopMonitoringDomain(sdUUID=u'd9e93730-1801-47fe-a2a1-aea67ce99acc', options=None) from=::1,44714, task_id=632cb6fe-a492-4ace-8128-6ddf30025434 (api:46) 5. When we stop monitoring a domain, we also release the host id... 2017-12-14 03:27:07,386-0500 INFO (monitor/d9e9373) [storage.SANLock] Releasing host id for domain d9e93730-1801-47fe-a2a1-aea67ce99acc (id: 1) (clusterlock:325) 6. Starting the hosted engine vm - this vm has a lease on the domain we just stopped monitoring... 2017-12-14 03:27:07,501-0500 INFO (vm/9feb6e29) [virt.vm] (vmId='9feb6e29-2785-43ca-91db-0abc96217847') <?xml version="1.0" encoding="utf-8"?><domain type="kvm" xmlns:ovirt-tune="http://ovirt.org/vm/tune/1.0" xmlns:ovirt-vm="http://ovirt.org/vm/1.0"> <name>HostedEngine</name> <uuid>9feb6e29-2785-43ca-91db-0abc96217847</uuid> ... <lease> <key>bdd50fc4-d29f-460a-ba0d-2d9dbeaff651</key> <lockspace>d9e93730-1801-47fe-a2a1-aea67ce99acc</lockspace> <target offset="0" path="/rhev/data-center/mnt/lago-he-basic-suite-master-storage:_exports_nfs__he/d9e93730-1801-47fe-a2a1-aea67ce99acc/images/cc06c1ea-4aea-4de1-a6b3-af3033c23d68/bdd50fc4-d29f-460a-ba0d-2d9dbeaff651.lease"/> </lease> 7. Libvirt fail to start the VM since there is no lockspace 2017-12-14 03:27:08,133-0500 ERROR (vm/9feb6e29) [virt.vm] (vmId='9feb6e29-2785-43ca-91db-0abc96217847') The vm start process failed (vm:915) ... libvirtError: Failed to acquire lock: No space left on device Looks like hosted engine is shooting itself in the leg, moving back to hosted engine.
(In reply to Nir Soffer from comment #4) > To check if the host id is acquire, you need to check the hosted engine > storage > domain status in repoStats - it must have {"acquired": True, ...}. > > I suggest to check vdsm logs and see if this is indeed the issue. Everything seams fine on that front. Looking at http://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/136/artifact/exported-artifacts/test_logs/he-basic-suite-master/post-002_bootstrap.py/lago-he-basic-suite-master-host0/_var_log/vdsm/vdsm.log I see: 2017-12-14 03:27:00,712-0500 INFO (monitor/d9e9373) [storage.Monitor] Host id for domain d9e93730-1801-47fe-a2a1-aea67ce99acc successfully acquired (id: 1) (monitor:447) 2017-12-14 03:27:00,829-0500 INFO (jsonrpc/3) [vdsm.api] FINISH repoStats return={u'd9e93730-1801-47fe-a2a1-aea67ce99acc': {'code': 0, 'actual': True, 'version': 3, 'acquired': True, 'delay': '0.000202206', 'lastCheck': '0.1', 'valid': True}} from=::1,44714, task_id=d60d4e22-a16c-4515-ae46-43557bcf29d9 (api:52) and VM.create comes only after: 2017-12-14 03:27:07,119-0500 INFO (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call VM.create succeeded in 0.02 seconds (__init__:573) and it fails due to: 2017-12-14 03:27:08,137-0500 INFO (vm/9feb6e29) [virt.vm] (vmId='9feb6e29-2785-43ca-91db-0abc96217847') Changed state to Down: Failed to acquire lock: No space left on device (code=1) (vm:1634)
(In reply to Nir Soffer from comment #5) > 4. Stopping monitoring the domain - this is called only from hosted engine > > 2017-12-14 03:27:07,385-0500 INFO (jsonrpc/0) [vdsm.api] START > stopMonitoringDomain(sdUUID=u'd9e93730-1801-47fe-a2a1-aea67ce99acc', > options=None) from=::1,44714, task_id=632cb6fe-a492-4ace-8128-6ddf30025434 > (api:46) According to http://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/136/artifact/exported-artifacts/test_logs/he-basic-suite-master/post-002_bootstrap.py/lago-he-basic-suite-master-host0/_var_log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20171214032313-aunduz.log it failed monitoring the engine VM here: 2017-12-14 03:27:07,138-0500 DEBUG otopi.context context._executeMethod:143 method exception Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/otopi/context.py", line 133, in _executeMethod method['method']() File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/gr-he-setup/vm/runvm.py", line 174, in _boot_from_hd self._create_vm() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/mixins.py", line 303, in _create_vm raise RuntimeError(str(e)) RuntimeError: Command VM.getStats with args {'vmID': '9feb6e29-2785-43ca-91db-0abc96217847'} failed: (code=100, message=General Exception: ("VM '9feb6e29-2785-43ca-91db-0abc96217847' was not defined yet or was undefined",)) and so we called _stopMonitoringDomain here 2017-12-14 03:27:07,382-0500 DEBUG otopi.plugins.gr_he_setup.storage.storage storage._stopMonitoringDomain:742 _stopMonitoringDomain it seams that vdsm wasn't aware that it was starting the hosted-engine VM
(In reply to Simone Tiraboschi from comment #7) > it seams that vdsm wasn't aware that it was starting the hosted-engine VM Ok, the issue it's here: We get 2017-12-14 03:27:07,119-0500 INFO (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call VM.create succeeded in 0.02 seconds (__init__:573) and so we expected that the VM has successfully been created and we start polling about its status. But, on overloaded systems, the first request now fails with: 2017-12-14 03:27:07,135-0500 INFO (jsonrpc/2) [api.virt] FINISH getStats return={'status': {'message': 'General Exception: ("VM \'9feb6e29-2785-43ca-91db-0abc96217847\' was not defined yet or was undefined",)', 'code': 100}} from=::1,44714 (api:52) 2017-12-14 03:27:07,135-0500 INFO (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call VM.getStats failed (error 100) in 0.00 seconds (__init__:573) But it was't undefined due to the sanlock issue as I though, but still not defined although VM.create was success. So we start the cleanup procedure which calls stopMonitoringDomain and so the sanlock issue since the VM was instead still going to be started.
(In reply to Simone Tiraboschi from comment #8) > 2017-12-14 03:27:07,119-0500 INFO (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC > call VM.create succeeded in 0.02 seconds (__init__:573) > > and so we expected that the VM has successfully been created and we start > polling about its status. > But, on overloaded systems, the first request now fails with: > > 2017-12-14 03:27:07,135-0500 INFO (jsonrpc/2) [api.virt] FINISH getStats > return={'status': {'message': 'General Exception: ("VM > \'9feb6e29-2785-43ca-91db-0abc96217847\' was not defined yet or was > undefined",)', 'code': 100}} from=::1,44714 (api:52) > 2017-12-14 03:27:07,135-0500 INFO (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC > call VM.getStats failed (error 100) in 0.00 seconds (__init__:573) Milan, is this correct expectation? Did we change the behavior?
getStats fails when attempting to update disk mapping metadata. That's not intended, we work on fixes.
Not reproduced on real HW, moving to verified as it worked fine for me. Please reopen if still happens on real hardware environment.
This bugzilla is included in oVirt 4.2.1 release, published on Feb 12th 2018. Since the problem described in this bug report should be resolved in oVirt 4.2.1 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.