1425900 – [downstream clone - 4.0.7] After backup-restore operation HE storage domain stuck in the 'Locked' state

Bug 1425900 - [downstream clone - 4.0.7] After backup-restore operation HE storage domain stuck in the 'Locked' state

Summary: [downstream clone - 4.0.7] After backup-restore operation HE storage domain s...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	unspecified
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	ovirt-4.0.7
Target Release:	---
Assignee:	Simone Tiraboschi
QA Contact:	Artyom
Docs Contact:
URL:
Whiteboard:	PM-14
Depends On:	1409112
Blocks:	1416111
TreeView+	depends on / blocked

Reported:	2017-02-22 17:25 UTC by rhev-integ
Modified:	2019-04-28 14:00 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Previously, attaching a live storage domain after restoring the database caused sanlock to kill the self-hosted engine virtual machine. This has now been fixed.
Clone Of:	1409112
Environment:
Last Closed:	2017-03-16 15:34:43 UTC
oVirt Team:	Integration
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:0542	normal	SHIPPED_LIVE	Red Hat Virtualization Manager 4.0.7	2017-03-16 19:25:04 UTC
oVirt gerrit	69622	'None'	'MERGED'	'core: Add a Guid constructor to ImportHostedEngineStorageDomain command'	2019-11-18 18:52:32 UTC
oVirt gerrit	69729	'None'	'ABANDONED'	'sp: Do not release host id when attaching domain'	2019-11-18 18:52:32 UTC
oVirt gerrit	69769	'None'	'MERGED'	'sp: Fix attaching of live storage domain'	2019-11-18 18:52:32 UTC
oVirt gerrit	69904	'None'	'MERGED'	'core: Add a Guid constructor to ImportHostedEngineStorageDomain command'	2019-11-18 18:52:32 UTC
oVirt gerrit	70325	'None'	'MERGED'	'sp: Fix attaching of live storage domain'	2019-11-18 18:52:32 UTC

Description rhev-integ 2017-02-22 17:25:24 UTC

+++ This bug is an upstream to downstream clone. The original bug is: +++
+++   bug 1409112 +++
======================================================================

Created attachment 1235894 [details]
logs

Description of problem:
I run backup-restore action on the HE environment, after that, I finish engine-setup on the HE VM, the engine starts the auto-import process but the HE storage domain stuck in the 'Locked' state and I can see error under the engine log:
2016-12-29 09:38:10,175-05 INFO  [org.ovirt.engine.core.utils.osinfo.OsInfoPreferencesLoader] (ServerService Thread Pool -- 53) [] Loading file '/etc/ovirt-engine/osinfo.conf.d/00-defaults.properties'
2016-12-29 09:38:10,294-05 ERROR [org.ovirt.engine.core.bll.CommandsFactory] (ServerService Thread Pool -- 53) [] CommandsFactory : Failed to get type information using reflection for Class  'org.ovirt.engine.core.bll.storage.domain.ImportHostedEngineStorageDomainCommand', Command Id 'c467e644-cd80-4c22-b8f5-c51b7836496b': org.ovirt.engine.core.bll.storage.domain.ImportHostedEngineStorageDomainCommand.<init>(org.ovirt.engine.core.compat.Guid)
2016-12-29 09:38:10,294-05 ERROR [org.ovirt.engine.core.bll.CommandsFactory] (ServerService Thread Pool -- 53) [] Exception: java.lang.NoSuchMethodException: org.ovirt.engine.core.bll.storage.domain.ImportHostedEngineStorageDomainCommand.<init>(org.ovirt.engine.core.compat.Guid)
    at java.lang.Class.getConstructor0(Class.java:3082) [rt.jar:1.8.0_111]
    at java.lang.Class.getDeclaredConstructor(Class.java:2178) [rt.jar:1.8.0_111]
    at org.ovirt.engine.core.bll.CommandsFactory.createCommand(CommandsFactory.java:127) [bll.jar:]
    at org.ovirt.engine.core.bll.Backend.compensate(Backend.java:412) [bll.jar:]
    at org.ovirt.engine.core.bll.Backend.initialize(Backend.java:263) [bll.jar:]
    at org.ovirt.engine.core.bll.Backend.create(Backend.java:194) [bll.jar:]
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [rt.jar:1.8.0_111]
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) [rt.jar:1.8.0_111]
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_111]
    at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_111]
    at org.jboss.as.ee.component.ManagedReferenceLifecycleMethodInterceptor.processInvocation(ManagedReferenceLifecycleMethodInterceptor.java:96)
    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340)
    at org.jboss.invocation.InterceptorContext$Invocation.proceed(InterceptorContext.java:437)
    at org.jboss.as.weld.ejb.Jsr299BindingsInterceptor.delegateInterception(Jsr299BindingsInterceptor.java:70) [wildfly-weld-7.0.0.GA-redhat-2.jar:7.0.0.GA-redhat-2]
    at org.jboss.as.weld.ejb.Jsr299BindingsInterceptor.doLifecycleInterception(Jsr299BindingsInterceptor.java:117) [wildfly-weld-7.0.0.GA-redhat-2.jar:7.0.0.GA-redhat-2]
    at org.jboss.as.weld.ejb.Jsr299BindingsInterceptor.processInvocation(Jsr299BindingsInterceptor.java:103) [wildfly-weld-7.0.0.GA-redhat-2.jar:7.0.0.GA-redhat-2]
    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340)
    at org.jboss.invocation.InterceptorContext$Invocation.proceed(InterceptorContext.java:437)
....

Version-Release number of selected component (if applicable):
rhevm-4.1.0-0.3.beta2.el7.noarch

How reproducible:
Always

Steps to Reproduce:
1. Deploy HE environment
2. Add the storage domain to the engine(to start auto-import process)
3. Wait until the engine will have HE VM
4. Set global maintenance
5. Backup the engine: # engine-backup --mode=backup --file=engine.backup --log=engine-backup.log
6. Copy the backup file from the HE VM to the host
7. Clean host from HE deploy(reprovisioning)
8. Run the HE deployment again
9. Answer No on the question "Automatically execute engine-setup on the engine appliance on first boot (Yes, No)[Yes]? "
10. Enter to the HE VM and copy the backup file from the host to the HE VM
11. Run restore command: # engine-backup --mode=restore --scope=all --file=engine.backup --log=engine-restore.log  --he-remove-storage-vm --he-remove-hosts --restore-permissions --provision-all-databases
12. Run engine setup: # engine-setup --offline
13. Finish HE deployment process

Actual results:
The engine starts the auto-import process, but the HE SD stuck in the 'Locked' state

Expected results:
The engine starts the auto-import process and will finish it without any errors. The HE SD and the HE VM must appear under the engine and must have status 'Up'

Additional info:
I tried to destroy the HE SD, in this case HE VM died on the host.

(Originally by Artyom Lukianov)

Comment 1 rhev-integ 2017-02-22 17:25:33 UTC

The below part suggests the engine was killed during import and then restarted. During the restart we're trying to recover the import command, which is not supported hence you get an exception.

	at org.ovirt.engine.core.bll.Backend.compensate(Backend.java:412) [bll.jar:]
	at org.ovirt.engine.core.bll.Backend.initialize(Backend.java:263) [bll.jar:]
	at org.ovirt.engine.core.bll.Backend.create(Backend.java:194) [bll.jar:]

The question is why was the engine killed?

(Originally by Doron Fediuck)

Comment 3 rhev-integ 2017-02-22 17:25:40 UTC

Simone can you please investigate?

(Originally by Sandro Bonazzola)

Comment 4 rhev-integ 2017-02-22 17:25:46 UTC

I do not have an idea why it was killed, the single thing that I can think about, it some kind of race condition when the engine connects the master SD.

(Originally by Artyom Lukianov)

Comment 5 rhev-integ 2017-02-22 17:25:51 UTC

Compensation is irrelevant for now.
If we want to solve it we need to understand what and why the engine was killed.

Can you reproduce this while checking who/what is killing the engine?

(Originally by Doron Fediuck)

Comment 6 rhev-integ 2017-02-22 17:25:58 UTC

I can reproduce the situation on my environment and you can take a look on it, just give me know when do you want it.

(Originally by Artyom Lukianov)

Comment 7 rhev-integ 2017-02-22 17:26:04 UTC

OK, fully reproduced.

The flow on a clean install is:

1. hosted-engine-setup creates the engine VM
2. engine-setup configures the engine inside the VM
3. hosted-engine-setup adds the first host to the engine
4. hosted-engine-setup polls till the host goes up into the engine and then configure the latest bits
5. hosted-engine-setup gracefully shutdown the engine VM, here we have a two minutes delay
6. ovirt-ha-agent will restart the engine VM
7. sooner or later, the user adds his first storage domain for regular VMs, only at this point the datacenter reaches the up status
8. once the datacenter reached the up status, the auto-import process will trigger; due to lock acquiring time it could last a few minutes

The flow on this backup/restore scenario instead is:

1. hosted-engine-setup creates the engine VM
2. the user connects to the engine VM and manually restores the backup with engine-backup: engine-backup could now filter out the previous engine VM and the hosted-engine storage domain on the fly; other hosts and other storage domain will remain in the DB
3. the user runs engine-setup to configure and start the engine inside the VM
3.1. if other hosts and other storage domains are in the engine DB, the datacenter goes up now: in that case also the auto-import process will trigger now
4. hosted-engine-setup adds the host where its running on to the engine
4.1. if other storage domains are in the engine DB (almost sure, otherwise no need to restore a backup!) the datacenter goes up now: in that case also the auto-import process will trigger now
5. hosted-engine-setup polls till the host goes up into the engine and then configure the latest bits; the auto-import process is still running
6. hosted-engine-setup gracefully shutdown the engine VM, here we have a two minutes delay; nothing is checking that the auto-import process correctly terminated
7. the auto-import process is not able to recover itself if it got interrupted in the middle and so this issue

So the point is that in this restore based scenario, the auto-import process could trigger before hosted-engine-setup will shutdown the engine VM getting than interrupted in the middle.

Workaroud:
Run hosted-engine-setup with OVEHOSTED_VM/automateVMShutdown=bool:False and when it asks to shutdown the engine VM to be restarted by ovirt-ha-agent, wait and ensure that the auto-import process successfully completed.

Possible solutions:
A. fix the auto-import procedure to be able to auto recover itself after a shutdown in the middle
B. have hosted-engine-setup polling also for running task before shutting down the engine VM; this could be not effective due to the two minutes delay

(Originally by Simone Tiraboschi)

Comment 8 rhev-integ 2017-02-22 17:26:10 UTC

Hi Simone, I tried to run hosted-egnine --deploy with the OVEHOSTED_VM/automateVMShutdown=bool:False, but I still have the problem.
Looks like when the engine starts the auto-import process it somehow releases the HE sanlock lock in the result HE VM killed and HE deploy will finish the deployment process.
Under the sanlock log I can see:
2017-01-04 10:24:14+0200 72445 [11444]: s5:r7 resource c4c300b7-82ca-46f2-be7c-2e32940addf3:SDM:/rhev/data-center/mnt/10.35.110.11:_Compute__NFS_alukiano_he__2/c4c300b7-82ca-46f2-be7c-2e32940addf3/dom_md/leases:1048576 for 2,15,13041
2017-01-04 10:24:15+0200 72446 [10581]: s5 kill 11932 sig 9 count 1
2017-01-04 10:24:15+0200 72446 [10581]: dead 11932 ci 3 count 1

messages
Jan  4 10:24:10 cyan-vdsf systemd: Removed slice user-99.slice.
Jan  4 10:24:10 cyan-vdsf systemd: Stopping user-99.slice.
Jan  4 10:24:15 cyan-vdsf sanlock[10581]: 2017-01-04 10:24:15+0200 72446 [10581]: s5 kill 11932 sig 9 count 1
Jan  4 10:24:15 cyan-vdsf sanlock[10581]: 2017-01-04 10:24:15+0200 72446 [10581]: dead 11932 ci 3 count 1
Jan  4 10:24:15 cyan-vdsf kernel: ovirtmgmt: port 2(vnet0) entered disabled state
Jan  4 10:24:15 cyan-vdsf kernel: device vnet0 left promiscuous mode
Jan  4 10:24:15 cyan-vdsf kernel: ovirtmgmt: port 2(vnet0) entered disabled state
Jan  4 10:24:15 cyan-vdsf kvm: 0 guests now active
Jan  4 10:24:15 cyan-vdsf systemd-machined: Machine qemu-1-HostedEngine terminated.

(Originally by Artyom Lukianov)

Comment 9 rhev-integ 2017-02-22 17:26:17 UTC

After discussion with QE moving to high.

(Originally by Yaniv Dary)

Comment 10 rhev-integ 2017-02-22 17:26:23 UTC

Yes, with OVEHOSTED_VM/automateVMShutdown=bool:False the engine VM doesn't get shut down in the middle of the auto import by hosted-engine setup but it still fail.

In that case it got abruptly shutdown since the engine image got teardown, I think it could be related to: https://bugzilla.redhat.com/show_bug.cgi?id=1402789

In VDSM logs we see:
2017-01-04 17:18:42,721 INFO  (jsonrpc/7) [dispatcher] Run and protect: attachStorageDomain(sdUUID=u'ad0313d6-1c3f-456d-8f6d-d8a5ee94e922', spUUID=u'00000001-0001-0001-0001-000000000311', options=None) (logUtils:49)
2017-01-04 17:18:42,722 INFO  (jsonrpc/7) [storage.StoragePool] sdUUID=ad0313d6-1c3f-456d-8f6d-d8a5ee94e922 spUUID=00000001-0001-0001-0001-000000000311 (sp:914)
2017-01-04 17:18:42,722 INFO  (jsonrpc/7) [storage.SANLock] Acquiring host id for domain ad0313d6-1c3f-456d-8f6d-d8a5ee94e922 (id: 1) (clusterlock:282)
2017-01-04 17:18:42,723 INFO  (jsonrpc/7) [storage.SANLock] Acquiring Lease(name='SDM', path=u'/rhev/data-center/mnt/192.168.1.115:_Virtual_exthe6/ad0313d6-1c3f-456d-8f6d-d8a5ee94e922/dom_md/leases', offset=1048576) for host id 1 (clusterlock:343)
2017-01-04 17:18:42,733 INFO  (jsonrpc/7) [storage.StoragePoolMemoryBackend] new storage pool master version 1 and domains map {u'e4eda512-d834-4727-b144-0994cb4424b3': u'Active', u'ad0313d6-1c3f-456d-8f6d-d8a5ee94e922': 'Attached'} (spbackends:450)
2017-01-04 17:18:42,733 INFO  (jsonrpc/7) [storage.StoragePool] Linking /rhev/data-center/mnt/192.168.1.115:_Virtual_exthe6/ad0313d6-1c3f-456d-8f6d-d8a5ee94e922 to /rhev/data-center/00000001-0001-0001-0001-000000000311/ad0313d6-1c3f-456d-8f6d-d8a5ee94e922 (sp:1183)
2017-01-04 17:18:42,734 INFO  (jsonrpc/7) [storage.SANLock] Releasing Lease(name='SDM', path=u'/rhev/data-center/mnt/192.168.1.115:_Virtual_exthe6/ad0313d6-1c3f-456d-8f6d-d8a5ee94e922/dom_md/leases', offset=1048576) (clusterlock:390)
2017-01-04 17:18:42,735 INFO  (jsonrpc/7) [storage.SANLock] Releasing host id for domain ad0313d6-1c3f-456d-8f6d-d8a5ee94e922 (id: 1) (clusterlock:305)
2017-01-04 17:18:42,785 INFO  (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call VM.getStats succeeded in 0.00 seconds (__init__:515)
2017-01-04 17:18:43,709 INFO  (libvirt/events) [virt.vm] (vmId='bafca90e-62b4-425f-b982-f6e62e845255') underlying process disconnected (vm:672)
2017-01-04 17:18:43,709 INFO  (libvirt/events) [virt.vm] (vmId='bafca90e-62b4-425f-b982-f6e62e845255') Release VM resources (vm:4080)
2017-01-04 17:18:43,710 INFO  (libvirt/events) [virt.vm] (vmId='bafca90e-62b4-425f-b982-f6e62e845255') Stopping connection (guestagent:430)
2017-01-04 17:18:43,711 INFO  (libvirt/events) [dispatcher] Run and protect: teardownImage(sdUUID=u'ad0313d6-1c3f-456d-8f6d-d8a5ee94e922', spUUID=u'00000000-0000-0000-0000-000000000000', imgUUID=u'baeb6b20-fa65-472b-b30b-c8fb05bd2fa6', volUUID=None) (logUtils:49)
2017-01-04 17:18:44,735 INFO  (jsonrpc/7) [dispatcher] Run and protect: attachStorageDomain, Return response: None (logUtils:52)
2017-01-04 17:18:44,736 INFO  (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call StorageDomain.attach succeeded in 2.02 seconds (__init__:515)
2017-01-04 17:18:44,738 INFO  (libvirt/events) [dispatcher] Run and protect: teardownImage, Return response: None (logUtils:52)
2017-01-04 17:18:44,738 INFO  (libvirt/events) [virt.vm] (vmId='bafca90e-62b4-425f-b982-f6e62e845255') Stopping connection (guestagent:430)
2017-01-04 17:18:44,738 WARN  (libvirt/events) [root] File: /var/lib/libvirt/qemu/channels/bafca90e-62b4-425f-b982-f6e62e845255.com.redhat.rhevm.vdsm already removed (utils:120)
2017-01-04 17:18:44,738 WARN  (libvirt/events) [root] File: /var/lib/libvirt/qemu/channels/bafca90e-62b4-425f-b982-f6e62e845255.org.qemu.guest_agent.0 already removed (utils:120)
2017-01-04 17:18:44,739 WARN  (libvirt/events) [root] File: /var/run/ovirt-vmconsole-console/bafca90e-62b4-425f-b982-f6e62e845255.sock already removed (utils:120)
2017-01-04 17:18:44,739 INFO  (libvirt/events) [dispatcher] Run and protect: inappropriateDevices(thiefId=u'bafca90e-62b4-425f-b982-f6e62e845255') (logUtils:49)
2017-01-04 17:18:44,741 INFO  (libvirt/events) [dispatcher] Run and protect: inappropriateDevices, Return response: None (logUtils:52)
2017-01-04 17:18:44,742 INFO  (libvirt/events) [virt.vm] (vmId='bafca90e-62b4-425f-b982-f6e62e845255') Changed state to Down: Lost connection with qemu process (code=2) (vm:1198)
2017-01-04 17:18:44,742 INFO  (libvirt/events) [virt.vm] (vmId='bafca90e-62b4-425f-b982-f6e62e845255') Stopping connection (guestagent:430)
2017-01-04 17:18:46,247 INFO  (periodic/3) [dispatcher] Run and protect: repoStats(options=None) (logUtils:49)

And then the engine VM goes down.
ad0313d6-1c3f-456d-8f6d-d8a5ee94e922 in my case was the hosted-engine storage domain uuid.

(Originally by Simone Tiraboschi)

Comment 11 rhev-integ 2017-02-22 17:26:30 UTC

Nir, could you please take a look at comment 9?

(Originally by Simone Tiraboschi)

Comment 12 rhev-integ 2017-02-22 17:26:36 UTC

In bootstrap mode, when ovirt-hosted-engine-setup launches the engine VM for the first time, on sanlock side we have:

[root@c72he20170105h1 ~]# sanlock client status
daemon d451ed7b-78b5-454b-8e39-bca5a3a3e935.c72he20170
p -1 helper
p -1 listener
p -1 inq_lockspace
p 19105 HostedEngine
p -1 status
s 53cfc928-9bf1-4c39-a0d7-ea6399e3a27a:1:/rhev/data-center/mnt/192.168.1.115\:_Virtual_exthe6/53cfc928-9bf1-4c39-a0d7-ea6399e3a27a/dom_md/ids:0
s e4eda512-d834-4727-b144-0994cb4424b3:1:/rhev/data-center/mnt/192.168.1.115\:_Virtual_data3/e4eda512-d834-4727-b144-0994cb4424b3/dom_md/ids:0 ADD
r 53cfc928-9bf1-4c39-a0d7-ea6399e3a27a:2451100e-7192-497a-a806-fec39266aa94:/rhev/data-center/mnt/192.168.1.115\:_Virtual_exthe6/53cfc928-9bf1-4c39-a0d7-ea6399e3a27a/images/006e56d7-2215-49e2-924c-b14c10a57f16/2451100e-7192-497a-a806-fec39266aa94.lease:0:1 p 19105

In this case the auto-import procedure will kill the engine VM.


When ovirt-ha-agent will become active and restart the engine VM on sanlock side we have:
[root@c72he20170105h1 ~]# sanlock client status
daemon d451ed7b-78b5-454b-8e39-bca5a3a3e935.c72he20170
p -1 helper
p -1 listener
p 21033 
p 22060 HostedEngine
p -1 status
s e4eda512-d834-4727-b144-0994cb4424b3:1:/rhev/data-center/mnt/192.168.1.115\:_Virtual_data3/e4eda512-d834-4727-b144-0994cb4424b3/dom_md/ids:0
s hosted-engine:1:/var/run/vdsm/storage/53cfc928-9bf1-4c39-a0d7-ea6399e3a27a/c39dd257-994e-4991-b21e-76f1bf9da4ff/278956d7-cc9a-43ab-9b90-1d3616fb7f3d:0
s 53cfc928-9bf1-4c39-a0d7-ea6399e3a27a:1:/rhev/data-center/mnt/192.168.1.115\:_Virtual_exthe6/53cfc928-9bf1-4c39-a0d7-ea6399e3a27a/dom_md/ids:0
r 53cfc928-9bf1-4c39-a0d7-ea6399e3a27a:2451100e-7192-497a-a806-fec39266aa94:/rhev/data-center/mnt/192.168.1.115\:_Virtual_exthe6/53cfc928-9bf1-4c39-a0d7-ea6399e3a27a/images/006e56d7-2215-49e2-924c-b14c10a57f16/2451100e-7192-497a-a806-fec39266aa94.lease:0:2 p 22060
r e4eda512-d834-4727-b144-0994cb4424b3:SDM:/rhev/data-center/mnt/192.168.1.115\:_Virtual_data3/e4eda512-d834-4727-b144-0994cb4424b3/dom_md/leases:1048576:5 p 21033

In this case the auto-import procedure will work.

(Originally by Simone Tiraboschi)

Comment 13 rhev-integ 2017-02-22 17:26:42 UTC

A stronger workaround is to forcefully disable the auto import procedure till we reach a stable condition.

The flow:
- Launch hosted-engine --deploy
- Say no to automatically execute engine-setup
- Connect to the engine VM
- Import the backup with 'engine-backup --mode=restore --file=engine_backup.tar.gz --log=engine_restore.log --restore-permissions --provision-all-databases --he-remove-storage-vm --he-remove-hosts'
- Disable the autoimport procedure with 'engine-config -s AutoImportHostedEngine=False'
- Run engine-setup
- Go back to hosted-engine-setup on the host and say that the engine is ready; hosted-engine-setup will add the host to the engine; once the host goes up, hosted-engine-setup will shutdown the engine VM within two minutes
- ovirt-ha-agent will take care to restart the engine VM
- once the engine VM is up again, re-enable the auto import procedure with 'engine-config -s AutoImportHostedEngine=False; systemctl restart ovirt-engine'
- the engine should now import the hosted-engine storage domain and the engine VM

(Originally by Simone Tiraboschi)

Comment 14 rhev-integ 2017-02-22 17:26:48 UTC

sorry,
engine-config -s AutoImportHostedEngine=True; systemctl restart ovirt-engine
is oviously the last comment

(Originally by Simone Tiraboschi)

Comment 15 rhev-integ 2017-02-22 17:26:55 UTC

Vdsm release the host id during attachStorageDomain only if the storage domain is
not monitored by vdsm. So it seems that the issue is caused by incorrect flow in
the backup and restore flow.

Before starting engine, hosted engine must call startMonitoringDomain. After that, 
vdsm will not release the host id on this domain during auto-import.

However, releasing the host id during attachStorageDomain is unneeded and unhelpful
so I removed it in the normal flow (see https://gerrit.ovirt.org/69729). Vdsm will
release the host id only in failure modes when nobody else will release it.

Simone, can you test this patch with hosted engine flows?

(Originally by Nir Soffer)

Comment 16 rhev-integ 2017-02-22 17:27:01 UTC

(In reply to Nir Soffer from comment #14)
> Before starting engine, hosted engine must call startMonitoringDomain. After
> that, 
> vdsm will not release the host id on this domain during auto-import.

AFAIK, we are calling startMonitoringDomain also from hosted-engine-setup, indeed if you check Artyom's logs you see:

2016-12-29 15:45:16,626 INFO  (jsonrpc/2) [dispatcher] Run and protect: startMonitoringDomain(sdUUID='5dd9de58-807f-4a38-874a-b5ff9a9bba87', hostID=1, options=None) (logUtils:49)
2016-12-29 15:45:16,627 INFO  (jsonrpc/2) [storage.Monitor] Start monitoring 5dd9de58-807f-4a38-874a-b5ff9a9bba87 (monitor:182)
2016-12-29 15:45:16,627 INFO  (jsonrpc/2) [dispatcher] Run and protect: startMonitoringDomain, Return response: None (logUtils:52)
2016-12-29 15:45:16,627 INFO  (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call Host.startMonitoringDomain succeeded in 0.00 seconds (__init__:515)
2016-12-29 15:45:16,628 INFO  (monitor/5dd9de5) [storage.check] Start checking u'/rhev/data-center/mnt/10.35.110.11:_Compute__NFS_alukiano_he/5dd9de58-807f-4a38-874a-b5ff9a9bba87/dom_md/metadata' (check:121)
2016-12-29 15:45:16,639 INFO  (monitor/5dd9de5) [storage.Monitor] Domain 5dd9de58-807f-4a38-874a-b5ff9a9bba87 became VALID (monitor:456)
2016-12-29 15:45:16,639 INFO  (monitor/5dd9de5) [storage.SANLock] Acquiring host id for domain 5dd9de58-807f-4a38-874a-b5ff9a9bba87 (id: 1) (clusterlock:282)
2016-12-29 15:45:16,640 INFO  (event/9) [vds] vmContainerLock acquired (clientIF:186)

bofore:

2016-12-29 15:45:52,805 INFO  (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call VM.create succeeded in 0.01 seconds (__init__:515)
2016-12-29 15:45:52,807 INFO  (vm/dfa26872) [virt.vm] (vmId='dfa26872-7212-4def-bbc2-1ff28759f6ba') VM wrapper has started (vm:1901)

Maybe the point is that host-deploy is restarting vdsm when the engine adds the host; is the monitoring supposed to persist over a VDSM restart?

> Simone, can you test this patch with hosted engine flows?

yes, sure, thanks

(Originally by Simone Tiraboschi)

Comment 17 rhev-integ 2017-02-22 17:27:07 UTC

(In reply to Simone Tiraboschi from comment #15)
> (In reply to Nir Soffer from comment #14)
> Maybe the point is that host-deploy is restarting vdsm when the engine adds
> the host; is the monitoring supposed to persist over a VDSM restart?

No, vdsm is stateless, the state is managed by engine or by hosted engine agent.
If you did not start to monitor the domain, vdsm does not know about it.

(Originally by Nir Soffer)

Comment 18 rhev-integ 2017-02-22 17:27:14 UTC

(In reply to Nir Soffer from comment #16)
> No, vdsm is stateless, the state is managed by engine or by hosted engine
> agent.
> If you did not start to monitor the domain, vdsm does not know about it.

So we have no way to fix this on hosted-engine-setup side since we already send startMonitoring but host-deploy is going to restart vdsm without notifying hosted-engine-setup and, as soon host-deploy finishes, the engine is going to import the hosted-engine-storage domain since the other information it needed was already in the DB the user restored.

(Originally by Simone Tiraboschi)

Comment 19 rhev-integ 2017-02-22 17:27:19 UTC

Testing  https://gerrit.ovirt.org/69729 show that the patch is wrong.

During attachStorageDomain the domain remain inactive, so vdsm cannot acquire 
a host id an leave it acquired, because nobody will release it.

We have 2 ways to fix this:
- before attaching the hosted engine storage domain, engine can ask vdsm to start
  monitor it. In this case vdsm will not release the host id.
- vdsm can check if the host id is already acquired on the domain, and leave it
  acquired in this case.

(Originally by Nir Soffer)

Comment 20 rhev-integ 2017-02-22 17:27:25 UTC

Patch http://gerrit.ovirt.org/69769 is implementing the second solution mentioned
in comment 18.

Simone, can you test this patch with hosted engine flows?

(Originally by Nir Soffer)

Comment 21 rhev-integ 2017-02-22 17:27:31 UTC

I think we should move the bug to vdsm.

(Originally by Nir Soffer)

Comment 22 rhev-integ 2017-02-22 17:27:38 UTC

We have two distinct issues here on two distinct component that together contributed to the issue:
- the first one is about releasing the host ID during the import process on VDSM side that and this lead to a reboot of the engine VM
- the second one is the autoimport process that wasn't able to restart if interrupted in the middle as for the effect of the first issue

The most evident one was the second point, eventually it could be worth to open a specific bug on VDSM and set this as dependent on that one.

(Originally by Simone Tiraboschi)

Comment 23 rhev-integ 2017-02-22 17:27:44 UTC

Verified on rhevm-4.1.0.2-0.2.el7.noarch

(Originally by Artyom Lukianov)

Comment 26 Artyom 2017-02-28 11:26:06 UTC

Verified on:
rhevm-4.0.7.3-0.1.el7ev.noarch
ovirt-hosted-engine-ha-2.1.0.2-1.el7ev.noarch
ovirt-hosted-engine-setup-2.1.0.2-1.el7ev.noarch

1. Deploy HE environment
2. Add the storage domain to the engine(to start auto-import process)
3. Wait until the engine will have HE VM
4. Set global maintenance
5. Backup the engine: # engine-backup --mode=backup --file=engine.backup --log=engine-backup.log
6. Copy the backup file from the HE VM to the host
7. Clean host from HE deploy(reprovisioning)
8. Run the HE deployment again
9. Answer No on the question "Automatically execute engine-setup on the engine appliance on first boot (Yes, No)[Yes]? "
10. Enter to the HE VM and copy the backup file from the host to the HE VM
11. Run restore command: # engine-backup --mode=restore --scope=all --file=engine.backup --log=engine-restore.log  --he-remove-storage-vm --he-remove-hosts --restore-permissions --provision-dwh-db --provision-db
12. Run engine setup: # engine-setup --offline
13. Finish HE deployment process

Engine UP and have HE SD and HE VM in the active state

Be aware of bugs under 4.1:
https://bugzilla.redhat.com/show_bug.cgi?id=1416459
https://bugzilla.redhat.com/show_bug.cgi?id=1416466

Comment 27 Artyom 2017-03-02 08:42:23 UTC

Verified on correct version
# rpm -qa | grep hosted
ovirt-hosted-engine-setup-2.0.4.3-2.el7ev.noarch
ovirt-hosted-engine-ha-2.0.7-2.el7ev.noarch

Comment 28 Artyom 2017-03-02 14:09:22 UTC

After numbers of backup-restore operation looks like we still have the problem with the auto-import operation:
2017-03-02 05:07:40,924 INFO  [org.ovirt.engine.core.vdsbroker.ResourceManager] (ServerService Thread Pool -- 55) [] VDS '847b4fe4-671c-4635-ac0a-6801064c3c95' was added to the Resource Manager
2017-03-02 05:07:40,951 INFO  [org.ovirt.engine.core.vdsbroker.ResourceManager] (ServerService Thread Pool -- 55) [] Finished initializing ResourceManager
2017-03-02 05:07:40,981 INFO  [org.ovirt.engine.core.bll.storage.domain.ImportHostedEngineStorageDomainCommand] (ServerService Thread Pool -- 55) [] Command [id=aaeb6ee1-0cd0-4251-8b8e-90a8f3a85a70]: Compensating DELETED_OR_UPDATED_ENTITY of org.ovirt.engine.core.common.businessentities.StorageDomainDynamic; snapshot: id=00000000-0000-0000-0000-000000000000.
2017-03-02 05:07:40,992 INFO  [org.ovirt.engine.core.utils.transaction.TransactionSupport] (ServerService Thread Pool -- 55) [] transaction rolled back
2017-03-02 05:07:40,992 ERROR [org.ovirt.engine.core.bll.Backend] (ServerService Thread Pool -- 55) [] Failed to run compensation on startup for Command 'org.ovirt.engine.core.bll.storage.domain.ImportHostedEngineStorageDomainCommand', Command Id 'aaeb6ee1-0cd0-4251-8b8e-90a8f3a85a70': CallableStatementCallback; SQL [{call insertstorage_domain_dynamic(?, ?, ?)}]; ERROR: insert or update on table "storage_domain_dynamic" violates foreign key constraint "fk_storage_domain_dynamic_storage_domain_static"
  Detail: Key (id)=(00000000-0000-0000-0000-000000000000) is not present in table "storage_domain_static".
  Where: SQL statement "INSERT INTO storage_domain_dynamic (
        available_disk_size,
        id,
        used_disk_size
        )
    VALUES (
        v_available_disk_size,
        v_id,
        v_used_disk_size
        )"
PL/pgSQL function insertstorage_domain_dynamic(integer,uuid,integer) line 3 at SQL statement; nested exception is org.postgresql.util.PSQLException: ERROR: insert or update on table "storage_domain_dynamic" violates foreign key constraint "fk_storage_domain_dynamic_storage_domain_static"
  Detail: Key (id)=(00000000-0000-0000-0000-000000000000) is not present in table "storage_domain_static".
  Where: SQL statement "INSERT INTO storage_domain_dynamic (

We miss the patch in our downstream build - https://gerrit.ovirt.org/#/c/72902

Comment 29 Artyom 2017-03-02 15:56:05 UTC

Verified with new vdsm:
vdsm-4.18.24-3.el7ev.x86_64

Ran backup/restore process 5 times in all 5 times HE VM and HE SD active.

Comment 31 errata-xmlrpc 2017-03-16 15:34:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0542.html

Note You need to log in before you can comment on or make changes to this bug.