Bug 1409112

Summary: After backup-restore operation HE storage domain stuck in the 'Locked' state
Product: [oVirt] ovirt-engine Reporter: Artyom <alukiano>
Component: BLL.HostedEngineAssignee: Simone Tiraboschi <stirabos>
Status: CLOSED CURRENTRELEASE QA Contact: Artyom <alukiano>
Severity: high Docs Contact:
Priority: medium    
Version: 4.1.0CC: alukiano, bugs, nsoffer, stirabos, ylavi
Target Milestone: ovirt-4.1.0-betaKeywords: Regression, Triaged, ZStream
Target Release: 4.1.0.2Flags: rule-engine: ovirt-4.1+
rule-engine: blocker+
Hardware: x86_64   
OS: Linux   
Whiteboard: PM-14
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1425900 (view as bug list) Environment:
Last Closed: 2017-02-01 14:37:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: SLA RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1240466, 1425900    
Attachments:
Description Flags
logs none

Description Artyom 2016-12-29 15:38:30 UTC
Created attachment 1235894 [details]
logs

Description of problem:
I run backup-restore action on the HE environment, after that, I finish engine-setup on the HE VM, the engine starts the auto-import process but the HE storage domain stuck in the 'Locked' state and I can see error under the engine log:
2016-12-29 09:38:10,175-05 INFO  [org.ovirt.engine.core.utils.osinfo.OsInfoPreferencesLoader] (ServerService Thread Pool -- 53) [] Loading file '/etc/ovirt-engine/osinfo.conf.d/00-defaults.properties'
2016-12-29 09:38:10,294-05 ERROR [org.ovirt.engine.core.bll.CommandsFactory] (ServerService Thread Pool -- 53) [] CommandsFactory : Failed to get type information using reflection for Class  'org.ovirt.engine.core.bll.storage.domain.ImportHostedEngineStorageDomainCommand', Command Id 'c467e644-cd80-4c22-b8f5-c51b7836496b': org.ovirt.engine.core.bll.storage.domain.ImportHostedEngineStorageDomainCommand.<init>(org.ovirt.engine.core.compat.Guid)
2016-12-29 09:38:10,294-05 ERROR [org.ovirt.engine.core.bll.CommandsFactory] (ServerService Thread Pool -- 53) [] Exception: java.lang.NoSuchMethodException: org.ovirt.engine.core.bll.storage.domain.ImportHostedEngineStorageDomainCommand.<init>(org.ovirt.engine.core.compat.Guid)
    at java.lang.Class.getConstructor0(Class.java:3082) [rt.jar:1.8.0_111]
    at java.lang.Class.getDeclaredConstructor(Class.java:2178) [rt.jar:1.8.0_111]
    at org.ovirt.engine.core.bll.CommandsFactory.createCommand(CommandsFactory.java:127) [bll.jar:]
    at org.ovirt.engine.core.bll.Backend.compensate(Backend.java:412) [bll.jar:]
    at org.ovirt.engine.core.bll.Backend.initialize(Backend.java:263) [bll.jar:]
    at org.ovirt.engine.core.bll.Backend.create(Backend.java:194) [bll.jar:]
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [rt.jar:1.8.0_111]
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) [rt.jar:1.8.0_111]
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_111]
    at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_111]
    at org.jboss.as.ee.component.ManagedReferenceLifecycleMethodInterceptor.processInvocation(ManagedReferenceLifecycleMethodInterceptor.java:96)
    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340)
    at org.jboss.invocation.InterceptorContext$Invocation.proceed(InterceptorContext.java:437)
    at org.jboss.as.weld.ejb.Jsr299BindingsInterceptor.delegateInterception(Jsr299BindingsInterceptor.java:70) [wildfly-weld-7.0.0.GA-redhat-2.jar:7.0.0.GA-redhat-2]
    at org.jboss.as.weld.ejb.Jsr299BindingsInterceptor.doLifecycleInterception(Jsr299BindingsInterceptor.java:117) [wildfly-weld-7.0.0.GA-redhat-2.jar:7.0.0.GA-redhat-2]
    at org.jboss.as.weld.ejb.Jsr299BindingsInterceptor.processInvocation(Jsr299BindingsInterceptor.java:103) [wildfly-weld-7.0.0.GA-redhat-2.jar:7.0.0.GA-redhat-2]
    at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340)
    at org.jboss.invocation.InterceptorContext$Invocation.proceed(InterceptorContext.java:437)
....

Version-Release number of selected component (if applicable):
rhevm-4.1.0-0.3.beta2.el7.noarch

How reproducible:
Always

Steps to Reproduce:
1. Deploy HE environment
2. Add the storage domain to the engine(to start auto-import process)
3. Wait until the engine will have HE VM
4. Set global maintenance
5. Backup the engine: # engine-backup --mode=backup --file=engine.backup --log=engine-backup.log
6. Copy the backup file from the HE VM to the host
7. Clean host from HE deploy(reprovisioning)
8. Run the HE deployment again
9. Answer No on the question "Automatically execute engine-setup on the engine appliance on first boot (Yes, No)[Yes]? "
10. Enter to the HE VM and copy the backup file from the host to the HE VM
11. Run restore command: # engine-backup --mode=restore --scope=all --file=engine.backup --log=engine-restore.log  --he-remove-storage-vm --he-remove-hosts --restore-permissions --provision-all-databases
12. Run engine setup: # engine-setup --offline
13. Finish HE deployment process

Actual results:
The engine starts the auto-import process, but the HE SD stuck in the 'Locked' state

Expected results:
The engine starts the auto-import process and will finish it without any errors. The HE SD and the HE VM must appear under the engine and must have status 'Up'

Additional info:
I tried to destroy the HE SD, in this case HE VM died on the host.

Comment 1 Doron Fediuck 2017-01-01 08:10:44 UTC
The below part suggests the engine was killed during import and then restarted. During the restart we're trying to recover the import command, which is not supported hence you get an exception.

	at org.ovirt.engine.core.bll.Backend.compensate(Backend.java:412) [bll.jar:]
	at org.ovirt.engine.core.bll.Backend.initialize(Backend.java:263) [bll.jar:]
	at org.ovirt.engine.core.bll.Backend.create(Backend.java:194) [bll.jar:]

The question is why was the engine killed?

Comment 2 Sandro Bonazzola 2017-01-02 07:59:46 UTC
Simone can you please investigate?

Comment 3 Artyom 2017-01-03 09:26:52 UTC
I do not have an idea why it was killed, the single thing that I can think about, it some kind of race condition when the engine connects the master SD.

Comment 4 Doron Fediuck 2017-01-03 10:20:30 UTC
Compensation is irrelevant for now.
If we want to solve it we need to understand what and why the engine was killed.

Can you reproduce this while checking who/what is killing the engine?

Comment 5 Artyom 2017-01-03 11:37:29 UTC
I can reproduce the situation on my environment and you can take a look on it, just give me know when do you want it.

Comment 6 Simone Tiraboschi 2017-01-03 14:09:13 UTC
OK, fully reproduced.

The flow on a clean install is:

1. hosted-engine-setup creates the engine VM
2. engine-setup configures the engine inside the VM
3. hosted-engine-setup adds the first host to the engine
4. hosted-engine-setup polls till the host goes up into the engine and then configure the latest bits
5. hosted-engine-setup gracefully shutdown the engine VM, here we have a two minutes delay
6. ovirt-ha-agent will restart the engine VM
7. sooner or later, the user adds his first storage domain for regular VMs, only at this point the datacenter reaches the up status
8. once the datacenter reached the up status, the auto-import process will trigger; due to lock acquiring time it could last a few minutes

The flow on this backup/restore scenario instead is:

1. hosted-engine-setup creates the engine VM
2. the user connects to the engine VM and manually restores the backup with engine-backup: engine-backup could now filter out the previous engine VM and the hosted-engine storage domain on the fly; other hosts and other storage domain will remain in the DB
3. the user runs engine-setup to configure and start the engine inside the VM
3.1. if other hosts and other storage domains are in the engine DB, the datacenter goes up now: in that case also the auto-import process will trigger now
4. hosted-engine-setup adds the host where its running on to the engine
4.1. if other storage domains are in the engine DB (almost sure, otherwise no need to restore a backup!) the datacenter goes up now: in that case also the auto-import process will trigger now
5. hosted-engine-setup polls till the host goes up into the engine and then configure the latest bits; the auto-import process is still running
6. hosted-engine-setup gracefully shutdown the engine VM, here we have a two minutes delay; nothing is checking that the auto-import process correctly terminated
7. the auto-import process is not able to recover itself if it got interrupted in the middle and so this issue

So the point is that in this restore based scenario, the auto-import process could trigger before hosted-engine-setup will shutdown the engine VM getting than interrupted in the middle.

Workaroud:
Run hosted-engine-setup with OVEHOSTED_VM/automateVMShutdown=bool:False and when it asks to shutdown the engine VM to be restarted by ovirt-ha-agent, wait and ensure that the auto-import process successfully completed.

Possible solutions:
A. fix the auto-import procedure to be able to auto recover itself after a shutdown in the middle
B. have hosted-engine-setup polling also for running task before shutting down the engine VM; this could be not effective due to the two minutes delay

Comment 7 Artyom 2017-01-04 08:37:56 UTC
Hi Simone, I tried to run hosted-egnine --deploy with the OVEHOSTED_VM/automateVMShutdown=bool:False, but I still have the problem.
Looks like when the engine starts the auto-import process it somehow releases the HE sanlock lock in the result HE VM killed and HE deploy will finish the deployment process.
Under the sanlock log I can see:
2017-01-04 10:24:14+0200 72445 [11444]: s5:r7 resource c4c300b7-82ca-46f2-be7c-2e32940addf3:SDM:/rhev/data-center/mnt/10.35.110.11:_Compute__NFS_alukiano_he__2/c4c300b7-82ca-46f2-be7c-2e32940addf3/dom_md/leases:1048576 for 2,15,13041
2017-01-04 10:24:15+0200 72446 [10581]: s5 kill 11932 sig 9 count 1
2017-01-04 10:24:15+0200 72446 [10581]: dead 11932 ci 3 count 1

messages
Jan  4 10:24:10 cyan-vdsf systemd: Removed slice user-99.slice.
Jan  4 10:24:10 cyan-vdsf systemd: Stopping user-99.slice.
Jan  4 10:24:15 cyan-vdsf sanlock[10581]: 2017-01-04 10:24:15+0200 72446 [10581]: s5 kill 11932 sig 9 count 1
Jan  4 10:24:15 cyan-vdsf sanlock[10581]: 2017-01-04 10:24:15+0200 72446 [10581]: dead 11932 ci 3 count 1
Jan  4 10:24:15 cyan-vdsf kernel: ovirtmgmt: port 2(vnet0) entered disabled state
Jan  4 10:24:15 cyan-vdsf kernel: device vnet0 left promiscuous mode
Jan  4 10:24:15 cyan-vdsf kernel: ovirtmgmt: port 2(vnet0) entered disabled state
Jan  4 10:24:15 cyan-vdsf kvm: 0 guests now active
Jan  4 10:24:15 cyan-vdsf systemd-machined: Machine qemu-1-HostedEngine terminated.

Comment 8 Yaniv Lavi 2017-01-04 13:31:31 UTC
After discussion with QE moving to high.

Comment 9 Simone Tiraboschi 2017-01-04 16:42:33 UTC
Yes, with OVEHOSTED_VM/automateVMShutdown=bool:False the engine VM doesn't get shut down in the middle of the auto import by hosted-engine setup but it still fail.

In that case it got abruptly shutdown since the engine image got teardown, I think it could be related to: https://bugzilla.redhat.com/show_bug.cgi?id=1402789

In VDSM logs we see:
2017-01-04 17:18:42,721 INFO  (jsonrpc/7) [dispatcher] Run and protect: attachStorageDomain(sdUUID=u'ad0313d6-1c3f-456d-8f6d-d8a5ee94e922', spUUID=u'00000001-0001-0001-0001-000000000311', options=None) (logUtils:49)
2017-01-04 17:18:42,722 INFO  (jsonrpc/7) [storage.StoragePool] sdUUID=ad0313d6-1c3f-456d-8f6d-d8a5ee94e922 spUUID=00000001-0001-0001-0001-000000000311 (sp:914)
2017-01-04 17:18:42,722 INFO  (jsonrpc/7) [storage.SANLock] Acquiring host id for domain ad0313d6-1c3f-456d-8f6d-d8a5ee94e922 (id: 1) (clusterlock:282)
2017-01-04 17:18:42,723 INFO  (jsonrpc/7) [storage.SANLock] Acquiring Lease(name='SDM', path=u'/rhev/data-center/mnt/192.168.1.115:_Virtual_exthe6/ad0313d6-1c3f-456d-8f6d-d8a5ee94e922/dom_md/leases', offset=1048576) for host id 1 (clusterlock:343)
2017-01-04 17:18:42,733 INFO  (jsonrpc/7) [storage.StoragePoolMemoryBackend] new storage pool master version 1 and domains map {u'e4eda512-d834-4727-b144-0994cb4424b3': u'Active', u'ad0313d6-1c3f-456d-8f6d-d8a5ee94e922': 'Attached'} (spbackends:450)
2017-01-04 17:18:42,733 INFO  (jsonrpc/7) [storage.StoragePool] Linking /rhev/data-center/mnt/192.168.1.115:_Virtual_exthe6/ad0313d6-1c3f-456d-8f6d-d8a5ee94e922 to /rhev/data-center/00000001-0001-0001-0001-000000000311/ad0313d6-1c3f-456d-8f6d-d8a5ee94e922 (sp:1183)
2017-01-04 17:18:42,734 INFO  (jsonrpc/7) [storage.SANLock] Releasing Lease(name='SDM', path=u'/rhev/data-center/mnt/192.168.1.115:_Virtual_exthe6/ad0313d6-1c3f-456d-8f6d-d8a5ee94e922/dom_md/leases', offset=1048576) (clusterlock:390)
2017-01-04 17:18:42,735 INFO  (jsonrpc/7) [storage.SANLock] Releasing host id for domain ad0313d6-1c3f-456d-8f6d-d8a5ee94e922 (id: 1) (clusterlock:305)
2017-01-04 17:18:42,785 INFO  (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call VM.getStats succeeded in 0.00 seconds (__init__:515)
2017-01-04 17:18:43,709 INFO  (libvirt/events) [virt.vm] (vmId='bafca90e-62b4-425f-b982-f6e62e845255') underlying process disconnected (vm:672)
2017-01-04 17:18:43,709 INFO  (libvirt/events) [virt.vm] (vmId='bafca90e-62b4-425f-b982-f6e62e845255') Release VM resources (vm:4080)
2017-01-04 17:18:43,710 INFO  (libvirt/events) [virt.vm] (vmId='bafca90e-62b4-425f-b982-f6e62e845255') Stopping connection (guestagent:430)
2017-01-04 17:18:43,711 INFO  (libvirt/events) [dispatcher] Run and protect: teardownImage(sdUUID=u'ad0313d6-1c3f-456d-8f6d-d8a5ee94e922', spUUID=u'00000000-0000-0000-0000-000000000000', imgUUID=u'baeb6b20-fa65-472b-b30b-c8fb05bd2fa6', volUUID=None) (logUtils:49)
2017-01-04 17:18:44,735 INFO  (jsonrpc/7) [dispatcher] Run and protect: attachStorageDomain, Return response: None (logUtils:52)
2017-01-04 17:18:44,736 INFO  (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call StorageDomain.attach succeeded in 2.02 seconds (__init__:515)
2017-01-04 17:18:44,738 INFO  (libvirt/events) [dispatcher] Run and protect: teardownImage, Return response: None (logUtils:52)
2017-01-04 17:18:44,738 INFO  (libvirt/events) [virt.vm] (vmId='bafca90e-62b4-425f-b982-f6e62e845255') Stopping connection (guestagent:430)
2017-01-04 17:18:44,738 WARN  (libvirt/events) [root] File: /var/lib/libvirt/qemu/channels/bafca90e-62b4-425f-b982-f6e62e845255.com.redhat.rhevm.vdsm already removed (utils:120)
2017-01-04 17:18:44,738 WARN  (libvirt/events) [root] File: /var/lib/libvirt/qemu/channels/bafca90e-62b4-425f-b982-f6e62e845255.org.qemu.guest_agent.0 already removed (utils:120)
2017-01-04 17:18:44,739 WARN  (libvirt/events) [root] File: /var/run/ovirt-vmconsole-console/bafca90e-62b4-425f-b982-f6e62e845255.sock already removed (utils:120)
2017-01-04 17:18:44,739 INFO  (libvirt/events) [dispatcher] Run and protect: inappropriateDevices(thiefId=u'bafca90e-62b4-425f-b982-f6e62e845255') (logUtils:49)
2017-01-04 17:18:44,741 INFO  (libvirt/events) [dispatcher] Run and protect: inappropriateDevices, Return response: None (logUtils:52)
2017-01-04 17:18:44,742 INFO  (libvirt/events) [virt.vm] (vmId='bafca90e-62b4-425f-b982-f6e62e845255') Changed state to Down: Lost connection with qemu process (code=2) (vm:1198)
2017-01-04 17:18:44,742 INFO  (libvirt/events) [virt.vm] (vmId='bafca90e-62b4-425f-b982-f6e62e845255') Stopping connection (guestagent:430)
2017-01-04 17:18:46,247 INFO  (periodic/3) [dispatcher] Run and protect: repoStats(options=None) (logUtils:49)

And then the engine VM goes down.
ad0313d6-1c3f-456d-8f6d-d8a5ee94e922 in my case was the hosted-engine storage domain uuid.

Comment 10 Simone Tiraboschi 2017-01-05 09:26:14 UTC
Nir, could you please take a look at comment 9?

Comment 11 Simone Tiraboschi 2017-01-05 10:45:01 UTC
In bootstrap mode, when ovirt-hosted-engine-setup launches the engine VM for the first time, on sanlock side we have:

[root@c72he20170105h1 ~]# sanlock client status
daemon d451ed7b-78b5-454b-8e39-bca5a3a3e935.c72he20170
p -1 helper
p -1 listener
p -1 inq_lockspace
p 19105 HostedEngine
p -1 status
s 53cfc928-9bf1-4c39-a0d7-ea6399e3a27a:1:/rhev/data-center/mnt/192.168.1.115\:_Virtual_exthe6/53cfc928-9bf1-4c39-a0d7-ea6399e3a27a/dom_md/ids:0
s e4eda512-d834-4727-b144-0994cb4424b3:1:/rhev/data-center/mnt/192.168.1.115\:_Virtual_data3/e4eda512-d834-4727-b144-0994cb4424b3/dom_md/ids:0 ADD
r 53cfc928-9bf1-4c39-a0d7-ea6399e3a27a:2451100e-7192-497a-a806-fec39266aa94:/rhev/data-center/mnt/192.168.1.115\:_Virtual_exthe6/53cfc928-9bf1-4c39-a0d7-ea6399e3a27a/images/006e56d7-2215-49e2-924c-b14c10a57f16/2451100e-7192-497a-a806-fec39266aa94.lease:0:1 p 19105

In this case the auto-import procedure will kill the engine VM.


When ovirt-ha-agent will become active and restart the engine VM on sanlock side we have:
[root@c72he20170105h1 ~]# sanlock client status
daemon d451ed7b-78b5-454b-8e39-bca5a3a3e935.c72he20170
p -1 helper
p -1 listener
p 21033 
p 22060 HostedEngine
p -1 status
s e4eda512-d834-4727-b144-0994cb4424b3:1:/rhev/data-center/mnt/192.168.1.115\:_Virtual_data3/e4eda512-d834-4727-b144-0994cb4424b3/dom_md/ids:0
s hosted-engine:1:/var/run/vdsm/storage/53cfc928-9bf1-4c39-a0d7-ea6399e3a27a/c39dd257-994e-4991-b21e-76f1bf9da4ff/278956d7-cc9a-43ab-9b90-1d3616fb7f3d:0
s 53cfc928-9bf1-4c39-a0d7-ea6399e3a27a:1:/rhev/data-center/mnt/192.168.1.115\:_Virtual_exthe6/53cfc928-9bf1-4c39-a0d7-ea6399e3a27a/dom_md/ids:0
r 53cfc928-9bf1-4c39-a0d7-ea6399e3a27a:2451100e-7192-497a-a806-fec39266aa94:/rhev/data-center/mnt/192.168.1.115\:_Virtual_exthe6/53cfc928-9bf1-4c39-a0d7-ea6399e3a27a/images/006e56d7-2215-49e2-924c-b14c10a57f16/2451100e-7192-497a-a806-fec39266aa94.lease:0:2 p 22060
r e4eda512-d834-4727-b144-0994cb4424b3:SDM:/rhev/data-center/mnt/192.168.1.115\:_Virtual_data3/e4eda512-d834-4727-b144-0994cb4424b3/dom_md/leases:1048576:5 p 21033

In this case the auto-import procedure will work.

Comment 12 Simone Tiraboschi 2017-01-05 10:52:41 UTC
A stronger workaround is to forcefully disable the auto import procedure till we reach a stable condition.

The flow:
- Launch hosted-engine --deploy
- Say no to automatically execute engine-setup
- Connect to the engine VM
- Import the backup with 'engine-backup --mode=restore --file=engine_backup.tar.gz --log=engine_restore.log --restore-permissions --provision-all-databases --he-remove-storage-vm --he-remove-hosts'
- Disable the autoimport procedure with 'engine-config -s AutoImportHostedEngine=False'
- Run engine-setup
- Go back to hosted-engine-setup on the host and say that the engine is ready; hosted-engine-setup will add the host to the engine; once the host goes up, hosted-engine-setup will shutdown the engine VM within two minutes
- ovirt-ha-agent will take care to restart the engine VM
- once the engine VM is up again, re-enable the auto import procedure with 'engine-config -s AutoImportHostedEngine=False; systemctl restart ovirt-engine'
- the engine should now import the hosted-engine storage domain and the engine VM

Comment 13 Simone Tiraboschi 2017-01-05 11:02:43 UTC
sorry,
engine-config -s AutoImportHostedEngine=True; systemctl restart ovirt-engine
is oviously the last comment

Comment 14 Nir Soffer 2017-01-05 16:02:53 UTC
Vdsm release the host id during attachStorageDomain only if the storage domain is
not monitored by vdsm. So it seems that the issue is caused by incorrect flow in
the backup and restore flow.

Before starting engine, hosted engine must call startMonitoringDomain. After that, 
vdsm will not release the host id on this domain during auto-import.

However, releasing the host id during attachStorageDomain is unneeded and unhelpful
so I removed it in the normal flow (see https://gerrit.ovirt.org/69729). Vdsm will
release the host id only in failure modes when nobody else will release it.

Simone, can you test this patch with hosted engine flows?

Comment 15 Simone Tiraboschi 2017-01-05 16:18:57 UTC
(In reply to Nir Soffer from comment #14)
> Before starting engine, hosted engine must call startMonitoringDomain. After
> that, 
> vdsm will not release the host id on this domain during auto-import.

AFAIK, we are calling startMonitoringDomain also from hosted-engine-setup, indeed if you check Artyom's logs you see:

2016-12-29 15:45:16,626 INFO  (jsonrpc/2) [dispatcher] Run and protect: startMonitoringDomain(sdUUID='5dd9de58-807f-4a38-874a-b5ff9a9bba87', hostID=1, options=None) (logUtils:49)
2016-12-29 15:45:16,627 INFO  (jsonrpc/2) [storage.Monitor] Start monitoring 5dd9de58-807f-4a38-874a-b5ff9a9bba87 (monitor:182)
2016-12-29 15:45:16,627 INFO  (jsonrpc/2) [dispatcher] Run and protect: startMonitoringDomain, Return response: None (logUtils:52)
2016-12-29 15:45:16,627 INFO  (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call Host.startMonitoringDomain succeeded in 0.00 seconds (__init__:515)
2016-12-29 15:45:16,628 INFO  (monitor/5dd9de5) [storage.check] Start checking u'/rhev/data-center/mnt/10.35.110.11:_Compute__NFS_alukiano_he/5dd9de58-807f-4a38-874a-b5ff9a9bba87/dom_md/metadata' (check:121)
2016-12-29 15:45:16,639 INFO  (monitor/5dd9de5) [storage.Monitor] Domain 5dd9de58-807f-4a38-874a-b5ff9a9bba87 became VALID (monitor:456)
2016-12-29 15:45:16,639 INFO  (monitor/5dd9de5) [storage.SANLock] Acquiring host id for domain 5dd9de58-807f-4a38-874a-b5ff9a9bba87 (id: 1) (clusterlock:282)
2016-12-29 15:45:16,640 INFO  (event/9) [vds] vmContainerLock acquired (clientIF:186)

bofore:

2016-12-29 15:45:52,805 INFO  (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call VM.create succeeded in 0.01 seconds (__init__:515)
2016-12-29 15:45:52,807 INFO  (vm/dfa26872) [virt.vm] (vmId='dfa26872-7212-4def-bbc2-1ff28759f6ba') VM wrapper has started (vm:1901)

Maybe the point is that host-deploy is restarting vdsm when the engine adds the host; is the monitoring supposed to persist over a VDSM restart?

> Simone, can you test this patch with hosted engine flows?

yes, sure, thanks

Comment 16 Nir Soffer 2017-01-05 16:29:17 UTC
(In reply to Simone Tiraboschi from comment #15)
> (In reply to Nir Soffer from comment #14)
> Maybe the point is that host-deploy is restarting vdsm when the engine adds
> the host; is the monitoring supposed to persist over a VDSM restart?

No, vdsm is stateless, the state is managed by engine or by hosted engine agent.
If you did not start to monitor the domain, vdsm does not know about it.

Comment 17 Simone Tiraboschi 2017-01-05 16:58:18 UTC
(In reply to Nir Soffer from comment #16)
> No, vdsm is stateless, the state is managed by engine or by hosted engine
> agent.
> If you did not start to monitor the domain, vdsm does not know about it.

So we have no way to fix this on hosted-engine-setup side since we already send startMonitoring but host-deploy is going to restart vdsm without notifying hosted-engine-setup and, as soon host-deploy finishes, the engine is going to import the hosted-engine-storage domain since the other information it needed was already in the DB the user restored.

Comment 18 Nir Soffer 2017-01-06 19:33:29 UTC
Testing  https://gerrit.ovirt.org/69729 show that the patch is wrong.

During attachStorageDomain the domain remain inactive, so vdsm cannot acquire 
a host id an leave it acquired, because nobody will release it.

We have 2 ways to fix this:
- before attaching the hosted engine storage domain, engine can ask vdsm to start
  monitor it. In this case vdsm will not release the host id.
- vdsm can check if the host id is already acquired on the domain, and leave it
  acquired in this case.

Comment 19 Nir Soffer 2017-01-06 20:06:26 UTC
Patch http://gerrit.ovirt.org/69769 is implementing the second solution mentioned
in comment 18.

Simone, can you test this patch with hosted engine flows?

Comment 20 Nir Soffer 2017-01-15 12:02:51 UTC
I think we should move the bug to vdsm.

Comment 21 Simone Tiraboschi 2017-01-17 10:17:33 UTC
We have two distinct issues here on two distinct component that together contributed to the issue:
- the first one is about releasing the host ID during the import process on VDSM side that and this lead to a reboot of the engine VM
- the second one is the autoimport process that wasn't able to restart if interrupted in the middle as for the effect of the first issue

The most evident one was the second point, eventually it could be worth to open a specific bug on VDSM and set this as dependent on that one.

Comment 22 Artyom 2017-01-25 14:59:25 UTC
Verified on rhevm-4.1.0.2-0.2.el7.noarch