Bug 1557793

Summary: ovirt-hosted-engine-cleanup takes too much time
Product: [oVirt] ovirt-hosted-engine-setup Reporter: Nikolai Sednev <nsednev>
Component: GeneralAssignee: Simone Tiraboschi <stirabos>
Status: CLOSED CURRENTRELEASE QA Contact: Nikolai Sednev <nsednev>
Severity: low Docs Contact:
Priority: high    
Version: 2.2.13CC: bugs, dfediuck, lveyde, nsednev, ylavi
Target Milestone: ovirt-4.2.4Keywords: Triaged
Target Release: 2.2.22Flags: ylavi: ovirt-4.2+
ylavi: blocker+
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: ovirt-hosted-engine-setup-2.2.22 Doc Type: Bug Fix
Doc Text:
Reducing timeout on storage operations on ovirt-hosted-engine-cleanup
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-06-26 08:35:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Integration RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1581783    
Attachments:
Description Flags
sosreport from alma03 none

Description Nikolai Sednev 2018-03-18 17:32:31 UTC
Created attachment 1409540 [details]
sosreport from alma03

Description of problem:
ovirt-hosted-engine-cleanup takes too much time, over 20 minutes and prints errors during execution.

alma03 ~]# ovirt-hosted-engine-cleanup
 This will de-configure the host to run ovirt-hosted-engine-setup from scratch. 
Caution, this operation should be used with care.

Are you sure you want to proceed? [y/n]
y
  -=== Destroy hosted-engine VM ===- 
  -=== Stop HA services ===- 
  -=== Shutdown sanlock ===- 
shutdown force 1 wait 0
shutdown done 0
  -=== Disconnecting the hosted-engine storage domain ===- 

********************************************************************************
Stuck in this way for more than 20 minutes...

MainThread::INFO::2018-03-18 18:27:43,227::states::413::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine vm was unexpectedly shut down
MainThread::INFO::2018-03-18 18:27:45,336::hosted_engine::614::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_domain_monitor) Stopped VDSM domain monitor
MainThread::INFO::2018-03-18 18:27:45,336::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down

alma03 ~]# date
Sun Mar 18 18:43:03 IST 2018
********************************************************************************

Traceback (most recent call last):
  File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/disconnect_storage_server.py", line 27, in <module>
    ha_cli.disconnect_storage_server()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 294, in disconnect_storage_server
    sserver.disconnect_storage_server()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/storage_server.py", line 325, in disconnect_storage_server
    connectionParams=conList,
  File "/usr/lib/python2.7/site-packages/vdsm/client.py", line 278, in _call
    raise TimeoutError(method, kwargs, timeout)
vdsm.client.TimeoutError: Request StoragePool.disconnectStorageServer with args {'connectionParams': [{'port': '3260', 'connection': '10.35.146.129', 'iqn': 'iqn.2008-05.com.xtremio:xio00153500071-514f0c50023f6c00', 'user': '', 'tpgt': '1', 'password': '', 'id': '9e177df8-91db-4b8b-81af-28d56d856dba'}], 'storagepoolID': '00000000-0000-0000-0000-000000000000', 'domainType': 3} timed out after 900 seconds
  -=== De-configure VDSM networks ===- 
  -=== Stop other services ===- 
  -=== De-configure external daemons ===- 
  -=== Removing configuration files ===- 
? /etc/init/libvirtd.conf already missing
- removing /etc/libvirt/nwfilter/vdsm-no-mac-spoofing.xml
- removing /etc/ovirt-hosted-engine/answers.conf
- removing /etc/ovirt-hosted-engine/hosted-engine.conf
- removing /etc/vdsm/vdsm.conf
- removing /etc/pki/vdsm/certs/cacert.pem
- removing /etc/pki/vdsm/certs/vdsmcert.pem
- removing /etc/pki/vdsm/keys/vdsmkey.pem
- removing /etc/pki/vdsm/libvirt-spice/ca-cert.pem
- removing /etc/pki/vdsm/libvirt-spice/server-cert.pem
- removing /etc/pki/vdsm/libvirt-spice/server-key.pem
- removing /etc/pki/CA/cacert.pem
- removing /etc/pki/libvirt/clientcert.pem
- removing /etc/pki/libvirt/private/clientkey.pem
? /etc/pki/ovirt-vmconsole/*.pem already missing
- removing /var/cache/libvirt/qemu
? /var/run/ovirt-hosted-engine-ha/* already missing
You have new mail in /var/spool/mail/root
[root@alma03 ~]# 


Version-Release number of selected component (if applicable):
ovirt-hosted-engine-ha-2.2.7-1.el7ev.noarch
ovirt-hosted-engine-setup-2.2.13-1.el7ev.noarch
rhvm-appliance-4.2-20180202.0.el7.noarch
Linux 3.10.0-861.el7.x86_64 #1 SMP Wed Mar 14 10:21:01 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.5 (Maipo)

How reproducible:
100%

Steps to Reproduce:
1.Deploy SHE Node 0 over iSCSI.
2.Execute "ovirt-hosted-engine-cleanup" on ha-host.

Actual results:
Undeployment takes too much time and prints errors.

Expected results:
Undeployment should finish without any exceptions and in less time.

Additional info:
Sosreport from host is attached.

Comment 1 Yaniv Kaul 2018-03-19 07:33:49 UTC
It should timeout. It may or may not succeed unmounting or whatever it's trying to do to the storage. You can't expect it to always fully succeed cleaning up messy configuration.

Do you know why it failed to disconnect the iSCSI connection?

Comment 2 Nikolai Sednev 2018-03-19 08:13:33 UTC
(In reply to Yaniv Kaul from comment #1)
> It should timeout. It may or may not succeed unmounting or whatever it's
> trying to do to the storage. You can't expect it to always fully succeed
> cleaning up messy configuration.
> 
> Do you know why it failed to disconnect the iSCSI connection?

I have no idea on why it behaved the way it did.
It took much more time with iSCSI vs. NFS.
In both scenarios (NFS&iSCSI) after waiting for cleanup to finish, I could normally redeploy.
I did not stated that cleanup had failed, I've said that in iSCSI flow there were errors printed and it took a way longer time frame in order to finish.

Comment 3 Sandro Bonazzola 2018-03-19 08:23:59 UTC
Let's understand why iSCSI take so much time.

Comment 4 Doron Fediuck 2018-03-19 09:02:15 UTC
"
timed out after 900 seconds
"

These are 15 minutes out of 20. The questions is what was done to the iscsi to cause it to stop responding?

Comment 6 Nikolai Sednev 2018-06-04 14:31:42 UTC
Works for me on these components:
ovirt-hosted-engine-ha-2.2.13-1.el7ev.noarch
ovirt-hosted-engine-setup-2.2.22-1.el7ev.noarch
rhvm-appliance-4.2-20180601.0.el7.noarch
Red Hat Enterprise Linux Server release 7.5 (Maipo)
Linux 3.10.0-862.3.2.el7.x86_64 #1 SMP Tue May 15 18:22:15 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux

there was no delay during "ovirt-hosted-engine-cleanup" and deployment was cleaned:
alma03 ~]# ovirt-hosted-engine-cleanup
 This will de-configure the host to run ovirt-hosted-engine-setup from scratch. 
Caution, this operation should be used with care.

Are you sure you want to proceed? [y/n]
y
  -=== Destroy hosted-engine VM ===- 
error: failed to get domain 'HostedEngine'
error: Domain not found: no domain with matching name 'HostedEngine'

  -=== Stop HA services ===- 
  -=== Shutdown sanlock ===- 
shutdown force 1 wait 0
shutdown done 0
  -=== Disconnecting the hosted-engine storage domain ===- 
Traceback (most recent call last):
  File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/disconnect_storage_server.py", line 30, in <module>
    timeout=ohostedcons.Const.STORAGE_SERVER_TIMEOUT,
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 313, in disconnect_storage_server
    sserver.disconnect_storage_server(timeout=timeout)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/storage_server.py", line 325, in disconnect_storage_server
    connectionParams=conList,
  File "/usr/lib/python2.7/site-packages/vdsm/client.py", line 278, in _call
    raise TimeoutError(method, kwargs, timeout)
vdsm.client.TimeoutError: Request StoragePool.disconnectStorageServer with args {'connectionParams': [{'port': '3260', 'connection': '10.35.146.225', 'iqn': 'iqn.2008-05.com.xtremio:xio00153500071-514f0c50023f6c05', 'user': '', 'tpgt': '1', 'password': '', 'id': 'ae2bde39-253a-486c-9479-9046a07a0c65'}], 'storagepoolID': '00000000-0000-0000-0000-000000000000', 'domainType': 3} timed out after 60 seconds
  -=== De-configure VDSM networks ===- 
  -=== Stop other services ===- 
  -=== De-configure external daemons ===- 
  -=== Removing configuration files ===- 
? /etc/init/libvirtd.conf already missing
- removing /etc/libvirt/nwfilter/vdsm-no-mac-spoofing.xml
- removing /etc/ovirt-hosted-engine/answers.conf
- removing /etc/ovirt-hosted-engine/hosted-engine.conf
- removing /etc/vdsm/vdsm.conf
- removing /etc/pki/vdsm/certs/cacert.pem
- removing /etc/pki/vdsm/certs/vdsmcert.pem
- removing /etc/pki/vdsm/keys/vdsmkey.pem
- removing /etc/pki/vdsm/libvirt-spice/ca-cert.pem
- removing /etc/pki/vdsm/libvirt-spice/server-cert.pem
- removing /etc/pki/vdsm/libvirt-spice/server-key.pem
- removing /etc/pki/CA/cacert.pem
- removing /etc/pki/libvirt/clientcert.pem
- removing /etc/pki/libvirt/private/clientkey.pem
? /etc/pki/ovirt-vmconsole/*.pem already missing
- removing /var/cache/libvirt/qemu
? /var/run/ovirt-hosted-engine-ha/* already missing
You have new mail in /var/spool/mail/root
[root@alma03 ~]# hosted-engine --vm-status
You must run deploy first


Moving to verified.

Comment 7 Sandro Bonazzola 2018-06-26 08:35:23 UTC
This bugzilla is included in oVirt 4.2.4 release, published on June 26th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.4 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.