Bug 1584669 - engine hangs on host-deploy communication when multiple parameters are passed to [vars] section of vdsm.conf
Summary: engine hangs on host-deploy communication when multiple parameters are passed...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: Host-Deploy
Version: ---
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ovirt-4.2.4
: ---
Assignee: Ravi Nori
QA Contact: Pavel Stehlik
URL:
Whiteboard:
Depends On:
Blocks: 1577593 1582527
TreeView+ depends on / blocked
 
Reported: 2018-05-31 12:29 UTC by Martin Perina
Modified: 2018-06-05 16:15 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-06-05 16:15:53 UTC
oVirt Team: Infra
Embargoed:
rule-engine: ovirt-4.2+
rule-engine: blocker+


Attachments (Terms of Use)
hanged process thread dump (21.21 KB, text/plain)
2018-05-31 12:30 UTC, Martin Perina
no flags Details
host-deploy log (205.75 KB, text/plain)
2018-05-31 12:31 UTC, Martin Perina
no flags Details
host-deploy log (235.81 KB, text/plain)
2018-05-31 12:37 UTC, Martin Perina
no flags Details
engine.log in debug mode (2.61 MB, text/plain)
2018-06-01 14:35 UTC, Simone Tiraboschi
no flags Details
JMX dump (1.20 MB, text/plain)
2018-06-01 16:12 UTC, Simone Tiraboschi
no flags Details

Description Martin Perina 2018-05-31 12:29:28 UTC
Description of problem:

If we add additional options to [vars] section of /etc/vdsm/vdsm.conf (example can be found at [1]), then the engine (which was upgraded from 4.1 to 4.2/master), cannot add new host, because host-deploy process hangs when it receives last option to be added to [vars] section.


[1] https://gerrit.ovirt.org/#/c/91809/1/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/hostdeploy/VdsDeployVdsmUnit.java

Version-Release number of selected component (if applicable):


How reproducible:

100%

Steps to Reproduce:
1. Run upgrade-from-release-suite-master with patch [1]
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Martin Perina 2018-05-31 12:30:26 UTC
Created attachment 1446261 [details]
hanged process thread dump

Comment 2 Martin Perina 2018-05-31 12:31:44 UTC
Created attachment 1446262 [details]
host-deploy log

Comment 3 Martin Perina 2018-05-31 12:37:42 UTC
Created attachment 1446263 [details]
host-deploy log

Comment 4 Simone Tiraboschi 2018-06-01 09:09:52 UTC
According to host-deploy logs:
 2018-05-31 07:44:57,672-0400 DEBUG otopi.plugins.otopi.dialog.machine dialog.__logString:204 DIALOG:SEND       **%QEnd: CUSTOMIZATION_COMMAND
 2018-05-31 08:04:54,449-0400 DEBUG otopi.context context._executeMethod:143 method exception
 Traceback (most recent call last):
   File "/tmp/ovirt-goPsgFOlPK/pythonlib/otopi/context.py", line 133, in _executeMethod
     method['method']()
   File "/tmp/ovirt-goPsgFOlPK/otopi-plugins/otopi/dialog/cli.py", line 137, in _customize
     "\nCustomization phase, use 'install' to proceed\nCOMMAND> "
   File "/tmp/ovirt-goPsgFOlPK/otopi-plugins/otopi/dialog/cli.py", line 92, in _runCommandPrompt
     prompt=True,
   File "/tmp/ovirt-goPsgFOlPK/otopi-plugins/otopi/dialog/machine.py", line 281, in queryString
     value = self._readline(hidden)
   File "/tmp/ovirt-goPsgFOlPK/pythonlib/otopi/dialog.py", line 248, in _readline
     raise IOError(_('End of file'))
 IOError: End of file
 2018-05-31 08:04:54,455-0400 ERROR otopi.context context._executeMethod:152 Failed to execute stage 'Environment customization': End of file
 2018-05-31 08:04:54,555-0400 DEBUG otopi.plugins.otopi.debug.debug_failure.debug_failure debug_failure._notification:100 tcp connections:
 id uid local foreign state pid exe
 0: 0 0.0.0.0:22 0.0.0.0:0 LISTEN 1079 /usr/sbin/sshd
 1: 0 192.168.201.3:53984 212.69.166.138:80 CLOSE_WAIT 1418 /usr/bin/python2.7
 2: 0 192.168.201.3:22 192.168.201.1:36154 ESTABLISHED 1460 /usr/sbin/sshd
 3: 0 192.168.201.3:57190 147.32.127.196:80 CLOSE_WAIT 1418 /usr/bin/python2.7

The host sent a "*%QEnd: CUSTOMIZATION_COMMAND" at 2018-05-31 07:44:57,672-0400 and the engine never moved to the next step and so at 2018-05-31 08:04:54,449-0400 the ssh connection got closed.
host-deploy was still running.

Comment 5 Simone Tiraboschi 2018-06-01 12:08:10 UTC
It's correctly working on bare metal:

2018-06-01 13:53:08,029+0200 DEBUG otopi.plugins.otopi.dialog.machine dialog.__logString:204 DIALOG:SEND       ***Q:STRING CUSTOMIZATION_COMMAND
2018-06-01 13:53:08,029+0200 DEBUG otopi.plugins.otopi.dialog.machine dialog.__logString:204 DIALOG:SEND       **%QEnd: CUSTOMIZATION_COMMAND
2018-06-01 13:53:08,056+0200 DEBUG otopi.plugins.otopi.dialog.machine dialog.__logString:204 DIALOG:RECEIVE    env-query -k VDSM_CONFIG/vars/ssl
2018-06-01 13:53:08,056+0200 DEBUG otopi.plugins.otopi.dialog.machine dialog.__logString:204 DIALOG:SEND       **%QStart: VDSM_CONFIG/vars/ssl
2018-06-01 13:53:08,056+0200 DEBUG otopi.plugins.otopi.dialog.machine dialog.__logString:204 DIALOG:SEND       ###
2018-06-01 13:53:08,057+0200 DEBUG otopi.plugins.otopi.dialog.machine dialog.__logString:204 DIALOG:SEND       ### Please specify value for 'VDSM_CONFIG/vars/ssl':
2018-06-01 13:53:08,057+0200 DEBUG otopi.plugins.otopi.dialog.machine dialog.__logString:204 DIALOG:SEND       ### Response is VALUE VDSM_CONFIG/vars/ssl=type:value or ABORT VDSM_CONFIG/vars/ssl
2018-06-01 13:53:08,057+0200 DEBUG otopi.plugins.otopi.dialog.machine dialog.__logString:204 DIALOG:SEND       ***Q:VALUE VDSM_CONFIG/vars/ssl
2018-06-01 13:53:08,057+0200 DEBUG otopi.plugins.otopi.dialog.machine dialog.__logString:204 DIALOG:SEND       **%QEnd: VDSM_CONFIG/vars/ssl
2018-06-01 13:53:08,057+0200 DEBUG otopi.plugins.otopi.dialog.machine dialog.__logString:204 DIALOG:RECEIVE    VALUE VDSM_CONFIG/vars/ssl=str:true
2018-06-01 13:53:08,058+0200 DEBUG otopi.plugins.otopi.dialog.machine dialog.__logString:204 DIALOG:SEND       **%QStart: CUSTOMIZATION_COMMAND
2018-06-01 13:53:08,058+0200 DEBUG otopi.plugins.otopi.dialog.machine dialog.__logString:204 DIALOG:SEND       ###
2018-06-01 13:53:08,058+0200 DEBUG otopi.plugins.otopi.dialog.machine dialog.__logString:204 DIALOG:SEND       ### Customization phase, use 'install' to proceed
2018-06-01 13:53:08,058+0200 DEBUG otopi.plugins.otopi.dialog.machine dialog.__logString:204 DIALOG:SEND       ### COMMAND>
2018-06-01 13:53:08,058+0200 DEBUG otopi.plugins.otopi.dialog.machine dialog.__logString:204 DIALOG:SEND       **%QHidden: FALSE
2018-06-01 13:53:08,058+0200 DEBUG otopi.plugins.otopi.dialog.machine dialog.__logString:204 DIALOG:SEND       ***Q:STRING CUSTOMIZATION_COMMAND
2018-06-01 13:53:08,058+0200 DEBUG otopi.plugins.otopi.dialog.machine dialog.__logString:204 DIALOG:SEND       **%QEnd: CUSTOMIZATION_COMMAND
2018-06-01 13:53:08,059+0200 DEBUG otopi.plugins.otopi.dialog.machine dialog.__logString:204 DIALOG:RECEIVE    env-query -k VDSM_CONFIG/vars/ssl_excludes
2018-06-01 13:53:08,059+0200 DEBUG otopi.plugins.otopi.dialog.machine dialog.__logString:204 DIALOG:SEND       **%QStart: VDSM_CONFIG/vars/ssl_excludes
2018-06-01 13:53:08,059+0200 DEBUG otopi.plugins.otopi.dialog.machine dialog.__logString:204 DIALOG:SEND       ###
2018-06-01 13:53:08,059+0200 DEBUG otopi.plugins.otopi.dialog.machine dialog.__logString:204 DIALOG:SEND       ### Please specify value for 'VDSM_CONFIG/vars/ssl_excludes':
2018-06-01 13:53:08,059+0200 DEBUG otopi.plugins.otopi.dialog.machine dialog.__logString:204 DIALOG:SEND       ### Response is VALUE VDSM_CONFIG/vars/ssl_excludes=type:value or ABORT VDSM_CONFIG/vars/ssl_excludes
2018-06-01 13:53:08,060+0200 DEBUG otopi.plugins.otopi.dialog.machine dialog.__logString:204 DIALOG:SEND       ***Q:VALUE VDSM_CONFIG/vars/ssl_excludes
2018-06-01 13:53:08,060+0200 DEBUG otopi.plugins.otopi.dialog.machine dialog.__logString:204 DIALOG:SEND       **%QEnd: VDSM_CONFIG/vars/ssl_excludes
2018-06-01 13:53:08,060+0200 DEBUG otopi.plugins.otopi.dialog.machine dialog.__logString:204 DIALOG:RECEIVE    VALUE VDSM_CONFIG/vars/ssl_excludes=str:OP_NO_TLSv1,OP_NO_TLSv1_1
2018-06-01 13:53:08,060+0200 DEBUG otopi.plugins.otopi.dialog.machine dialog.__logString:204 DIALOG:SEND       **%QStart: CUSTOMIZATION_COMMAND
2018-06-01 13:53:08,060+0200 DEBUG otopi.plugins.otopi.dialog.machine dialog.__logString:204 DIALOG:SEND       ###
2018-06-01 13:53:08,061+0200 DEBUG otopi.plugins.otopi.dialog.machine dialog.__logString:204 DIALOG:SEND       ### Customization phase, use 'install' to proceed
2018-06-01 13:53:08,061+0200 DEBUG otopi.plugins.otopi.dialog.machine dialog.__logString:204 DIALOG:SEND       ### COMMAND>
2018-06-01 13:53:08,061+0200 DEBUG otopi.plugins.otopi.dialog.machine dialog.__logString:204 DIALOG:SEND       **%QHidden: FALSE
2018-06-01 13:53:08,061+0200 DEBUG otopi.plugins.otopi.dialog.machine dialog.__logString:204 DIALOG:SEND       ***Q:STRING CUSTOMIZATION_COMMAND
2018-06-01 13:53:08,061+0200 DEBUG otopi.plugins.otopi.dialog.machine dialog.__logString:204 DIALOG:SEND       **%QEnd: CUSTOMIZATION_COMMAND

Comment 6 Simone Tiraboschi 2018-06-01 14:35:16 UTC
Created attachment 1446705 [details]
engine.log in debug mode

Comment 7 Simone Tiraboschi 2018-06-01 14:37:56 UTC
Attaching engine.log in debug mode.

The last interaction between the host and the engine is 
2018-06-01 09:59:31,540-04 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [7696e2a8] Got: **%QStart: CUSTOMIZATION_COMMAND
2018-06-01 09:59:31,540-04 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [7696e2a8] Got: ###
2018-06-01 09:59:31,540-04 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [7696e2a8] Got: ### Customization phase, use 'install' to proceed
2018-06-01 09:59:31,540-04 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [7696e2a8] Got: ### COMMAND> 
2018-06-01 09:59:31,540-04 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [7696e2a8] Got: **%QHidden: FALSE
2018-06-01 09:59:31,540-04 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [7696e2a8] Got: ***Q:STRING CUSTOMIZATION_COMMAND
2018-06-01 09:59:31,540-04 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [7696e2a8] nextEvent: QueryString CUSTOMIZATION_COMMAND null
2018-06-01 09:59:31,540-04 DEBUG [org.ovirt.engine.core.bll.hostdeploy.VdsDeployBase] (VdsDeploy) [7696e2a8] Installation of lago-upgrade-from-release-suite-master-host-0: Event QueryString CUSTOMIZATION_COMMAND null
2018-06-01 09:59:31,541-04 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [7696e2a8] Got: **%QEnd: CUSTOMIZATION_COMMAND

then nothing more happens.

And the SSL session dies by timeout 20 minutes later.

2018-06-01 10:19:28,751-04 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHDialog] (EE-ManagedThreadFactory-engine-Thread-1) [7696e2a8] Exception: javax.naming.TimeLimitExceededException: SSH session timeout host 'root@lago-upgrade-from-release-suite-master-host-0'
	at org.ovirt.engine.core.uutils.ssh.SSHClient.executeCommand(SSHClient.java:477) [uutils.jar:]
	at org.ovirt.engine.core.uutils.ssh.SSHDialog.executeCommand(SSHDialog.java:316) [uutils.jar:]
	at org.ovirt.engine.core.bll.hostdeploy.VdsDeployBase.execute(VdsDeployBase.java:557) [bll.jar:]

Comment 8 Simone Tiraboschi 2018-06-01 15:49:26 UTC
On a successfully execution the engine correctly moves to the next value:

2018-06-01 17:40:17,200+02 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [f9f6b461-33b5-4841-b42f-79e5542fed73] env-query VDSM_CONFIG/vars/ssl_excludes=java.lang.String:OP_NO_TLSv1,OP_NO_TLSv1_1
2018-06-01 17:40:17,201+02 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [f9f6b461-33b5-4841-b42f-79e5542fed73] Got: **%QEnd: CUSTOMIZATION_COMMAND
2018-06-01 17:40:17,202+02 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [f9f6b461-33b5-4841-b42f-79e5542fed73] Got: **%QStart: VDSM_CONFIG/vars/ssl_excludes
2018-06-01 17:40:17,203+02 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [f9f6b461-33b5-4841-b42f-79e5542fed73] Got: ###
2018-06-01 17:40:17,203+02 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [f9f6b461-33b5-4841-b42f-79e5542fed73] Got: ### Please specify value for 'VDSM_CONFIG/vars/ssl_excludes':
2018-06-01 17:40:17,203+02 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [f9f6b461-33b5-4841-b42f-79e5542fed73] Got: ### Response is VALUE VDSM_CONFIG/vars/ssl_excludes=type:value or ABORT VDSM_CONFIG/vars/ssl_excludes
2018-06-01 17:40:17,203+02 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [f9f6b461-33b5-4841-b42f-79e5542fed73] Got: ***Q:VALUE VDSM_CONFIG/vars/ssl_excludes
2018-06-01 17:40:17,203+02 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [f9f6b461-33b5-4841-b42f-79e5542fed73] nextEvent: QueryValue VDSM_CONFIG/vars/ssl_excludes=null abort=false
2018-06-01 17:40:17,203+02 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [f9f6b461-33b5-4841-b42f-79e5542fed73] Got: **%QEnd: VDSM_CONFIG/vars/ssl_excludes
2018-06-01 17:40:17,204+02 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [f9f6b461-33b5-4841-b42f-79e5542fed73] Got: **%QStart: CUSTOMIZATION_COMMAND
2018-06-01 17:40:17,204+02 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [f9f6b461-33b5-4841-b42f-79e5542fed73] Got: ###
2018-06-01 17:40:17,204+02 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [f9f6b461-33b5-4841-b42f-79e5542fed73] Got: ### Customization phase, use 'install' to proceed
2018-06-01 17:40:17,204+02 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [f9f6b461-33b5-4841-b42f-79e5542fed73] Got: ### COMMAND> 
2018-06-01 17:40:17,204+02 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [f9f6b461-33b5-4841-b42f-79e5542fed73] Got: **%QHidden: FALSE
2018-06-01 17:40:17,205+02 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [f9f6b461-33b5-4841-b42f-79e5542fed73] Got: ***Q:STRING CUSTOMIZATION_COMMAND
2018-06-01 17:40:17,205+02 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [f9f6b461-33b5-4841-b42f-79e5542fed73] nextEvent: QueryString CUSTOMIZATION_COMMAND null

but when it hangs it doesn't.

I think it's something on engine say, maybe a kind of deadlock with the LockManager ???

Please note that also https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_manual/2773/artifact/exported-artifacts/test_logs/upgrade-from-release-suite-master/post-002_bootstrap.py/lago-upgrade-from-release-suite-master-engine/_var_log/ovirt-engine/server.log

got abruptly truncated on 
2018-06-01 09:59:28,562-04 INFO  [org.apache.sshd.client.session.ClientSessionImpl] (sshd-SshClient[43bd9c58]-nio2-thread-1) Server version string: SSH-2.0-OpenSSH_7.4
2018-06-01 09:59:28,578-04 INFO  [org.apache.sshd.client.session.ClientSessionImpl] (sshd-SshClient[43bd9c58]-nio2-thread-2) Kex: server->client aes128-ctr hmac-sha2-256 none
2018-06-01 09:59:28,578-04 INFO  [org.apache.sshd.client.session.ClientSessionImpl] (sshd-SshClient[43bd9c58]-nio2-thread-2) Kex: client->server aes128-ctr hmac-sha2-256 none
2018-06-01 09:59:28,886-04 WARN  [org.apache.sshd.client.session.ClientConnectionService] (sshd-SshClient[43bd9c58]-nio2-thread-2) Unknown global request: hostkeys-00

almost about 09:59:31 when we have the latest activity on the ssh connection.

Comment 9 Simone Tiraboschi 2018-06-01 16:12:13 UTC
Created attachment 1446734 [details]
JMX dump

Comment 10 Simone Tiraboschi 2018-06-01 16:13:27 UTC
Attaching a JMX dump from an hanged engine (reproduced once also on bare metal!!!), Martin, anything relevant?

Comment 11 Simone Tiraboschi 2018-06-01 16:15:32 UTC
Moving to engine side, since we understood that it hangs there during host-deploy interaction.

Comment 12 Martin Perina 2018-06-01 16:44:00 UTC
(In reply to Simone Tiraboschi from comment #10)
> Attaching a JMX dump from an hanged engine (reproduced once also on bare
> metal!!!), Martin, anything relevant?

Thanks a lot Simone!

Ravi, could you please debug the issue on the engine side?

Comment 13 Ravi Nori 2018-06-05 16:07:58 UTC
updated the patch in comment #1 to fix the issue

Comment 14 Martin Perina 2018-06-05 16:15:53 UTC
Closing as NOTABUG, because the issue was caused by improper usage of host-deploy interface.


Note You need to log in before you can comment on or make changes to this bug.