Bug 1080594 - [host-deploy][node-upgarde] Pipe closed at the end of vdsm-upgrade
Summary: [host-deploy][node-upgarde] Pipe closed at the end of vdsm-upgrade
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: oVirt
Classification: Retired
Component: ovirt-engine-core
Version: 3.4
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 3.4.1
Assignee: Alon Bar-Lev
QA Contact: Pavel Stehlik
URL:
Whiteboard: infra
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-03-25 18:03 UTC by Fabian Deutsch
Modified: 2014-05-08 13:35 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-05-08 13:35:16 UTC
oVirt Team: ---
Embargoed:


Attachments (Terms of Use)
engine.log (102.17 KB, text/x-log)
2014-03-26 04:29 UTC, Douglas Schilling Landgraf
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 26122 0 None MERGED host-deploy: node-upgrade: apply artificial eof for vdsm-upgrade Never
oVirt gerrit 26242 0 None MERGED host-deploy: node-upgrade: apply artificial eof for vdsm-upgrade Never

Description Fabian Deutsch 2014-03-25 18:03:28 UTC
Description of problem:

This is about the bug which came up in bug 1074257.

From bug 1074257 comment 26:

Update about "Pipe closed".
========================

The upgrade is happening on the node via vdsm-upgrade command but after upgrade and before the reboot the communication between ovirt-node and ovirt-engine unexpectedly closes generating IOException ("pipe broken").

logs from ovirt-engine UI events:
=========================================
Host localhost.localdomain installation failed. Pipe closed.
Installing Host localhost.localdomain. Step: RHEV_INSTALL.
Installing Host localhost.localdomain. Step: umount; Details: umount Succeeded .
Installing Host localhost.localdomain. Step: doUpgrade; Details:
Upgrade Succeeded. Rebooting .
Installing Host localhost.localdomain. Step: setMountPoint; Details:
Mount succeeded. .
Installing Host localhost.localdomain. Step: RHEL_INSTALL; Details:
vdsm daemon stopped for upgrade process! .
Installing Host localhost.localdomain. Executing
/usr/share/vdsm-reg/vdsm-upgrade.
Installing Host localhost.localdomain. Sending file
/usr/share/ovirt-node-iso/ovirt-node-iso-3.0.4-1.999.201403191804.el6.iso
to /data/updates/ovirt-node-image.iso.
Installing Host localhost.localdomain. Connected to host
192.168.100.133 with SSH key fingerprint:
a9:8b:21:a8:0d:e4:16:7d:4b:79:38:f3:e0:f6:92:e0.
2014-Mar-20, 22:43

/var/log/secure during the pipe closed
=========================================
Mar 22 04:40:56 localhost sshd[23184]: Accepted publickey for root from 192.168.100.228 port 33905 ssh2
Mar 22 04:40:56 localhost sshd[23184]: pam_unix(sshd:session): session opened for user root by (uid=0)
Mar 22 04:41:38 localhost sshd[23184]: channel_by_id: 0: bad id: channel free
Mar 22 04:41:38 localhost sshd[23184]: Disconnecting: Received ieof for nonexistent channel 0.
Mar 22 04:41:38 localhost sshd[23184]: pam_unix(sshd:session): session closed for user root
Mar 22 04:42:14 localhost sshd[25420]: Connection closed by 192.168.100.1
Mar 22 04:42:20 localhost sshd[12927]: Received signal 15; terminating.

Alon can be related to BZ#1051035 ?

Additionally to this "Pipe closed" event, I have seen some connection timeout from ovirt-engine to ovirt-node during the process of sending the iso as reported in comment#20, working back after a retry. 

On oVirt Node side executing manually /usr/share/vdsm-reg/vdsm-upgrade works without  error.
# /usr/share/vdsm-reg/vdsm-upgrade 
<BSTRAP component='RHEL_INSTALL' status='WARN' message='vdsm daemon is already down before we stop it for upgrade.'/>
<BSTRAP component='setMountPoint' status='OK' message='Mount succeeded.'/>
<BSTRAP component='doUpgrade' status='OK' message='Upgrade Succeeded. Rebooting'/>
<BSTRAP component='umount' status='OK' message='umount Succeeded'/>
<BSTRAP component='RHEV_INSTALL' status='OK'/>

Version-Release number of selected component (if applicable):
3.4

How reproducible:
?

Steps to Reproduce:
1. Run upgrade throu RHEV-M to upgrade RHEV-H
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Fabian Deutsch 2014-03-25 18:09:44 UTC
== A ==
$ ssh -v root.122.211 "/usr/share/vdsm-reg/vdsm-upgrade ; ls -l /proc/self/fd ; cat \$(ls -l /proc/self/fd | egrep -o '/proc/[0-9]+/fd' | sed 's/fd/cmdline/')"
OpenSSH_6.2p2, OpenSSL 1.0.1e-fips 11 Feb 2013
...
debug1: Next authentication method: password
root.122.211's password: 
debug1: Enabling compression at level 6.
debug1: Authentication succeeded (password).
Authenticated to 192.168.122.211 ([192.168.122.211]:22).
debug1: channel 0: new [client-session]
debug1: Requesting no-more-sessions
debug1: Entering interactive session.
debug1: Sending environment.
debug1: Sending env XMODIFIERS = @im=ibus
debug1: Sending env LANG = de_DE.utf8
debug1: Sending command: /usr/share/vdsm-reg/vdsm-upgrade ; ls -l /proc/self/fd ; cat $(ls -l /proc/self/fd | egrep -o '/proc/[0-9]+/fd' | sed 's/fd/cmdline/')
<BSTRAP component='RHEL_INSTALL' status='WARN' message='vdsm daemon is already down before we stop it for upgrade.'/>
<BSTRAP component='setMountPoint' status='OK' message='Mount succeeded.'/>
<BSTRAP component='doUpgrade' status='OK' message='Upgrade Succeeded. Rebooting'/>
<BSTRAP component='umount' status='OK' message='umount Succeeded'/>
<BSTRAP component='RHEV_INSTALL' status='OK'/>
total 0
lr-x------. 1 root root 64 Mar 25 17:53 0 -> pipe:[33284]
l-wx------. 1 root root 64 Mar 25 17:53 1 -> pipe:[33285]
l-wx------. 1 root root 64 Mar 25 17:53 2 -> pipe:[33286]
lr-x------. 1 root root 64 Mar 25 17:53 3 -> /proc/13304/fd
cat: /proc/13306/cmdline: No such file or directory
debug1: client_input_channel_req: channel 0 rtype exit-status reply 0
debug1: client_input_channel_req: channel 0 rtype eow reply 0
debug1: channel 0: free: client-session, nchannels 1
Transferred: sent 3440, received 2960 bytes, in 10.2 seconds
Bytes per second: sent 338.3, received 291.1
debug1: Exit status 1
debug1: compress outgoing: raw data 322, compressed 256, factor 0.80
debug1: compress incoming: raw data 807, compressed 414, factor 0.51

Everything looks good so far.


== B ==
$ ssh -v root.122.211 "/usr/share/vdsm-reg/vdsm-upgrade ; ls -l /proc/self/fd ; sleep 10"
OpenSSH_6.2p2, OpenSSL 1.0.1e-fips 11 Feb 2013
...
debug1: Next authentication method: password
root.122.211's password: 
debug1: Enabling compression at level 6.
debug1: Authentication succeeded (password).
Authenticated to 192.168.122.211 ([192.168.122.211]:22).
debug1: channel 0: new [client-session]
debug1: Requesting no-more-sessions
debug1: Entering interactive session.
debug1: Sending environment.
debug1: Sending env XMODIFIERS = @im=ibus
debug1: Sending env LANG = de_DE.utf8
debug1: Sending command: /usr/share/vdsm-reg/vdsm-upgrade ; ls -l /proc/self/fd ; sleep 10
<BSTRAP component='RHEL_INSTALL' status='WARN' message='vdsm daemon is already down before we stop it for upgrade.'/>
<BSTRAP component='setMountPoint' status='OK' message='Mount succeeded.'/>
<BSTRAP component='doUpgrade' status='OK' message='Upgrade Succeeded. Rebooting'/>
<BSTRAP component='umount' status='OK' message='umount Succeeded'/>
<BSTRAP component='RHEV_INSTALL' status='OK'/>
total 0
lr-x------. 1 root root 64 Mar 25 17:56 0 -> pipe:[33300]
l-wx------. 1 root root 64 Mar 25 17:56 1 -> pipe:[33301]
l-wx------. 1 root root 64 Mar 25 17:56 2 -> pipe:[33302]
lr-x------. 1 root root 64 Mar 25 17:56 3 -> /proc/13313/fd
debug1: channel 0: free: client-session, nchannels 1
Connection to 192.168.122.211 closed by remote host.
Transferred: sent 3392, received 2800 bytes, in 16.5 seconds
Bytes per second: sent 205.9, received 169.9
debug1: Exit status -1
debug1: compress outgoing: raw data 248, compressed 211, factor 0.85
debug1: compress incoming: raw data 682, compressed 317, factor 0.46

Compare the last  6 lines to the output of the previous ssh run. You see that in this run it is obvious that the "pipe got" closed / ssh got killed.

But only because I added the sleep of 10 seconds to force this.


In summary:
The default flow (see at the top == A ==) is okay, the ssh connection is closed correctly.
The output in the description also shows that the vdsm-upgrade completes.

The output in == B == (above) shows how a failed ssh connection looks.


My idea is now that the ssh implementation used by RHEV-M has a regression which doesn't correctly close the ssh connection (see the unusual messages in /var/log/secure in the description, which are not seen in the snippets in this comment).

Alon,

is openssh used or a java based implementation to establish ssh connections?

Comment 2 Alon Bar-Lev 2014-03-25 18:18:44 UTC
ssh implementation was not changed since 3.2, nor dependency.

But from the above trace I do not understand what actually happens as there is no timestamps.

Was there delay of 10 seconds before sshd killed?

I will send you ssh test program.

Comment 3 Alon Bar-Lev 2014-03-25 18:54:52 UTC
Test program sent.

Comment 4 Fabian Deutsch 2014-03-25 18:59:53 UTC
And the output for a good run:

]$ make check HOST=192.168.122.211 USER=root PASSWORD=harharhar COMMAND="/usr/share/vdsm-reg/vdsm-upgrade"
javac -cp sshd-core-0.7.0.jar:. SSHTest.java
java -cp *:. SSHTest "192.168.122.211" "root" "r" "/usr/share/vdsm-reg/vdsm-upgrade"
3 [main] INFO org.apache.sshd.common.util.SecurityUtils - BouncyCastle not registered, using the default JCE provider
144 [NioProcessor-2] INFO org.apache.sshd.client.session.ClientSessionImpl - Session created...
210 [NioProcessor-2] INFO org.apache.sshd.client.session.ClientSessionImpl - Server version string: SSH-2.0-OpenSSH_5.3
211 [NioProcessor-2] INFO org.apache.sshd.client.session.ClientSessionImpl - Received SSH_MSG_KEXINIT
242 [NioProcessor-2] INFO org.apache.sshd.client.kex.DHG1 - Send SSH_MSG_KEXDH_INIT
246 [NioProcessor-2] INFO org.apache.sshd.client.kex.DHG1 - Received SSH_MSG_KEXDH_REPLY
273 [NioProcessor-2] INFO org.apache.sshd.client.session.ClientSessionImpl - Received SSH_MSG_NEWKEYS
278 [NioProcessor-2] INFO org.apache.sshd.client.session.ClientSessionImpl - Send SSH_MSG_SERVICE_REQUEST for ssh-userauth
321 [main] INFO org.apache.sshd.client.auth.UserAuthPassword - Send SSH_MSG_USERAUTH_REQUEST for password
347 [NioProcessor-2] INFO org.apache.sshd.client.auth.UserAuthPassword - Received SSH_MSG_USERAUTH_SUCCESS
355 [main] INFO org.apache.sshd.client.channel.ChannelExec - Send SSH_MSG_CHANNEL_OPEN on channel 0
434 [NioProcessor-2] INFO org.apache.sshd.client.channel.ChannelExec - Send SSH_MSG_CHANNEL_REQUEST exec
Tue Mar 25 19:58:22 CET 2014 STATUS1: 129
10957 [NioProcessor-2] INFO org.apache.sshd.client.channel.ChannelExec - Received SSH_MSG_CHANNEL_REQUEST on channel 0
Tue Mar 25 19:58:23 CET 2014 STATUS2: 160
Tue Mar 25 19:58:23 CET 2014 STATUS3: 178
STDOUT:
<BSTRAP component='RHEL_INSTALL' status='WARN' message='vdsm daemon is already down before we stop it for upgrade.'/>
<BSTRAP component='setMountPoint' status='OK' message='Mount succeeded.'/>
<BSTRAP component='doUpgrade' status='OK' message='Upgrade Succeeded. Rebooting'/>
<BSTRAP component='umount' status='OK' message='umount Succeeded'/>
<BSTRAP component='RHEV_INSTALL' status='OK'/>

STDERR:

GOOD: true

Comment 5 Fabian Deutsch 2014-03-25 19:01:46 UTC
I actually tried to force a "pipe closed" problem by adding a sleep, but it did not happen. Looks like a good run:

$ make check HOST=192.168.122.211 USER=root PASSWORD=r COMMAND="/usr/share/vdsm-reg/vdsm-upgrade ; sleep 10"
javac -cp sshd-core-0.7.0.jar:. SSHTest.java
java -cp *:. SSHTest "192.168.122.211" "root" "r" "/usr/share/vdsm-reg/vdsm-upgrade ; sleep 10"
2 [main] INFO org.apache.sshd.common.util.SecurityUtils - BouncyCastle not registered, using the default JCE provider
151 [NioProcessor-2] INFO org.apache.sshd.client.session.ClientSessionImpl - Session created...
220 [NioProcessor-2] INFO org.apache.sshd.client.session.ClientSessionImpl - Server version string: SSH-2.0-OpenSSH_5.3
224 [NioProcessor-2] INFO org.apache.sshd.client.session.ClientSessionImpl - Received SSH_MSG_KEXINIT
254 [NioProcessor-2] INFO org.apache.sshd.client.kex.DHG1 - Send SSH_MSG_KEXDH_INIT
257 [NioProcessor-2] INFO org.apache.sshd.client.kex.DHG1 - Received SSH_MSG_KEXDH_REPLY
278 [NioProcessor-2] INFO org.apache.sshd.client.session.ClientSessionImpl - Received SSH_MSG_NEWKEYS
283 [NioProcessor-2] INFO org.apache.sshd.client.session.ClientSessionImpl - Send SSH_MSG_SERVICE_REQUEST for ssh-userauth
325 [main] INFO org.apache.sshd.client.auth.UserAuthPassword - Send SSH_MSG_USERAUTH_REQUEST for password
484 [NioProcessor-2] INFO org.apache.sshd.client.auth.UserAuthPassword - Received SSH_MSG_USERAUTH_SUCCESS
492 [main] INFO org.apache.sshd.client.channel.ChannelExec - Send SSH_MSG_CHANNEL_OPEN on channel 0
618 [NioProcessor-2] INFO org.apache.sshd.client.channel.ChannelExec - Send SSH_MSG_CHANNEL_REQUEST exec
Tue Mar 25 20:00:37 CET 2014 STATUS1: 129
Tue Mar 25 20:00:46 CET 2014 STATUS2: 146
Tue Mar 25 20:00:46 CET 2014 STATUS3: 146
STDOUT:
<BSTRAP component='RHEL_INSTALL' status='WARN' message='vdsm daemon is already down before we stop it for upgrade.'/>
<BSTRAP component='setMountPoint' status='OK' message='Mount succeeded.'/>
<BSTRAP component='doUpgrade' status='OK' message='Upgrade Succeeded. Rebooting'/>
<BSTRAP component='umount' status='OK' message='umount Succeeded'/>
<BSTRAP component='RHEV_INSTALL' status='OK'/>

STDERR:

GOOD: true

Comment 6 Alon Bar-Lev 2014-03-25 19:10:21 UTC
Yes it is, why have you added sleep 10?

This is exactly what the engine is doing.

Douglas, please reproduce. I do not know where the problem is and got no log for this io error at engine side.

Comment 7 Douglas Schilling Landgraf 2014-03-26 04:29:37 UTC
Created attachment 878846 [details]
engine.log

Comment 8 Douglas Schilling Landgraf 2014-03-26 04:37:59 UTC
The steps to reproduce:

$ git clone git://gerrit.ovirt.org/ovirt-engine
$ git checkout remotes/origin/ovirt-engine-3.4 -b 3.4
<make the compilation from http://www.ovirt.org/OVirt_Engine_Development_Environment>
$HOME/ovirt-engine/share/ovirt-engine/services/ovirt-engine/ovirt-engine.py start

Install the ovirt-node rpm on engine:
http://209.132.179.162/ovirt/ovirt-node-iso-3.0.4-1.0.201401291204.noarch.rpm

Download the iso from below link:
http://209.132.179.162/ovirt/ovirt-node-iso-3.0.4-1.0.201401291204.iso

- Install the iso (host will be up)
- Put the host into maint. on the dialog below click to upgrade, select the iso -> OK

You might face timeout sending the iso to ovirt-node and others times it just send the iso, executes the upgrade, reboot and send the pipe error to engine UI.

Please note that I have attached the engine.log which includes a timeout error before the pipe one. Finally, Lev also faced both error on downstream build based on 3.4.

Comment 9 Alon Bar-Lev 2014-03-26 11:23:43 UTC
Cannot reproduce on my machine nor ovirt-node and rhev-h

On Lev's physical host it is reproduced, with same version of rhev-h.

Comment 10 Fabian Deutsch 2014-03-26 11:35:25 UTC
Might there be something up with Lev's network setup?

Comment 11 Alon Bar-Lev 2014-03-26 15:23:02 UTC
OK, found the root cause.

Java's PipedInputStream has this behavior that if remote or local pipe is closed it does not returns EOF but IOException. This makes it very hard to process considering there is no non-blocking InputStream/OutputStream. So it was a race lucky could be reproduced reliable on one host.

Solution is to quit read loop as long as we get the end message.

The host-deploy is not effected as it already waits for termination message within the otopi protocol.

Note for future: we need similar termination message to ovirt-node-upgrade script, a message/token that is always printed in termination.

---

commit 31717db9daa9f10afb4c4a9adb7431981e7f90e6
Author: Alon Bar-Lev <alonbl>
Date:   Wed Mar 26 15:33:04 2014 +0200

    host-deploy: node-upgrade: apply artificial eof for vdsm-upgrade
    
    apply artificial end-of-file
    
    there is no way in java to use non blocking
    InputStream.
    
    there is no way to interrupt the blocking.
    
    and we cannot terminate thread as the connection
    pool will get messy.
    
    so we should detect eof based on data.
    
    Bug-Url: https://bugzilla.redhat.com/show_bug.cgi?id=1080594
    Change-Id: I093357933d2f10c0d90864f53977b832a58dafbd
    Signed-off-by: Alon Bar-Lev <alonbl>

Comment 12 Sandro Bonazzola 2014-05-08 13:35:16 UTC
This is an automated message

oVirt 3.4.1 has been released:
 * should fix your issue
 * should be available at your local mirror within two days.

If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.