Description of problem: This is about the bug which came up in bug 1074257. From bug 1074257 comment 26: Update about "Pipe closed". ======================== The upgrade is happening on the node via vdsm-upgrade command but after upgrade and before the reboot the communication between ovirt-node and ovirt-engine unexpectedly closes generating IOException ("pipe broken"). logs from ovirt-engine UI events: ========================================= Host localhost.localdomain installation failed. Pipe closed. Installing Host localhost.localdomain. Step: RHEV_INSTALL. Installing Host localhost.localdomain. Step: umount; Details: umount Succeeded . Installing Host localhost.localdomain. Step: doUpgrade; Details: Upgrade Succeeded. Rebooting . Installing Host localhost.localdomain. Step: setMountPoint; Details: Mount succeeded. . Installing Host localhost.localdomain. Step: RHEL_INSTALL; Details: vdsm daemon stopped for upgrade process! . Installing Host localhost.localdomain. Executing /usr/share/vdsm-reg/vdsm-upgrade. Installing Host localhost.localdomain. Sending file /usr/share/ovirt-node-iso/ovirt-node-iso-3.0.4-1.999.201403191804.el6.iso to /data/updates/ovirt-node-image.iso. Installing Host localhost.localdomain. Connected to host 192.168.100.133 with SSH key fingerprint: a9:8b:21:a8:0d:e4:16:7d:4b:79:38:f3:e0:f6:92:e0. 2014-Mar-20, 22:43 /var/log/secure during the pipe closed ========================================= Mar 22 04:40:56 localhost sshd[23184]: Accepted publickey for root from 192.168.100.228 port 33905 ssh2 Mar 22 04:40:56 localhost sshd[23184]: pam_unix(sshd:session): session opened for user root by (uid=0) Mar 22 04:41:38 localhost sshd[23184]: channel_by_id: 0: bad id: channel free Mar 22 04:41:38 localhost sshd[23184]: Disconnecting: Received ieof for nonexistent channel 0. Mar 22 04:41:38 localhost sshd[23184]: pam_unix(sshd:session): session closed for user root Mar 22 04:42:14 localhost sshd[25420]: Connection closed by 192.168.100.1 Mar 22 04:42:20 localhost sshd[12927]: Received signal 15; terminating. Alon can be related to BZ#1051035 ? Additionally to this "Pipe closed" event, I have seen some connection timeout from ovirt-engine to ovirt-node during the process of sending the iso as reported in comment#20, working back after a retry. On oVirt Node side executing manually /usr/share/vdsm-reg/vdsm-upgrade works without error. # /usr/share/vdsm-reg/vdsm-upgrade <BSTRAP component='RHEL_INSTALL' status='WARN' message='vdsm daemon is already down before we stop it for upgrade.'/> <BSTRAP component='setMountPoint' status='OK' message='Mount succeeded.'/> <BSTRAP component='doUpgrade' status='OK' message='Upgrade Succeeded. Rebooting'/> <BSTRAP component='umount' status='OK' message='umount Succeeded'/> <BSTRAP component='RHEV_INSTALL' status='OK'/> Version-Release number of selected component (if applicable): 3.4 How reproducible: ? Steps to Reproduce: 1. Run upgrade throu RHEV-M to upgrade RHEV-H 2. 3. Actual results: Expected results: Additional info:
== A == $ ssh -v root.122.211 "/usr/share/vdsm-reg/vdsm-upgrade ; ls -l /proc/self/fd ; cat \$(ls -l /proc/self/fd | egrep -o '/proc/[0-9]+/fd' | sed 's/fd/cmdline/')" OpenSSH_6.2p2, OpenSSL 1.0.1e-fips 11 Feb 2013 ... debug1: Next authentication method: password root.122.211's password: debug1: Enabling compression at level 6. debug1: Authentication succeeded (password). Authenticated to 192.168.122.211 ([192.168.122.211]:22). debug1: channel 0: new [client-session] debug1: Requesting no-more-sessions debug1: Entering interactive session. debug1: Sending environment. debug1: Sending env XMODIFIERS = @im=ibus debug1: Sending env LANG = de_DE.utf8 debug1: Sending command: /usr/share/vdsm-reg/vdsm-upgrade ; ls -l /proc/self/fd ; cat $(ls -l /proc/self/fd | egrep -o '/proc/[0-9]+/fd' | sed 's/fd/cmdline/') <BSTRAP component='RHEL_INSTALL' status='WARN' message='vdsm daemon is already down before we stop it for upgrade.'/> <BSTRAP component='setMountPoint' status='OK' message='Mount succeeded.'/> <BSTRAP component='doUpgrade' status='OK' message='Upgrade Succeeded. Rebooting'/> <BSTRAP component='umount' status='OK' message='umount Succeeded'/> <BSTRAP component='RHEV_INSTALL' status='OK'/> total 0 lr-x------. 1 root root 64 Mar 25 17:53 0 -> pipe:[33284] l-wx------. 1 root root 64 Mar 25 17:53 1 -> pipe:[33285] l-wx------. 1 root root 64 Mar 25 17:53 2 -> pipe:[33286] lr-x------. 1 root root 64 Mar 25 17:53 3 -> /proc/13304/fd cat: /proc/13306/cmdline: No such file or directory debug1: client_input_channel_req: channel 0 rtype exit-status reply 0 debug1: client_input_channel_req: channel 0 rtype eow reply 0 debug1: channel 0: free: client-session, nchannels 1 Transferred: sent 3440, received 2960 bytes, in 10.2 seconds Bytes per second: sent 338.3, received 291.1 debug1: Exit status 1 debug1: compress outgoing: raw data 322, compressed 256, factor 0.80 debug1: compress incoming: raw data 807, compressed 414, factor 0.51 Everything looks good so far. == B == $ ssh -v root.122.211 "/usr/share/vdsm-reg/vdsm-upgrade ; ls -l /proc/self/fd ; sleep 10" OpenSSH_6.2p2, OpenSSL 1.0.1e-fips 11 Feb 2013 ... debug1: Next authentication method: password root.122.211's password: debug1: Enabling compression at level 6. debug1: Authentication succeeded (password). Authenticated to 192.168.122.211 ([192.168.122.211]:22). debug1: channel 0: new [client-session] debug1: Requesting no-more-sessions debug1: Entering interactive session. debug1: Sending environment. debug1: Sending env XMODIFIERS = @im=ibus debug1: Sending env LANG = de_DE.utf8 debug1: Sending command: /usr/share/vdsm-reg/vdsm-upgrade ; ls -l /proc/self/fd ; sleep 10 <BSTRAP component='RHEL_INSTALL' status='WARN' message='vdsm daemon is already down before we stop it for upgrade.'/> <BSTRAP component='setMountPoint' status='OK' message='Mount succeeded.'/> <BSTRAP component='doUpgrade' status='OK' message='Upgrade Succeeded. Rebooting'/> <BSTRAP component='umount' status='OK' message='umount Succeeded'/> <BSTRAP component='RHEV_INSTALL' status='OK'/> total 0 lr-x------. 1 root root 64 Mar 25 17:56 0 -> pipe:[33300] l-wx------. 1 root root 64 Mar 25 17:56 1 -> pipe:[33301] l-wx------. 1 root root 64 Mar 25 17:56 2 -> pipe:[33302] lr-x------. 1 root root 64 Mar 25 17:56 3 -> /proc/13313/fd debug1: channel 0: free: client-session, nchannels 1 Connection to 192.168.122.211 closed by remote host. Transferred: sent 3392, received 2800 bytes, in 16.5 seconds Bytes per second: sent 205.9, received 169.9 debug1: Exit status -1 debug1: compress outgoing: raw data 248, compressed 211, factor 0.85 debug1: compress incoming: raw data 682, compressed 317, factor 0.46 Compare the last 6 lines to the output of the previous ssh run. You see that in this run it is obvious that the "pipe got" closed / ssh got killed. But only because I added the sleep of 10 seconds to force this. In summary: The default flow (see at the top == A ==) is okay, the ssh connection is closed correctly. The output in the description also shows that the vdsm-upgrade completes. The output in == B == (above) shows how a failed ssh connection looks. My idea is now that the ssh implementation used by RHEV-M has a regression which doesn't correctly close the ssh connection (see the unusual messages in /var/log/secure in the description, which are not seen in the snippets in this comment). Alon, is openssh used or a java based implementation to establish ssh connections?
ssh implementation was not changed since 3.2, nor dependency. But from the above trace I do not understand what actually happens as there is no timestamps. Was there delay of 10 seconds before sshd killed? I will send you ssh test program.
Test program sent.
And the output for a good run: ]$ make check HOST=192.168.122.211 USER=root PASSWORD=harharhar COMMAND="/usr/share/vdsm-reg/vdsm-upgrade" javac -cp sshd-core-0.7.0.jar:. SSHTest.java java -cp *:. SSHTest "192.168.122.211" "root" "r" "/usr/share/vdsm-reg/vdsm-upgrade" 3 [main] INFO org.apache.sshd.common.util.SecurityUtils - BouncyCastle not registered, using the default JCE provider 144 [NioProcessor-2] INFO org.apache.sshd.client.session.ClientSessionImpl - Session created... 210 [NioProcessor-2] INFO org.apache.sshd.client.session.ClientSessionImpl - Server version string: SSH-2.0-OpenSSH_5.3 211 [NioProcessor-2] INFO org.apache.sshd.client.session.ClientSessionImpl - Received SSH_MSG_KEXINIT 242 [NioProcessor-2] INFO org.apache.sshd.client.kex.DHG1 - Send SSH_MSG_KEXDH_INIT 246 [NioProcessor-2] INFO org.apache.sshd.client.kex.DHG1 - Received SSH_MSG_KEXDH_REPLY 273 [NioProcessor-2] INFO org.apache.sshd.client.session.ClientSessionImpl - Received SSH_MSG_NEWKEYS 278 [NioProcessor-2] INFO org.apache.sshd.client.session.ClientSessionImpl - Send SSH_MSG_SERVICE_REQUEST for ssh-userauth 321 [main] INFO org.apache.sshd.client.auth.UserAuthPassword - Send SSH_MSG_USERAUTH_REQUEST for password 347 [NioProcessor-2] INFO org.apache.sshd.client.auth.UserAuthPassword - Received SSH_MSG_USERAUTH_SUCCESS 355 [main] INFO org.apache.sshd.client.channel.ChannelExec - Send SSH_MSG_CHANNEL_OPEN on channel 0 434 [NioProcessor-2] INFO org.apache.sshd.client.channel.ChannelExec - Send SSH_MSG_CHANNEL_REQUEST exec Tue Mar 25 19:58:22 CET 2014 STATUS1: 129 10957 [NioProcessor-2] INFO org.apache.sshd.client.channel.ChannelExec - Received SSH_MSG_CHANNEL_REQUEST on channel 0 Tue Mar 25 19:58:23 CET 2014 STATUS2: 160 Tue Mar 25 19:58:23 CET 2014 STATUS3: 178 STDOUT: <BSTRAP component='RHEL_INSTALL' status='WARN' message='vdsm daemon is already down before we stop it for upgrade.'/> <BSTRAP component='setMountPoint' status='OK' message='Mount succeeded.'/> <BSTRAP component='doUpgrade' status='OK' message='Upgrade Succeeded. Rebooting'/> <BSTRAP component='umount' status='OK' message='umount Succeeded'/> <BSTRAP component='RHEV_INSTALL' status='OK'/> STDERR: GOOD: true
I actually tried to force a "pipe closed" problem by adding a sleep, but it did not happen. Looks like a good run: $ make check HOST=192.168.122.211 USER=root PASSWORD=r COMMAND="/usr/share/vdsm-reg/vdsm-upgrade ; sleep 10" javac -cp sshd-core-0.7.0.jar:. SSHTest.java java -cp *:. SSHTest "192.168.122.211" "root" "r" "/usr/share/vdsm-reg/vdsm-upgrade ; sleep 10" 2 [main] INFO org.apache.sshd.common.util.SecurityUtils - BouncyCastle not registered, using the default JCE provider 151 [NioProcessor-2] INFO org.apache.sshd.client.session.ClientSessionImpl - Session created... 220 [NioProcessor-2] INFO org.apache.sshd.client.session.ClientSessionImpl - Server version string: SSH-2.0-OpenSSH_5.3 224 [NioProcessor-2] INFO org.apache.sshd.client.session.ClientSessionImpl - Received SSH_MSG_KEXINIT 254 [NioProcessor-2] INFO org.apache.sshd.client.kex.DHG1 - Send SSH_MSG_KEXDH_INIT 257 [NioProcessor-2] INFO org.apache.sshd.client.kex.DHG1 - Received SSH_MSG_KEXDH_REPLY 278 [NioProcessor-2] INFO org.apache.sshd.client.session.ClientSessionImpl - Received SSH_MSG_NEWKEYS 283 [NioProcessor-2] INFO org.apache.sshd.client.session.ClientSessionImpl - Send SSH_MSG_SERVICE_REQUEST for ssh-userauth 325 [main] INFO org.apache.sshd.client.auth.UserAuthPassword - Send SSH_MSG_USERAUTH_REQUEST for password 484 [NioProcessor-2] INFO org.apache.sshd.client.auth.UserAuthPassword - Received SSH_MSG_USERAUTH_SUCCESS 492 [main] INFO org.apache.sshd.client.channel.ChannelExec - Send SSH_MSG_CHANNEL_OPEN on channel 0 618 [NioProcessor-2] INFO org.apache.sshd.client.channel.ChannelExec - Send SSH_MSG_CHANNEL_REQUEST exec Tue Mar 25 20:00:37 CET 2014 STATUS1: 129 Tue Mar 25 20:00:46 CET 2014 STATUS2: 146 Tue Mar 25 20:00:46 CET 2014 STATUS3: 146 STDOUT: <BSTRAP component='RHEL_INSTALL' status='WARN' message='vdsm daemon is already down before we stop it for upgrade.'/> <BSTRAP component='setMountPoint' status='OK' message='Mount succeeded.'/> <BSTRAP component='doUpgrade' status='OK' message='Upgrade Succeeded. Rebooting'/> <BSTRAP component='umount' status='OK' message='umount Succeeded'/> <BSTRAP component='RHEV_INSTALL' status='OK'/> STDERR: GOOD: true
Yes it is, why have you added sleep 10? This is exactly what the engine is doing. Douglas, please reproduce. I do not know where the problem is and got no log for this io error at engine side.
Created attachment 878846 [details] engine.log
The steps to reproduce: $ git clone git://gerrit.ovirt.org/ovirt-engine $ git checkout remotes/origin/ovirt-engine-3.4 -b 3.4 <make the compilation from http://www.ovirt.org/OVirt_Engine_Development_Environment> $HOME/ovirt-engine/share/ovirt-engine/services/ovirt-engine/ovirt-engine.py start Install the ovirt-node rpm on engine: http://209.132.179.162/ovirt/ovirt-node-iso-3.0.4-1.0.201401291204.noarch.rpm Download the iso from below link: http://209.132.179.162/ovirt/ovirt-node-iso-3.0.4-1.0.201401291204.iso - Install the iso (host will be up) - Put the host into maint. on the dialog below click to upgrade, select the iso -> OK You might face timeout sending the iso to ovirt-node and others times it just send the iso, executes the upgrade, reboot and send the pipe error to engine UI. Please note that I have attached the engine.log which includes a timeout error before the pipe one. Finally, Lev also faced both error on downstream build based on 3.4.
Cannot reproduce on my machine nor ovirt-node and rhev-h On Lev's physical host it is reproduced, with same version of rhev-h.
Might there be something up with Lev's network setup?
OK, found the root cause. Java's PipedInputStream has this behavior that if remote or local pipe is closed it does not returns EOF but IOException. This makes it very hard to process considering there is no non-blocking InputStream/OutputStream. So it was a race lucky could be reproduced reliable on one host. Solution is to quit read loop as long as we get the end message. The host-deploy is not effected as it already waits for termination message within the otopi protocol. Note for future: we need similar termination message to ovirt-node-upgrade script, a message/token that is always printed in termination. --- commit 31717db9daa9f10afb4c4a9adb7431981e7f90e6 Author: Alon Bar-Lev <alonbl> Date: Wed Mar 26 15:33:04 2014 +0200 host-deploy: node-upgrade: apply artificial eof for vdsm-upgrade apply artificial end-of-file there is no way in java to use non blocking InputStream. there is no way to interrupt the blocking. and we cannot terminate thread as the connection pool will get messy. so we should detect eof based on data. Bug-Url: https://bugzilla.redhat.com/show_bug.cgi?id=1080594 Change-Id: I093357933d2f10c0d90864f53977b832a58dafbd Signed-off-by: Alon Bar-Lev <alonbl>
This is an automated message oVirt 3.4.1 has been released: * should fix your issue * should be available at your local mirror within two days. If problems still persist, please make note of it in this bug report.