Bug 1123613

Summary:	Unable to upgrade RHEVH - Host slot-6 installation failed. SSH session timeout host 'root@host-ip'.
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	Pavel Stehlik <pstehlik>
Component:	ovirt-engine	Assignee:	Moti Asayag <masayag>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Pavol Brilla <pbrilla>
Severity:	low	Docs Contact:
Priority:	low
Version:	3.4.1-1	CC:	bazulay, cshao, ecohen, iheim, lpeer, oourfali, pstehlik, rbalakri, Rhev-m-bugs, sherold, ybronhei, ycui, yeylon
Target Milestone:	ovirt-3.6.0-rc
Target Release:	3.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	infra
Fixed In Version:	3.6.0-11	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-01-13 15:40:55 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Infra	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Pavel Stehlik 2014-07-27 09:02:02 UTC

Description of problem:
 previous verison 6.5-20140723.0.el6ev.noarch.rpm 
 - it's NOT enough when there is ~1G space left on '/' - so basicaly workaroung could be to erase /var/log/some-logs in order to have more space.

The only log message is engine.log [1]. I didn't find any other log on node which tells anything about try to upgrade. This issue is there probably for longer time, but I never encounter it even I had less space on device - weird... It won't start to copy the ISO file from engine to node (old iso is there and UI says - SSH Timeout).


Some system properties:
[root@slot-6 ~]# cat /etc/redhat-release
Red Hat Enterprise Virtualization Hypervisor release 6.5 (20140725.0.el6ev)
[root@slot-6 ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/live-rw   1.5G  457M  1.1G  31% /
/dev/mapper/HostVG-Config
                      7.8M  2.4M  5.1M  32% /config
tmpfs                 5.9G     0  5.9G   0% /dev/shm
[root@slot-6 ~]# mount
/dev/mapper/live-rw on / type ext2 (ro,noatime)
/dev/mapper/HostVG-Config on /config type ext4 (rw,noatime)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
tmpfs on /dev/shm type tmpfs (rw,rootcontext="system_u:object_r:tmpfs_t:s0")
/data/images on /var/lib/libvirt/images type none (rw,bind)
/data/core on /var/log/core type none (rw,bind)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)




[1] -
...
2014-07-27 09:48:59,559 INFO  [org.ovirt.engine.core.utils.ssh.SSHDialog] (org.ovirt.thread.pool-4-thread-17) SSH execute root.63.138 'mkdir -p '/data/updates''
2014-07-27 09:53:59,743 ERROR [org.ovirt.engine.core.bll.OVirtNodeUpgrade] (org.ovirt.thread.pool-4-thread-17) [743a0981] Timeout during node 10.34.63.138 upgrade: javax.naming.TimeLimitExceededException: SSH session timeout host 'root.63.138'
        at org.ovirt.engine.core.utils.ssh.SSHClient.executeCommand(SSHClient.java:499) [utils.jar:]
        at org.ovirt.engine.core.utils.ssh.SSHClient.sendFile(SSHClient.java:633) [utils.jar:]
        at org.ovirt.engine.core.utils.ssh.SSHDialog.sendFile(SSHDialog.java:374) [utils.jar:]

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. install new RHEVH rpm to engine
2. in webadmin Maint RHEVH host & upgrade from General sub-tab
3.

Comment 3 cshao 2014-07-28 05:51:26 UTC

RHEV-H QE reproduce this issue with special condition, not always encounter rhevh upgrading failed.

Test version:
rhev-hypervisor6-6.5-20140723.0.el6ev(vdsm-4.14.11-5.el6ev.x86_64)
rhev-hypervisor6-6.5-20140725.0.el6ev(vdsm-4.14.11-5.el6ev.x86_64)
ovirt-node-3.0.1-18.el6.14.noarch
ovirt-node-3.0.1-18.el6.14.noarch
rhevm-3.4.1-0.30.el6ev.noarch


Test scenario1 (Default install-> log size=2048M)
===================================================
1. Install rhev-hypervisor6-6.5-20140725.0.rpm to engine
2. Install rhev-hypervisor6-6.5-20140723.0.el6 with default setting(log size=2048M)
3. Register to RHEVM.
4. In webadmin, Maintenance RHEVH host & upgrade to (rhev-hypervisor6-6.5-20140725.0).

Test result:
Upgrade can successful


Test scenario2 (Minimal install-> log size=24M)
===================================================
1. Install rhev-hypervisor6-6.5-20140723.0.el6 with minimal setting(log size=24M)
2. Register to RHEVM.
3. In webadmin, Maintenance RHEVH host & upgrade to (rhev-hypervisor6-6.5-20140725.0).

Test result:
Upgrade can successful


Test scenario3 ***Special Condition*** (default install and full of the log partition manually)
===================================================
1. Install rhev-hypervisor6-6.5-20140723.0.el6 with default setting(log size=2048M)
2. Register to RHEVM and approve it.
3. DD a big file(1.6G) in log partition.
4. In webadmin, Maintenance RHEVH host & upgrade to (rhev-hypervisor6-6.5-20140725.0).

Test result:
1. Step3: pop-up info:
Critical. low disk space. Host dhcp-9-139.nay.redhat.com has less than 500 MB if free soace left on :/var/log
2. Step4: rhevh upgrading failed.
 Host dhcp-9-139.nay.redhat.com installation failed, SSH session timeout host 'root.9.139'

Comment 4 cshao 2014-07-28 07:48:46 UTC

This is not a regression bug due we can reproduce this issue with below version.

rhev-hypervisor6-6.5-20140618.0.el6ev 
vdsm-4.14.7-3.el6ev.x86_64
ovirt-node-3.0.1-18.el6_5.10.noarch

Update to rhev-hypervisor6-6.5-20140624.0
vdsm-4.14.7-3.el6ev.x86_64
ovirt-node-3.0.1-18.el6_5.11.noarch

Here are some update for #c3   ->Test scenario3
Need add one step "Remove the DD file" after step 3.

Thanks!

Comment 5 Fabian Deutsch 2014-07-28 10:06:20 UTC

Moving this to Engine, as the iso is never copied from the Engine to Node side (as noted in the description).

Comment 6 Barak 2014-07-29 09:42:43 UTC

As far as I remember we report low disk space to the engine, and warn about it.
This is not something specific to RHEVH or to this version of RHEVH.

This is not a blocker for anything.

Yaniv - please take a look into it, let's try to think of some generic solution on host-deply.

Comment 7 Barak 2014-07-29 09:43:55 UTC

This is not an urgent issue - nor this is specific for this version.
A workaround would be to reinstall the rhevh.

Comment 8 Pavel Stehlik 2014-08-29 13:23:12 UTC

(In reply to Barak from comment #6)
> As far as I remember we report low disk space to the engine, and warn about
> it.
> This is not something specific to RHEVH or to this version of RHEVH.
> 
> This is not a blocker for anything.
> 
> Yaniv - please take a look into it, let's try to think of some generic
> solution on host-deply.

Ok, as workaround exists, no blocker, till here I agree.
 
I don't agree that general warning about "Low disk space." will prevent user or inform him clearly enough about what is root cause of this upgrade issue. "Low space for weeks and I'm gonna install 100M to 1G available space. What's wrong?"  <= this is real scenario.

Solution - either please inform user that when Low space warning occurred he can experience this issue.
 Or even better - tell him when trying to upload file - WHY IT FAILED - because of not enough space left on device. 

Currently is misleading warning about SSH session Timeout which is very bad.

Comment 9 Oved Ourfali 2015-08-06 07:00:47 UTC

Moti - please check if we indeed report the low disk space on this rhev-h machine to the engine.
If so, worth adding to the warning that:
"Low disk space might cause an issue upgrading this host."

Comment 10 Pavel Stehlik 2015-08-12 10:41:02 UTC

(In reply to Oved Ourfali from comment #9)
> Moti - please check if we indeed report the low disk space on this rhev-h
> machine to the engine.
> If so, worth adding to the warning that:
> "Low disk space might cause an issue upgrading this host."

Agree.
 (although I don't understand why not to use specific error during uploading - and why to expose SSH Timeout instead..).

Comment 11 Moti Asayag 2015-08-18 09:41:43 UTC

(In reply to Oved Ourfali from comment #9)
> Moti - please check if we indeed report the low disk space on this rhev-h
> machine to the engine.

Yes, the monitoring process checks the storage for ovirt-node and regular hosts.
The check relies on configuration value 'VdsLocalDisksCriticallyLowFreeSpace', which its default is 500MB, so there should be a warning for that.

Comment 12 Pavel Stehlik 2016-01-13 15:40:55 UTC

Closing old bugs. In case still happens, pls reopen.