Bug 1074257 - RHEVH can no longer be upgraded or re-installed through RHEVM
Summary: RHEVH can no longer be upgraded or re-installed through RHEVM
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 3.4.0
Assignee: Douglas Schilling Landgraf
QA Contact: Pavel Stehlik
URL:
Whiteboard: infra
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-03-09 09:25 UTC by Lev Veyde
Modified: 2016-11-16 16:15 UTC (History)
21 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Cannot upgrade RHEV-H hosts using the same iso already installed in the host. Consequence: Users cannot upgrade RHEV-H hosts via RHEV-M admin page.
Clone Of:
Environment:
Last Closed:
oVirt Team: Infra
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
RHEVM Engine log (gzipped) (161.88 KB, application/x-tar-gz)
2014-03-09 09:25 UTC, Lev Veyde
no flags Details
deploymentlog (18.49 KB, application/x-tar-gz)
2014-04-01 12:29 UTC, Tareq Alayan
no flags Details
enginelog (14.99 KB, application/x-tar-gz)
2014-04-01 12:31 UTC, Tareq Alayan
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 25700 0 None None None Never

Description Lev Veyde 2014-03-09 09:25:46 UTC
Created attachment 872383 [details]
RHEVM Engine log (gzipped)

Description of problem:

It seems that we're receiving the following error during the "host-reinstall" tests (which is supposed to reinstall the already installed RHEVH host with the ISO image from RHEVM):

20:01:42 2014-03-06 20:01:42,456 - MainThread - hosts - ERROR - Response code is not valid, expected is: [200, 201], actual is: 400
20:01:42 2014-03-06 20:01:42,630 - MainThread - plmanagement.error_fetcher - ERROR - Errors fetched from VDC(jenkins-automation-rpm-vm17.eng.lab.tlv.redhat.com): 2014-03-06 20:01:42,356 ERROR [org.ovirt.engine.core.bll.UpdateVdsCommand] (ajp-/127.0.0.1:8702-5) [107] Installation/upgrade of Host a59f1b46-5da0-4c27-94c1-4a41f898e923,cinteg26.ci.lab.tlv.redhat.com failed due to: Cannot upgrade Host. Host version is not compatible with selected ISO version. Please select an ISO with major version 6.x.

I verified that the RHEVM has the correct RHEVH RPM installed:
rpm -q rhev-hypervisor6
rhev-hypervisor6-6.5-20140305.0.el6ev.noarch

And the RHEVH also has this latest version installed:
cat /etc/redhat-release
Red Hat Enterprise Virtualization Hypervisor release 6.5 (20140305.0.el6ev)

Comment 1 Fabian Deutsch 2014-03-10 11:56:10 UTC
I don't see any obvious version mismatch between the rpm and within the iso.

Comment 2 Amador Pahim 2014-03-11 13:42:09 UTC
Could you share the result of command bellow?

 # rpm -ql rhev-hypervisor6

Comment 3 Lev Veyde 2014-03-11 13:49:21 UTC
(In reply to Amador Pahim from comment #2)
> Could you share the result of command bellow?
> 
>  # rpm -ql rhev-hypervisor6

/usr/share/rhev-hypervisor
/usr/share/rhev-hypervisor/rhevh-6.5-20140305.0.el6ev.iso
/usr/share/rhev-hypervisor/vdsm-compatibility-6.5-20140305.0.el6ev.txt
/usr/share/rhev-hypervisor/version-6.5-20140305.0.el6ev.txt

Comment 4 Lev Veyde 2014-03-11 13:51:00 UTC
Just in case...

# cat /usr/share/rhev-hypervisor/vdsm-compatibility-6.5-20140305.0.el6ev.txt
3.4,3.3,3.2,3.1,3.0

# cat /usr/share/rhev-hypervisor/version-6.5-20140305.0.el6ev.txt
6.5,20140305.0.el6ev

Comment 5 Fabian Deutsch 2014-03-11 19:25:37 UTC
Lev,

could you please check if when the previous RHEV-H release is instaleld on a machine
a) RHEVM suggest to upgrade
b) the upgrade succeds?

Comment 6 Lev Veyde 2014-03-11 20:01:53 UTC
(In reply to Fabian Deutsch from comment #5)
> Lev,
> 
> could you please check if when the previous RHEV-H release is instaleld on a
> machine
> a) RHEVM suggest to upgrade
> b) the upgrade succeds?

The upgrade from previous version (Red Hat Enterprise Virtualization Hypervisor release 6.5 (20140217.0.el6ev)) fails as well.

Comment 8 Amador Pahim 2014-03-12 11:43:00 UTC
Not able to reproduce this issue with rhevm-3.4.0-0.3.master.el6ev.noarch.
Tried to upgrade from rhev-hypervisor6-6.5-20140217.0.el6ev.noarch and then to reinstall rhev-hypervisor6-6.5-20140305.0.el6ev.noarch. No issues like this.

Comment 9 Amador Pahim 2014-03-12 14:01:56 UTC
Tested with rhevm-3.4.0-0.4.master.el6ev.noarch and rhev-hypervisor6-6.5-20140311.0.el6ev.noarch. Still not reproduced. Instead, another error comes to the game. During reinstall, engine falls in exception:

2014-03-12 10:37:41,079 ERROR [org.ovirt.engine.core.bll.OVirtNodeUpgrade] (OVirtNodeUpgrade) Error during upgrade: java.io.IOException: Pipe closed


After the error, hypervisor is restarted. When it is back, I noticed system was reinstalled but vdsm is down:


[root@rhevh /]# service vdsmd status
VDS daemon is not running


No success starting it:


[root@rhevh /]# service vdsmd restart
Shutting down vdsm daemon:
vdsm watchdog stop                                         [  OK  ]
vdsm: not running                                          [FAILED]
vdsm: Running run_final_hooks
vdsm stop                                                  [  OK  ]
initctl: Job is already running: libvirtd
vdsm: Running mkdirs
vdsm: Running configure_coredump
vdsm: Running run_init_hooks
vdsm: Running gencerts
vdsm: Running check_is_configured
libvirt is not configured for vdsm yet
sanlock service requires restart
Modules libvirt,sanlock are not configured
 Traceback (most recent call last):
  File "/usr/bin/vdsm-tool", line 145, in <module>
    sys.exit(main())
  File "/usr/bin/vdsm-tool", line 142, in main
    return tool_command[cmd]["command"](*args[1:])
  File "/usr/lib64/python2.6/site-packages/vdsm/tool/configurator.py", line 265, in isconfigured
RuntimeError:

One of the modules is not configured to work with VDSM.
To configure the module use the following:
'vdsm-tool configure [module_name]'.

If all modules are not configured try to use:
'vdsm-tool configure --force'
(The force flag will stop the module's service and start it
afterwards automatically to load the new configuration.)

vdsm: stopped during execute check_is_configured task (task returned with error code 1).
vdsm start                                                 [FAILED]


VDSM only works again after a "configure --force":


[root@rhevh /]# vdsm-tool configure --force

Checking configuration status...


Running configure...
checking certs..
File already persisted: /etc/libvirt/libvirtd.conf
File already persisted: /etc/libvirt/qemu.conf
File already persisted: /etc/sysconfig/libvirtd
File already persisted: /etc/logrotate.d/libvirtd

Reconfiguration of libvirt is done.
/bin/sed: cannot rename /etc/logrotate.d/sedOFU2sH: Device or resource busy
/bin/mv: inter-device move failed: `/tmp/tmp.tVIZ4UY9xN' to `/etc/logrotate.d/libvirtd'; unable to remove target: Device or resource busy

Done configuring modules to VDSM.



[root@rhevh /]# service vdsmd restart
Shutting down vdsm daemon: 
vdsm watchdog stop                                         [  OK  ]
vdsm: not running                                          [FAILED]
vdsm: Running run_final_hooks
vdsm stop                                                  [  OK  ]
initctl: Job is already running: libvirtd
vdsm: Running mkdirs
vdsm: Running configure_coredump
vdsm: Running run_init_hooks
vdsm: Running gencerts
vdsm: Running check_is_configured
libvirt is already configured for vdsm
sanlock service is already configured
vdsm: Running validate_configuration
SUCCESS: ssl configured to true. No conflicts
vdsm: Running prepare_transient_repository
vdsm: Running syslog_available
vdsm: Running nwfilter
vdsm: Running dummybr
vdsm: Running load_needed_modules
vdsm: Running tune_system
vdsm: Running test_space
vdsm: Running test_lo
vdsm: Running restore_nets
vdsm: Running unified_network_persistence_upgrade
vdsm: Running upgrade_300_nets
Starting up vdsm daemon: 
vdsm start                                                 [  OK  ]
[root@rhevh /]#

Comment 22 Fabian Deutsch 2014-03-17 18:51:43 UTC
From a RHEV-H perspective this could be considered as a blocker, because upgrades are - according to comment 6 - also affected.

Comment 23 Kiril Nesenko 2014-03-17 18:56:31 UTC
(In reply to Fabian Deutsch from comment #22)
> From a RHEV-H perspective this could be considered as a blocker, because
> upgrades are - according to comment 6 - also affected.

Douglas found a problem with vdsm build. I am going to rebuild vdsm tomorrow and then will rebuild rhevh and will see.

- Kiril

Comment 26 Douglas Schilling Landgraf 2014-03-22 06:16:20 UTC
Update about "Pipe closed".
========================

The upgrade is happening on the node via vdsm-upgrade command but after upgrade and before the reboot the communication between ovirt-node and ovirt-engine unexpectedly closes generating IOException ("pipe broken").

logs from ovirt-engine UI events:
=========================================
Host localhost.localdomain installation failed. Pipe closed.
Installing Host localhost.localdomain. Step: RHEV_INSTALL.
Installing Host localhost.localdomain. Step: umount; Details: umount Succeeded .
Installing Host localhost.localdomain. Step: doUpgrade; Details:
Upgrade Succeeded. Rebooting .
Installing Host localhost.localdomain. Step: setMountPoint; Details:
Mount succeeded. .
Installing Host localhost.localdomain. Step: RHEL_INSTALL; Details:
vdsm daemon stopped for upgrade process! .
Installing Host localhost.localdomain. Executing
/usr/share/vdsm-reg/vdsm-upgrade.
Installing Host localhost.localdomain. Sending file
/usr/share/ovirt-node-iso/ovirt-node-iso-3.0.4-1.999.201403191804.el6.iso
to /data/updates/ovirt-node-image.iso.
Installing Host localhost.localdomain. Connected to host
192.168.100.133 with SSH key fingerprint:
a9:8b:21:a8:0d:e4:16:7d:4b:79:38:f3:e0:f6:92:e0.
2014-Mar-20, 22:43

/var/log/secure during the pipe closed
=========================================
Mar 22 04:40:56 localhost sshd[23184]: Accepted publickey for root from 192.168.100.228 port 33905 ssh2
Mar 22 04:40:56 localhost sshd[23184]: pam_unix(sshd:session): session opened for user root by (uid=0)
Mar 22 04:41:38 localhost sshd[23184]: channel_by_id: 0: bad id: channel free
Mar 22 04:41:38 localhost sshd[23184]: Disconnecting: Received ieof for nonexistent channel 0.
Mar 22 04:41:38 localhost sshd[23184]: pam_unix(sshd:session): session closed for user root
Mar 22 04:42:14 localhost sshd[25420]: Connection closed by 192.168.100.1
Mar 22 04:42:20 localhost sshd[12927]: Received signal 15; terminating.

Alon can be related to BZ#1051035 ?

Additionally to this "Pipe closed" event, I have seen some connection timeout from ovirt-engine to ovirt-node during the process of sending the iso as reported in comment#20, working back after a retry. 

On oVirt Node side executing manually /usr/share/vdsm-reg/vdsm-upgrade works without  error.
# /usr/share/vdsm-reg/vdsm-upgrade 
<BSTRAP component='RHEL_INSTALL' status='WARN' message='vdsm daemon is already down before we stop it for upgrade.'/>
<BSTRAP component='setMountPoint' status='OK' message='Mount succeeded.'/>
<BSTRAP component='doUpgrade' status='OK' message='Upgrade Succeeded. Rebooting'/>
<BSTRAP component='umount' status='OK' message='umount Succeeded'/>
<BSTRAP component='RHEV_INSTALL' status='OK'/>

Comment 27 Alon Bar-Lev 2014-03-22 06:40:08 UTC
(In reply to Douglas Schilling Landgraf from comment #26)
> Alon can be related to BZ#1051035 ?

yes, but this already fixed in master, 3.4 and 3.3.z.

but it is easy to verify... if engine receive the RHEV_INSTALL status OK, then it is unrelated as the entire data is recieved.

I am also unsure that discussing two different issues at one bug is wise, this bug was about inability to upgrade the node, as no iso was shown in engine side.

what you need to do is:
1. open a bug per issue.
2. figure out if the sshd process is terminated or not when vdsm-upgrade script ends, just hard code node sleep to 600 seconds or something.
3. once the sshd process is terminated see if tcp session is terminated.
4. if tcp session is terminated checkout the engine behavior at this point.

Comment 29 Douglas Schilling Landgraf 2014-03-23 16:48:51 UTC
(In reply to Alon Bar-Lev from comment #27)
> (In reply to Douglas Schilling Landgraf from comment #26)
> > Alon can be related to BZ#1051035 ?
> 
> yes, but this already fixed in master, 3.4 and 3.3.z.

I saw the patch but still see "Pipe closed" error on engine's 3.4 branch upstream during the upgrade.

> 
> but it is easy to verify... if engine receive the RHEV_INSTALL status OK,
> then it is unrelated as the entire data is recieved.
> 
Agreed.

> I am also unsure that discussing two different issues at one bug is wise,
> this bug was about inability to upgrade the node, as no iso was shown in
> engine side.

Agreed. Lev, about your original report I am not able to reproduce or Amador.
Are you still seeing it in the last bits? (Looks like not, based on your comment#19). If not, let's close this bug and as alon suggested open bug per issue.

> 
> what you need to do is:
> 1. open a bug per issue.
> 2. figure out if the sshd process is terminated or not when vdsm-upgrade
> script ends, just hard code node sleep to 600 seconds or something.
> 3. once the sshd process is terminated see if tcp session is terminated.
> 4. if tcp session is terminated checkout the engine behavior at this point.

Sure. I will double check.

Comment 30 Lev Veyde 2014-03-24 14:11:38 UTC
(In reply to Douglas Schilling Landgraf from comment #29)
> (In reply to Alon Bar-Lev from comment #27)
> > (In reply to Douglas Schilling Landgraf from comment #26)
> > > Alon can be related to BZ#1051035 ?
> > 
> > yes, but this already fixed in master, 3.4 and 3.3.z.
> 
> I saw the patch but still see "Pipe closed" error on engine's 3.4 branch
> upstream during the upgrade.
> 
> > 
> > but it is easy to verify... if engine receive the RHEV_INSTALL status OK,
> > then it is unrelated as the entire data is recieved.
> > 
> Agreed.
> 
> > I am also unsure that discussing two different issues at one bug is wise,
> > this bug was about inability to upgrade the node, as no iso was shown in
> > engine side.
> 
> Agreed. Lev, about your original report I am not able to reproduce or Amador.
> Are you still seeing it in the last bits? (Looks like not, based on your
> comment#19). If not, let's close this bug and as alon suggested open bug per
> issue.
> 
> > 
> > what you need to do is:
> > 1. open a bug per issue.
> > 2. figure out if the sshd process is terminated or not when vdsm-upgrade
> > script ends, just hard code node sleep to 600 seconds or something.
> > 3. once the sshd process is terminated see if tcp session is terminated.
> > 4. if tcp session is terminated checkout the engine behavior at this point.
> 
> Sure. I will double check.

With last version of RHEVH that I checked I only got to the closed SSH connection issue. Thus not sure if the original issue still exists, as this issue may come before the original one in the flow.

Comment 31 Douglas Schilling Landgraf 2014-03-25 06:09:14 UTC
(In reply to Alon Bar-Lev from comment #27)
> 
> what you need to do is:
> 1. open a bug per issue.
> 2. figure out if the sshd process is terminated or not when vdsm-upgrade
> script ends, just hard code node sleep to 600 seconds or something.
> 3. once the sshd process is terminated see if tcp session is terminated.
> 4. if tcp session is terminated checkout the engine behavior at this point.

I have changed vdsm-upgrade for tests only from:

if install.ovirt_boot_setup(reboot="Y") 
to
if install.ovirt_boot_setup(reboot="N") 

and included os.system("reboot") in vdsm-upgrade only when the script finish and I don't see any "Pipe closed" error anymore. Seems a sync issue. Fabian any suggestion? Besides of http://gerrit.ovirt.org/#/c/25967/ ? Is it time to open a bug in the ovirt-node side?

Also, double check the ovirt-node.iso is copied correctly to /data/updates as I already shared previously.

Comment 32 Fabian Deutsch 2014-03-25 16:35:21 UTC
Hey Douglas,

thanks for investigating this so much.

I can not reproduce this with plain ssh:

Provision Node with Node for 3.4rc2, then from another host:

$ scp ovirt-node-iso-3.0.4-TestDay.vdsm.el6.iso root.122.211:/data/updates/ovirt-node-image.iso

$ ssh root.122.211 "/usr/share/vdsm-reg/vdsm-upgrade"

<BSTRAP component='RHEL_INSTALL' status='WARN' message='vdsm daemon is already down before we stop it for upgrade.'/>
<BSTRAP component='setMountPoint' status='OK' message='Mount succeeded.'/>
<BSTRAP component='doUpgrade' status='OK' message='Upgrade Succeeded. Rebooting'/>
<BSTRAP component='umount' status='OK' message='umount Succeeded'/>
<BSTRAP component='RHEV_INSTALL' status='OK'/>

$

No pipe closed error when triggering the vdsm-upgrade manually via ssh. There are also no unusual things in /var/log/secure.

Can someone confirm this?

Alon, do you maybe know if there is something special about Engines ssh client?

Comment 33 Fabian Deutsch 2014-03-25 16:36:52 UTC
Lev,

can you tell if upstream 3.4 is also affected by this?

Comment 34 Fabian Deutsch 2014-03-25 16:37:48 UTC
(In reply to Lev Veyde from comment #30)
...
> With last version of RHEVH that I checked I only got to the closed SSH
> connection issue. Thus not sure if the original issue still exists, as this
> issue may come before the original one in the flow.

Can you also tell here what versions of RHEV-H and RHEV-M you checked?

Comment 35 Alon Bar-Lev 2014-03-25 16:53:11 UTC
(In reply to Fabian Deutsch from comment #32)
> 
> No pipe closed error when triggering the vdsm-upgrade manually via ssh.
> There are also no unusual things in /var/log/secure.
> 
> Can someone confirm this?
> 
> Alon, do you maybe know if there is something special about Engines ssh
> client?

There is nothing special, if process ends, session terminates, client happy.

It worked so far, no reason it won't keep working.

Please open a separate bug for this bug is resolved and abused.

Comment 36 Fabian Deutsch 2014-03-25 16:54:11 UTC
Alon,

where do you see that this bug is solved?

Comment 37 Alon Bar-Lev 2014-03-25 17:04:10 UTC
(In reply to Fabian Deutsch from comment #36)
> Alon,
> 
> where do you see that this bug is solved?

This bug is all about:
"""
a59f1b46-5da0-4c27-94c1-4a41f898e923,cinteg26.ci.lab.tlv.redhat.com failed due to: Cannot upgrade Host. Host version is not compatible with selected ISO version. Please select an ISO with major version 6.x.
"""

This was resolved.

Comment 38 Fabian Deutsch 2014-03-25 17:11:59 UTC
Based on a dialog on IRC:

The original bug is solved, because the upgrade could be triggered in latter trials.
And that the upgrade was run indiciates that the error which is mentioned in the description is gone.

The remaining comments were abusing this bug, because they are about a different issue.

Lev, could you please verify that the original bug as described in the description is really gone?

Comment 39 Lev Veyde 2014-03-26 08:44:50 UTC
(In reply to Fabian Deutsch from comment #33)
> Lev,
> 
> can you tell if upstream 3.4 is also affected by this?

I don't know - only tested downstream RHEVH.

(In reply to Fabian Deutsch from comment #38)
> Based on a dialog on IRC:
> 
> The original bug is solved, because the upgrade could be triggered in latter
> trials.
> And that the upgrade was run indiciates that the error which is mentioned in
> the description is gone.
> 
> The remaining comments were abusing this bug, because they are about a
> different issue.
> 
> Lev, could you please verify that the original bug as described in the
> description is really gone?

I no longer see it in manual tests, but it still appear in automatic one (checking that).

Comment 40 Eyal Edri 2014-03-30 09:45:16 UTC
tareq - can you veify this fails on rhev-h testing in qe or not?

Comment 41 Douglas Schilling Landgraf 2014-03-30 13:53:25 UTC
(In reply to Lev Veyde from comment #39)
> (In reply to Fabian Deutsch from comment #33)
> > Lev,
> > 
> > can you tell if upstream 3.4 is also affected by this?
> 
> I don't know - only tested downstream RHEVH.
> 
> (In reply to Fabian Deutsch from comment #38)
> > Based on a dialog on IRC:
> > 
> > The original bug is solved, because the upgrade could be triggered in latter
> > trials.
> > And that the upgrade was run indiciates that the error which is mentioned in
> > the description is gone.
> > 
> > The remaining comments were abusing this bug, because they are about a
> > different issue.
> > 
> > Lev, could you please verify that the original bug as described in the
> > description is really gone?
> 
> I no longer see it in manual tests, but it still appear in automatic one
> (checking that).

Ok, I am moving to QA for now for double check. Let's us know in case you find something in your automatic tests.

(In reply to Eyal Edri from comment #40)
> tareq - can you veify this fails on rhev-h testing in qe or not?

Hi Tareq,

In case you see the below message during your tests:

2014-03-12 10:37:41,079 ERROR [org.ovirt.engine.core.bll.OVirtNodeUpgrade] (OVirtNodeUpgrade) Error during upgrade: java.io.IOException: Pipe closed

Please go to bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1080594 it's in POST and need to be backported to 3.4 downstream.

Thanks

Comment 42 Tareq Alayan 2014-04-01 12:27:57 UTC
I have a running engine rhevm-3.4.0-0.12.beta2.el6ev.noarch that have up and running rhevh(Red Hat Enterprise Virtualization Hypervisor release 6.5 (20140313.1.el6ev)

tried to upgrade to
Red Hat Enterprise Virtualization Hypervisor release 6.5 (20140326.0.el6ev)


Result: installation failed.


However the rhevh version is  Red Hat Enterprise Virtualization Hypervisor release 6.5 (20140326.0.el6ev)

and when i tried to activate host it stayed in unresponsive state.


logs attached.

Comment 43 Tareq Alayan 2014-04-01 12:29:44 UTC
Created attachment 881325 [details]
deploymentlog

Comment 44 Tareq Alayan 2014-04-01 12:31:14 UTC
Created attachment 881326 [details]
enginelog

Comment 45 Alon Bar-Lev 2014-04-01 12:35:27 UTC
If you were able to see iso image in the upgrade dialog, this bug is resolved.

You are be experiencing bug#1080594 or

Comment 46 Tareq Alayan 2014-04-01 13:08:10 UTC
moving to verified since the new iso image is installed.

Comment 48 Lev Veyde 2014-04-08 13:54:28 UTC
Closing the bug, as what we see is potentially due to another bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1082612

Comment 49 Zac Dover 2014-05-06 05:50:44 UTC
Chris,

Does this one need a release note?

Thanks in advance.

Zac

Comment 50 Itamar Heim 2014-06-12 14:10:32 UTC
Closing as part of 3.4.0


Note You need to log in before you can comment on or make changes to this bug.