Bug 1074257
| Summary: | RHEVH can no longer be upgraded or re-installed through RHEVM | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Lev Veyde <lveyde> | ||||||||
| Component: | ovirt-engine | Assignee: | Douglas Schilling Landgraf <dougsland> | ||||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Pavel Stehlik <pstehlik> | ||||||||
| Severity: | unspecified | Docs Contact: | |||||||||
| Priority: | unspecified | ||||||||||
| Version: | 3.3.0 | CC: | acathrow, alonbl, asegundo, bazulay, cpelland, danken, dfediuck, dougsland, eedri, fdeutsch, gklein, iheim, knesenko, lpeer, lveyde, obasan, Rhev-m-bugs, talayan, ybronhei, yeylon, zdover | ||||||||
| Target Milestone: | --- | Keywords: | AutomationBlocker | ||||||||
| Target Release: | 3.4.0 | ||||||||||
| Hardware: | Unspecified | ||||||||||
| OS: | Unspecified | ||||||||||
| Whiteboard: | infra | ||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||
| Doc Text: |
Cause:
Cannot upgrade RHEV-H hosts using the same iso already installed in the host.
Consequence:
Users cannot upgrade RHEV-H hosts via RHEV-M admin page.
|
Story Points: | --- | ||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | Type: | Bug | |||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | Infra | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Attachments: |
|
||||||||||
I don't see any obvious version mismatch between the rpm and within the iso. Could you share the result of command bellow? # rpm -ql rhev-hypervisor6 (In reply to Amador Pahim from comment #2) > Could you share the result of command bellow? > > # rpm -ql rhev-hypervisor6 /usr/share/rhev-hypervisor /usr/share/rhev-hypervisor/rhevh-6.5-20140305.0.el6ev.iso /usr/share/rhev-hypervisor/vdsm-compatibility-6.5-20140305.0.el6ev.txt /usr/share/rhev-hypervisor/version-6.5-20140305.0.el6ev.txt Just in case... # cat /usr/share/rhev-hypervisor/vdsm-compatibility-6.5-20140305.0.el6ev.txt 3.4,3.3,3.2,3.1,3.0 # cat /usr/share/rhev-hypervisor/version-6.5-20140305.0.el6ev.txt 6.5,20140305.0.el6ev Lev, could you please check if when the previous RHEV-H release is instaleld on a machine a) RHEVM suggest to upgrade b) the upgrade succeds? (In reply to Fabian Deutsch from comment #5) > Lev, > > could you please check if when the previous RHEV-H release is instaleld on a > machine > a) RHEVM suggest to upgrade > b) the upgrade succeds? The upgrade from previous version (Red Hat Enterprise Virtualization Hypervisor release 6.5 (20140217.0.el6ev)) fails as well. Not able to reproduce this issue with rhevm-3.4.0-0.3.master.el6ev.noarch. Tried to upgrade from rhev-hypervisor6-6.5-20140217.0.el6ev.noarch and then to reinstall rhev-hypervisor6-6.5-20140305.0.el6ev.noarch. No issues like this. Tested with rhevm-3.4.0-0.4.master.el6ev.noarch and rhev-hypervisor6-6.5-20140311.0.el6ev.noarch. Still not reproduced. Instead, another error comes to the game. During reinstall, engine falls in exception:
2014-03-12 10:37:41,079 ERROR [org.ovirt.engine.core.bll.OVirtNodeUpgrade] (OVirtNodeUpgrade) Error during upgrade: java.io.IOException: Pipe closed
After the error, hypervisor is restarted. When it is back, I noticed system was reinstalled but vdsm is down:
[root@rhevh /]# service vdsmd status
VDS daemon is not running
No success starting it:
[root@rhevh /]# service vdsmd restart
Shutting down vdsm daemon:
vdsm watchdog stop [ OK ]
vdsm: not running [FAILED]
vdsm: Running run_final_hooks
vdsm stop [ OK ]
initctl: Job is already running: libvirtd
vdsm: Running mkdirs
vdsm: Running configure_coredump
vdsm: Running run_init_hooks
vdsm: Running gencerts
vdsm: Running check_is_configured
libvirt is not configured for vdsm yet
sanlock service requires restart
Modules libvirt,sanlock are not configured
Traceback (most recent call last):
File "/usr/bin/vdsm-tool", line 145, in <module>
sys.exit(main())
File "/usr/bin/vdsm-tool", line 142, in main
return tool_command[cmd]["command"](*args[1:])
File "/usr/lib64/python2.6/site-packages/vdsm/tool/configurator.py", line 265, in isconfigured
RuntimeError:
One of the modules is not configured to work with VDSM.
To configure the module use the following:
'vdsm-tool configure [module_name]'.
If all modules are not configured try to use:
'vdsm-tool configure --force'
(The force flag will stop the module's service and start it
afterwards automatically to load the new configuration.)
vdsm: stopped during execute check_is_configured task (task returned with error code 1).
vdsm start [FAILED]
VDSM only works again after a "configure --force":
[root@rhevh /]# vdsm-tool configure --force
Checking configuration status...
Running configure...
checking certs..
File already persisted: /etc/libvirt/libvirtd.conf
File already persisted: /etc/libvirt/qemu.conf
File already persisted: /etc/sysconfig/libvirtd
File already persisted: /etc/logrotate.d/libvirtd
Reconfiguration of libvirt is done.
/bin/sed: cannot rename /etc/logrotate.d/sedOFU2sH: Device or resource busy
/bin/mv: inter-device move failed: `/tmp/tmp.tVIZ4UY9xN' to `/etc/logrotate.d/libvirtd'; unable to remove target: Device or resource busy
Done configuring modules to VDSM.
[root@rhevh /]# service vdsmd restart
Shutting down vdsm daemon:
vdsm watchdog stop [ OK ]
vdsm: not running [FAILED]
vdsm: Running run_final_hooks
vdsm stop [ OK ]
initctl: Job is already running: libvirtd
vdsm: Running mkdirs
vdsm: Running configure_coredump
vdsm: Running run_init_hooks
vdsm: Running gencerts
vdsm: Running check_is_configured
libvirt is already configured for vdsm
sanlock service is already configured
vdsm: Running validate_configuration
SUCCESS: ssl configured to true. No conflicts
vdsm: Running prepare_transient_repository
vdsm: Running syslog_available
vdsm: Running nwfilter
vdsm: Running dummybr
vdsm: Running load_needed_modules
vdsm: Running tune_system
vdsm: Running test_space
vdsm: Running test_lo
vdsm: Running restore_nets
vdsm: Running unified_network_persistence_upgrade
vdsm: Running upgrade_300_nets
Starting up vdsm daemon:
vdsm start [ OK ]
[root@rhevh /]#
From a RHEV-H perspective this could be considered as a blocker, because upgrades are - according to comment 6 - also affected. (In reply to Fabian Deutsch from comment #22) > From a RHEV-H perspective this could be considered as a blocker, because > upgrades are - according to comment 6 - also affected. Douglas found a problem with vdsm build. I am going to rebuild vdsm tomorrow and then will rebuild rhevh and will see. - Kiril Update about "Pipe closed".
========================
The upgrade is happening on the node via vdsm-upgrade command but after upgrade and before the reboot the communication between ovirt-node and ovirt-engine unexpectedly closes generating IOException ("pipe broken").
logs from ovirt-engine UI events:
=========================================
Host localhost.localdomain installation failed. Pipe closed.
Installing Host localhost.localdomain. Step: RHEV_INSTALL.
Installing Host localhost.localdomain. Step: umount; Details: umount Succeeded .
Installing Host localhost.localdomain. Step: doUpgrade; Details:
Upgrade Succeeded. Rebooting .
Installing Host localhost.localdomain. Step: setMountPoint; Details:
Mount succeeded. .
Installing Host localhost.localdomain. Step: RHEL_INSTALL; Details:
vdsm daemon stopped for upgrade process! .
Installing Host localhost.localdomain. Executing
/usr/share/vdsm-reg/vdsm-upgrade.
Installing Host localhost.localdomain. Sending file
/usr/share/ovirt-node-iso/ovirt-node-iso-3.0.4-1.999.201403191804.el6.iso
to /data/updates/ovirt-node-image.iso.
Installing Host localhost.localdomain. Connected to host
192.168.100.133 with SSH key fingerprint:
a9:8b:21:a8:0d:e4:16:7d:4b:79:38:f3:e0:f6:92:e0.
2014-Mar-20, 22:43
/var/log/secure during the pipe closed
=========================================
Mar 22 04:40:56 localhost sshd[23184]: Accepted publickey for root from 192.168.100.228 port 33905 ssh2
Mar 22 04:40:56 localhost sshd[23184]: pam_unix(sshd:session): session opened for user root by (uid=0)
Mar 22 04:41:38 localhost sshd[23184]: channel_by_id: 0: bad id: channel free
Mar 22 04:41:38 localhost sshd[23184]: Disconnecting: Received ieof for nonexistent channel 0.
Mar 22 04:41:38 localhost sshd[23184]: pam_unix(sshd:session): session closed for user root
Mar 22 04:42:14 localhost sshd[25420]: Connection closed by 192.168.100.1
Mar 22 04:42:20 localhost sshd[12927]: Received signal 15; terminating.
Alon can be related to BZ#1051035 ?
Additionally to this "Pipe closed" event, I have seen some connection timeout from ovirt-engine to ovirt-node during the process of sending the iso as reported in comment#20, working back after a retry.
On oVirt Node side executing manually /usr/share/vdsm-reg/vdsm-upgrade works without error.
# /usr/share/vdsm-reg/vdsm-upgrade
<BSTRAP component='RHEL_INSTALL' status='WARN' message='vdsm daemon is already down before we stop it for upgrade.'/>
<BSTRAP component='setMountPoint' status='OK' message='Mount succeeded.'/>
<BSTRAP component='doUpgrade' status='OK' message='Upgrade Succeeded. Rebooting'/>
<BSTRAP component='umount' status='OK' message='umount Succeeded'/>
<BSTRAP component='RHEV_INSTALL' status='OK'/>
(In reply to Douglas Schilling Landgraf from comment #26) > Alon can be related to BZ#1051035 ? yes, but this already fixed in master, 3.4 and 3.3.z. but it is easy to verify... if engine receive the RHEV_INSTALL status OK, then it is unrelated as the entire data is recieved. I am also unsure that discussing two different issues at one bug is wise, this bug was about inability to upgrade the node, as no iso was shown in engine side. what you need to do is: 1. open a bug per issue. 2. figure out if the sshd process is terminated or not when vdsm-upgrade script ends, just hard code node sleep to 600 seconds or something. 3. once the sshd process is terminated see if tcp session is terminated. 4. if tcp session is terminated checkout the engine behavior at this point. (In reply to Alon Bar-Lev from comment #27) > (In reply to Douglas Schilling Landgraf from comment #26) > > Alon can be related to BZ#1051035 ? > > yes, but this already fixed in master, 3.4 and 3.3.z. I saw the patch but still see "Pipe closed" error on engine's 3.4 branch upstream during the upgrade. > > but it is easy to verify... if engine receive the RHEV_INSTALL status OK, > then it is unrelated as the entire data is recieved. > Agreed. > I am also unsure that discussing two different issues at one bug is wise, > this bug was about inability to upgrade the node, as no iso was shown in > engine side. Agreed. Lev, about your original report I am not able to reproduce or Amador. Are you still seeing it in the last bits? (Looks like not, based on your comment#19). If not, let's close this bug and as alon suggested open bug per issue. > > what you need to do is: > 1. open a bug per issue. > 2. figure out if the sshd process is terminated or not when vdsm-upgrade > script ends, just hard code node sleep to 600 seconds or something. > 3. once the sshd process is terminated see if tcp session is terminated. > 4. if tcp session is terminated checkout the engine behavior at this point. Sure. I will double check. (In reply to Douglas Schilling Landgraf from comment #29) > (In reply to Alon Bar-Lev from comment #27) > > (In reply to Douglas Schilling Landgraf from comment #26) > > > Alon can be related to BZ#1051035 ? > > > > yes, but this already fixed in master, 3.4 and 3.3.z. > > I saw the patch but still see "Pipe closed" error on engine's 3.4 branch > upstream during the upgrade. > > > > > but it is easy to verify... if engine receive the RHEV_INSTALL status OK, > > then it is unrelated as the entire data is recieved. > > > Agreed. > > > I am also unsure that discussing two different issues at one bug is wise, > > this bug was about inability to upgrade the node, as no iso was shown in > > engine side. > > Agreed. Lev, about your original report I am not able to reproduce or Amador. > Are you still seeing it in the last bits? (Looks like not, based on your > comment#19). If not, let's close this bug and as alon suggested open bug per > issue. > > > > > what you need to do is: > > 1. open a bug per issue. > > 2. figure out if the sshd process is terminated or not when vdsm-upgrade > > script ends, just hard code node sleep to 600 seconds or something. > > 3. once the sshd process is terminated see if tcp session is terminated. > > 4. if tcp session is terminated checkout the engine behavior at this point. > > Sure. I will double check. With last version of RHEVH that I checked I only got to the closed SSH connection issue. Thus not sure if the original issue still exists, as this issue may come before the original one in the flow. (In reply to Alon Bar-Lev from comment #27) > > what you need to do is: > 1. open a bug per issue. > 2. figure out if the sshd process is terminated or not when vdsm-upgrade > script ends, just hard code node sleep to 600 seconds or something. > 3. once the sshd process is terminated see if tcp session is terminated. > 4. if tcp session is terminated checkout the engine behavior at this point. I have changed vdsm-upgrade for tests only from: if install.ovirt_boot_setup(reboot="Y") to if install.ovirt_boot_setup(reboot="N") and included os.system("reboot") in vdsm-upgrade only when the script finish and I don't see any "Pipe closed" error anymore. Seems a sync issue. Fabian any suggestion? Besides of http://gerrit.ovirt.org/#/c/25967/ ? Is it time to open a bug in the ovirt-node side? Also, double check the ovirt-node.iso is copied correctly to /data/updates as I already shared previously. Hey Douglas, thanks for investigating this so much. I can not reproduce this with plain ssh: Provision Node with Node for 3.4rc2, then from another host: $ scp ovirt-node-iso-3.0.4-TestDay.vdsm.el6.iso root.122.211:/data/updates/ovirt-node-image.iso $ ssh root.122.211 "/usr/share/vdsm-reg/vdsm-upgrade" <BSTRAP component='RHEL_INSTALL' status='WARN' message='vdsm daemon is already down before we stop it for upgrade.'/> <BSTRAP component='setMountPoint' status='OK' message='Mount succeeded.'/> <BSTRAP component='doUpgrade' status='OK' message='Upgrade Succeeded. Rebooting'/> <BSTRAP component='umount' status='OK' message='umount Succeeded'/> <BSTRAP component='RHEV_INSTALL' status='OK'/> $ No pipe closed error when triggering the vdsm-upgrade manually via ssh. There are also no unusual things in /var/log/secure. Can someone confirm this? Alon, do you maybe know if there is something special about Engines ssh client? Lev, can you tell if upstream 3.4 is also affected by this? (In reply to Lev Veyde from comment #30) ... > With last version of RHEVH that I checked I only got to the closed SSH > connection issue. Thus not sure if the original issue still exists, as this > issue may come before the original one in the flow. Can you also tell here what versions of RHEV-H and RHEV-M you checked? (In reply to Fabian Deutsch from comment #32) > > No pipe closed error when triggering the vdsm-upgrade manually via ssh. > There are also no unusual things in /var/log/secure. > > Can someone confirm this? > > Alon, do you maybe know if there is something special about Engines ssh > client? There is nothing special, if process ends, session terminates, client happy. It worked so far, no reason it won't keep working. Please open a separate bug for this bug is resolved and abused. Alon, where do you see that this bug is solved? (In reply to Fabian Deutsch from comment #36) > Alon, > > where do you see that this bug is solved? This bug is all about: """ a59f1b46-5da0-4c27-94c1-4a41f898e923,cinteg26.ci.lab.tlv.redhat.com failed due to: Cannot upgrade Host. Host version is not compatible with selected ISO version. Please select an ISO with major version 6.x. """ This was resolved. Based on a dialog on IRC: The original bug is solved, because the upgrade could be triggered in latter trials. And that the upgrade was run indiciates that the error which is mentioned in the description is gone. The remaining comments were abusing this bug, because they are about a different issue. Lev, could you please verify that the original bug as described in the description is really gone? (In reply to Fabian Deutsch from comment #33) > Lev, > > can you tell if upstream 3.4 is also affected by this? I don't know - only tested downstream RHEVH. (In reply to Fabian Deutsch from comment #38) > Based on a dialog on IRC: > > The original bug is solved, because the upgrade could be triggered in latter > trials. > And that the upgrade was run indiciates that the error which is mentioned in > the description is gone. > > The remaining comments were abusing this bug, because they are about a > different issue. > > Lev, could you please verify that the original bug as described in the > description is really gone? I no longer see it in manual tests, but it still appear in automatic one (checking that). tareq - can you veify this fails on rhev-h testing in qe or not? (In reply to Lev Veyde from comment #39) > (In reply to Fabian Deutsch from comment #33) > > Lev, > > > > can you tell if upstream 3.4 is also affected by this? > > I don't know - only tested downstream RHEVH. > > (In reply to Fabian Deutsch from comment #38) > > Based on a dialog on IRC: > > > > The original bug is solved, because the upgrade could be triggered in latter > > trials. > > And that the upgrade was run indiciates that the error which is mentioned in > > the description is gone. > > > > The remaining comments were abusing this bug, because they are about a > > different issue. > > > > Lev, could you please verify that the original bug as described in the > > description is really gone? > > I no longer see it in manual tests, but it still appear in automatic one > (checking that). Ok, I am moving to QA for now for double check. Let's us know in case you find something in your automatic tests. (In reply to Eyal Edri from comment #40) > tareq - can you veify this fails on rhev-h testing in qe or not? Hi Tareq, In case you see the below message during your tests: 2014-03-12 10:37:41,079 ERROR [org.ovirt.engine.core.bll.OVirtNodeUpgrade] (OVirtNodeUpgrade) Error during upgrade: java.io.IOException: Pipe closed Please go to bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1080594 it's in POST and need to be backported to 3.4 downstream. Thanks I have a running engine rhevm-3.4.0-0.12.beta2.el6ev.noarch that have up and running rhevh(Red Hat Enterprise Virtualization Hypervisor release 6.5 (20140313.1.el6ev) tried to upgrade to Red Hat Enterprise Virtualization Hypervisor release 6.5 (20140326.0.el6ev) Result: installation failed. However the rhevh version is Red Hat Enterprise Virtualization Hypervisor release 6.5 (20140326.0.el6ev) and when i tried to activate host it stayed in unresponsive state. logs attached. Created attachment 881325 [details]
deploymentlog
Created attachment 881326 [details]
enginelog
If you were able to see iso image in the upgrade dialog, this bug is resolved. You are be experiencing bug#1080594 or moving to verified since the new iso image is installed. I still see the issue: http://jenkins-ci.eng.lab.tlv.redhat.com/job/rhevm_3.4_automation_coretools_rhevh_restapi_hosts_nfs_rest_factory/126/testReport/Hosts/019-Reinstall%20host/Reinstall_host/ Closing the bug, as what we see is potentially due to another bug: https://bugzilla.redhat.com/show_bug.cgi?id=1082612 Chris, Does this one need a release note? Thanks in advance. Zac Closing as part of 3.4.0 |
Created attachment 872383 [details] RHEVM Engine log (gzipped) Description of problem: It seems that we're receiving the following error during the "host-reinstall" tests (which is supposed to reinstall the already installed RHEVH host with the ISO image from RHEVM): 20:01:42 2014-03-06 20:01:42,456 - MainThread - hosts - ERROR - Response code is not valid, expected is: [200, 201], actual is: 400 20:01:42 2014-03-06 20:01:42,630 - MainThread - plmanagement.error_fetcher - ERROR - Errors fetched from VDC(jenkins-automation-rpm-vm17.eng.lab.tlv.redhat.com): 2014-03-06 20:01:42,356 ERROR [org.ovirt.engine.core.bll.UpdateVdsCommand] (ajp-/127.0.0.1:8702-5) [107] Installation/upgrade of Host a59f1b46-5da0-4c27-94c1-4a41f898e923,cinteg26.ci.lab.tlv.redhat.com failed due to: Cannot upgrade Host. Host version is not compatible with selected ISO version. Please select an ISO with major version 6.x. I verified that the RHEVM has the correct RHEVH RPM installed: rpm -q rhev-hypervisor6 rhev-hypervisor6-6.5-20140305.0.el6ev.noarch And the RHEVH also has this latest version installed: cat /etc/redhat-release Red Hat Enterprise Virtualization Hypervisor release 6.5 (20140305.0.el6ev)