Description of problem: Setup Two RHEV-H with Hosted Engine successful. Try to _upgrade_ one of two RHEV-H via Hosted Engine, but failed on Step: RHEV_INSTALL. Version-Release number of selected component (if applicable): # rpm -qa ovirt-node ovirt-hosted-engine-setup ovirt-hosted-engine-ha vdsm kernel ovirt-node-plugin-hosted-engine ovirt-hosted-engine-setup-1.2.5.3-1.el7ev.noarch vdsm-4.16.24-2.el7ev.x86_64 ovirt-hosted-engine-ha-1.2.6-2.el7ev.noarch ovirt-node-plugin-hosted-engine-0.2.0-18.0.el7ev.noarch kernel-3.10.0-229.11.1.el7.x86_64 ovirt-node-3.2.3-18.el7.noarch # cat /etc/rhev-hypervisor-release Red Hat Enterprise Virtualization Hypervisor release 7.1 (20150813.0.el7ev) How reproducible: 100% Steps to Reproduce: 1. Setup HE on the first RHEV-H successful. - nfs storage - em1 2. Setup additional HE on second RHEV-H successful. 3. All above RHEV-H are UP in Hosted Engine 4. Download RHEV-H 7.1 20150813.0.el7ev into RHEV-M to make RHEV-H 7.1 iso listed in Install Page of upgrade. 5. Maintenance the second RHEV-H 6. Click on 'Upgrade' in rhevm portal - Host sheet. 7. Selected rhevh iso which you want to upgrade. Actual results: Upgrade failed. Expected results: Upgrade RHEV-H via Hosted-engine successful. Additional info: 1. [root@dhcp-11-107 updates]# ll total 244736 -rw-r--r--. 1 root root 250609664 Aug 17 11:51 ovirt-node-image.iso 2. <snip> 2015-08-17 07:51:09,812 INFO [org.ovirt.engine.core.bll.InstallerMessages] (org.ovirt.thread.pool-7-thread-15) [55a55548] Installation dhcp-11-107.nay.redhat.com: Sending file /usr/share/rhev-hypervisor/rhevh-7.1-20150813.0.el7ev.iso to /data/updates/ovirt-node-image.iso 2015-08-17 07:51:10,210 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-7-thread-15) [55a55548] Correlation ID: 55a55548, Call Stack: null, Custom Event ID: -1, Message: Installing Host hosted_engine_2. Sending file /usr/share/rhev-hypervisor/rhevh-7.1-20150813.0.el7ev.iso to /data/updates/ovirt-node-image.iso. 2015-08-17 07:51:10,211 INFO [org.ovirt.engine.core.uutils.ssh.SSHDialog] (org.ovirt.thread.pool-7-thread-15) SSH execute root.redhat.com 'mkdir -p '/data/updates'' 2015-08-17 07:51:23,340 INFO [org.ovirt.engine.core.bll.InstallerMessages] (org.ovirt.thread.pool-7-thread-15) [55a55548] Installation dhcp-11-107.nay.redhat.com: Executing /usr/share/vdsm-reg/vdsm-upgrade 2015-08-17 07:51:23,444 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-7-thread-15) [55a55548] Correlation ID: 55a55548, Call Stack: null, Custom Event ID: -1, Message: Installing Host hosted_engine_2. Executing /usr/share/vdsm-reg/vdsm-upgrade. 2015-08-17 07:51:23,444 INFO [org.ovirt.engine.core.uutils.ssh.SSHDialog] (org.ovirt.thread.pool-7-thread-15) SSH execute root.redhat.com '/usr/share/vdsm-reg/vdsm-upgrade' 2015-08-17 07:51:23,574 INFO [org.ovirt.engine.core.bll.OVirtNodeUpgrade] (OVirtNodeUpgrade) update from host dhcp-11-107.nay.redhat.com: <BSTRAP component="ovirt-node-upgrade" status="OK" message="ovirt-node-upgrade.UpgradeTool: INFO Temporary Directory is: /data/tmpHVhJJ6 "/> 2015-08-17 07:51:23,575 INFO [org.ovirt.engine.core.bll.InstallerMessages] (OVirtNodeUpgrade) Installation dhcp-11-107.nay.redhat.com: Step: ovirt-node-upgrade; Details: ovirt-node-upgrade.UpgradeTool: INFO Temporary Directory is: /data/tmpHVhJJ6 2015-08-17 07:51:23,581 ERROR [org.ovirt.engine.core.uutils.ssh.SSHDialog] (org.ovirt.thread.pool-7-thread-15) SSH error running command root.redhat.com:'/usr/share/vdsm-reg/vdsm-upgrade': java.io.IOException: Command returned failure code 1 during SSH session 'root.redhat.com' at org.ovirt.engine.core.uutils.ssh.SSHClient.executeCommand(SSHClient.java:527) [uutils.jar:] at org.ovirt.engine.core.uutils.ssh.SSHDialog.executeCommand(SSHDialog.java:318) [uutils.jar:] at org.ovirt.engine.core.bll.OVirtNodeUpgrade.execute(OVirtNodeUpgrade.java:215) [bll.jar:] ... 2015-08-17 07:51:26,049 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (OVirtNodeUpgrade) Correlation ID: 55a55548, Call Stack: null, Custom Event ID: -1, Message: Installing Host hosted_engine_2. Step: ovirt-node-upgrade; Details: RuntimeError: Previous upgrade completed, you must reboot . 2015-08-17 07:51:26,049 INFO [org.ovirt.engine.core.bll.OVirtNodeUpgrade] (OVirtNodeUpgrade) update from host dhcp-11-107.nay.redhat.com: <BSTRAP component="ovirt-node-upgrade" status="FAIL" message="Upgraded Failed"/> </snip>
Created attachment 1063789 [details] rhevh_var_log
Created attachment 1063791 [details] sosreport_rhevh
Created attachment 1063804 [details] engine.log
Created attachment 1063806 [details] hosted-deploy log
Created attachment 1063807 [details] screenshot.png
From vdsm-upgrade log I see vdsm cannot be stopped and raised the upgrade issue. 2015-08-17 10:27:10,658 - INFO - ovirt-node-upgrade - Running pre-upgrade hooks 2015-08-17 10:27:10,658 - INFO - ovirt-node-upgrade - Running: 01-vdsm 2015-08-17 10:27:10,658 - DEBUG - ovirt-node-upgrade - ('/usr/libexec/ovirt-node/hooks/pre-upgrade/01-vdsm',) 2015-08-17 10:27:16,551 - DEBUG - ovirt-node-upgrade - [u'/usr/libexec/ovirt-node/hooks/pre-upgrade/01-vdsm: Stopping vdsmd to upgrade'] 2015-08-17 10:27:16,551 - DEBUG - ovirt-node-upgrade - Failed to stop vdsdm: Error: ServiceOperationError: _systemctlStop failed Job for vdsmd.service canceled. 2015-08-17 10:27:16,551 - ERROR - ovirt-node-upgrade - Error: Upgrade Failed: Command Failed: '('/usr/libexec/ovirt-node/hooks/pre-upgrade/01-vdsm',)' [u'/usr/libexec/ovirt-node/hooks/pre-upgrade/01-vdsm: Stopping vdsmd to upgrade'] Traceback (most recent call last): File "/usr/sbin/ovirt-node-upgrade", line 365, in run self._run_hooks("pre-upgrade") File "/usr/sbin/ovirt-node-upgrade", line 197, in _run_hooks self._system(hook) File "/usr/sbin/ovirt-node-upgrade", line 145, in _system raise RuntimeError("Command Failed: '%s' %s" % (command, output)) RuntimeError: Command Failed: '('/usr/libexec/ovirt-node/hooks/pre-upgrade/01-vdsm',)' [u'/usr/libexec/ovirt-node/hooks/pre-upgrade/01-vdsm: Stopping vdsmd to upgrade']
Additional: To reproduce this issue, you must setup hosted-engine on RHEV-H, the upgrade this RHEV-H via Hosted Engine.
Tried to stop vdsm manually: #1 - checking vdsm status: ------------------------------ # /bin/systemctl status vdsmd.service vdsmd.service - Virtual Desktop Server Manager Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled) Active: active (running) since Mon 2015-08-17 10:30:12 UTC; 3h 59min ago # 2 - Stopping vdsm manually ------------------------------ # /bin/systemctl stop vdsmd.service Redirecting to /bin/systemctl stop vdsmd.service Job for vdsmd.service canceled. # 3 - Checking status ------------------------------ # bin/systemctl status -l vdsmd.service vdsmd.service - Virtual Desktop Server Manager Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled) Active: deactivating (stop-sigterm) since Mon 2015-08-17 14:29:32 UTC; 5s ago Process: 8775 ExecStartPre=/usr/libexec/vdsm/vdsmd_init_common.sh --pre-start (code=exited, status=0/SUCCESS) Main PID: 8887 (vdsm) CGroup: /system.slice/vdsmd.service ├─ 8887 /usr/bin/python /usr/share/vdsm/vdsm ├─29047 /usr/libexec/ioprocess --read-pipe-fd 29 --write-pipe-fd 20 --max-threads 10 --max-queued-requests 10 ├─29053 /usr/libexec/ioprocess --read-pipe-fd 54 --write-pipe-fd 50 --max-threads 10 --max-queued-requests 10 ├─29056 /usr/libexec/ioprocess --read-pipe-fd 40 --write-pipe-fd 34 --max-threads 10 --max-queued-requests 10 ├─29067 /usr/libexec/ioprocess --read-pipe-fd 50 --write-pipe-fd 47 --max-threads 10 --max-queued-requests 10 └─29069 /usr/libexec/ioprocess --read-pipe-fd 64 --write-pipe-fd 60 --max-threads 10 --max-queued-requests 10 Aug 17 10:30:12 dhcp-11-107.nay.redhat.com python[8887]: DIGEST-MD5 ask_user_info() Aug 17 10:30:12 dhcp-11-107.nay.redhat.com python[8887]: DIGEST-MD5 make_client_response() Aug 17 10:30:12 dhcp-11-107.nay.redhat.com python[8887]: DIGEST-MD5 client step 3 Aug 17 14:29:32 dhcp-11-107.nay.redhat.com systemd[1]: Stopping Virtual Desktop Server Manager... Aug 17 14:29:32 dhcp-11-107.nay.redhat.com vdsm[8887]: vdsm IOProcessClient ERROR IOProcess failure Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 107, in _communicate Exception: FD closed Aug 17 14:29:32 dhcp-11-107.nay.redhat.com vdsm[8887]: vdsm IOProcessClient ERROR IOProcess failure Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 107, in _communicate Exception: FD closed Aug 17 14:29:32 dhcp-11-107.nay.redhat.com vdsm[8887]: vdsm IOProcessClient ERROR IOProcess failure Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 107, in _communicate Exception: FD closed Aug 17 14:29:32 dhcp-11-107.nay.redhat.com vdsm[8887]: vdsm IOProcessClient ERROR IOProcess failure Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 107, in _communicate Exception: FD closed Aug 17 14:29:32 dhcp-11-107.nay.redhat.com vdsm[8887]: vdsm IOProcessClient ERROR IOProcess failure Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 107, in _communicate Exception: FD closed Aug 17 14:29:33 dhcp-11-107.nay.redhat.com systemd[1]: Starting Virtual Desktop Server Manager... Dan, could you please review this one? Looks like an error in ioprocess and vdsm. Thanks!
lower this priority because we can reproduce this issue on second rhevh host with HE, but the third rhevh with HE and the forth rhevh with HE can be upgraded via hosted-engine successful. I am sure I did the same steps on these rhevh hosts. https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Virtualization/3.5/html/Installation_Guide/Upgrading_the_Self-Hosted_Engine.html How reproducible: 30%
The ioprocess traceback you see is related to BZ#1189200, which needs to be fixed and is only a redundant log that we need to remove. but is not related to any problems with stopping the vdsm service.
Still encounter this issue in rhev-hypervisor6-6.7-20150813.0.
Douglas, Can you provide the vdsm log for comment 11?
I have provided a patch solving BZ#1189200: https://gerrit.ovirt.org/#/c/45038/ You can try and reproduce the issue with it. But I strongly suggest you keep investigating the issue, and not assume that BZ#1189200 is a blocker for this bug. As there is nothing in the issue over there that should prevent vdsm from stopping as a service.
(In reply to Yeela Kaplan from comment #17) > I have provided a patch solving BZ#1189200: > https://gerrit.ovirt.org/#/c/45038/ > > You can try and reproduce the issue with it. > But I strongly suggest you keep investigating the issue, and not assume that > BZ#1189200 is a blocker for this bug. > As there is nothing in the issue over there that should prevent vdsm from > stopping as a service. The logs are pointing to it. If there is other vdsm issue hiding behind your fix will discover it. If you can provide a VDSM downstream build with your fix we can include into a scratch-build of RHEV-H and re-test it.
the ioprocess traceback you see in the vdsm log is just noise. Please try to reproduce the issue so we can find the real problem hiding behind it.
(In reply to Yeela Kaplan from comment #19) > the ioprocess traceback you see in the vdsm log is just noise. > Please try to reproduce the issue so we can find the real problem hiding > behind it. Ying, do you mind to provide again a reproducer machine so VDSM folks can investigate it?
should this issue be resolved by the following commit? https://gerrit.ovirt.org/gitweb?p=vdsm.git;a=commit;h=c6f3e6eba71c1166281f4ef7e9be1ad7000f4e77 Please properly assign and ack the bug.
Douglas, After investigating further and connecting to the hosts Ying provided, the vdsm is killed only on host RHEV-H-2. Meaning the job is cancelled and vdsm got a kill signal. We tried a different approach: Stopping the ha services (ovirt-ha-broker, ovirt-ha-agent). It fixed the problem on this machine. Meaning vdsm stopped normally with the signal TERM. We can still see the IOProcess error in the log, meaning it is still there when vdsm stops normally now. This log is not related and is just noise as the IOProcesses get the same signals the main vdsm process gets, causing it to close its fds and the python bindings to raise an FD closed exception which is expected behavior. I guess that we should ask the ha guys why it is keeping vdsm from closing...
Will this bug hit us if we fix this for the next images and not in the current one? Is there a workaround? if there is, what is it?
continuing comment 24, removing BZ#1189200 dependency.
(In reply to Yeela Kaplan from comment #26) > continuing comment 24, removing BZ#1189200 dependency. Yeela, if the ioprocess patch only suppress the errors messages from vdsm and do not impact anything else would be nice to include it into 3.5.4 as we are going to rebuild the rhev-h anyway. These errors messages are pretty bad/confusing IMO. Yeela/Yaniv, what do you think?
(In reply to Yaniv Dary from comment #25) > Will this bug hit us if we fix this for the next images and not in the > current one? Yes because we require to stop the vdsm in the current version of node to proceed with the upgrade. > Is there a workaround? if there is, what is it? Probably requesting users to go to the unsupported shell via (F2 key) and stopping manually vdsm before upgrading the iso via oVirt Web admin. Ying, does it also happens during cdrom/usb upgrade? Maybe we could have it as workaround?
Fix will only be available in el7.2. Not even on vdsm master. As it depeneds on an el7 bug.
(In reply to Douglas Schilling Landgraf from comment #28) > (In reply to Yaniv Dary from comment #25) > > Will this bug hit us if we fix this for the next images and not in the > > current one? > > Yes because we require to stop the vdsm in the current version of node to > proceed with the upgrade. > > > Is there a workaround? if there is, what is it? > > Probably requesting users to go to the unsupported shell via (F2 key) and > stopping manually vdsm before upgrading the iso via oVirt Web admin. > > Ying, does it also happens during cdrom/usb upgrade? Maybe we could have it > as workaround? Ying, answering myself after talking with Yaniv. Doing cdrom/usb upgrade will be bad for remote/servers, we can ignore this approach for now.
(In reply to Douglas Schilling Landgraf from comment #30) > (In reply to Douglas Schilling Landgraf from comment #28) > > (In reply to Yaniv Dary from comment #25) > > > Will this bug hit us if we fix this for the next images and not in the > > > current one? > > > > Yes because we require to stop the vdsm in the current version of node to > > proceed with the upgrade. > > > > > Is there a workaround? if there is, what is it? > > > > Probably requesting users to go to the unsupported shell via (F2 key) and > > stopping manually vdsm before upgrading the iso via oVirt Web admin. > > > > Ying, does it also happens during cdrom/usb upgrade? Maybe we could have it > > as workaround? > > Ying, answering myself after talking with Yaniv. Doing cdrom/usb upgrade > will be bad for remote/servers, we can ignore this approach for now. It should not be ignored it is a workaround option, but without manual steps for remote hosts, it's a blocker.
Hi, Just to mention that I couldn't reproduce this report, below my steps. Phase 1: ---------- #1 Installed rhev-hypervisor6-6.7-20150813.0.iso #2 Configured Hostname/Network via TUI (eth0) #3 Configured Hosted Engine via TUI: - provided RHEL 6.7 as ISO for installation of RHEVM-3.5 (3.5.4.2-1.3.el6ev) - nfs storage - Installed RHEVM 3.5 in the VM - Finished the configuration of Hosted Engine and everything is working. #4 Installed the RPM rhev-hypervisor6-6.7-20150813.0.el6ev.noarch.rpm into RHEV-M Phase 2: ------------ #5 After the first hosted engine is UP, installed a second machine with rhev-hypervisor6-6.7-20150813.0 #6 Configure Hostname/Network to all nodes and engine communicate #7 In Hosted Engine Tab via TUI selected "Start Additional host setup" and during the process provide the same NFS storage to include this RHEV-H into the existing Hosted Engine instance. After hosted-engine setup, everything should be working and the two hosts should be UP in RHEV-M. Phase 3: --------- #8 The two hosts are UP in RHEV-M, select the last which was added and put in maint. #9 Right click in the Host, select upgrade and the ISO to be upgraded #10 The upgrade happened without any issue, hosted rebooted and later became UP.
Douglas: The failing test (#11) uses systemctl (systemd) and the test where you can't reproduce uses sysV (RHEL 6 based image, #33). Could it be that systemd refused to stop vdsm? If it is so, then we need to know why. Hosted engine depends on VDSM, but systemd should be smart enough to kill it too.
(In reply to Martin Sivák from comment #36) > Douglas: The failing test (#11) uses systemctl (systemd) and the test where > you can't reproduce uses sysV (RHEL 6 based image, #33). Good catch Martin. I have tried both systems el6 and el7. The comment#11 was in the bogus machine QE provided. > > Could it be that systemd refused to stop vdsm? If it is so, then we need to > know why. Hosted engine depends on VDSM, but systemd should be smart enough > to kill it too. Agreed. This report is in my radar, I will give a new try today. Thanks !
Any updates on this issue?
Will this bz be ready for 3.5.5?
No, it is targeted for 3.5.6
Martin/Martin - I'm not sure who's responsible for some interactions in this component -- when a host is put into maintenance, does it also put that host into maintenance for hosted engine? It seems like this may not be happening, and it's something that we can do from ovirt-node-upgrade on the node side, but that it should already be done. What's the expected flow?
Hi Ryan, yes putting host to maintenance should also maintenance hosted engine (local mode). This bug might also be related to https://gerrit.ovirt.org/#/c/45842/ so it might be already fixed. Do we have any recent test results with new enough hosted engine (3.6)?
We do have 3.6 builds on 7.2 available. Do you know what the NVR of the package with the fix would be? Ying: has this been tested on 3.6/7.2?
(In reply to Ryan Barry from comment #48) > We do have 3.6 builds on 7.2 available. Do you know what the NVR of the > package with the fix would be? > > Ying: has this been tested on 3.6/7.2? Since the first rhevh 7.2 for 3.6.0 build, we are blocked by critical bugs long time. A serial bugs... bug 1260470, bug 1267437, bug 1260548, bug 1260551 ,bug 1260559, bug 1270203 and bug 1267470 ... We CAN NOT test this bug now due to these critical bugs on our node.
Raising priority because it's blocking testing. We are possibly also seeing this in 3.5.z in bug 1271707
*** Bug 1271707 has been marked as a duplicate of this bug. ***
Yeela, what is the bug you are referencing in comment 29?
I am referring to BZ#1189200 in comment 29, but the traceback seen in vdsm log on vdsm restart is unrelated to this bug... It is just noise log that will be removed. not a real bug.
So, to me this problem described in this bug (IIUIC) is not Node specific. IIUIC the problem is that the node is not brought into the right maintenance mode on el7 hosts (comment 36), and using a manual workaround (comment 38) fixes the issue. Simone, what is the expected flow on RHEL-H hosts? I sthere any user intervention needed? And with that in mind, what does this mean for Node? Can the whole process be automated?
(In reply to Fabian Deutsch from comment #56) > IIUIC the problem is that the node is not brought into the right maintenance > mode on el7 hosts (comment 36), and using a manual workaround (comment 38) > fixes the issue. Need to be noticed this issue occurred on rhev-h 6.7.el6 as well. see comment 14.
This is 3.5.z only.
Martin can you fill doc text?
To verify this bug I need start from RHEV-H 3.5.6 and upgrade to greater version.
(In reply to Artyom from comment #62) > To verify this bug I need start from RHEV-H 3.5.6 and upgrade to greater > version. Fabian, Is there a way you can think of that might help us simulate this ?
The bug is in upgrading a host which is involved in HE. I don't see how we can shortcut this.
Upgrade succeed via engine from: Red Hat Enterprise Virtualization Hypervisor release 6.7 (20151028.0.el6ev) ovirt-hosted-engine-ha-1.2.8-1.el6ev.noarch to Red Hat Enterprise Virtualization Hypervisor release 6.7 (20151029.0.el6ev) ovirt-hosted-engine-ha-1.2.8-1.el6ev.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-2529.html