Description of problem: I wanted to upgrade my SHE 3.6 to 4.0 but it failed (but surprisingly the engine is 4.0 now). While reading upgrade guide I became skeptical about the flow as: - you want 4.0 but you are adding 'rhel-7-server-rhev-mgmt-agent-rpms' channel which is latest, ie. 4.2 - appliance is OK, rhevm-appliance-4.0.20170307.0-1.el7ev.noarch Besides the failure itself, my concern is: - how do we treat our upgrade documentation to older versions which mentions a channel (eg. 'rhel-7-server-rhev-mgmt-agent-rpms') having only only older version but the latest one? are we thus telling customer to upgrade to latest rpms from such channel or we let customers experiment here? IMO the failure is related to the channel having 4.0, 4.1 and 4.2 rpms. Am I not right? FYI, upgrading just 'ovirt-hosted-engine-setup' to latest version in this channel means it pulls other requirements, read 'ansible' which does not exist in 4.0... So the host system is now some mix of 3.6 and 4.2, as we recommend to upgrade 'ovirt-hosted-engine-setup' and 'rhevm-appliance' first and _only__then_ the rest of the host. The failure itself: ---%>--- |- [ ERROR ] Failed to execute stage 'Closing up': Command '/bin/firewall-cmd' failed to execute |- [ INFO ] Stage: Clean up |- Log file is located at /var/log/ovirt-engine/setup/ovirt-engine-setup-20180823135310-ucjbzp.log |- [ INFO ] Generating answer file '/var/lib/ovirt-engine/setup/answers/20180823135533-setup.conf' |- [ INFO ] Stage: Pre-termination |- [ INFO ] Stage: Termination |- [ ERROR ] Execution of setup failed |- HE_APPLIANCE_ENGINE_SETUP_FAIL [ ERROR ] Engine setup failed on the appliance [ ERROR ] Failed to execute stage 'Closing up': Engine setup failed on the appliance Please check its log on the appliance. [ INFO ] Stage: Clean up [ INFO ] Generating answer file '/var/lib/ovirt-hosted-engine-setup/answers/answers-20180823115533.conf' [ INFO ] Stage: Pre-termination [ INFO ] Stage: Termination [ ERROR ] Hosted Engine upgrade failed: you can use --rollback-upgrade option to recover the engine VM disk from a backup. Log file is located at /var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20180823104447-4o09te.log ---%<--- And the log inside engine: ---%>--- 2018-08-23 13:53:12 DEBUG otopi.plugins.otopi.services.systemd plugin.execute:921 execute-output: ('/bin/systemctl', 'status', 'firewalld.service') stdo ut: ● firewalld.service - firewalld - dynamic firewall daemon Loaded: loaded (/usr/lib/systemd/system/firewalld.service; enabled; vendor preset: enabled) Active: active (running) since Thu 2018-08-23 11:51:42 CEST; 2h 1min ago Docs: man:firewalld(1) Main PID: 538 (firewalld) CGroup: /system.slice/firewalld.service └─538 /usr/bin/python -Es /usr/sbin/firewalld --nofork --nopid Aug 23 11:51:38 localhost.localdomain systemd[1]: Starting firewalld - dynamic firewall daemon... Aug 23 11:51:42 localhost.localdomain systemd[1]: Started firewalld - dynamic firewall daemon. Aug 23 11:51:45 localhost.localdomain firewalld[538]: WARNING: '/usr/sbin/ip6tables-restore -n' failed: Aug 23 11:51:45 localhost.localdomain firewalld[538]: WARNING: '/usr/sbin/iptables-restore -n' failed: Aug 23 11:51:45 localhost.localdomain firewalld[538]: ERROR: '/usr/sbin/ebtables-restore --noflush' failed: Aug 23 11:51:45 localhost.localdomain firewalld[538]: ERROR: COMMAND_FAILED Aug 23 11:51:45 localhost.localdomain firewalld[538]: ERROR: INVALID_ZONE Aug 23 11:51:46 localhost.localdomain firewalld[538]: ERROR: INVALID_ZONE ... 2018-08-23 13:55:33 DEBUG otopi.plugins.otopi.network.firewalld plugin.execute:926 execute-output: ('/bin/firewall-cmd', '--reload') stderr: ESC[91mError: COMMAND_FAILEDESC[00m 2018-08-23 13:55:33 DEBUG otopi.context context._executeMethod:142 method exception Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/otopi/context.py", line 132, in _executeMethod method['method']() File "/usr/share/otopi/plugins/otopi/network/firewalld.py", line 324, in _closeup '--reload' File "/usr/lib/python2.7/site-packages/otopi/plugin.py", line 931, in execute command=args[0], RuntimeError: Command '/bin/firewall-cmd' failed to execute 2018-08-23 13:55:33 ERROR otopi.context context._executeMethod:151 Failed to execute stage 'Closing up': ---%<--- Huh, why failed to execute? ---%>--- # which firewall-cmd /usr/bin/firewall-cmd # ls -li /bin/firewall-cmd /usr/bin/firewall-cmd 1250549 -rwxr-xr-x. 1 root root 105358 Feb 10 2017 /bin/firewall-cmd 1250549 -rwxr-xr-x. 1 root root 105358 Feb 10 2017 /usr/bin/firewall-cmd # journalctl | grep firewall Aug 23 11:51:38 localhost.localdomain systemd[1]: Starting firewalld - dynamic firewall daemon... Aug 23 11:51:42 localhost.localdomain systemd[1]: Started firewalld - dynamic firewall daemon. Aug 23 11:51:45 localhost.localdomain firewalld[538]: WARNING: '/usr/sbin/ip6tables-restore -n' failed: Aug 23 11:51:45 localhost.localdomain firewalld[538]: WARNING: '/usr/sbin/iptables-restore -n' failed: Aug 23 11:51:45 localhost.localdomain firewalld[538]: ERROR: '/usr/sbin/ebtables-restore --noflush' failed: Aug 23 11:51:45 localhost.localdomain firewalld[538]: ERROR: COMMAND_FAILED Aug 23 11:51:45 localhost.localdomain firewalld[538]: ERROR: INVALID_ZONE Aug 23 11:51:45 localhost.localdomain NetworkManager[557]: <warn> [1535017905.6136] firewall: [0x7f5e40ca0b60,change:"eth0"]: complete: request failed (INVALID_ZONE) Aug 23 11:51:46 localhost.localdomain firewalld[538]: ERROR: INVALID_ZONE Aug 23 11:51:46 localhost.localdomain NetworkManager[557]: <warn> [1535017906.6607] firewall: [0x7f5e40ca2b50,change:"eth0"]: complete: request failed (INVALID_ZONE) Aug 23 13:55:33 she-test-01.rhev.lab.eng.brq.redhat.com firewalld[538]: WARNING: '/usr/sbin/ip6tables-restore -n' failed: Aug 23 13:55:33 she-test-01.rhev.lab.eng.brq.redhat.com firewalld[538]: WARNING: '/usr/sbin/iptables-restore -n' failed: Aug 23 13:55:33 she-test-01.rhev.lab.eng.brq.redhat.com firewalld[538]: ERROR: '/usr/sbin/ebtables-restore --noflush' failed: Aug 23 13:55:33 she-test-01.rhev.lab.eng.brq.redhat.com firewalld[538]: ERROR: COMMAND_FAILED ---%<--- Version-Release number of selected component (if applicable): ovirt-hosted-engine-setup-2.2.25-1.el7ev.noarch How reproducible: just tried once Steps to Reproduce: 1. have 3.6 SHE (I had EL 7.4) 2. upgrade ovirt-hosted-engine-setup and rhevm-appliance as written in upgrade guide 3. hosted-engine --upgrade-appliance Actual results: setup failed but engine is 4.0 now Expected results: either it should work fine, or it should be documented not to mess with various rpm versions or it should rollback Additional info:
(In reply to Jiri Belka from comment #0) > IMO the failure is related to the channel having 4.0, 4.1 > and 4.2 rpms. Am I not right? No, it's not directly related to that. We discussed this more than once and unfortunately we can do that much due to repository design. Upstream we have 4.0, 4.1 and 4.2 repositories and each of them contains both the engine rpms and host ones. Downstream we have instead host channel and engine channel; for the engine channel we have 4.0, 4.1, 4.2 and so on while for the hos/agent channel we have just 3 or 4 so, due to that, latest (whatever it is...) ovirt-hosted-engine-setup is supposed to keep 3.6/el6 -> 4.0/el7 upgrade capability. > Expected results: > either it should work fine, or it should be documented not to mess with > various rpm versions or it should rollback We have a manual rollback command called: hosted-engine --rollback-upgrade the user is supposed to manually run it if needed.
(In reply to Jiri Belka from comment #0) > Huh, why failed to execute? > > ---%>--- > # which firewall-cmd > /usr/bin/firewall-cmd > # ls -li /bin/firewall-cmd /usr/bin/firewall-cmd > 1250549 -rwxr-xr-x. 1 root root 105358 Feb 10 2017 /bin/firewall-cmd > 1250549 -rwxr-xr-x. 1 root root 105358 Feb 10 2017 /usr/bin/firewall-cmd > > # journalctl | grep firewall > Aug 23 11:51:38 localhost.localdomain systemd[1]: Starting firewalld - > dynamic firewall daemon... > Aug 23 11:51:42 localhost.localdomain systemd[1]: Started firewalld - > dynamic firewall daemon. > Aug 23 11:51:45 localhost.localdomain firewalld[538]: WARNING: > '/usr/sbin/ip6tables-restore -n' failed: > Aug 23 11:51:45 localhost.localdomain firewalld[538]: WARNING: > '/usr/sbin/iptables-restore -n' failed: > Aug 23 11:51:45 localhost.localdomain firewalld[538]: ERROR: > '/usr/sbin/ebtables-restore --noflush' failed: > Aug 23 11:51:45 localhost.localdomain firewalld[538]: ERROR: COMMAND_FAILED > Aug 23 11:51:45 localhost.localdomain firewalld[538]: ERROR: INVALID_ZONE > Aug 23 11:51:45 localhost.localdomain NetworkManager[557]: <warn> > [1535017905.6136] firewall: [0x7f5e40ca0b60,change:"eth0"]: complete: > request failed (INVALID_ZONE) > Aug 23 11:51:46 localhost.localdomain firewalld[538]: ERROR: INVALID_ZONE > Aug 23 11:51:46 localhost.localdomain NetworkManager[557]: <warn> > [1535017906.6607] firewall: [0x7f5e40ca2b50,change:"eth0"]: complete: > request failed (INVALID_ZONE) > Aug 23 13:55:33 she-test-01.rhev.lab.eng.brq.redhat.com firewalld[538]: > WARNING: '/usr/sbin/ip6tables-restore -n' failed: > Aug 23 13:55:33 she-test-01.rhev.lab.eng.brq.redhat.com firewalld[538]: > WARNING: '/usr/sbin/iptables-restore -n' failed: > Aug 23 13:55:33 she-test-01.rhev.lab.eng.brq.redhat.com firewalld[538]: > ERROR: '/usr/sbin/ebtables-restore --noflush' failed: > Aug 23 13:55:33 she-test-01.rhev.lab.eng.brq.redhat.com firewalld[538]: > ERROR: COMMAND_FAILED > ---%<--- We had report of it also in the past, here one: https://bugzilla.redhat.com/1494985 but we also had others. Unfortunately we never got a systematic reproducer. I think it is/was something inside firewalld. Jiri, could you please retry on the same env reporting if and how it's reproducible?
# rpm -q redhat-release-server vdsm ovirt-hosted-engine-setup ovirt-hosted-engine-ha libvirt-daemon qemu-kvm-rhev redhat-release-server-7.4-18.el7.x86_64 vdsm-4.17.45-1.el7ev.noarch ovirt-hosted-engine-setup-1.3.7.4-1.el7ev.noarch ovirt-hosted-engine-ha-1.3.5.10-2.el7ev.noarch libvirt-daemon-3.2.0-14.el7_4.3.x86_64 qemu-kvm-rhev-2.9.0-16.el7_4.8.x86_64 # hosted-engine --vm-status --== Host 1 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : 10-37-140-183.rhev.lab.eng.brq.redhat.com Host ID : 1 Engine status : {"health": "good", "vm": "up", "detail": "up"} Score : 3400 stopped : False Local maintenance : False crc32 : 103fa4f0 local_conf_timestamp : 0 Host timestamp : 59706 now proceeding with upgrade... # yum repolist -v | grep -i repo-id Repo-id : rhel-7-server-ansible-2-rpms Repo-id : rhel-7-server-rhv-4-mgmt-agent-rpms Repo-id : rhel-7-server-rpms ^^ ansible is needed as it is required by ovirt-hosted-engine-setup in 4.2 (rhel-7-server-rhv-4-mgmt-agent-rpms will use latest version, thus 4.2 version). yum update ovirt-hosted-engine-setup rhevm-appliance ovirt-engine-sdk-python noarch 3.6.9.1-1.el7ev rhel-7-server-rhv-4-mgmt-agent-rpms 484 k replacing rhevm-sdk-python.noarch 3.6.9.1-1.el7ev ... Updating: ovirt-hosted-engine-setup noarch 2.2.25-1.el7ev rhel-7-server-rhv-4-mgmt-agent-rpms 401 k rhevm-appliance noarch 1:4.0.20170307.0-1.el7ev rhel-7-server-rhv-4-mgmt-agent-rpms 1.5 G ... ansible noarch 2.6.3-1.el7ae rhel-7-server-ansible-2-rpms 10 M ... ovirt-host x86_64 4.2.3-1.el7ev rhel-7-server-rhv-4-mgmt-agent-rpms 8.7 k ovirt-host-dependencies x86_64 4.2.3-1.el7ev rhel-7-server-rhv-4-mgmt-agent-rpms 8.6 k ... otopi noarch 1.7.8-1.el7ev rhel-7-server-rhv-4-mgmt-agent-rpms 166 k ovirt-host-deploy noarch 1.7.4-1.el7ev rhel-7-server-rhv-4-mgmt-agent-rpms 96 k ovirt-hosted-engine-ha noarch 2.2.16-1.el7ev rhel-7-server-rhv-4-mgmt-agent-rpms 316 k ovirt-setup-lib noarch 1.1.4-1.el7ev rhel-7-server-rhv-4-mgmt-agent-rpms 19 k ... # hosted-engine --upgrade-appliance ... |- [ INFO ] Creating/refreshing Engine 'internal' domain database schema |- [ INFO ] Generating post install configuration file '/etc/ovirt-engine-setup.conf.d/20-setup-ovirt-post.conf' |- [ INFO ] Stage: Transaction commit |- [ INFO ] Stage: Closing up |- [ ERROR ] Failed to execute stage 'Closing up': Command '/bin/firewall-cmd' failed to execute |- [ INFO ] Stage: Clean up |- Log file is located at /var/log/ovirt-engine/setup/ovirt-engine-setup-20180824105359-xeuw5a.log |- [ INFO ] Generating answer file '/var/lib/ovirt-engine/setup/answers/20180824105613-setup.conf' |- [ INFO ] Stage: Pre-termination |- [ INFO ] Stage: Termination |- [ ERROR ] Execution of setup failed |- HE_APPLIANCE_ENGINE_SETUP_FAIL [ ERROR ] Engine setup failed on the appliance [ ERROR ] Failed to execute stage 'Closing up': Engine setup failed on the appliance Please check its log on the appliance. [ INFO ] Stage: Clean up [ INFO ] Generating answer file '/var/lib/ovirt-hosted-engine-setup/answers/answers-20180824085612.conf' [ INFO ] Stage: Pre-termination [ INFO ] Stage: Termination [ ERROR ] Hosted Engine upgrade failed: you can use --rollback-upgrade option to recover the engine VM disk from a backup. Log file is located at /var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20180824074344-zzvphn.log
This is definitively due to 2018-08-23 13:55:33 DEBUG otopi.plugins.otopi.network.firewalld plugin.execute:926 execute-output: ('/bin/firewall-cmd', '--reload') stderr: ESC[91mError: COMMAND_FAILEDESC[00m Attaching firewalld logs at debug level. Eric, could you please take a look?
Created attachment 1478440 [details] firewalld logs
Created attachment 1478441 [details] firewalld.conf
after some struggling with ova file, i got an engine vm based on this ova file version up and here are data: - rpms rhevm-4.0.7.4-0.1.el7ev.noarch firewalld-0.4.3.2-8.1.el7_3.2.noarch redhat-release-server-7.3-7.el7.x86_64 iptables-1.4.21-17.el7.x86_64 - firewall-cmd check # firewall-cmd --list-all public (active) target: default icmp-block-inversion: no interfaces: eth0 sources: services: dhcpv6-client ssh ports: protocols: masquerade: no forward-ports: sourceports: icmp-blocks: rich rules: - trying engine-setup manually (engine-setup --offline) ... Firewall manage : firewalld .. 2018-08-24 07:44:55 INFO otopi.plugins.ovirt_engine_common.base.core.misc misc._terminate:156 Execution of setup completed successfully ^^ finished without any issue - firewalld recheck # firewall-cmd --list-all public (active) target: default icmp-block-inversion: no interfaces: eth0 sources: services: dhcpv6-client ovirt-fence-kdump-listener ovirt-http ovirt-https ovirt-imageio-proxy ovirt-postgres ovirt-vmconsole-proxy ovirt-websocket-proxy ssh ports: protocols: masquerade: no forward-ports: sourceports: icmp-blocks: rich rules:
thus rhevm-appliance-4.0.20170307.0-1.el7ev.ova itself does work fine via `engine-setup --offline'. the problem must be related how is engine vm based on this ova file deployed from HE host. # grep -iE 'firewall|iptables' /var/lib/ovirt-engine/setup/answers/20180824074455-setup.conf OVESETUP_CONFIG/firewallManager=str:firewalld OVESETUP_CONFIG/firewallChangesReview=none:None OVESETUP_CONFIG/updateFirewall=bool:True
i tried to run engine-setup on this ova file (image) with cloud-init iso with following content and it worked fine https://paste.fedoraproject.org/paste/NVmZZeDNAZ6IVQOum7G~Gw/raw
(In reply to Simone Tiraboschi from comment #6) > This is definitively due to > > 2018-08-23 13:55:33 DEBUG otopi.plugins.otopi.network.firewalld > plugin.execute:926 execute-output: ('/bin/firewall-cmd', '--reload') stderr: > ESC[91mError: COMMAND_FAILEDESC[00m > > Attaching firewalld logs at debug level. > > Eric, could you please take a look? From the errors I can't tell much. One thing I notice is that this is a pretty old version of firewalld - so old that the "--wait" option for iptables-restore is not being used (see bug 1446162). If anything else on the system happens to be holding the iptables lock, then the iptables-restore will fail. If you can try to reproduce with this setting in /etc/firewalld/firewalld.conf it would really help. /etc/firewalld/firewalld.conf: IndividualCalls=yes With IndividualCalls=yes, iptables will be called directly instead of using iptables-restore. The individual calls _will_ use the "-w" option.
(In reply to Eric Garver from comment #12) > With IndividualCalls=yes, iptables will be called directly instead of using > iptables-restore. The individual calls _will_ use the "-w" option. Thanks, we are going to try that.
I took the diff and patch it on 3.6 SHE host, then update proceed successfully.
(In reply to Jiri Belka from comment #14) > I took the diff and patch it on 3.6 SHE host, then update proceed > successfully. Please realize that using IndividualCalls=yes has a performance impact when applying rules. That's why iptables-restore is used in later firewalld versions.
(In reply to Eric Garver from comment #15) > Please realize that using IndividualCalls=yes has a performance impact when > applying rules. That's why iptables-restore is used in later firewalld > versions. Thanks, that upgrade code has been developed to let the user, as easily and as smooth as possible, perform an upgrade of his RHV 3.6 environment where RHV manager was running on an VM based on RHEL6. The user cannot run a direct RHV manager upgrade from 3.6 to the current one (4.2) but he has to pass trough 4.0 and 4.1. Upgrade from 4.1 and 4.2 can be performed basically in place while 3.6/el6 -> 4.0/el7 is much more complex due to the OS change. Now RHV manager 4.0 appliance is still shipped over a RHEL 7.2 based appliance and we are not planning to rebuild and retest its appliance over RHEL 7.5 or 7.6. So Eric basically you are suggesting to use IndividualCalls=yes only during the setup process and then restore the initial configuration to bypass this and let the user reach RHEL 7.5 for the target solution.
(In reply to Simone Tiraboschi from comment #16) > (In reply to Eric Garver from comment #15) > > Please realize that using IndividualCalls=yes has a performance impact when > > applying rules. That's why iptables-restore is used in later firewalld > > versions. > > Thanks, > that upgrade code has been developed to let the user, as easily and as > smooth as possible, perform an upgrade of his RHV 3.6 environment where RHV > manager was running on an VM based on RHEL6. > > The user cannot run a direct RHV manager upgrade from 3.6 to the current one > (4.2) but he has to pass trough 4.0 and 4.1. > Upgrade from 4.1 and 4.2 can be performed basically in place while 3.6/el6 > -> 4.0/el7 is much more complex due to the OS change. > > Now RHV manager 4.0 appliance is still shipped over a RHEL 7.2 based > appliance and we are not planning to rebuild and retest its appliance over > RHEL 7.5 or 7.6. > > So Eric basically you are suggesting to use IndividualCalls=yes only during > the setup process and then restore the initial configuration to bypass this > and let the user reach RHEL 7.5 for the target solution. Yes. It's a good idea to restore the original value of IndividualCalls=no once you've upgraded to a RHEL base that supports iptables-restore "-w" option.
ok, ovirt-hosted-engine-setup-2.2.27-1.el7ev.noarch ... [ INFO ] Engine is still not reachable, waiting... [ INFO ] Engine replied: DB Up!Welcome to Health Status! [ INFO ] Connecting to Engine [ INFO ] Connecting to Engine [ INFO ] Connecting to Engine [ INFO ] Connecting to Engine [ INFO ] Connecting to Engine [ ERROR ] Failed to execute stage 'Closing up': Cannot connect to Engine API on she-test-01.rhev.lab.eng.brq.redhat.com [ INFO ] Stage: Clean up [ INFO ] Generating answer file '/var/lib/ovirt-hosted-engine-setup/answers/answers-20180927183139.conf' [ INFO ] Stage: Pre-termination [ INFO ] Stage: Termination [ ERROR ] Hosted Engine upgrade failed: you can use --rollback-upgrade option to recover the engine VM disk from a backup. Log file is located at /var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20180927172108-ck35jq.log # hosted-engine --vm-status !! Cluster is in GLOBAL MAINTENANCE mode !! --== Host 1 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : 10-37-140-183.example.com Host ID : 1 Engine status : {"health": "good", "vm": "up", "detail": "Up"} Score : 3000 stopped : False Local maintenance : False crc32 : b5a6a960 local_conf_timestamp : 278243 Host timestamp : 278243 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=278243 (Sun Sep 30 22:09:54 2018) host-id=1 score=3000 vm_conf_refresh_time=278243 (Sun Sep 30 22:09:54 2018) conf_on_shared_storage=True maintenance=False state=GlobalMaintenance stopped=False !! Cluster is in GLOBAL MAINTENANCE mode !! You have new mail in /var/spool/mail/root If I'll be able to reproduce above ERROR about connection to API, I'll open separate BZ. But for upgrade as whole, it worked.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3482