+++ This bug is a downstream clone. The original bug is: +++ +++ bug 1579103 +++ ====================================================================== Description of problem: After upgrading 3 out of 3 hypervisors all of them failed in staring ovirt-ha-{agent,broker}. The components are restarted by systemd and continue to trigger abrtd in analyzing python dumps. The root cause is that for some reason after the upgrade the logs: '/var/log/ovirt-hosted-engine-ha/broker.log' '/var/log/ovirt-hosted-engine-ha/agent.log' changes their permissions from vdsm:kvm into root:root, making the impossible to be opened by the hosted-engine daemons and filling the system log with scary messages like: mag 17 02:32:50 m1.goodfellow.local systemd[1]: Started oVirt Hosted Engine High Availability Communications Broker. mag 17 02:32:50 m1.goodfellow.local systemd[1]: Starting oVirt Hosted Engine High Availability Communications Broker... mag 17 02:32:51 m1.goodfellow.local python[1624]: detected unhandled Python exception in '/usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker' mag 17 02:32:51 m1.goodfellow.local abrt-server[1629]: Not saving repeating crash in '/usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker' mag 17 02:32:51 m1.goodfellow.local systemd[1]: ovirt-ha-broker.service: main process exited, code=exited, status=1/FAILURE mag 17 02:32:51 m1.goodfellow.local systemd[1]: Unit ovirt-ha-broker.service entered failed state. mag 17 02:32:51 m1.goodfellow.local systemd[1]: ovirt-ha-broker.service failed. [...] mag 17 02:32:53 m1.goodfellow.local systemd[1]: Started oVirt Hosted Engine High Availability Communications Broker. mag 17 02:32:53 m1.goodfellow.local systemd[1]: Starting oVirt Hosted Engine High Availability Communications Broker... mag 17 02:32:54 m1.goodfellow.local python[1828]: detected unhandled Python exception in '/usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker' mag 17 02:32:54 m1.goodfellow.local abrt-server[1835]: Not saving repeating crash in '/usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker' mag 17 02:32:54 m1.goodfellow.local systemd[1]: ovirt-ha-broker.service: main process exited, code=exited, status=1/FAILURE mag 17 02:32:54 m1.goodfellow.local systemd[1]: Unit ovirt-ha-broker.service entered failed state. mag 17 02:32:54 m1.goodfellow.local systemd[1]: ovirt-ha-broker.service failed. mag 17 02:32:54 m1.goodfellow.local systemd[1]: ovirt-ha-broker.service holdoff time over, scheduling restart. mag 17 02:32:54 m1.goodfellow.local systemd[1]: Cannot add dependency job for unit lvm2-lvmetad.socket, ignoring: Unit is masked. mag 17 02:32:54 m1.goodfellow.local systemd[1]: start request repeated too quickly for ovirt-ha-broker.service mag 17 02:32:54 m1.goodfellow.local systemd[1]: Failed to start oVirt Hosted Engine High Availability Communications Broker. mag 17 02:32:54 m1.goodfellow.local systemd[1]: Unit ovirt-ha-broker.service entered failed state. mag 17 02:32:54 m1.goodfellow.local systemd[1]: ovirt-ha-broker.service failed. [...] mag 17 02:33:36 m1.goodfellow.local vdsm[6511]: WARN Not ready yet, ignoring event '|net|host_conn|no_id' args={} mag 17 02:33:36 m1.goodfellow.local vdsm[6511]: WARN Not ready yet, ignoring event '|net|host_conn|no_id' args={} mag 17 02:33:36 m1.goodfellow.local vdsm[6511]: WARN File: /var/run/vdsm/trackedInterfaces/ovirtmgmt already removed mag 17 02:33:36 m1.goodfellow.local vdsm[6511]: WARN Not ready yet, ignoring event '|net|host_conn|no_id' args={} mag 17 02:33:36 m1.goodfellow.local vdsm[6511]: WARN File: /var/run/vdsm/trackedInterfaces/eno2 already removed mag 17 02:33:37 m1.goodfellow.local vdsm[6511]: WARN MOM not available. mag 17 02:33:37 m1.goodfellow.local vdsm[6511]: WARN MOM not available, KSM stats will be missing. mag 17 02:33:37 m1.goodfellow.local vdsm[6511]: ERROR failed to retrieve Hosted Engine HA score '[Errno 2] No such file or directory'Is the Hosted Engine setup finished? [...] mag 17 02:33:49 m1.goodfellow.local systemd[1]: ovirt-ha-agent.service holdoff time over, scheduling restart. mag 17 02:33:49 m1.goodfellow.local systemd[1]: Cannot add dependency job for unit lvm2-lvmetad.socket, ignoring: Unit is masked. mag 17 02:33:49 m1.goodfellow.local systemd[1]: Started oVirt Hosted Engine High Availability Communications Broker. mag 17 02:33:49 m1.goodfellow.local systemd[1]: Starting oVirt Hosted Engine High Availability Communications Broker... mag 17 02:33:49 m1.goodfellow.local systemd[1]: Started oVirt Hosted Engine High Availability Monitoring Agent. mag 17 02:33:49 m1.goodfellow.local systemd[1]: Starting oVirt Hosted Engine High Availability Monitoring Agent... mag 17 02:33:50 m1.goodfellow.local python[9792]: detected unhandled Python exception in '/usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent' mag 17 02:33:50 m1.goodfellow.local python[9791]: detected unhandled Python exception in '/usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker' mag 17 02:33:50 m1.goodfellow.local abrt-server[9846]: Not saving repeating crash in '/usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent' mag 17 02:33:50 m1.goodfellow.local systemd[1]: ovirt-ha-agent.service: main process exited, code=exited, status=1/FAILURE mag 17 02:33:50 m1.goodfellow.local systemd[1]: Unit ovirt-ha-agent.service entered failed state. mag 17 02:33:50 m1.goodfellow.local systemd[1]: ovirt-ha-agent.service failed. mag 17 02:33:51 m1.goodfellow.local kernel: warning: `turbostat' uses 32-bit capabilities (legacy support in use) mag 17 02:33:52 m1.goodfellow.local vdsm[6511]: ERROR failed to retrieve Hosted Engine HA score '[Errno 2] No such file or directory'Is the Hosted Engine setup finished? For an average end-user, reading "Is the Hosted Engine setup finished?" is very scary, because transmit a sense of insecurity related to the hosted-engine-ha cluster, that is the core of the infra and the way to manage the infra. Also the message per itself is not that clear, you have to go inside the abrtd crash folder and read the backtrace to have some meaningful hints on what is going on Version-Release number of selected component (if applicable): ovirt-hosted-engine-ha-2.2.11-1.el7ev.noarch redhat-release-virtualization-host-4.2-3.0.el7.x86_64 How reproducible: Always (experienced in 3 out of 3 nodes, starting from different releases of RHV-H 4.1) Steps to Reproduce: 1. upgrade RHV-H to 4.2.3 from RHV-H 4.1 2. reboot the node Actual results: ovirt-ha-agent , ovirt-ha-broker services do not start, reducing the number of members of the hosted-engine cluster. Expected results: painless upgrade, no change in the permission of the log files. Additional info: those are the backtraces abrtd collect for both the daemons: [root@m1 abrt]# cat Python-2018-05-17-02\:34\:17-14299/backtrace __init__.py:925:_open:IOError: [Errno 13] Permission denied: '/var/log/ovirt-hosted-engine-ha/broker.log' Traceback (most recent call last): File "/usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker", line 25, in <module> broker.Broker().run() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 44, in run self._initialize_logging() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 70, in _initialize_logging disable_existing_loggers=False) File "/usr/lib64/python2.7/logging/config.py", line 78, in fileConfig handlers = _install_handlers(cp, formatters) File "/usr/lib64/python2.7/logging/config.py", line 156, in _install_handlers h = klass(*args) File "/usr/lib64/python2.7/logging/handlers.py", line 169, in __init__ BaseRotatingHandler.__init__(self, filename, 'a', encoding, delay) File "/usr/lib64/python2.7/logging/handlers.py", line 64, in __init__ logging.FileHandler.__init__(self, filename, mode, encoding, delay) File "/usr/lib64/python2.7/logging/__init__.py", line 902, in __init__ StreamHandler.__init__(self, self._open()) File "/usr/lib64/python2.7/logging/__init__.py", line 925, in _open stream = open(self.baseFilename, self.mode) IOError: [Errno 13] Permission denied: '/var/log/ovirt-hosted-engine-ha/broker.log' Local variables in innermost frame: self: <logging.handlers.TimedRotatingFileHandler object at 0x7fcd4d7d8e10> [root@m1 abrt]# cat Python-2018-05-17-02\:34\:17-14300/backtrace __init__.py:925:_open:IOError: [Errno 13] Permission denied: '/var/log/ovirt-hosted-engine-ha/agent.log' Traceback (most recent call last): File "/usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent", line 25, in <module> agent.Agent().run() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 66, in run self._initialize_logging() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 95, in _initialize_logging disable_existing_loggers=False) File "/usr/lib64/python2.7/logging/config.py", line 78, in fileConfig handlers = _install_handlers(cp, formatters) File "/usr/lib64/python2.7/logging/config.py", line 156, in _install_handlers h = klass(*args) File "/usr/lib64/python2.7/logging/handlers.py", line 169, in __init__ BaseRotatingHandler.__init__(self, filename, 'a', encoding, delay) File "/usr/lib64/python2.7/logging/handlers.py", line 64, in __init__ logging.FileHandler.__init__(self, filename, mode, encoding, delay) File "/usr/lib64/python2.7/logging/__init__.py", line 902, in __init__ StreamHandler.__init__(self, self._open()) File "/usr/lib64/python2.7/logging/__init__.py", line 925, in _open stream = open(self.baseFilename, self.mode) IOError: [Errno 13] Permission denied: '/var/log/ovirt-hosted-engine-ha/agent.log' Local variables in innermost frame: self: <logging.handlers.TimedRotatingFileHandler object at 0x7fd9477a0890> (Originally by Andrea Perotti)
Hi, can you please attach /var/log/imgbased.log ? (Originally by Yuval Turgeman)
Created attachment 1440337 [details] Archive containing imgbased.log for all 3 hypervisors (Originally by Andrea Perotti)
Deployed SHE over NFS on pair of hosts with these components: ovirt-hosted-engine-ha-2.1.11-1.el7ev.noarch ovirt-hosted-engine-setup-2.1.4.2-1.el7ev.noarch ovirt-engine-setup-4.1.11.2-0.1.el7.noarchrhvm-appliance-20180419.0-1.x86_64.rhevm.ova Red Hat Enterprise Linux release 7.5-0.14 Linux 3.10.0-862.el7.x86_64 #1 SMP Wed Mar 21 18:14:51 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux Upgraded the engine to latest components: ovirt-engine-setup-4.2.4.2-0.1.el7_3.noarch Linux 3.10.0-862.6.1.el7.x86_64 #1 SMP Mon Jun 4 15:33:25 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.5 (Maipo) Upgraded hosts from UI of the engine using "Installation->Check for Upgrade" and then "Installation->Upgrade", one by one and then engine restarted the hosts after upgrade. [root@alma04 ~]# yum list all Available Packages redhat-virtualization-host-image-update.noarch 4.2-20180605.0.el7_5 rhvh-qe Components after the upgrade: redhat-virtualization-host-image-update-4.2-20180605.0.el7_5.noarch.rpm ovirt-hosted-engine-ha-2.2.13-1.el7ev.noarch ovirt-hosted-engine-setup-2.2.22-1.el7ev.noarch vdsm-4.20.29-1.el7ev.x86_64 qemu-kvm-rhev-2.10.0-21.el7_5.3.x86_64 libvirt-3.9.0-14.el7_5.5.x86_64 sanlock-3.6.0-1.el7.x86_64 The result was as follows: Upgraded RHVH was not counted as ha-capable host in UI, "Hosted Engine HA:Not Active". ha-agent was running, but ha-broker was in failed status. alma04 ~]# systemctl status ovirt-ha-agent ● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor preset: disabled) Active: active (running) since Thu 2018-06-14 09:55:51 IDT; 2s ago Main PID: 34023 (ovirt-ha-agent) Tasks: 1 Memory: 20.2M CGroup: /system.slice/ovirt-ha-agent.service └─34023 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent Jun 14 09:55:51 alma04.qa.lab.tlv.redhat.com systemd[1]: Started oVirt Hosted Engine High Availability Monitori...ent. Jun 14 09:55:51 alma04.qa.lab.tlv.redhat.com systemd[1]: Starting oVirt Hosted Engine High Availability Monitor...t... Jun 14 09:55:52 alma04.qa.lab.tlv.redhat.com python[34023]: detected unhandled Python exception in '/usr/share/o...nt' Hint: Some lines were ellipsized, use -l to show in full. [root@alma04 ~]# systemctl status ovirt-ha-broker ● ovirt-ha-broker.service - oVirt Hosted Engine High Availability Communications Broker Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-broker.service; enabled; vendor preset: disabled) Active: failed (Result: start-limit) since Thu 2018-06-14 09:56:16 IDT; 914ms ago Process: 35178 ExecStart=/usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker (code=exited, status=1/FAILURE) Main PID: 35178 (code=exited, status=1/FAILURE) Jun 14 09:56:16 alma04.qa.lab.tlv.redhat.com systemd[1]: Unit ovirt-ha-broker.service entered failed state. Jun 14 09:56:16 alma04.qa.lab.tlv.redhat.com systemd[1]: ovirt-ha-broker.service failed. Jun 14 09:56:16 alma04.qa.lab.tlv.redhat.com systemd[1]: ovirt-ha-broker.service holdoff time over, scheduling ...art. Jun 14 09:56:16 alma04.qa.lab.tlv.redhat.com systemd[1]: start request repeated too quickly for ovirt-ha-broker...vice Jun 14 09:56:16 alma04.qa.lab.tlv.redhat.com systemd[1]: Failed to start oVirt Hosted Engine High Availability ...ker. Jun 14 09:56:16 alma04.qa.lab.tlv.redhat.com systemd[1]: Unit ovirt-ha-broker.service entered failed state. Jun 14 09:56:16 alma04.qa.lab.tlv.redhat.com systemd[1]: ovirt-ha-broker.service failed. Hint: Some lines were ellipsized, use -l to show in full. [root@alma04 ~]# systemctl status ovirt-ha-agent -l\ > ^C [root@alma04 ~]# systemctl status ovirt-ha-agent -l ● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor preset: disabled) Active: activating (auto-restart) (Result: exit-code) since Thu 2018-06-14 09:59:22 IDT; 6s ago Process: 39568 ExecStart=/usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent (code=exited, status=1/FAILURE) Main PID: 39568 (code=exited, status=1/FAILURE) Tasks: 0 Memory: 0B CGroup: /system.slice/ovirt-ha-agent.service Jun 14 09:59:22 alma04.qa.lab.tlv.redhat.com systemd[1]: ovirt-ha-agent.service failed. [root@alma04 ~]# systemctl status ovirt-ha-broker -l ● ovirt-ha-broker.service - oVirt Hosted Engine High Availability Communications Broker Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-broker.service; enabled; vendor preset: disabled) Active: failed (Result: start-limit) since Thu 2018-06-14 09:59:40 IDT; 902ms ago Process: 40811 ExecStart=/usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker (code=exited, status=1/FAILURE) Main PID: 40811 (code=exited, status=1/FAILURE) Jun 14 09:59:40 alma04.qa.lab.tlv.redhat.com systemd[1]: ovirt-ha-broker.service: main process exited, code=exited, status=1/FAILURE Jun 14 09:59:40 alma04.qa.lab.tlv.redhat.com systemd[1]: Unit ovirt-ha-broker.service entered failed state. Jun 14 09:59:40 alma04.qa.lab.tlv.redhat.com systemd[1]: ovirt-ha-broker.service failed. Jun 14 09:59:40 alma04.qa.lab.tlv.redhat.com systemd[1]: ovirt-ha-broker.service holdoff time over, scheduling restart. Jun 14 09:59:40 alma04.qa.lab.tlv.redhat.com systemd[1]: start request repeated too quickly for ovirt-ha-broker.service Jun 14 09:59:40 alma04.qa.lab.tlv.redhat.com systemd[1]: Failed to start oVirt Hosted Engine High Availability Communications Broker. Jun 14 09:59:40 alma04.qa.lab.tlv.redhat.com systemd[1]: Unit ovirt-ha-broker.service entered failed state. Jun 14 09:59:40 alma04.qa.lab.tlv.redhat.com systemd[1]: ovirt-ha-broker.service failed. The initial issue of this bug was resolved as follows: alma04 ~]# ll -lsha /var/log/ovirt-hosted-engine-ha/broker.log 19M ----------. 1 vdsm kvm 19M Jun 14 09:42 /var/log/ovirt-hosted-engine-ha/broker.log [root@alma04 ~]# ll -lsha /var/log/ovirt-hosted-engine-ha/agent.log 3.1M ----------. 1 vdsm kvm 3.1M Jun 14 09:39 /var/log/ovirt-hosted-engine-ha/agent.log Moving this bug to verified and will open separate bug on found new issues.
Forth to previous comment, please see https://bugzilla.redhat.com/show_bug.cgi?id=1591150 for more details.
(In reply to Nikolai Sednev from comment #7) > The initial issue of this bug was resolved as follows: > alma04 ~]# ll -lsha /var/log/ovirt-hosted-engine-ha/broker.log > 19M ----------. 1 vdsm kvm 19M Jun 14 09:42 > /var/log/ovirt-hosted-engine-ha/broker.log > [root@alma04 ~]# ll -lsha /var/log/ovirt-hosted-engine-ha/agent.log > 3.1M ----------. 1 vdsm kvm 3.1M Jun 14 09:39 > /var/log/ovirt-hosted-engine-ha/agent.log According to this the issue is still the same on my opinion: 000 doesn't look a good permission set to be able to append to that log file.
chmod 644 /var/log/ovirt-hosted-engine-ha/*.log seams a valid workaround.
uid/gid were fixed from being root:root back to normal vdsm:kvm, but the parameters are wrong "----------.".
*** Bug 1591150 has been marked as a duplicate of this bug. ***
On 4.1RHVH I saw: [root@alma03 ~]# ll -lsha /var/log/ovirt-hosted-engine-ha/agent.log 12K -rw-r--r--. 1 root root 12K Jun 18 17:01 /var/log/ovirt-hosted-engine-ha/agent.log [root@alma03 ~]# ll -lsha /var/log/ovirt-hosted-engine-ha/broker.log 32K -rw-r--r--. 1 root root 32K Jun 18 17:01 /var/log/ovirt-hosted-engine-ha/broker.log Components were as follows: ovirt-hosted-engine-ha-2.1.11-1.el7ev.noarch ovirt-hosted-engine-setup-2.1.4.2-1.el7ev.noarch Available Packages redhat-virtualization-host-image-update.noarch 4.2-20180615.0.el7_5 rhvh-qe [root@alma03 ~]# Then I've tried to upgrade 4.1RHVH from UI to RHVH4.2 and rebooted it. I've seen in UI this: Failed to upgrade Host alma03 (User: admin@internal-authz). 6/18/185:35:06 PM  Failed to upgrade Host alma03 (User: admin@internal-authz). 6/18/185:35:04 PM  Host alma03 was switched to Maintenance Mode. 6/18/185:04:58 PM  Host alma03 upgrade was started (User: admin@internal-authz). 6/18/185:04:58 PM ha-agent fails to get started on host and permissions are still incorrect (-rw-r-----. 1 vdsm kvm, instead of -rw-r--r--.) @alma03 ~]# systemctl status ovirt-ha-agent -l ● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor preset: disabled) Active: activating (auto-restart) (Result: exit-code) since Mon 2018-06-18 18:12:32 IDT; 9s ago Process: 30019 ExecStart=/usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent (code=exited, status=157) Main PID: 30019 (code=exited, status=157) Jun 18 18:12:32 alma03.qa.lab.tlv.redhat.com systemd[1]: Unit ovirt-ha-agent.service entered failed state. Jun 18 18:12:32 alma03.qa.lab.tlv.redhat.com systemd[1]: ovirt-ha-agent.service failed. [root@alma03 ~]# systemctl status ovirt-ha-broker -l ● ovirt-ha-broker.service - oVirt Hosted Engine High Availability Communications Broker Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-broker.service; enabled; vendor preset: disabled) Active: active (running) since Mon 2018-06-18 18:12:26 IDT; 22s ago Main PID: 29910 (ovirt-ha-broker) Tasks: 1 Memory: 18.2M CGroup: /system.slice/ovirt-ha-broker.service └─29910 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker Jun 18 18:12:26 alma03.qa.lab.tlv.redhat.com systemd[1]: Started oVirt Hosted Engine High Availability Communications Broker. Jun 18 18:12:26 alma03.qa.lab.tlv.redhat.com systemd[1]: Starting oVirt Hosted Engine High Availability Communications Broker... [root@alma03 ~]# ll -lsha /var/log/ovirt-hosted-engine-ha/agent.log 260K -rw-r-----. 1 vdsm kvm 253K Jun 18 18:13 /var/log/ovirt-hosted-engine-ha/agent.log [root@alma03 ~]# ll -lsha /var/log/ovirt-hosted-engine-ha/broker.log 524K -rw-r-----. 1 vdsm kvm 517K Jun 18 18:12 /var/log/ovirt-hosted-engine-ha/broker.log [root@alma03 ~]# Components on host after upgrade: ovirt-hosted-engine-ha-2.2.14-1.el7ev.noarch ovirt-hosted-engine-setup-2.2.22-1.el7ev.noarch Linux 3.10.0-862.3.2.el7.x86_64 #1 SMP Tue May 15 18:22:15 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux release 7.5
Created attachment 1452671 [details] sosreport from alma03
Created attachment 1452675 [details] engine logs
Engine version: ovirt-engine-setup-4.2.4.4-0.1.el7_3.noarch
Moving to Ido since Simone is not available this week
Hi Ido, How this moved to ON_QA from POST? QE still doesn't have the new build so we can't test it. Moving back to POST for proper process
Note to QE: For testing the fix imgbased-1.0.20 is needed. A new RHV-H build including this package will be done today.
Tested on these components on engine: ovirt-engine-setup-4.2.4.5-0.1.el7_3.noarch Linux 3.10.0-862.6.3.el7.x86_64 #1 SMP Fri Jun 15 17:57:37 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.5 (Maipo) 1.Deployed SHE on 4.1RHVH. Components on host: ovirt-hosted-engine-ha-2.1.11-1.el7ev.noarch ovirt-hosted-engine-setup-2.1.4.2-1.el7ev.noarch Linux 3.10.0-862.el7.x86_64 #1 SMP Wed Mar 21 18:14:51 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux release 7.5-0.14 2.Added additional 4.1RHVH as ha-host. 3.Checked for upgrade for one of the hosts via UI. 4.Upgraded 4.1RHVH from UI to 4.2RHVH (redhat-virtualization-host-4.2-20180622.0 ) and host did not recovered: Failed to upgrade Host alma03.qa.lab.tlv.redhat.com (User: admin@internal-authz). 6/25/183:03:31 PM  Failed to upgrade Host alma03.qa.lab.tlv.redhat.com (User: admin@internal-authz). 6/25/183:03:28 PM Host alma03.qa.lab.tlv.redhat.com was switched to Maintenance Mode. 6/25/182:33:24 PM  Host alma03.qa.lab.tlv.redhat.com upgrade was started (User: admin@internal-authz). 6/25/182:33:24 PM  Check for available updates on host alma03.qa.lab.tlv.redhat.com was completed successfully with message 'found updates for packages redhat-virtualization-host-image-update'. 6/25/182:30:54 PM  Host alma03.qa.lab.tlv.redhat.com has available updates: redhat-virtualization-host-image-update. 6/25/182:30:54 PM  Started to check for available updates on host alma03.qa.lab.tlv.redhat.com. 6/25/182:30:36 PM  Status of host alma03.qa.lab.tlv.redhat.com was set to Up. 6/25/182:26:50 PM Host appeared as "Install Failed" in UI and like this in CLI: --== Host 2 status ==-- conf_on_shared_storage : True Status up-to-date : False Hostname : alma03 Host ID : 2 Engine status : unknown stale-data Score : 0 stopped : False Local maintenance : True crc32 : 8546a3cf local_conf_timestamp : 5994 Host timestamp : 5979 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=5979 (Mon Jun 25 14:59:27 2018) host-id=2 score=0 vm_conf_refresh_time=5994 (Mon Jun 25 14:59:42 2018) conf_on_shared_storage=True maintenance=True state=LocalMaintenance stopped=False 5.I ran "hosted-engine --set-maintenance --mode=none" on host to release it from local maintenance. 6.Host remained in local maintenance, regardless of step 5.
alma03 ~]# systemctl status ovirt-ha-agent -l ● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor preset: disabled) Active: active (running) since Mon 2018-06-25 14:25:47 IDT; 51min ago Main PID: 19189 (ovirt-ha-agent) Tasks: 2 Memory: 36.6M CGroup: /system.slice/ovirt-ha-agent.service └─19189 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent --no-daemon Jun 25 14:25:47 alma03.qa.lab.tlv.redhat.com systemd[1]: Started oVirt Hosted Engine High Availability Monitoring Agent. Jun 25 14:25:47 alma03.qa.lab.tlv.redhat.com systemd[1]: Starting oVirt Hosted Engine High Availability Monitoring Agent... [root@alma03 ~]# systemctl status ovirt-ha-broker -l ● ovirt-ha-broker.service - oVirt Hosted Engine High Availability Communications Broker Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-broker.service; enabled; vendor preset: disabled) Active: active (running) since Mon 2018-06-25 14:25:47 IDT; 51min ago Main PID: 19188 (ovirt-ha-broker) Tasks: 7 Memory: 44.5M CGroup: /system.slice/ovirt-ha-broker.service └─19188 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker --no-daemon Jun 25 14:25:47 alma03.qa.lab.tlv.redhat.com systemd[1]: Started oVirt Hosted Engine High Availability Communications Broker. Jun 25 14:25:47 alma03.qa.lab.tlv.redhat.com systemd[1]: Starting oVirt Hosted Engine High Availability Communications Broker... Jun 25 14:25:49 alma03.qa.lab.tlv.redhat.com ovirt-ha-broker[19188]: ovirt-ha-broker ovirt_hosted_engine_ha.broker.listener.ConnectionHandler ERROR Error handling request, data: 'set-storage-domain FilesystemBackend dom_type=nfs3 sd_uuid=4fb972d3-5659-4a0f-a564-65e83fc81f72' Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/listener.py", line 166, in handle data) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/listener.py", line 299, in _dispatch .set_storage_domain(client, sd_type, **options) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 66, in set_storage_domain self._backends[client].connect() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 462, in connect self._dom_type) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 107, in get_domain_path " in {1}".format(sd_uuid, parent)) BackendFailureException: path to storage domain 4fb972d3-5659-4a0f-a564-65e83fc81f72 not found in /rhev/data-center/mnt Jun 25 15:00:08 alma03.qa.lab.tlv.redhat.com ovirt-ha-broker[19188]: ovirt-ha-broker ovirt_hosted_engine_ha.broker.submonitor_base.SubmonitorBase ERROR Error executing submonitor mgmt-bridge, args {'use_ssl': 'true', 'bridge_name': 'ovirtmgmt', 'address': '0'} Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/submonitor_base.py", line 115, in _worker self.action(self._options) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/submonitors/mgmt_bridge.py", line 42, in action logger=self._log File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 187, in connect_vdsm_json_rpc requestQueue=requestQueue, File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 248, in connect responseQueue) File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 234, in _create lazy_start=False) File "/usr/lib/python2.7/site-packages/yajsonrpc/stompreactor.py", line 629, in StandAloneRpcClient client = StompClient(utils.create_connected_socket(host, port, sslctx), File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 947, in create_connected_socket sock = sslctx.wrapSocket(sock) File "/usr/lib/python2.7/site-packages/vdsm/sslutils.py", line 101, in wrapSocket ca_certs=self.ca_certs) File "/usr/lib64/python2.7/ssl.py", line 936, in wrap_socket ciphers=ciphers) File "/usr/lib64/python2.7/ssl.py", line 547, in __init__ self._context.load_verify_locations(ca_certs) IOError: [Errno 2] No such file or directory
Nikolai, did the host reboot after the upgrade ?
(In reply to Yuval Turgeman from comment #24) > Nikolai, did the host reboot after the upgrade ? During upgrade, the host was not able to get restarted from UI, as I did not configured remote power management. I've restarted it manually and after that, host seamed upgrading something and then it got rebooted on its own again and then started with black screen and did not let me to log in for a few minutes (15-20), then I received the login prompt. Now host has these components: ovirt-hosted-engine-setup-2.2.22-1.el7ev.noarch ovirt-hosted-engine-ha-2.2.14-1.el7ev.noarch Linux 3.10.0-862.6.3.el7.x86_64 #1 SMP Fri Jun 15 17:57:37 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux release 7.5 alma03 ~]# ll -lsha /var/log/ovirt-hosted-engine-ha/agent.log 368K -rw-r-----. 1 vdsm kvm 363K Jun 25 16:18 /var/log/ovirt-hosted-engine-ha/agent.log [root@alma03 ~]# ll -lsha /var/log/ovirt-hosted-engine-ha/broker.log 724K -rw-r-----. 1 vdsm kvm 718K Jun 25 16:18 /var/log/ovirt-hosted-engine-ha/broker.log I tried to activate the host from UI and host successfully activated. Check for available updates on host alma03.qa.lab.tlv.redhat.com was completed successfully with message 'no updates found.'. 6/25/184:20:48 PM  Started to check for available updates on host alma03.qa.lab.tlv.redhat.com. 6/25/184:20:29 PM  Successfully refreshed the capabilities of host alma03.qa.lab.tlv.redhat.com. 6/25/184:20:25 PM  Status of host alma03.qa.lab.tlv.redhat.com was set to Up. 6/25/184:19:19 PM  No faulty multipath paths on host alma03.qa.lab.tlv.redhat.com 6/25/184:19:19 PM  Activation of host alma03.qa.lab.tlv.redhat.com initiated by admin@internal-authz. 6/25/184:19:15 PM  Successfully refreshed the capabilities of host alma03.qa.lab.tlv.redhat.com. 6/25/184:13:56 PM
Since both agent and broker are running correctly after the upgrade, movint this bug to verified. Please also see https://bugzilla.redhat.com/show_bug.cgi?id=1534197#c57 .
See also https://bugzilla.redhat.com/show_bug.cgi?id=1594836, I've opened it to document the restart issue there.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2066
BZ<2>Jira Resync