Description of problem: After upgrading 3 out of 3 hypervisors all of them failed in staring ovirt-ha-{agent,broker}. The components are restarted by systemd and continue to trigger abrtd in analyzing python dumps. The root cause is that for some reason after the upgrade the logs: '/var/log/ovirt-hosted-engine-ha/broker.log' '/var/log/ovirt-hosted-engine-ha/agent.log' changes their permissions from vdsm:kvm into root:root, making the impossible to be opened by the hosted-engine daemons and filling the system log with scary messages like: mag 17 02:32:50 m1.goodfellow.local systemd[1]: Started oVirt Hosted Engine High Availability Communications Broker. mag 17 02:32:50 m1.goodfellow.local systemd[1]: Starting oVirt Hosted Engine High Availability Communications Broker... mag 17 02:32:51 m1.goodfellow.local python[1624]: detected unhandled Python exception in '/usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker' mag 17 02:32:51 m1.goodfellow.local abrt-server[1629]: Not saving repeating crash in '/usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker' mag 17 02:32:51 m1.goodfellow.local systemd[1]: ovirt-ha-broker.service: main process exited, code=exited, status=1/FAILURE mag 17 02:32:51 m1.goodfellow.local systemd[1]: Unit ovirt-ha-broker.service entered failed state. mag 17 02:32:51 m1.goodfellow.local systemd[1]: ovirt-ha-broker.service failed. [...] mag 17 02:32:53 m1.goodfellow.local systemd[1]: Started oVirt Hosted Engine High Availability Communications Broker. mag 17 02:32:53 m1.goodfellow.local systemd[1]: Starting oVirt Hosted Engine High Availability Communications Broker... mag 17 02:32:54 m1.goodfellow.local python[1828]: detected unhandled Python exception in '/usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker' mag 17 02:32:54 m1.goodfellow.local abrt-server[1835]: Not saving repeating crash in '/usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker' mag 17 02:32:54 m1.goodfellow.local systemd[1]: ovirt-ha-broker.service: main process exited, code=exited, status=1/FAILURE mag 17 02:32:54 m1.goodfellow.local systemd[1]: Unit ovirt-ha-broker.service entered failed state. mag 17 02:32:54 m1.goodfellow.local systemd[1]: ovirt-ha-broker.service failed. mag 17 02:32:54 m1.goodfellow.local systemd[1]: ovirt-ha-broker.service holdoff time over, scheduling restart. mag 17 02:32:54 m1.goodfellow.local systemd[1]: Cannot add dependency job for unit lvm2-lvmetad.socket, ignoring: Unit is masked. mag 17 02:32:54 m1.goodfellow.local systemd[1]: start request repeated too quickly for ovirt-ha-broker.service mag 17 02:32:54 m1.goodfellow.local systemd[1]: Failed to start oVirt Hosted Engine High Availability Communications Broker. mag 17 02:32:54 m1.goodfellow.local systemd[1]: Unit ovirt-ha-broker.service entered failed state. mag 17 02:32:54 m1.goodfellow.local systemd[1]: ovirt-ha-broker.service failed. [...] mag 17 02:33:36 m1.goodfellow.local vdsm[6511]: WARN Not ready yet, ignoring event '|net|host_conn|no_id' args={} mag 17 02:33:36 m1.goodfellow.local vdsm[6511]: WARN Not ready yet, ignoring event '|net|host_conn|no_id' args={} mag 17 02:33:36 m1.goodfellow.local vdsm[6511]: WARN File: /var/run/vdsm/trackedInterfaces/ovirtmgmt already removed mag 17 02:33:36 m1.goodfellow.local vdsm[6511]: WARN Not ready yet, ignoring event '|net|host_conn|no_id' args={} mag 17 02:33:36 m1.goodfellow.local vdsm[6511]: WARN File: /var/run/vdsm/trackedInterfaces/eno2 already removed mag 17 02:33:37 m1.goodfellow.local vdsm[6511]: WARN MOM not available. mag 17 02:33:37 m1.goodfellow.local vdsm[6511]: WARN MOM not available, KSM stats will be missing. mag 17 02:33:37 m1.goodfellow.local vdsm[6511]: ERROR failed to retrieve Hosted Engine HA score '[Errno 2] No such file or directory'Is the Hosted Engine setup finished? [...] mag 17 02:33:49 m1.goodfellow.local systemd[1]: ovirt-ha-agent.service holdoff time over, scheduling restart. mag 17 02:33:49 m1.goodfellow.local systemd[1]: Cannot add dependency job for unit lvm2-lvmetad.socket, ignoring: Unit is masked. mag 17 02:33:49 m1.goodfellow.local systemd[1]: Started oVirt Hosted Engine High Availability Communications Broker. mag 17 02:33:49 m1.goodfellow.local systemd[1]: Starting oVirt Hosted Engine High Availability Communications Broker... mag 17 02:33:49 m1.goodfellow.local systemd[1]: Started oVirt Hosted Engine High Availability Monitoring Agent. mag 17 02:33:49 m1.goodfellow.local systemd[1]: Starting oVirt Hosted Engine High Availability Monitoring Agent... mag 17 02:33:50 m1.goodfellow.local python[9792]: detected unhandled Python exception in '/usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent' mag 17 02:33:50 m1.goodfellow.local python[9791]: detected unhandled Python exception in '/usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker' mag 17 02:33:50 m1.goodfellow.local abrt-server[9846]: Not saving repeating crash in '/usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent' mag 17 02:33:50 m1.goodfellow.local systemd[1]: ovirt-ha-agent.service: main process exited, code=exited, status=1/FAILURE mag 17 02:33:50 m1.goodfellow.local systemd[1]: Unit ovirt-ha-agent.service entered failed state. mag 17 02:33:50 m1.goodfellow.local systemd[1]: ovirt-ha-agent.service failed. mag 17 02:33:51 m1.goodfellow.local kernel: warning: `turbostat' uses 32-bit capabilities (legacy support in use) mag 17 02:33:52 m1.goodfellow.local vdsm[6511]: ERROR failed to retrieve Hosted Engine HA score '[Errno 2] No such file or directory'Is the Hosted Engine setup finished? For an average end-user, reading "Is the Hosted Engine setup finished?" is very scary, because transmit a sense of insecurity related to the hosted-engine-ha cluster, that is the core of the infra and the way to manage the infra. Also the message per itself is not that clear, you have to go inside the abrtd crash folder and read the backtrace to have some meaningful hints on what is going on Version-Release number of selected component (if applicable): ovirt-hosted-engine-ha-2.2.11-1.el7ev.noarch redhat-release-virtualization-host-4.2-3.0.el7.x86_64 How reproducible: Always (experienced in 3 out of 3 nodes, starting from different releases of RHV-H 4.1) Steps to Reproduce: 1. upgrade RHV-H to 4.2.3 from RHV-H 4.1 2. reboot the node Actual results: ovirt-ha-agent , ovirt-ha-broker services do not start, reducing the number of members of the hosted-engine cluster. Expected results: painless upgrade, no change in the permission of the log files. Additional info: those are the backtraces abrtd collect for both the daemons: [root@m1 abrt]# cat Python-2018-05-17-02\:34\:17-14299/backtrace __init__.py:925:_open:IOError: [Errno 13] Permission denied: '/var/log/ovirt-hosted-engine-ha/broker.log' Traceback (most recent call last): File "/usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker", line 25, in <module> broker.Broker().run() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 44, in run self._initialize_logging() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 70, in _initialize_logging disable_existing_loggers=False) File "/usr/lib64/python2.7/logging/config.py", line 78, in fileConfig handlers = _install_handlers(cp, formatters) File "/usr/lib64/python2.7/logging/config.py", line 156, in _install_handlers h = klass(*args) File "/usr/lib64/python2.7/logging/handlers.py", line 169, in __init__ BaseRotatingHandler.__init__(self, filename, 'a', encoding, delay) File "/usr/lib64/python2.7/logging/handlers.py", line 64, in __init__ logging.FileHandler.__init__(self, filename, mode, encoding, delay) File "/usr/lib64/python2.7/logging/__init__.py", line 902, in __init__ StreamHandler.__init__(self, self._open()) File "/usr/lib64/python2.7/logging/__init__.py", line 925, in _open stream = open(self.baseFilename, self.mode) IOError: [Errno 13] Permission denied: '/var/log/ovirt-hosted-engine-ha/broker.log' Local variables in innermost frame: self: <logging.handlers.TimedRotatingFileHandler object at 0x7fcd4d7d8e10> [root@m1 abrt]# cat Python-2018-05-17-02\:34\:17-14300/backtrace __init__.py:925:_open:IOError: [Errno 13] Permission denied: '/var/log/ovirt-hosted-engine-ha/agent.log' Traceback (most recent call last): File "/usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent", line 25, in <module> agent.Agent().run() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 66, in run self._initialize_logging() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 95, in _initialize_logging disable_existing_loggers=False) File "/usr/lib64/python2.7/logging/config.py", line 78, in fileConfig handlers = _install_handlers(cp, formatters) File "/usr/lib64/python2.7/logging/config.py", line 156, in _install_handlers h = klass(*args) File "/usr/lib64/python2.7/logging/handlers.py", line 169, in __init__ BaseRotatingHandler.__init__(self, filename, 'a', encoding, delay) File "/usr/lib64/python2.7/logging/handlers.py", line 64, in __init__ logging.FileHandler.__init__(self, filename, mode, encoding, delay) File "/usr/lib64/python2.7/logging/__init__.py", line 902, in __init__ StreamHandler.__init__(self, self._open()) File "/usr/lib64/python2.7/logging/__init__.py", line 925, in _open stream = open(self.baseFilename, self.mode) IOError: [Errno 13] Permission denied: '/var/log/ovirt-hosted-engine-ha/agent.log' Local variables in innermost frame: self: <logging.handlers.TimedRotatingFileHandler object at 0x7fd9477a0890>
Hi, can you please attach /var/log/imgbased.log ?
Created attachment 1440337 [details] Archive containing imgbased.log for all 3 hypervisors
This just happened to me too. I am upgrading RHEV from 4.1 to 4.2, and the first hypervisor hit this problem when it was rebooted by the manager. I chowned the two log files to vdsm:kvm and it auto-magically recovered.. Will this happen again when the log files rotate? I won't upgrade the other 2 hypervisors until I am sure about this problem.
Hello, I can confirm that after changing ownership, in my environment composed by two nodes, the log rotation didn't have any problem. Now 15 days after upgrading both RHV-H nodes from latest 4.1 to 4.2 (as of 23/05) eg: [root@rhevora3 ovirt-hosted-engine-ha]# ll total 29176 -rw-r--r--. 1 vdsm kvm 1861815 Jun 7 16:34 agent.log -rw-r--r--. 1 vdsm kvm 2110007 May 31 19:18 agent.log.2018-05-30 -rw-r--r--. 1 vdsm kvm 2109216 Jun 1 19:18 agent.log.2018-05-31 -rw-r--r--. 1 vdsm kvm 2109085 Jun 2 19:18 agent.log.2018-06-01 -rw-r--r--. 1 vdsm kvm 2106519 Jun 3 19:18 agent.log.2018-06-02 -rw-r--r--. 1 vdsm kvm 2107300 Jun 4 19:19 agent.log.2018-06-03 -rw-r--r--. 1 vdsm kvm 2106155 Jun 5 19:19 agent.log.2018-06-04 -rw-r--r--. 1 vdsm kvm 2104958 Jun 6 19:19 agent.log.2018-06-05 -rw-r--r--. 1 vdsm kvm 1499470 Jun 7 16:34 broker.log -rw-r--r--. 1 vdsm kvm 1644019 May 31 19:12 broker.log.2018-05-30 -rw-r--r--. 1 vdsm kvm 1655812 Jun 1 19:12 broker.log.2018-05-31 -rw-r--r--. 1 vdsm kvm 1661890 Jun 2 19:11 broker.log.2018-06-01 -rw-r--r--. 1 vdsm kvm 1670210 Jun 3 19:12 broker.log.2018-06-02 -rw-r--r--. 1 vdsm kvm 1673656 Jun 4 19:12 broker.log.2018-06-03 -rw-r--r--. 1 vdsm kvm 1674546 Jun 5 19:12 broker.log.2018-06-04 -rw-r--r--. 1 vdsm kvm 1684896 Jun 6 19:12 broker.log.2018-06-05 [root@rhevora3 ovirt-hosted-engine-ha]#
Hello, I don't know if it is already expected to contain the fix, as the target appears to be 4.2.4, but I have updated one RHEV 4.1 environment, based on 2 RHV-H nodes originally in version rhvh-4.1-0.20171101.0+1 with self hosted engine. After following the typical upgrade flow for the self hosted engine, first to latest 4.1 (4.1.11.2-0.1.el7) and then to 4.2 (4.2.3.8-0.1.el7), I updated the RHV-H hosts to version 4.2-20180531.0.el7_5. For both the hosts the permissions was ok as expected. I had another problem for only one of the 2 hosts, throwing me to dracut shell when booting the 4.2 image, but I'm going to open a dedicated case and eventually bugzilla for it. Gianluca
Works for me on ovirt-hosted-engine-ha-2.2.14-1.el7ev.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:1049
sync2jira