Bug 1579103 - RHV-H 4.2.3: hosted-engine agent fails to start after upgrade due to Permission denied: '/var/log/ovirt-hosted-engine-ha/broker.log' '/var/log/ovirt-hosted-engine-ha/agent.log'
Summary: RHV-H 4.2.3: hosted-engine agent fails to start after upgrade due to Permissi...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-hosted-engine-ha
Version: 4.2.3
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ovirt-4.3.0
: ---
Assignee: Martin Sivák
QA Contact: Nikolai Sednev
URL:
Whiteboard:
Depends On:
Blocks: 1585028 1627790
TreeView+ depends on / blocked
 
Reported: 2018-05-17 01:27 UTC by Andrea Perotti
Modified: 2021-12-10 16:10 UTC (History)
11 users (show)

Fixed In Version: v2.3.0-0.1.beta
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1585028 (view as bug list)
Environment:
Last Closed: 2019-05-08 12:31:57 UTC
oVirt Team: Integration
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Archive containing imgbased.log for all 3 hypervisors (72.48 KB, application/x-xz)
2018-05-22 22:24 UTC, Andrea Perotti
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3485761 0 None None None 2018-06-14 12:13:46 UTC
Red Hat Product Errata RHEA-2019:1049 0 None None None 2019-05-08 12:32:03 UTC
oVirt gerrit 91592 0 master MERGED packaging: set ghost directive for log files 2021-01-01 03:43:53 UTC
oVirt gerrit 91706 0 v2.2.z MERGED packaging: set ghost directive for log files 2021-01-01 03:43:56 UTC
oVirt gerrit 92246 0 master MERGED packaging: explicitly set permissions on log files 2021-01-01 03:44:31 UTC
oVirt gerrit 92250 0 v2.2.z MERGED packaging: explicitly set permissions on log files 2021-01-01 03:43:57 UTC

Description Andrea Perotti 2018-05-17 01:27:02 UTC
Description of problem:
After upgrading 3 out of 3 hypervisors all of them failed in staring ovirt-ha-{agent,broker}.
The components are restarted by systemd and continue to trigger abrtd in analyzing python dumps.

The root cause is that for some reason after the upgrade the logs:

'/var/log/ovirt-hosted-engine-ha/broker.log' 
'/var/log/ovirt-hosted-engine-ha/agent.log'

changes their permissions from vdsm:kvm into root:root, making the impossible to be opened by the hosted-engine daemons and filling the system log with scary messages like:

mag 17 02:32:50 m1.goodfellow.local systemd[1]: Started oVirt Hosted Engine High Availability Communications Broker.
mag 17 02:32:50 m1.goodfellow.local systemd[1]: Starting oVirt Hosted Engine High Availability Communications Broker...
mag 17 02:32:51 m1.goodfellow.local python[1624]: detected unhandled Python exception in '/usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker'
mag 17 02:32:51 m1.goodfellow.local abrt-server[1629]: Not saving repeating crash in '/usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker'
mag 17 02:32:51 m1.goodfellow.local systemd[1]: ovirt-ha-broker.service: main process exited, code=exited, status=1/FAILURE
mag 17 02:32:51 m1.goodfellow.local systemd[1]: Unit ovirt-ha-broker.service entered failed state.
mag 17 02:32:51 m1.goodfellow.local systemd[1]: ovirt-ha-broker.service failed.
[...]
mag 17 02:32:53 m1.goodfellow.local systemd[1]: Started oVirt Hosted Engine High Availability Communications Broker.
mag 17 02:32:53 m1.goodfellow.local systemd[1]: Starting oVirt Hosted Engine High Availability Communications Broker...
mag 17 02:32:54 m1.goodfellow.local python[1828]: detected unhandled Python exception in '/usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker'
mag 17 02:32:54 m1.goodfellow.local abrt-server[1835]: Not saving repeating crash in '/usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker'
mag 17 02:32:54 m1.goodfellow.local systemd[1]: ovirt-ha-broker.service: main process exited, code=exited, status=1/FAILURE
mag 17 02:32:54 m1.goodfellow.local systemd[1]: Unit ovirt-ha-broker.service entered failed state.
mag 17 02:32:54 m1.goodfellow.local systemd[1]: ovirt-ha-broker.service failed.
mag 17 02:32:54 m1.goodfellow.local systemd[1]: ovirt-ha-broker.service holdoff time over, scheduling restart.
mag 17 02:32:54 m1.goodfellow.local systemd[1]: Cannot add dependency job for unit lvm2-lvmetad.socket, ignoring: Unit is masked.
mag 17 02:32:54 m1.goodfellow.local systemd[1]: start request repeated too quickly for ovirt-ha-broker.service
mag 17 02:32:54 m1.goodfellow.local systemd[1]: Failed to start oVirt Hosted Engine High Availability Communications Broker.
mag 17 02:32:54 m1.goodfellow.local systemd[1]: Unit ovirt-ha-broker.service entered failed state.
mag 17 02:32:54 m1.goodfellow.local systemd[1]: ovirt-ha-broker.service failed.
[...]
mag 17 02:33:36 m1.goodfellow.local vdsm[6511]: WARN Not ready yet, ignoring event '|net|host_conn|no_id' args={}
mag 17 02:33:36 m1.goodfellow.local vdsm[6511]: WARN Not ready yet, ignoring event '|net|host_conn|no_id' args={}
mag 17 02:33:36 m1.goodfellow.local vdsm[6511]: WARN File: /var/run/vdsm/trackedInterfaces/ovirtmgmt already removed
mag 17 02:33:36 m1.goodfellow.local vdsm[6511]: WARN Not ready yet, ignoring event '|net|host_conn|no_id' args={}
mag 17 02:33:36 m1.goodfellow.local vdsm[6511]: WARN File: /var/run/vdsm/trackedInterfaces/eno2 already removed
mag 17 02:33:37 m1.goodfellow.local vdsm[6511]: WARN MOM not available.
mag 17 02:33:37 m1.goodfellow.local vdsm[6511]: WARN MOM not available, KSM stats will be missing.
mag 17 02:33:37 m1.goodfellow.local vdsm[6511]: ERROR failed to retrieve Hosted Engine HA score '[Errno 2] No such file or directory'Is the Hosted Engine setup finished?
[...]
mag 17 02:33:49 m1.goodfellow.local systemd[1]: ovirt-ha-agent.service holdoff time over, scheduling restart.
mag 17 02:33:49 m1.goodfellow.local systemd[1]: Cannot add dependency job for unit lvm2-lvmetad.socket, ignoring: Unit is masked.
mag 17 02:33:49 m1.goodfellow.local systemd[1]: Started oVirt Hosted Engine High Availability Communications Broker.
mag 17 02:33:49 m1.goodfellow.local systemd[1]: Starting oVirt Hosted Engine High Availability Communications Broker...
mag 17 02:33:49 m1.goodfellow.local systemd[1]: Started oVirt Hosted Engine High Availability Monitoring Agent.
mag 17 02:33:49 m1.goodfellow.local systemd[1]: Starting oVirt Hosted Engine High Availability Monitoring Agent...
mag 17 02:33:50 m1.goodfellow.local python[9792]: detected unhandled Python exception in '/usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent'
mag 17 02:33:50 m1.goodfellow.local python[9791]: detected unhandled Python exception in '/usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker'
mag 17 02:33:50 m1.goodfellow.local abrt-server[9846]: Not saving repeating crash in '/usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent'
mag 17 02:33:50 m1.goodfellow.local systemd[1]: ovirt-ha-agent.service: main process exited, code=exited, status=1/FAILURE
mag 17 02:33:50 m1.goodfellow.local systemd[1]: Unit ovirt-ha-agent.service entered failed state.
mag 17 02:33:50 m1.goodfellow.local systemd[1]: ovirt-ha-agent.service failed.
mag 17 02:33:51 m1.goodfellow.local kernel: warning: `turbostat' uses 32-bit capabilities (legacy support in use)
mag 17 02:33:52 m1.goodfellow.local vdsm[6511]: ERROR failed to retrieve Hosted Engine HA score '[Errno 2] No such file or directory'Is the Hosted Engine setup finished?


For an average end-user, reading "Is the Hosted Engine setup finished?" is very scary, because transmit a sense of insecurity related to the hosted-engine-ha cluster, that is the core of the infra and the way to manage the infra.

Also the message per itself is not that clear, you have to go inside the abrtd crash folder and read the backtrace to have some meaningful hints on what is going on

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-ha-2.2.11-1.el7ev.noarch
redhat-release-virtualization-host-4.2-3.0.el7.x86_64

How reproducible:
Always (experienced in 3 out of 3 nodes, starting from different releases of RHV-H 4.1)

Steps to Reproduce:
1. upgrade RHV-H to 4.2.3 from RHV-H 4.1
2. reboot the node

Actual results:
ovirt-ha-agent , ovirt-ha-broker services do not start, reducing the number of members of the hosted-engine cluster.

Expected results:
painless upgrade, no change in the permission of the log files.


Additional info:

those are the backtraces abrtd collect for both the daemons:


[root@m1 abrt]# cat Python-2018-05-17-02\:34\:17-14299/backtrace 
__init__.py:925:_open:IOError: [Errno 13] Permission denied: '/var/log/ovirt-hosted-engine-ha/broker.log'

Traceback (most recent call last):
  File "/usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker", line 25, in <module>
    broker.Broker().run()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 44, in run
    self._initialize_logging()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 70, in _initialize_logging
    disable_existing_loggers=False)
  File "/usr/lib64/python2.7/logging/config.py", line 78, in fileConfig
    handlers = _install_handlers(cp, formatters)
  File "/usr/lib64/python2.7/logging/config.py", line 156, in _install_handlers
    h = klass(*args)
  File "/usr/lib64/python2.7/logging/handlers.py", line 169, in __init__
    BaseRotatingHandler.__init__(self, filename, 'a', encoding, delay)
  File "/usr/lib64/python2.7/logging/handlers.py", line 64, in __init__
    logging.FileHandler.__init__(self, filename, mode, encoding, delay)
  File "/usr/lib64/python2.7/logging/__init__.py", line 902, in __init__
    StreamHandler.__init__(self, self._open())
  File "/usr/lib64/python2.7/logging/__init__.py", line 925, in _open
    stream = open(self.baseFilename, self.mode)
IOError: [Errno 13] Permission denied: '/var/log/ovirt-hosted-engine-ha/broker.log'

Local variables in innermost frame:
self: <logging.handlers.TimedRotatingFileHandler object at 0x7fcd4d7d8e10>
[root@m1 abrt]# cat Python-2018-05-17-02\:34\:17-14300/backtrace 
__init__.py:925:_open:IOError: [Errno 13] Permission denied: '/var/log/ovirt-hosted-engine-ha/agent.log'

Traceback (most recent call last):
  File "/usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent", line 25, in <module>
    agent.Agent().run()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 66, in run
    self._initialize_logging()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 95, in _initialize_logging
    disable_existing_loggers=False)
  File "/usr/lib64/python2.7/logging/config.py", line 78, in fileConfig
    handlers = _install_handlers(cp, formatters)
  File "/usr/lib64/python2.7/logging/config.py", line 156, in _install_handlers
    h = klass(*args)
  File "/usr/lib64/python2.7/logging/handlers.py", line 169, in __init__
    BaseRotatingHandler.__init__(self, filename, 'a', encoding, delay)
  File "/usr/lib64/python2.7/logging/handlers.py", line 64, in __init__
    logging.FileHandler.__init__(self, filename, mode, encoding, delay)
  File "/usr/lib64/python2.7/logging/__init__.py", line 902, in __init__
    StreamHandler.__init__(self, self._open())
  File "/usr/lib64/python2.7/logging/__init__.py", line 925, in _open
    stream = open(self.baseFilename, self.mode)
IOError: [Errno 13] Permission denied: '/var/log/ovirt-hosted-engine-ha/agent.log'

Local variables in innermost frame:
self: <logging.handlers.TimedRotatingFileHandler object at 0x7fd9477a0890>

Comment 2 Yuval Turgeman 2018-05-21 08:33:40 UTC
Hi, can you please attach /var/log/imgbased.log ?

Comment 3 Andrea Perotti 2018-05-22 22:24:47 UTC
Created attachment 1440337 [details]
Archive containing imgbased.log for all 3 hypervisors

Comment 6 Denice 2018-06-04 01:20:18 UTC
This just happened to me too.  I am upgrading RHEV from 4.1 to 4.2, and the first hypervisor hit this problem when it was rebooted by the manager.  I chowned the two log files to vdsm:kvm and it auto-magically recovered.. Will this happen again when the log files rotate?  I won't upgrade the other 2 hypervisors until I am sure about this problem.

Comment 7 Gianluca Cecchi 2018-06-07 14:37:43 UTC
Hello,
I can confirm that after changing ownership, in my environment composed by two nodes, the log rotation didn't have any problem. Now 15 days after upgrading both RHV-H nodes from latest 4.1 to 4.2 (as of 23/05)

eg:

[root@rhevora3 ovirt-hosted-engine-ha]# ll
total 29176
-rw-r--r--. 1 vdsm kvm 1861815 Jun  7 16:34 agent.log
-rw-r--r--. 1 vdsm kvm 2110007 May 31 19:18 agent.log.2018-05-30
-rw-r--r--. 1 vdsm kvm 2109216 Jun  1 19:18 agent.log.2018-05-31
-rw-r--r--. 1 vdsm kvm 2109085 Jun  2 19:18 agent.log.2018-06-01
-rw-r--r--. 1 vdsm kvm 2106519 Jun  3 19:18 agent.log.2018-06-02
-rw-r--r--. 1 vdsm kvm 2107300 Jun  4 19:19 agent.log.2018-06-03
-rw-r--r--. 1 vdsm kvm 2106155 Jun  5 19:19 agent.log.2018-06-04
-rw-r--r--. 1 vdsm kvm 2104958 Jun  6 19:19 agent.log.2018-06-05
-rw-r--r--. 1 vdsm kvm 1499470 Jun  7 16:34 broker.log
-rw-r--r--. 1 vdsm kvm 1644019 May 31 19:12 broker.log.2018-05-30
-rw-r--r--. 1 vdsm kvm 1655812 Jun  1 19:12 broker.log.2018-05-31
-rw-r--r--. 1 vdsm kvm 1661890 Jun  2 19:11 broker.log.2018-06-01
-rw-r--r--. 1 vdsm kvm 1670210 Jun  3 19:12 broker.log.2018-06-02
-rw-r--r--. 1 vdsm kvm 1673656 Jun  4 19:12 broker.log.2018-06-03
-rw-r--r--. 1 vdsm kvm 1674546 Jun  5 19:12 broker.log.2018-06-04
-rw-r--r--. 1 vdsm kvm 1684896 Jun  6 19:12 broker.log.2018-06-05
[root@rhevora3 ovirt-hosted-engine-ha]#

Comment 8 Gianluca Cecchi 2018-06-18 05:58:15 UTC
Hello,
I don't know if it is already expected to contain the fix, as the target appears to be 4.2.4, but I have updated one RHEV 4.1 environment, based on 2 RHV-H nodes originally in version rhvh-4.1-0.20171101.0+1 with self hosted engine. 
After following the typical upgrade flow for the self hosted engine, first to latest 4.1 (4.1.11.2-0.1.el7) and then to 4.2 (4.2.3.8-0.1.el7), I updated the RHV-H hosts to version 4.2-20180531.0.el7_5.
For both the hosts the permissions was ok as expected.

I had another problem for only one of the 2 hosts, throwing me to dracut shell when booting the 4.2 image, but I'm going to open a dedicated case and eventually bugzilla for it.

Gianluca

Comment 12 Nikolai Sednev 2018-08-26 06:31:18 UTC
Works for me on ovirt-hosted-engine-ha-2.2.14-1.el7ev.

Comment 14 errata-xmlrpc 2019-05-08 12:31:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:1049

Comment 15 Daniel Gur 2019-08-28 13:12:53 UTC
sync2jira

Comment 16 Daniel Gur 2019-08-28 13:17:05 UTC
sync2jira


Note You need to log in before you can comment on or make changes to this bug.