Bug 1911910

Summary: ovirt-ha-broker - Failing to start Post upgrade to streams
Product: [oVirt] ovirt-engine Reporter: penguinpages <jeremey.wise>
Component: Python LibraryAssignee: Asaf Rachmani <arachman>
Status: CLOSED DUPLICATE QA Contact: meital avital <mavital>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.4.4.1CC: ahadas, bugs
Target Milestone: ovirt-4.4.5Flags: pm-rhel: ovirt-4.4+
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-01-18 09:37:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Integration RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Logs from hosted-engine broker none

Description penguinpages 2021-01-01 14:44:56 UTC
Description of problem:
3 node HCI cluster CentOS8.  Converted to CentOS Streams now Engine no longer booting up.  All three nodes same with cycling events about python library error per below


Version-Release number of selected component (if applicable):


How reproducible:
100%


Steps to Reproduce:
1. Upgrade from CentOS8 to streams
2. Update packages.  Reboot
3. HCI with Gluster volumes are mounting but engine no longer starts

Actual results:
[root@thor ~]# systemctl enable ovirt-ha-agent
[root@thor ~]# systemctl start ovirt-ha-agent
[root@thor ~]#  systemctl status ovirt-ha-agent
● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent
   Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor preset: disabled)
   Active: activating (auto-restart) (Result: exit-code) since Fri 2021-01-01 09:33:34 EST; 6s ago
  Process: 427064 ExecStart=/usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent (code=exited, status=157)
 Main PID: 427064 (code=exited, status=157)
    Tasks: 0 (limit: 1235320)
   Memory: 0B
   CGroup: /system.slice/ovirt-ha-agent.service
[root@thor ~]#
[root@thor ~]#
[root@thor ~]#  systemctl status ovirt-ha-agent -l
● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent
   Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor preset: disabled)
   Active: activating (auto-restart) (Result: exit-code) since Fri 2021-01-01 09:34:06 EST; 3s ago
  Process: 427376 ExecStart=/usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent (code=exited, status=157)
 Main PID: 427376 (code=exited, status=157)
    Tasks: 0 (limit: 1235320)
   Memory: 0B
   CGroup: /system.slice/ovirt-ha-agent.service
[root@thor ~]#


Expected results:
Nodes to boot HCI engine which will launch VMs.

Additional info:
Jan  1 09:35:28 thor journal[428169]: ovirt-ha-broker ovirt_hosted_engine_ha.broker.broker.Broker ERROR Failed initializing the broker: [Errno 107] Transport endpoint is not connected: '/rhev/data-center/mnt/glusterSD/thorst.penguinpages.local:_engine/3afc47ba-afb9-413f-8de5-8d9a2f45ecde/ha_agent/hosted-engine.metadata'
Jan  1 09:35:28 thor journal[428169]: ovirt-ha-broker ovirt_hosted_engine_ha.broker.broker.Broker ERROR Traceback (most recent call last):#012  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 64, in run#012    self._storage_broker_instance = self._get_storage_broker()#012  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 143, in _get_storage_broker#012    return storage_broker.StorageBroker()#012  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 97, in __init__#012    self._backend.connect()#012  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 408, in connect#012    self._check_symlinks(self._storage_path, volume.path, service_link)#012  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 105, in _check_symlinks#012    os.unlink(service_link)#012OSError: [Errno 107] Transport endpoint is not connected: '/rhev/data-center/mnt/glusterSD/thorst.penguinpages.local:_engine/3afc47ba-afb9-413f-8de5-8d9a2f45ecde/ha_agent/hosted-engine.metadata'
Jan  1 09:35:28 thor journal[428169]: ovirt-ha-broker ovirt_hosted_engine_ha.broker.broker.Broker ERROR Trying to restart the broker
Jan  1 09:35:28 thor platform-python[428169]: detected unhandled Python exception in '/usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker'
Jan  1 09:35:28 thor systemd[1]: ovirt-ha-broker.service: Main process exited, code=exited, status=1/FAILURE
Jan  1 09:35:28 thor systemd[1]: ovirt-ha-broker.service: Failed with result 'exit-code'.
Jan  1 09:35:28 thor abrt-server[428202]: Deleting problem directory Python3-2021-01-01-09:35:28-428169 (dup of Python3-2020-09-18-14:25:13-1363)
Jan  1 09:35:28 thor systemd[1]: ovirt-ha-broker.service: Service RestartSec=100ms expired, scheduling restart.
Jan  1 09:35:28 thor systemd[1]: ovirt-ha-broker.service: Scheduled restart job, restart counter is at 8235.
Jan  1 09:35:28 thor systemd[1]: Stopped oVirt Hosted Engine High Availability Communications Broker.
Jan  1 09:35:28 thor systemd[1]: Started oVirt Hosted Engine High Availability Communications Broker.
Jan  1 09:35:28 thor abrt-server[428202]: /bin/sh: reporter-systemd-journal: command not found
Jan  1 09:35:29 thor systemd[1]: ovirt-ha-agent.service: Service RestartSec=10s expired, scheduling restart.
Jan  1 09:35:29 thor systemd[1]: ovirt-ha-agent.service: Scheduled restart job, restart counter is at 3962.
Jan  1 09:35:29 thor systemd[1]: Stopped oVirt Hosted Engine High Availability Monitoring Agent.
Jan  1 09:35:29 thor systemd[1]: Started oVirt Hosted Engine High Availability Monitoring Agent.
Jan  1 09:35:30 thor journal[428219]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine ERROR Failed to start necessary monitors
Jan  1 09:35:30 thor journal[428219]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Traceback (most recent call last):#012  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 85, in start_monitor#012    response = self._proxy.start_monitor(type, options)#012  File "/usr/lib64/python3.6/xmlrpc/client.py", line 1112, in __call__#012    return self.__send(self.__name, args)#012  File "/usr/lib64/python3.6/xmlrpc/client.py", line 1452, in __request#012    verbose=self.__verbose#012  File "/usr/lib64/python3.6/xmlrpc/client.py", line 1154, in request#012    return self.single_request(host, handler, request_body, verbose)#012  File "/usr/lib64/python3.6/xmlrpc/client.py", line 1166, in single_request#012    http_conn = self.send_request(host, handler, request_body, verbose)#012  File "/usr/lib64/python3.6/xmlrpc/client.py", line 1279, in send_request#012    self.send_content(connection, request_body)#012  File "/usr/lib64/python3.6/xmlrpc/client.py", line 1309, in send_content#012    connection.endheaders(request_body)#012  File "/usr/lib64/python3.6/http/client.py", line 1264, in endheaders#012    self._send_output(message_body, encode_chunked=encode_chunked)#012  File "/usr/lib64/python3.6/http/client.py", line 1040, in _send_output#012    self.send(msg)#012  File "/usr/lib64/python3.6/http/client.py", line 978, in send#012    self.connect()#012  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/unixrpc.py", line 74, in connect#012    self.sock.connect(base64.b16decode(self.host))#012FileNotFoundError: [Errno 2] No such file or directory#012#012During handling of the above exception, another exception occurred:#012#012Traceback (most recent call last):#012  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 131, in _run_agent#012    return action(he)#012  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper#012    return he.start_monitoring()#012  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 437, in start_monitoring#012    self._initialize_broker()#012  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 561, in _initialize_broker#012    m.get('options', {}))#012  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 91, in start_monitor#012    ).format(t=type, o=options, e=e)#012ovirt_hosted_engine_ha.lib.exceptions.RequestError: brokerlink - failed to start monitor via ovirt-ha-broker: [Errno 2] No such file or directory, [monitor: 'network', options: {'addr': '172.16.100.1', 'network_test': 'dns', 'tcp_t_address': '', 'tcp_t_port': ''}]
Jan  1 09:35:30 thor journal[428219]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Trying to restart agent
Jan  1 09:35:30 thor systemd[1]: ovirt-ha-agent.service: Main process exited, code=exited, status=157/n/a
Jan  1 09:35:30 thor systemd[1]: ovirt-ha-agent.service: Failed with result 'exit-code'.
Jan  1 09:35:30 thor systemd[1]: Started Session c8243 of user root.
Jan  1 09:35:30 thor systemd[1]: session-c8243.scope: Succeeded.
Jan  1 09:35:31 thor vdsm[7851]: WARN Failed to retrieve Hosted Engine HA info, is Hosted Engine setup finished?

Comment 1 RHEL Program Management 2021-01-03 08:48:28 UTC
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.

Comment 2 penguinpages 2021-01-04 03:52:54 UTC
Created attachment 1744219 [details]
Logs from  hosted-engine broker

Something is wrong with update referencing gluster mount point for engine.


Just trying to start back up my VMs :)


I am about 90% certain nothing is wrong with the volumes.. Something patched and broke how HA starts engine services.

###
[root@medusa ~]# mount |grep gluster
/dev/mapper/vdo_0900 on /gluster_bricks/gv0 type xfs (rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota,_netdev,x-systemd.requires=vdo.service)
/dev/mapper/gluster_vg_sdb-gluster_lv_engine on /gluster_bricks/engine type xfs (rw,noatime,nodiratime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota,_netdev,x-systemd.requires=vdo.service)
/dev/mapper/gluster_vg_sdb-gluster_lv_vmstore on /gluster_bricks/vmstore type xfs (rw,noatime,nodiratime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=512,swidth=512,noquota,_netdev,x-systemd.requires=vdo.service)
/dev/mapper/gluster_vg_sdb-gluster_lv_data on /gluster_bricks/data type xfs (rw,noatime,nodiratime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=512,swidth=512,noquota,_netdev,x-systemd.requires=vdo.service)
thorst.penguinpages.local:/engine on /rhev/data-center/mnt/glusterSD/thorst.penguinpages.local:_engine type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072,_netdev,x-systemd.device-timeout=0)
medusast.penguinpages.local:/data on /media/data type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072,_netdev)
medusast.penguinpages.local:/engine on /media/engine type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072,_netdev)
medusast.penguinpages.local:/vmstore on /media/vmstore type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072,_netdev)
medusast.penguinpages.local:/gv0 on /media/gv0 type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072,_netdev)
[root@medusa ~]# df -h
Filesystem                                     Size  Used Avail Use% Mounted on
devtmpfs                                        32G     0   32G   0% /dev
tmpfs                                           32G   16K   32G   1% /dev/shm
tmpfs                                           32G  1.2G   31G   4% /run
tmpfs                                           32G     0   32G   0% /sys/fs/cgroup
/dev/mapper/cl-root                             50G  9.4G   41G  19% /
/dev/mapper/cl-home                            163G  1.2G  162G   1% /home
/dev/sda2                                      976M  252M  658M  28% /boot
/dev/sda1                                      599M  6.9M  592M   2% /boot/efi
/dev/mapper/vdo_0900                           4.0T   29G  4.0T   1% /gluster_bricks/gv0
/dev/mapper/gluster_vg_sdb-gluster_lv_engine   100G  9.0G   91G   9% /gluster_bricks/engine
/dev/mapper/gluster_vg_sdb-gluster_lv_vmstore 1000G  249G  752G  25% /gluster_bricks/vmstore
/dev/mapper/gluster_vg_sdb-gluster_lv_data    1000G  769G  232G  77% /gluster_bricks/data
thorst.penguinpages.local:/engine              100G   10G   90G  10% /rhev/data-center/mnt/glusterSD/thorst.penguinpages.local:_engine
tmpfs                                          6.3G     0  6.3G   0% /run/user/0
medusast.penguinpages.local:/data             1000G  779G  222G  78% /media/data
medusast.penguinpages.local:/engine            100G   10G   90G  10% /media/engine
medusast.penguinpages.local:/vmstore          1000G  259G  742G  26% /media/vmstore
medusast.penguinpages.local:/gv0               4.0T   70G  4.0T   2% /media/gv0
[root@medusa ~]# vdostats --human-readable
Device                    Size      Used Available Use% Space saving%
/dev/mapper/vdo_sdb     476.9G    397.3G     79.7G  83%           72%
/dev/mapper/vdo_0900    931.5G      4.8G    926.7G   0%           68%

Comment 3 penguinpages 2021-01-12 21:01:02 UTC

Questions:

1) is their any way to manually start VMs while the "engine" is offline?
2) I think this ticket is noting a fix to be coming in 4.4+? 

[root@odin ~]# rpm -qa |grep ovirt
ovirt-imageio-client-2.1.1-1.el8.x86_64
ovirt-vmconsole-1.0.9-1.el8.noarch
ovirt-openvswitch-2.11-0.2020061801.el8.noarch
ovirt-provider-ovn-driver-1.2.33-1.el8.noarch
ovirt-engine-appliance-4.4-20201221110111.1.el8.x86_64
ovirt-host-dependencies-4.4.1-4.el8.x86_64
ovirt-python-openvswitch-2.11-0.2020061801.el8.noarch
ovirt-hosted-engine-ha-2.4.5-1.el8.noarch
ovirt-ansible-collection-1.2.4-1.el8.noarch
ovirt-openvswitch-ovn-common-2.11-0.2020061801.el8.noarch
ovirt-release44-4.4.4-1.el8.noarch
ovirt-openvswitch-ovn-2.11-0.2020061801.el8.noarch
ovirt-openvswitch-ovn-host-2.11-0.2020061801.el8.noarch
python3-ovirt-setup-lib-1.3.2-1.el8.noarch
ovirt-vmconsole-host-1.0.9-1.el8.noarch
ovirt-imageio-common-2.1.1-1.el8.x86_64
ovirt-imageio-daemon-2.1.1-1.el8.x86_64
python3-ovirt-engine-sdk4-4.4.8-1.el8.x86_64


As noted ... I am on 4.4-2   I just need to start my VMs again.. and after three weeks now have to decide to give up debug and wipe because no ETA on fix.

Comment 4 penguinpages 2021-01-15 21:06:52 UTC

Learning a bit more ..  I think root cause is package conflict issues


###########
[root@thor ~]# dnf update
Last metadata expiration check: 2:39:36 ago on Fri 15 Jan 2021 01:17:50 PM EST.
Error:
 Problem 1: package ovirt-host-4.4.1-4.el8.x86_64 requires cockpit-dashboard, but none of the providers can be installed
  - package cockpit-bridge-234-1.el8.x86_64 conflicts with cockpit-dashboard < 233 provided by cockpit-dashboard-217-1.el8.noarch
  - cannot install the best update candidate for package ovirt-host-4.4.1-4.el8.x86_64
  - cannot install the best update candidate for package cockpit-bridge-217-1.el8.x86_64
 Problem 2: problem with installed package ovirt-host-4.4.1-4.el8.x86_64
  - package ovirt-host-4.4.1-4.el8.x86_64 requires cockpit-dashboard, but none of the providers can be installed
  - package cockpit-system-234-1.el8.noarch obsoletes cockpit-dashboard provided by cockpit-dashboard-217-1.el8.noarch
  - cannot install the best update candidate for package cockpit-dashboard-217-1.el8.noarch
 Problem 3: package ovirt-hosted-engine-setup-2.4.9-1.el8.noarch requires ovirt-host >= 4.4.0, but none of the providers can be installed
  - package ovirt-host-4.4.1-4.el8.x86_64 requires cockpit-dashboard, but none of the providers can be installed
  - package ovirt-host-4.4.1-1.el8.x86_64 requires cockpit-dashboard, but none of the providers can be installed
  - package ovirt-host-4.4.1-2.el8.x86_64 requires cockpit-dashboard, but none of the providers can be installed
  - package ovirt-host-4.4.1-3.el8.x86_64 requires cockpit-dashboard, but none of the providers can be installed
  - package cockpit-system-234-1.el8.noarch obsoletes cockpit-dashboard provided by cockpit-dashboard-217-1.el8.noarch
  - cannot install the best update candidate for package ovirt-hosted-engine-setup-2.4.9-1.el8.noarch
  - cannot install the best update candidate for package cockpit-system-217-1.el8.noarch
(try to add '--allowerasing' to command line to replace conflicting packages or '--skip-broken' to skip uninstallable packages or '--nobest' to use not only best candidate packages)
[root@thor ~]# dnf update --allowerasing
Last metadata expiration check: 2:39:45 ago on Fri 15 Jan 2021 01:17:50 PM EST.
Dependencies resolved.
=========================================================================================================================================================================================================================================
 Package                                                             Architecture                                     Version                                                 Repository                                            Size
=========================================================================================================================================================================================================================================
Upgrading:
 cockpit-bridge                                                      x86_64                                           234-1.el8                                               baseos                                               597 k
 cockpit-system                                                      noarch                                           234-1.el8                                               baseos                                               3.1 M
     replacing  cockpit-dashboard.noarch 217-1.el8
Removing dependent packages:
 cockpit-ovirt-dashboard                                             noarch                                           0.14.17-1.el8                                           @ovirt-4.4                                            16 M
 ovirt-host                                                          x86_64                                           4.4.1-4.el8                                             @ovirt-4.4                                            11 k
 ovirt-hosted-engine-setup                                           noarch                                           2.4.9-1.el8                                             @ovirt-4.4                                           1.3 M

Transaction Summary
=========================================================================================================================================================================================================================================
Upgrade  2 Packages
Remove   3 Packages

#####################

[root@odin ~]# systemctl status ovirt-ha-agent.service
● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent
   Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor preset: disabled)
   Active: inactive (dead)
Condition: start condition failed at Fri 2021-01-15 16:02:53 EST; 7s ago
           └─ ConditionFileNotEmpty=/etc/ovirt-hosted-engine/hosted-engine.conf was not met
[root@odin ~]#


[root@odin ~]# systemctl status ovirt-ha-agent.service
● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent
   Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor preset: disabled)
   Active: inactive (dead)
Condition: start condition failed at Fri 2021-01-15 16:02:53 EST; 7s ago
           └─ ConditionFileNotEmpty=/etc/ovirt-hosted-engine/hosted-engine.conf was not met
[root@odin ~]#



If you do force the upgrade.. the never ending log rotation of errors goes away..

Comment 5 Sandro Bonazzola 2021-01-18 09:37:57 UTC
closing as duplicate of bug #1917011.
Please wait before switching to CentOS Stream, compatibility for it is still in tech preview.

*** This bug has been marked as a duplicate of bug 1917011 ***