+++ This bug was initially created as a clone of Bug #2036613 +++ Description of problem: In previous oVirt versions, the cluster upgrade via the UI was doing a sequential upgrade over each node. So update node -> reboot -> finished -> next node -> etc Now I've noticed multiple hosts are upgraded at the same time, which isn't a good situation in some environments. Next to that, the cluster was put back online before even the last node was put into maintenance and got upgraded: EVENT_ID: ANSIBLE_RUNNER_EVENT_NOTIFICATION(559), Ansible Runner.. Set cluster upgrade status to finished. This occured before the last node was put into maintenance mode. How reproducible: Every time Steps to Reproduce: 1. Upgrade cluster via UI --- Additional comment from RHEL Program Management on 2022-01-05 10:24:25 UTC --- The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again. --- Additional comment from Martin Perina on 2022-01-05 10:25:39 UTC --- Could you please provide engine logs, so we can investigate the sequence of actions? And which version are you using? --- Additional comment from Jean-Louis Dupond on 2022-01-05 10:57:40 UTC --- Hi, This is on the current oVirt version (4.4.9.5). Just before the upgrade: 2021-11-03T10:26:16+0100 SUBDEBUG Upgrade: ansible-2.9.27-2.el8.noarch 2021-11-03T10:27:25+0100 SUBDEBUG Upgrade: ovirt-ansible-collection-1.6.5-1.el8.noarch 2021-11-03T10:27:49+0100 SUBDEBUG Upgraded: ovirt-ansible-collection-1.6.2-1.el8.noarch 2021-11-03T10:27:50+0100 SUBDEBUG Upgraded: ansible-2.9.25-1.el8.noarch Its hard to upload the complete oVirt engine log, as it contains tons of sensitive data. But i've added the logs from grep ANSIBLE ovirt-engine.log. --- Additional comment from Jean-Louis Dupond on 2022-01-05 10:58:17 UTC --- --- Additional comment from Jean-Louis Dupond on 2022-01-05 11:00:17 UTC --- As you can see in the logs it starts to upgrade 001, finishes, then starts 002, finishes, but then it suddenly starts upgrading 003 and 004 at the same time. When those are finished, it sets cluster maintenance to finished, and then starts upgrading 005 ... --- Additional comment from Martin Perina on 2022-01-05 11:50:26 UTC --- Sorry, but that's not enough information to investigate. Could you please attach complete engine.log and mark the attachment as private? That will allow only Red Hat employees to be able to access the log --- Additional comment from Jean-Louis Dupond on 2022-01-11 07:51:53 UTC --- Info provided via mail as I can't upload attachment as private. --- Additional comment from Martin Necas on 2022-02-15 13:32:55 UTC --- Hi, this is a bit tricky issue because in the logs that you sent it seems like it started the upgrade but immediately ended. Have you checked for an upgrade? Plus when I was testing it on my env I did not find out this issue. And we have not done any huge changes in the cluster_upgrade role between 1.6.2 and 1.6.5 (only some gluster changes). --- Additional comment from Jean-Louis Dupond on 2022-03-14 10:41:31 UTC --- Hi Martin, Yes checking for upgrade was enabled during the cluster upgrade. I did another upgrade of the cluster today, and same issue occurred again. --- Additional comment from Jean-Louis Dupond on 2022-03-14 13:15:14 UTC --- I did some more investigation, and it looks like a race condition here. Upgrade host is started: 2022-03-14 11:22:13,033 - runner_service.services.playbook - DEBUG - cb_event_handler event_data={'uuid': '00505697-8380-8e82-18e6-0000000002c1', 'counter': 702, 'stdout': '\r\nTASK [ovirt.ovirt.cluster_upgrade : Upgrade host] ******************************', 'start_line': 693, 'end_line': 695, 'runner_ident': '9230d23a-a380-11ec-8536-005056978380', 'event': 'playbook_on_task_start', 'pid': 7689, 'created': '2022-03-14T10:22:13.032398', 'parent_uuid': '00505697-8380-8e82-18e6-000000000005', 'event_data': {'playbook': 'ovirt-cluster-upgrade.yml', 'playbook_uuid': '48039698-615b-43fe-b3ff-27284188194a', 'play': 'oVirt cluster upgrade wizard target', 'play_uuid': '00505697-8380-8e82-18e6-000000000005', 'play_pattern': 'localhost', 'task': 'Upgrade host', 'task_uuid': '00505697-8380-8e82-18e6-0000000002c1', 'task_action': 'ovirt_host', 'task_args': '', 'task_path': '/usr/share/ansible/collections/ansible_collections/ovirt/ovirt/roles/cluster_upgrade/tasks/upgrade.yml:76', 'role': 'cluster_upgrade', 'name': 'ovirt.ovirt.cluster_upgrade : Upgrade host', 'is_conditional': False, 'uuid': '00505697-8380-8e82-18e6-0000000002c1'}} The upgrade command is executed: 2022-03-14 11:22:13,637+01 INFO [org.ovirt.engine.core.bll.MaintenanceNumberOfVdssCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] Lock Acquired to object 'EngineLock:{exclusiveLocks='', sharedLocks='[d497efe5-2344-4d58-8985-7b053d3c35a3=POOL]'}' 2022-03-14 11:22:13,640+01 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-1) [] EVENT_ID: HOST_UPGRADE_STARTED(840), Host ovn001 upgrade was started (User: admin@internal). 2022-03-14 11:22:13,751+01 INFO [org.ovirt.engine.core.bll.MaintenanceNumberOfVdssCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] Running command: MaintenanceNumberOfVdssCommand internal: true. Entities affected : ID: 7f6bd5e0-59c1-42e3-84e4-40cef0dc684d Type: VDSAction group MANIPULATE_HOST with role type ADMIN 2022-03-14 11:22:13,755+01 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] START, SetVdsStatusVDSCommand(HostName = ovn001, SetVdsStatusVDSCommandParameters:{hostId='7f6bd5e0-59c1-42e3-84e4-40cef0dc684d', status='PreparingForMaintenance', nonOperationalRe ason='NONE', stopSpmFailureLogged='true', maintenanceReason=''}), log id: 38d5d176 2022-03-14 11:22:13,755+01 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] VDS 'ovn001' is spm and moved from up calling resetIrs. 2022-03-14 11:22:13,756+01 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.ResetIrsVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] START, ResetIrsVDSCommand( ResetIrsVDSCommandParameters:{storagePoolId='d497efe5-2344-4d58-8985-7b053d3c35a3', ignoreFailoverLimit='false', vdsId='7f6bd5e0-59c1-42e3-84e4-40cef0dc684d', ignor eStopFailed='false'}), log id: 190620f0 2022-03-14 11:22:13,761+01 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] START, SpmStopVDSCommand(HostName = ovn001, SpmStopVDSCommandParameters:{hostId='7f6bd5e0-59c1-42e3-84e4-40cef0dc684d', storagePoolId='d497efe5-2344-4d58-8985-7b053d3c35a3'}), log id: 414d8653 2022-03-14 11:22:13,767+01 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] SpmStopVDSCommand::Stopping SPM on vds 'ovn001', pool id 'd497efe5-2344-4d58-8985-7b053d3c35a3' 2022-03-14 11:22:16,991+01 INFO [org.ovirt.engine.core.bll.EndExternalStepCommand] (default task-1) [f0ac5a1f-1a3b-4141-9ec0-6c1c1f407573] Running command: EndExternalStepCommand internal: false. Entities affected : ID: aaa00000-0000-0000-0000-123456789aaa Type: SystemAction group INJECT_EXTERNAL_TASKS with role type ADMIN 2022-03-14 11:22:17,174+01 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] FINISH, SpmStopVDSCommand, return: , log id: 414d8653 2022-03-14 11:22:17,182+01 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.ResetIrsVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] FINISH, ResetIrsVDSCommand, return: , log id: 190620f0 2022-03-14 11:22:17,185+01 INFO [org.ovirt.engine.core.vdsbroker.VdsManager] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] Clearing domains data for host ovn001 2022-03-14 11:22:17,196+01 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] FINISH, SetVdsStatusVDSCommand, return: , log id: 38d5d176 2022-03-14 11:22:17,202+01 INFO [org.ovirt.engine.core.bll.MaintenanceNumberOfVdssCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] Lock freed to object 'EngineLock:{exclusiveLocks='', sharedLocks='[d497efe5-2344-4d58-8985-7b053d3c35a3=POOL]'}' 2022-03-14 11:22:17,261+01 INFO [org.ovirt.engine.core.bll.MaintenanceVdsCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] Running command: MaintenanceVdsCommand internal: true. Entities affected : ID: 7f6bd5e0-59c1-42e3-84e4-40cef0dc684d Type: VDS As you can see the SetVdsStatusVDSCommand only finishes 4 seconds after the start of the upgrade call. But in https://github.com/oVirt/ovirt-ansible-collection/blob/78a8362f953b08dc1b1553d7bb4b50696b3e19e6/plugins/modules/ovirt_host.py#L596 we have a loop which checks the status of the host every 3 seconds. Which means the command is started at 11:22:13, and at 11:22:16 it checks the state of the host. But at that time, the host is still in hoststate.UP ... And Ansible thinks its fine to continue ... Which is exactly what we see in the ansible runner logs: 2022-03-14 11:22:16,693 - runner_service.services.playbook - DEBUG - cb_event_handler event_data={'uuid': '4a3f1375-541f-49c5-a7f6-a0deb4cb23c3', 'counter': 704, 'stdout': 'changed: [localhost]', 'start_line': 695, 'end_line': 696, 'runner_ident': '9230d23a-a380-11ec-8536-005056978380', 'event': 'runner_on_ok', 'pid': 7689, 'created': '2022-03-14T10:22:16.692617', 'parent_uuid': '00505697-8380-8e82-18e6-0000000002c1', 'event_data': {'playbook': 'ovirt-cluster-upgrade.yml', 'playbook_uuid': '48039698-615b-43fe-b3ff-27284188194a', 'play': 'oVirt cluster upgrade wizard target', 'play_uuid': '00505697-8380-8e82-18e6-000000000005', 'play_pattern': 'localhost', 'task': 'Upgrade host', 'task_uuid': '00505697-8380-8e82-18e6-0000000002c1', 'task_action': 'ovirt_host', 'task_args': '', 'task_path': '/usr/share/ansible/collections/ansible_collections/ovirt/ovirt/roles/cluster_upgrade/tasks/upgrade.yml:76', 'role': 'cluster_upgrade', 'host': 'localhost', 'remote_addr': '127.0.0.1', 'res': {'changed': True, 'id': '7f6bd5e0-59c1-42e3-84e4-40cef0dc684d', 'host': {'href': '/ovirt-engine/api/hosts/7f6bd5e0-59c1-42e3-84e4-40cef0dc684d', 'comment': '', 'id': '7f6bd5e0-59c1-42e3-84e4-40cef0dc684d', 'name': 'ovn001', ...... 'spm': {'priority': 5, 'status': 'spm'}, 'ssh': {'fingerprint': 'SHA256:j3mXtCrNCckmuewn+qK8OJo309P+VLjKukb9yfyUAi8', ...... , 'status': 'up', ..... , 'diff': {'after': {}, 'before': {}}, 'invocation': {'module_args': {'name': 'ovn001', 'state': 'upgraded', 'check_upgrade': True, 'reboot_after_upgrade': True, 'timeout': 3600, 'wait': True, 'poll_interval': 3, 'fetch_nested': False, 'nested_attributes': [], 'public_key': False, 'force': False, 'activate': True, 'id': None, 'comment': None, 'cluster': None, 'address': None, 'ssh_port': None, 'password': None, 'kdump_integration': None, 'spm_priority': None, 'override_iptables': None, 'reboot_after_installation': None, 'override_display': None, 'kernel_params': None, 'hosted_engine': None, 'power_management_enabled': None, 'iscsi': None, 'vgpu_placement': None}}, '_ansible_no_log': False}, 'start': '2022-03-14T10:22:13.033105', 'end': '2022-03-14T10:22:16.692385', 'duration': 3.65928, 'event_loop': None, 'uuid': '4a3f1375-541f-49c5-a7f6-a0deb4cb23c3'}} --- Additional comment from Jean-Louis Dupond on 2022-04-14 11:10:40 UTC --- This still happens on 4.5.0. And it looks more and more like a race condition. The host status is checked to early, which causes it to still have status UP instead of MAINTENANCE. So the ansible playbook think it completed, while it just didn't start yet. I noticed this occurs more often on when the host that is being upgraded is the SPM. Most likely because switching SPM adds an additional latency to the switch to maintenance flow.
Think that the selection of the hosts is related to the ui issue which should be fixed in [1]. Scott am I right? [1] https://github.com/oVirt/ovirt-engine-ui-extensions/pull/63
(In reply to Martin Necas from comment #2) > Think that the selection of the hosts is related to the ui issue which > should be fixed in [1]. > Scott am I right? > > [1] https://github.com/oVirt/ovirt-engine-ui-extensions/pull/63 Correct. The wizard did not send the host names correctly. That problem was fixed in the referenced PR.
I could not reproduce the bug (I tried the cluster upgrade few times). Feel free to reopen when it happens again. Verified in ovirt-ansible-collection-2.2.1-1.el8ev.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (RHV Engine and Host Common Packages update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:6394