Description of problem: In previous oVirt versions, the cluster upgrade via the UI was doing a sequential upgrade over each node. So update node -> reboot -> finished -> next node -> etc Now I've noticed multiple hosts are upgraded at the same time, which isn't a good situation in some environments. Next to that, the cluster was put back online before even the last node was put into maintenance and got upgraded: EVENT_ID: ANSIBLE_RUNNER_EVENT_NOTIFICATION(559), Ansible Runner.. Set cluster upgrade status to finished. This occured before the last node was put into maintenance mode. How reproducible: Every time Steps to Reproduce: 1. Upgrade cluster via UI
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.
Could you please provide engine logs, so we can investigate the sequence of actions? And which version are you using?
Hi, This is on the current oVirt version (4.4.9.5). Just before the upgrade: 2021-11-03T10:26:16+0100 SUBDEBUG Upgrade: ansible-2.9.27-2.el8.noarch 2021-11-03T10:27:25+0100 SUBDEBUG Upgrade: ovirt-ansible-collection-1.6.5-1.el8.noarch 2021-11-03T10:27:49+0100 SUBDEBUG Upgraded: ovirt-ansible-collection-1.6.2-1.el8.noarch 2021-11-03T10:27:50+0100 SUBDEBUG Upgraded: ansible-2.9.25-1.el8.noarch Its hard to upload the complete oVirt engine log, as it contains tons of sensitive data. But i've added the logs from grep ANSIBLE ovirt-engine.log.
Created attachment 1849013 [details] engine log ANSIBLE
As you can see in the logs it starts to upgrade 001, finishes, then starts 002, finishes, but then it suddenly starts upgrading 003 and 004 at the same time. When those are finished, it sets cluster maintenance to finished, and then starts upgrading 005 ...
Sorry, but that's not enough information to investigate. Could you please attach complete engine.log and mark the attachment as private? That will allow only Red Hat employees to be able to access the log
Info provided via mail as I can't upload attachment as private.
Hi, this is a bit tricky issue because in the logs that you sent it seems like it started the upgrade but immediately ended. Have you checked for an upgrade? Plus when I was testing it on my env I did not find out this issue. And we have not done any huge changes in the cluster_upgrade role between 1.6.2 and 1.6.5 (only some gluster changes).
Hi Martin, Yes checking for upgrade was enabled during the cluster upgrade. I did another upgrade of the cluster today, and same issue occurred again.
I did some more investigation, and it looks like a race condition here. Upgrade host is started: 2022-03-14 11:22:13,033 - runner_service.services.playbook - DEBUG - cb_event_handler event_data={'uuid': '00505697-8380-8e82-18e6-0000000002c1', 'counter': 702, 'stdout': '\r\nTASK [ovirt.ovirt.cluster_upgrade : Upgrade host] ******************************', 'start_line': 693, 'end_line': 695, 'runner_ident': '9230d23a-a380-11ec-8536-005056978380', 'event': 'playbook_on_task_start', 'pid': 7689, 'created': '2022-03-14T10:22:13.032398', 'parent_uuid': '00505697-8380-8e82-18e6-000000000005', 'event_data': {'playbook': 'ovirt-cluster-upgrade.yml', 'playbook_uuid': '48039698-615b-43fe-b3ff-27284188194a', 'play': 'oVirt cluster upgrade wizard target', 'play_uuid': '00505697-8380-8e82-18e6-000000000005', 'play_pattern': 'localhost', 'task': 'Upgrade host', 'task_uuid': '00505697-8380-8e82-18e6-0000000002c1', 'task_action': 'ovirt_host', 'task_args': '', 'task_path': '/usr/share/ansible/collections/ansible_collections/ovirt/ovirt/roles/cluster_upgrade/tasks/upgrade.yml:76', 'role': 'cluster_upgrade', 'name': 'ovirt.ovirt.cluster_upgrade : Upgrade host', 'is_conditional': False, 'uuid': '00505697-8380-8e82-18e6-0000000002c1'}} The upgrade command is executed: 2022-03-14 11:22:13,637+01 INFO [org.ovirt.engine.core.bll.MaintenanceNumberOfVdssCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] Lock Acquired to object 'EngineLock:{exclusiveLocks='', sharedLocks='[d497efe5-2344-4d58-8985-7b053d3c35a3=POOL]'}' 2022-03-14 11:22:13,640+01 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-1) [] EVENT_ID: HOST_UPGRADE_STARTED(840), Host ovn001 upgrade was started (User: admin@internal). 2022-03-14 11:22:13,751+01 INFO [org.ovirt.engine.core.bll.MaintenanceNumberOfVdssCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] Running command: MaintenanceNumberOfVdssCommand internal: true. Entities affected : ID: 7f6bd5e0-59c1-42e3-84e4-40cef0dc684d Type: VDSAction group MANIPULATE_HOST with role type ADMIN 2022-03-14 11:22:13,755+01 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] START, SetVdsStatusVDSCommand(HostName = ovn001, SetVdsStatusVDSCommandParameters:{hostId='7f6bd5e0-59c1-42e3-84e4-40cef0dc684d', status='PreparingForMaintenance', nonOperationalRe ason='NONE', stopSpmFailureLogged='true', maintenanceReason=''}), log id: 38d5d176 2022-03-14 11:22:13,755+01 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] VDS 'ovn001' is spm and moved from up calling resetIrs. 2022-03-14 11:22:13,756+01 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.ResetIrsVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] START, ResetIrsVDSCommand( ResetIrsVDSCommandParameters:{storagePoolId='d497efe5-2344-4d58-8985-7b053d3c35a3', ignoreFailoverLimit='false', vdsId='7f6bd5e0-59c1-42e3-84e4-40cef0dc684d', ignor eStopFailed='false'}), log id: 190620f0 2022-03-14 11:22:13,761+01 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] START, SpmStopVDSCommand(HostName = ovn001, SpmStopVDSCommandParameters:{hostId='7f6bd5e0-59c1-42e3-84e4-40cef0dc684d', storagePoolId='d497efe5-2344-4d58-8985-7b053d3c35a3'}), log id: 414d8653 2022-03-14 11:22:13,767+01 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] SpmStopVDSCommand::Stopping SPM on vds 'ovn001', pool id 'd497efe5-2344-4d58-8985-7b053d3c35a3' 2022-03-14 11:22:16,991+01 INFO [org.ovirt.engine.core.bll.EndExternalStepCommand] (default task-1) [f0ac5a1f-1a3b-4141-9ec0-6c1c1f407573] Running command: EndExternalStepCommand internal: false. Entities affected : ID: aaa00000-0000-0000-0000-123456789aaa Type: SystemAction group INJECT_EXTERNAL_TASKS with role type ADMIN 2022-03-14 11:22:17,174+01 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] FINISH, SpmStopVDSCommand, return: , log id: 414d8653 2022-03-14 11:22:17,182+01 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.ResetIrsVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] FINISH, ResetIrsVDSCommand, return: , log id: 190620f0 2022-03-14 11:22:17,185+01 INFO [org.ovirt.engine.core.vdsbroker.VdsManager] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] Clearing domains data for host ovn001 2022-03-14 11:22:17,196+01 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] FINISH, SetVdsStatusVDSCommand, return: , log id: 38d5d176 2022-03-14 11:22:17,202+01 INFO [org.ovirt.engine.core.bll.MaintenanceNumberOfVdssCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] Lock freed to object 'EngineLock:{exclusiveLocks='', sharedLocks='[d497efe5-2344-4d58-8985-7b053d3c35a3=POOL]'}' 2022-03-14 11:22:17,261+01 INFO [org.ovirt.engine.core.bll.MaintenanceVdsCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] Running command: MaintenanceVdsCommand internal: true. Entities affected : ID: 7f6bd5e0-59c1-42e3-84e4-40cef0dc684d Type: VDS As you can see the SetVdsStatusVDSCommand only finishes 4 seconds after the start of the upgrade call. But in https://github.com/oVirt/ovirt-ansible-collection/blob/78a8362f953b08dc1b1553d7bb4b50696b3e19e6/plugins/modules/ovirt_host.py#L596 we have a loop which checks the status of the host every 3 seconds. Which means the command is started at 11:22:13, and at 11:22:16 it checks the state of the host. But at that time, the host is still in hoststate.UP ... And Ansible thinks its fine to continue ... Which is exactly what we see in the ansible runner logs: 2022-03-14 11:22:16,693 - runner_service.services.playbook - DEBUG - cb_event_handler event_data={'uuid': '4a3f1375-541f-49c5-a7f6-a0deb4cb23c3', 'counter': 704, 'stdout': 'changed: [localhost]', 'start_line': 695, 'end_line': 696, 'runner_ident': '9230d23a-a380-11ec-8536-005056978380', 'event': 'runner_on_ok', 'pid': 7689, 'created': '2022-03-14T10:22:16.692617', 'parent_uuid': '00505697-8380-8e82-18e6-0000000002c1', 'event_data': {'playbook': 'ovirt-cluster-upgrade.yml', 'playbook_uuid': '48039698-615b-43fe-b3ff-27284188194a', 'play': 'oVirt cluster upgrade wizard target', 'play_uuid': '00505697-8380-8e82-18e6-000000000005', 'play_pattern': 'localhost', 'task': 'Upgrade host', 'task_uuid': '00505697-8380-8e82-18e6-0000000002c1', 'task_action': 'ovirt_host', 'task_args': '', 'task_path': '/usr/share/ansible/collections/ansible_collections/ovirt/ovirt/roles/cluster_upgrade/tasks/upgrade.yml:76', 'role': 'cluster_upgrade', 'host': 'localhost', 'remote_addr': '127.0.0.1', 'res': {'changed': True, 'id': '7f6bd5e0-59c1-42e3-84e4-40cef0dc684d', 'host': {'href': '/ovirt-engine/api/hosts/7f6bd5e0-59c1-42e3-84e4-40cef0dc684d', 'comment': '', 'id': '7f6bd5e0-59c1-42e3-84e4-40cef0dc684d', 'name': 'ovn001', ...... 'spm': {'priority': 5, 'status': 'spm'}, 'ssh': {'fingerprint': 'SHA256:j3mXtCrNCckmuewn+qK8OJo309P+VLjKukb9yfyUAi8', ...... , 'status': 'up', ..... , 'diff': {'after': {}, 'before': {}}, 'invocation': {'module_args': {'name': 'ovn001', 'state': 'upgraded', 'check_upgrade': True, 'reboot_after_upgrade': True, 'timeout': 3600, 'wait': True, 'poll_interval': 3, 'fetch_nested': False, 'nested_attributes': [], 'public_key': False, 'force': False, 'activate': True, 'id': None, 'comment': None, 'cluster': None, 'address': None, 'ssh_port': None, 'password': None, 'kdump_integration': None, 'spm_priority': None, 'override_iptables': None, 'reboot_after_installation': None, 'override_display': None, 'kernel_params': None, 'hosted_engine': None, 'power_management_enabled': None, 'iscsi': None, 'vgpu_placement': None}}, '_ansible_no_log': False}, 'start': '2022-03-14T10:22:13.033105', 'end': '2022-03-14T10:22:16.692385', 'duration': 3.65928, 'event_loop': None, 'uuid': '4a3f1375-541f-49c5-a7f6-a0deb4cb23c3'}}
This still happens on 4.5.0. And it looks more and more like a race condition. The host status is checked to early, which causes it to still have status UP instead of MAINTENANCE. So the ansible playbook think it completed, while it just didn't start yet. I noticed this occurs more often on when the host that is being upgraded is the SPM. Most likely because switching SPM adds an additional latency to the switch to maintenance flow.
4.4.10 ( 4.6 CL ) > 4.5.2 ( 4.7 CL ) ( 3 hosts in engine ) Update cluster from UI passed - not marked as done prematurely Tested on: Software Version:4.5.2-0.3.el8ev
This bugzilla is included in oVirt 4.5.2 release, published on August 10th 2022. Since the problem described in this bug report should be resolved in oVirt 4.5.2 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.