Description of problem: VM sometimes switches its state from "up" to "PoweringUp" right after a full backup was initialized. This is causing the second time 'full backup' attempt to fail due to: Operation Failed: [Cannot backup VM. The Virtual Machine should be in Up status.] Looking at the engine log I can see that the VM quickly switches its state from 'up' to 'PoweringUp' (VM ID: a370e333-f55e-47c6-a2b0-5d90fe2ed7e5): 2020-11-01 21:02:01,501+02 INFO [org.ovirt.engine.core.bll.StartVmBackupCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-79) [backups_create_2b474e01-b643-4069] Ready to start image transfers using backup URLs 2020-11-01 21:02:04,358+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-1) [] VM 'a370e333-f55e-47c6-a2b0-5d90fe2ed7e5'(vm_TestCase27319_0121002844) moved from 'Up' --> 'PoweringUp' On most of the times when the same test passes, I dont see this kind of behavior. The VM does not switch its state from 'up' to 'PoweringUp'. See here for example [1]. [1]: 2020-11-01 21:05:53,959+02 INFO [org.ovirt.engine.core.bll.StartVmBackupCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-37) [backups_create_92ef6d0c-0fbc-4311] Ready to start image transfers using backup URLs 2020-11-01 21:06:03,843+02 INFO [org.ovirt.engine.core.bll.StopVmBackupCommand] (default task-1) [backups_syncAction_8b995808-d7d7-41e] Running command: StopVmBackupCommand internal: false. Entities affected : ID: 388d36c1-acc5-4b0d-88cb-dea70151f334 Type: VMAction group BACKUP_DISK with role type ADMIN, ID: 02f04186-e061-4e62-9007-91f8945e8512 Type: DiskAction group BACKUP_DISK with role type ADMIN 2020-11-01 21:06:03,843+02 INFO [org.ovirt.engine.core.bll.StopVmBackupCommand] (default task-1) [backups_syncAction_8b995808-d7d7-41e] Stopping VmBackup '51643e02-185a-4922-b59f-270c92f28ece' 2020-11-01 21:06:03,845+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.StopVmBackupVDSCommand] (default task-1) [backups_syncAction_8b995808-d7d7-41e] START, StopVmBackupVDSCommand(HostName = host_mixed_3, VmBackupVDSParameters:{hostId='8d3eae07-2717-4d4c-84fd-112fcce3e6c2', backupId='51643e02-185a-4922-b59f-270c92f28ece'}), log id: 2513fa92 2020-11-01 21:06:03,858+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.StopVmBackupVDSCommand] (default task-1) [backups_syncAction_8b995808-d7d7-41e] FINISH, StopVmBackupVDSCommand, return: , log id: 2513fa92 2020-11-01 21:06:03,870+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-1) [backups_syncAction_8b995808-d7d7-41e] EVENT_ID: VM_BACKUP_FINALIZED(10,794), Backup <UNKNOWN> for VM vm_TestCase27319_0121044587 finalized (User: admin@internal-authz). Version-Release number of selected component (if applicable): rhv-4.4.3-12 How reproducible: Usually this test passes fine, but in rear cases, this issue is being Reproduced by the Automation. This issue Was noticed at first on rhv-4.4.3-10 And now also seen on rhv-4.4.3-12 Steps to Reproduce: 1. Create a VM from template (with real OS on it) and start it 2. Make a full backup of its disks + finalize 3. repeat step 2 once more Actual results: Second backup fails to initiate due to the fact that VM state switched from 'up' to 'PoweringUp' after the first backup attempt Expected results: Second backup should complete successfully. Additional info: Logs attached (Always look for VM ID: a370e333-f55e-47c6-a2b0-5d90fe2ed7e5 'vm_TestCase27319_0121002844'): ovirt-engine-logs%20(8).tar.gz : Includes all of the logs from relevant VDSM host (messages + libvirt + vdsm) hypervisor-lynx24_vdsm.log : VDSM log of this particular test case execution engine.log : engine log Please note, there are no logs from the VM it self (qemu + messages), as the automation does not fetch them. Manually this issue wasnt reproduced. So the attached logs are all we have.
Adding some additional info: - Not marking this BZ as a regression because the reproduction of it is very difficult, and its occurrence is rare. - The impact of this issue is as following: The user is failing to make additional backup (2nd) during the time gap in which the VM state is 'powering up'. As soon as the state switches to 'up' again, the user is able to complete the second time backup request. According to logs, the VM went from 'PoweringUp' --> 'Up' right after the failed backup attempt. So the VM was in state of "powering up" during 15 seconds [1] I can not say for sure what caused the VM to fall back to 'up' state. This could be the failed backup attempt, or some other trigger. - How reproducible: very rare. Cant state a number, as i simple dont have the statistics. Can only state that i saw this only once before in the rhv-4.4.3-10 [1]: 2020-11-01 21:02:04,358+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-1) [] VM 'a370e333-f55e-47c6-a2b0-5d90fe2ed7e5'(vm_TestCase27319_0121002844) moved from 'Up' --> 'PoweringUp' . . . 2020-11-01 21:02:19,589+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-87) [] VM 'a370e333-f55e-47c6-a2b0-5d90fe2ed7e5'(vm_TestCase27319_0121002844) moved from 'PoweringUp' --> 'Up'
I encountered the same issue again while executing automation on GLUSTERFS, FC, NFS storages. The ERRROR occurred on GLUSTERFS, while on the other SDs it succeeded. The VM is started: 2020-11-05 06:17:31,323+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-23) [] VM '46db8892-f629-42f9-8f27-b0078f4b0bf2'(vm_TestCase27319_0506163526) moved from 'PoweringUp' --> 'Up' Starting a backup: 2020-11-05 06:17:40,345+02 INFO [org.ovirt.engine.core.bll.StartVmBackupCommand] (default task-2) [backups_create_dbc4f26b-4074-4101] Running command: StartVmBackupCommand internal: false. Entities affected : ID: 46db8892-f629-42f9-8f27-b0078f4b0bf2 Type: VMAction group BACKUP_DISK with role type ADMIN, ID: 97c5cecb-6d07-4689-a78a-c25c4c9407d1 Type: DiskAction group BACKUP_DISK with role type ADMIN VM is switching from 'up' to 'powering up': 2020-11-05 06:17:51,732+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-51) [] VM '46db8892-f629-42f9-8f27-b0078f4b0bf2'(vm_TestCase27319_0506163526) moved from 'Up' --> 'PoweringUp' And causing a second backup attempt to fail: 2020-11-05 06:18:01,866+02 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-2) [] Operation Failed: [Cannot backup VM. The Virtual Machine should be in Up status.] This time there were also EXCEPTIONS when trying to delete the backup checkpoints, but they are relating to this bug at this time. Attaching the logs. RHV-4.4-tier1 #105 TestCase27319
Created attachment 1728302 [details] second time reproduction
The issue might be in the VmAnalyzer -> moving to virt team
These statuses are reported by the host: 2020-11-01 21:01:23,553+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] ... VM 'a370e333-f55e-47c6-a2b0-5d90fe2ed7e5'(vm_TestCase27319_0121002844) moved from 'WaitForLaunch' --> 'PoweringUp' 2020-11-01 21:01:50,862+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] ... VM 'a370e333-f55e-47c6-a2b0-5d90fe2ed7e5'(vm_TestCase27319_0121002844) moved from 'PoweringUp' --> 'Up' 2020-11-01 21:02:04,358+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] ... VM 'a370e333-f55e-47c6-a2b0-5d90fe2ed7e5'(vm_TestCase27319_0121002844) moved from 'Up' --> 'PoweringUp' 2020-11-01 21:02:19,589+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] ... VM 'a370e333-f55e-47c6-a2b0-5d90fe2ed7e5'(vm_TestCase27319_0121002844) moved from 'PoweringUp' --> 'Up' Not long after starting the backup: 2020-11-01 21:02:01,305+0200 INFO (jsonrpc/6) [virt.vm] (vmId='a370e333-f55e-47c6-a2b0-5d90fe2ed7e5') Starting backup for backup_id: '3afef71e-502e-4ec0-bc56-8e99577b4294', backup xml: <?xml version='1.0' encoding='utf-8'?> Without the guest agent, VDSM reports the VM as PoweringUp for 60 sec. So I suspect that for some reason VDSM lost the connection to the guest agent around 21:02:04 and since we didn't reach 60 sec since starting the VM, VDSM reported it in PoweringUp. This scenario of taking a backup of a VM less than a minute after starting it is not that interesting IMHO. We can just wait a minute in the automation before triggering the backup, thus reducing the severity. Tomas, do you think it worth checking what made VDSM to switch the VM status back to PoweringUp?
(In reply to Arik from comment #7) > Tomas, do you think it worth checking what made VDSM to switch the VM status > back to PoweringUp? Talked offline - this could be related to recent changes we've made and is not expected to be too complicated to resolve
(In reply to Arik from comment #7) > We can just wait a minute in the automation before triggering the backup, > thus reducing the severity. This is not an automation issue, it can happen in real system, if a backup was started while the vm in the 60 seconds window that the vm is flipping its state. Of course real backup application must handle vms this case by retrying the operation, so the severity is limited. For automation we control the environment, and start the vm, so we can add retries if the vm is running for less than 60 seconds.
(In reply to Nir Soffer from comment #9) > This is not an automation issue, it can happen in real system, if a backup > was started while the vm in the 60 seconds window that the vm is flipping > its state. Well, right, I didn't claim it's an automation issue but it's something that would not be that interesting to invest the time in compared to other stuff in our backlog (as most production VMs are installed with qemu-guest-agent and thus reach UP state much faster) if we wouldn't suspect it's a result of a recent change in VDSM which is relatively easy to solve (see comment 8). Anyway I'd prefer our automation to test flows that are more likely to happen in real-world, and not to take backups of VMs that are powering up..
Verified on rhv-release-4.4.4-6 Since originally this issue was hard to catch, i ran the same scenario described in my description over and over again, for 6 times in a row. The issue wasnt reproduced. Considering this as verified.
This bugzilla is included in oVirt 4.4.4 release, published on December 21st 2020. Since the problem described in this bug report should be resolved in oVirt 4.4.4 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.