Bug 1893656 - [CBT] VM state switches from 'up' to 'powering up' right after full backup - causing a second immediate backup attempt to fail
Summary: [CBT] VM state switches from 'up' to 'powering up' right after full backup - ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: 4.30.50
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ovirt-4.4.4
: 4.40.39
Assignee: Tomáš Golembiovský
QA Contact: Ilan Zuckerman
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-11-02 10:15 UTC by Ilan Zuckerman
Modified: 2021-11-04 19:28 UTC (History)
7 users (show)

Fixed In Version: vdsm-4.40.39
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-01-12 16:23:54 UTC
oVirt Team: Virt
Embargoed:
pm-rhel: ovirt-4.4+
pm-rhel: planning_ack+
ahadas: devel_ack+
pm-rhel: testing_ack+


Attachments (Terms of Use)
second time reproduction (461.89 KB, application/zip)
2020-11-11 11:49 UTC, Ilan Zuckerman
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 112317 0 master MERGED vm: remember UP state after power up 2021-01-05 08:07:13 UTC

Description Ilan Zuckerman 2020-11-02 10:15:54 UTC
Description of problem:
VM sometimes switches its state from "up" to "PoweringUp" right after a full backup was initialized.
This is causing the second time 'full backup' attempt to fail due to:

Operation Failed: [Cannot backup VM. The Virtual Machine should be in Up status.]

Looking at the engine log I can see that the VM quickly switches its state from 'up' to 'PoweringUp' (VM ID: a370e333-f55e-47c6-a2b0-5d90fe2ed7e5):

2020-11-01 21:02:01,501+02 INFO  [org.ovirt.engine.core.bll.StartVmBackupCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-79) [backups_create_2b474e01-b643-4069] Ready to start image transfers using backup URLs
2020-11-01 21:02:04,358+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-1) [] VM 'a370e333-f55e-47c6-a2b0-5d90fe2ed7e5'(vm_TestCase27319_0121002844) moved from 'Up' --> 'PoweringUp'


On most of the times when the same test passes, I dont see this kind of behavior. The VM does not switch its state from 'up' to 'PoweringUp'. See here for example [1].

[1]:
2020-11-01 21:05:53,959+02 INFO  [org.ovirt.engine.core.bll.StartVmBackupCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-37) [backups_create_92ef6d0c-0fbc-4311] Ready to start image transfers using backup URLs
2020-11-01 21:06:03,843+02 INFO  [org.ovirt.engine.core.bll.StopVmBackupCommand] (default task-1) [backups_syncAction_8b995808-d7d7-41e] Running command: StopVmBackupCommand internal: false. Entities affected :  ID: 388d36c1-acc5-4b0d-88cb-dea70151f334 Type: VMAction group BACKUP_DISK with role type ADMIN,  ID: 02f04186-e061-4e62-9007-91f8945e8512 Type: DiskAction group BACKUP_DISK with role type ADMIN
2020-11-01 21:06:03,843+02 INFO  [org.ovirt.engine.core.bll.StopVmBackupCommand] (default task-1) [backups_syncAction_8b995808-d7d7-41e] Stopping VmBackup '51643e02-185a-4922-b59f-270c92f28ece'
2020-11-01 21:06:03,845+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.StopVmBackupVDSCommand] (default task-1) [backups_syncAction_8b995808-d7d7-41e] START, StopVmBackupVDSCommand(HostName = host_mixed_3, VmBackupVDSParameters:{hostId='8d3eae07-2717-4d4c-84fd-112fcce3e6c2', backupId='51643e02-185a-4922-b59f-270c92f28ece'}), log id: 2513fa92
2020-11-01 21:06:03,858+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.StopVmBackupVDSCommand] (default task-1) [backups_syncAction_8b995808-d7d7-41e] FINISH, StopVmBackupVDSCommand, return: , log id: 2513fa92
2020-11-01 21:06:03,870+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-1) [backups_syncAction_8b995808-d7d7-41e] EVENT_ID: VM_BACKUP_FINALIZED(10,794), Backup <UNKNOWN> for VM vm_TestCase27319_0121044587 finalized (User: admin@internal-authz).


Version-Release number of selected component (if applicable):
rhv-4.4.3-12

How reproducible:
Usually this test passes fine, but in rear cases, this issue is being Reproduced by the Automation.
This issue Was noticed at first on  rhv-4.4.3-10 And now also seen on  rhv-4.4.3-12

Steps to Reproduce:
1. Create a VM from template (with real OS on it) and start it
2. Make a full backup of its disks + finalize
3. repeat step 2 once more

Actual results:
Second backup fails to initiate due to the fact that VM state switched from 'up' to 'PoweringUp' after the first backup attempt

Expected results:
Second backup should complete successfully. 

Additional info:
Logs attached (Always look for VM ID:
a370e333-f55e-47c6-a2b0-5d90fe2ed7e5 'vm_TestCase27319_0121002844'):

ovirt-engine-logs%20(8).tar.gz : Includes all of the logs from relevant VDSM host (messages + libvirt + vdsm)
hypervisor-lynx24_vdsm.log	: VDSM log of this particular test case execution
engine.log : engine log

Please note, there are no logs from the VM it self (qemu + messages), as the automation does not fetch them. Manually this issue wasnt reproduced. So the attached logs are all we have.

Comment 3 Ilan Zuckerman 2020-11-02 12:28:01 UTC
Adding some additional info:

- Not marking this BZ as a regression because the reproduction of it is very difficult, and its occurrence is rare.

- The impact of this issue is as following:
The user is failing to make additional backup (2nd) during the time gap in which the VM state is 'powering up'. As soon as the state switches to 'up' again, the user is able to complete the second time backup request.
According to logs, the VM went from 'PoweringUp' --> 'Up' right after the failed backup attempt. 
So the VM was in state of "powering up" during 15 seconds [1]
I can not say for sure what caused the VM to fall back to 'up' state. This could be the failed backup attempt, or some other trigger.

- How reproducible: very rare. Cant state a number, as i simple dont have the statistics. Can only state that i saw this only once before in the rhv-4.4.3-10

[1]:

2020-11-01 21:02:04,358+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-1) [] VM 'a370e333-f55e-47c6-a2b0-5d90fe2ed7e5'(vm_TestCase27319_0121002844) moved from 'Up' --> 'PoweringUp'
.
.
.
2020-11-01 21:02:19,589+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-87) [] VM 'a370e333-f55e-47c6-a2b0-5d90fe2ed7e5'(vm_TestCase27319_0121002844) moved from 'PoweringUp' --> 'Up'

Comment 4 Ilan Zuckerman 2020-11-11 11:48:48 UTC
I encountered the same issue again while executing automation on GLUSTERFS, FC, NFS storages.
The ERRROR occurred on GLUSTERFS, while on the other SDs it succeeded.

The VM is started:

2020-11-05 06:17:31,323+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-23) [] VM '46db8892-f629-42f9-8f27-b0078f4b0bf2'(vm_TestCase27319_0506163526) moved from 'PoweringUp' --> 'Up'


Starting a backup:

2020-11-05 06:17:40,345+02 INFO  [org.ovirt.engine.core.bll.StartVmBackupCommand] (default task-2) [backups_create_dbc4f26b-4074-4101] Running command: StartVmBackupCommand internal: false. Entities affected :  ID: 46db8892-f629-42f9-8f27-b0078f4b0bf2 Type: VMAction group BACKUP_DISK with role type ADMIN,  ID: 97c5cecb-6d07-4689-a78a-c25c4c9407d1 Type: DiskAction group BACKUP_DISK with role type ADMIN


VM is switching from 'up' to 'powering up':

2020-11-05 06:17:51,732+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-51) [] VM '46db8892-f629-42f9-8f27-b0078f4b0bf2'(vm_TestCase27319_0506163526) moved from 'Up' --> 'PoweringUp'


And causing a second backup attempt to fail:

2020-11-05 06:18:01,866+02 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-2) [] Operation Failed: [Cannot backup VM. The Virtual Machine should be in Up status.]


This time there were also EXCEPTIONS when trying to delete the backup checkpoints, but they are relating to this bug at this time.
Attaching the logs.
RHV-4.4-tier1 #105 
TestCase27319

Comment 5 Ilan Zuckerman 2020-11-11 11:49:25 UTC
Created attachment 1728302 [details]
second time reproduction

Comment 6 Eyal Shenitzky 2020-11-11 11:49:50 UTC
The issue might be in the VmAnalyzer -> moving to virt team

Comment 7 Arik 2020-11-11 21:03:02 UTC
These statuses are reported by the host:
2020-11-01 21:01:23,553+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] ... VM 'a370e333-f55e-47c6-a2b0-5d90fe2ed7e5'(vm_TestCase27319_0121002844) moved from 'WaitForLaunch' --> 'PoweringUp'
2020-11-01 21:01:50,862+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] ... VM 'a370e333-f55e-47c6-a2b0-5d90fe2ed7e5'(vm_TestCase27319_0121002844) moved from 'PoweringUp' --> 'Up'
2020-11-01 21:02:04,358+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] ... VM 'a370e333-f55e-47c6-a2b0-5d90fe2ed7e5'(vm_TestCase27319_0121002844) moved from 'Up' --> 'PoweringUp'
2020-11-01 21:02:19,589+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] ... VM 'a370e333-f55e-47c6-a2b0-5d90fe2ed7e5'(vm_TestCase27319_0121002844) moved from 'PoweringUp' --> 'Up'

Not long after starting the backup:
2020-11-01 21:02:01,305+0200 INFO  (jsonrpc/6) [virt.vm] (vmId='a370e333-f55e-47c6-a2b0-5d90fe2ed7e5') Starting backup for backup_id: '3afef71e-502e-4ec0-bc56-8e99577b4294', backup xml: <?xml version='1.0' encoding='utf-8'?>

Without the guest agent, VDSM reports the VM as PoweringUp for 60 sec.
So I suspect that for some reason VDSM lost the connection to the guest agent around 21:02:04 and since we didn't reach 60 sec since starting the VM, VDSM reported it in PoweringUp.

This scenario of taking a backup of a VM less than a minute after starting it is not that interesting IMHO.
We can just wait a minute in the automation before triggering the backup, thus reducing the severity.

Tomas, do you think it worth checking what made VDSM to switch the VM status back to PoweringUp?

Comment 8 Arik 2020-11-12 14:47:18 UTC
(In reply to Arik from comment #7)
> Tomas, do you think it worth checking what made VDSM to switch the VM status
> back to PoweringUp?

Talked offline - this could be related to recent changes we've made and is not expected to be too complicated to resolve

Comment 9 Nir Soffer 2020-11-16 09:49:57 UTC
(In reply to Arik from comment #7)
> We can just wait a minute in the automation before triggering the backup,
> thus reducing the severity.

This is not an automation issue, it can happen in real system, if a backup
was started while the vm in the 60 seconds window that the vm is flipping
its state.

Of course real backup application must handle vms this case by retrying 
the operation, so the severity is limited.

For automation we control the environment, and start the vm, so we can
add retries if the vm is running for less than 60 seconds.

Comment 10 Arik 2020-11-16 11:24:16 UTC
(In reply to Nir Soffer from comment #9)
> This is not an automation issue, it can happen in real system, if a backup
> was started while the vm in the 60 seconds window that the vm is flipping
> its state.

Well, right, I didn't claim it's an automation issue but it's something that would not be that interesting to invest the time in compared to other stuff in our backlog (as most production VMs are installed with qemu-guest-agent and thus reach UP state much faster) if we wouldn't suspect it's a result of a recent change in VDSM which is relatively easy to solve (see comment 8).
Anyway I'd prefer our automation to test flows that are more likely to happen in real-world, and not to take backups of VMs that are powering up..

Comment 11 Ilan Zuckerman 2020-12-28 13:59:39 UTC
Verified on rhv-release-4.4.4-6

Since originally this issue was hard to catch, i ran the same scenario described in my description over and over again, for 6 times in a row.
The issue wasnt reproduced. Considering this as verified.

Comment 12 Sandro Bonazzola 2021-01-12 16:23:54 UTC
This bugzilla is included in oVirt 4.4.4 release, published on December 21st 2020.

Since the problem described in this bug report should be resolved in oVirt 4.4.4 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.