Bug 1893656
Summary: | [CBT] VM state switches from 'up' to 'powering up' right after full backup - causing a second immediate backup attempt to fail | ||||||
---|---|---|---|---|---|---|---|
Product: | [oVirt] vdsm | Reporter: | Ilan Zuckerman <izuckerm> | ||||
Component: | Core | Assignee: | Tomáš Golembiovský <tgolembi> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Ilan Zuckerman <izuckerm> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 4.30.50 | CC: | ahadas, bugs, eshenitz, nsoffer, sfishbai, tgolembi, tnisan | ||||
Target Milestone: | ovirt-4.4.4 | Keywords: | Automation | ||||
Target Release: | 4.40.39 | Flags: | pm-rhel:
ovirt-4.4+
pm-rhel: planning_ack+ ahadas: devel_ack+ pm-rhel: testing_ack+ |
||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | vdsm-4.40.39 | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2021-01-12 16:23:54 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Ilan Zuckerman
2020-11-02 10:15:54 UTC
Adding some additional info: - Not marking this BZ as a regression because the reproduction of it is very difficult, and its occurrence is rare. - The impact of this issue is as following: The user is failing to make additional backup (2nd) during the time gap in which the VM state is 'powering up'. As soon as the state switches to 'up' again, the user is able to complete the second time backup request. According to logs, the VM went from 'PoweringUp' --> 'Up' right after the failed backup attempt. So the VM was in state of "powering up" during 15 seconds [1] I can not say for sure what caused the VM to fall back to 'up' state. This could be the failed backup attempt, or some other trigger. - How reproducible: very rare. Cant state a number, as i simple dont have the statistics. Can only state that i saw this only once before in the rhv-4.4.3-10 [1]: 2020-11-01 21:02:04,358+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-1) [] VM 'a370e333-f55e-47c6-a2b0-5d90fe2ed7e5'(vm_TestCase27319_0121002844) moved from 'Up' --> 'PoweringUp' . . . 2020-11-01 21:02:19,589+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-87) [] VM 'a370e333-f55e-47c6-a2b0-5d90fe2ed7e5'(vm_TestCase27319_0121002844) moved from 'PoweringUp' --> 'Up' I encountered the same issue again while executing automation on GLUSTERFS, FC, NFS storages. The ERRROR occurred on GLUSTERFS, while on the other SDs it succeeded. The VM is started: 2020-11-05 06:17:31,323+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-23) [] VM '46db8892-f629-42f9-8f27-b0078f4b0bf2'(vm_TestCase27319_0506163526) moved from 'PoweringUp' --> 'Up' Starting a backup: 2020-11-05 06:17:40,345+02 INFO [org.ovirt.engine.core.bll.StartVmBackupCommand] (default task-2) [backups_create_dbc4f26b-4074-4101] Running command: StartVmBackupCommand internal: false. Entities affected : ID: 46db8892-f629-42f9-8f27-b0078f4b0bf2 Type: VMAction group BACKUP_DISK with role type ADMIN, ID: 97c5cecb-6d07-4689-a78a-c25c4c9407d1 Type: DiskAction group BACKUP_DISK with role type ADMIN VM is switching from 'up' to 'powering up': 2020-11-05 06:17:51,732+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-51) [] VM '46db8892-f629-42f9-8f27-b0078f4b0bf2'(vm_TestCase27319_0506163526) moved from 'Up' --> 'PoweringUp' And causing a second backup attempt to fail: 2020-11-05 06:18:01,866+02 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-2) [] Operation Failed: [Cannot backup VM. The Virtual Machine should be in Up status.] This time there were also EXCEPTIONS when trying to delete the backup checkpoints, but they are relating to this bug at this time. Attaching the logs. RHV-4.4-tier1 #105 TestCase27319 Created attachment 1728302 [details]
second time reproduction
The issue might be in the VmAnalyzer -> moving to virt team These statuses are reported by the host: 2020-11-01 21:01:23,553+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] ... VM 'a370e333-f55e-47c6-a2b0-5d90fe2ed7e5'(vm_TestCase27319_0121002844) moved from 'WaitForLaunch' --> 'PoweringUp' 2020-11-01 21:01:50,862+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] ... VM 'a370e333-f55e-47c6-a2b0-5d90fe2ed7e5'(vm_TestCase27319_0121002844) moved from 'PoweringUp' --> 'Up' 2020-11-01 21:02:04,358+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] ... VM 'a370e333-f55e-47c6-a2b0-5d90fe2ed7e5'(vm_TestCase27319_0121002844) moved from 'Up' --> 'PoweringUp' 2020-11-01 21:02:19,589+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] ... VM 'a370e333-f55e-47c6-a2b0-5d90fe2ed7e5'(vm_TestCase27319_0121002844) moved from 'PoweringUp' --> 'Up' Not long after starting the backup: 2020-11-01 21:02:01,305+0200 INFO (jsonrpc/6) [virt.vm] (vmId='a370e333-f55e-47c6-a2b0-5d90fe2ed7e5') Starting backup for backup_id: '3afef71e-502e-4ec0-bc56-8e99577b4294', backup xml: <?xml version='1.0' encoding='utf-8'?> Without the guest agent, VDSM reports the VM as PoweringUp for 60 sec. So I suspect that for some reason VDSM lost the connection to the guest agent around 21:02:04 and since we didn't reach 60 sec since starting the VM, VDSM reported it in PoweringUp. This scenario of taking a backup of a VM less than a minute after starting it is not that interesting IMHO. We can just wait a minute in the automation before triggering the backup, thus reducing the severity. Tomas, do you think it worth checking what made VDSM to switch the VM status back to PoweringUp? (In reply to Arik from comment #7) > Tomas, do you think it worth checking what made VDSM to switch the VM status > back to PoweringUp? Talked offline - this could be related to recent changes we've made and is not expected to be too complicated to resolve (In reply to Arik from comment #7) > We can just wait a minute in the automation before triggering the backup, > thus reducing the severity. This is not an automation issue, it can happen in real system, if a backup was started while the vm in the 60 seconds window that the vm is flipping its state. Of course real backup application must handle vms this case by retrying the operation, so the severity is limited. For automation we control the environment, and start the vm, so we can add retries if the vm is running for less than 60 seconds. (In reply to Nir Soffer from comment #9) > This is not an automation issue, it can happen in real system, if a backup > was started while the vm in the 60 seconds window that the vm is flipping > its state. Well, right, I didn't claim it's an automation issue but it's something that would not be that interesting to invest the time in compared to other stuff in our backlog (as most production VMs are installed with qemu-guest-agent and thus reach UP state much faster) if we wouldn't suspect it's a result of a recent change in VDSM which is relatively easy to solve (see comment 8). Anyway I'd prefer our automation to test flows that are more likely to happen in real-world, and not to take backups of VMs that are powering up.. Verified on rhv-release-4.4.4-6 Since originally this issue was hard to catch, i ran the same scenario described in my description over and over again, for 6 times in a row. The issue wasnt reproduced. Considering this as verified. This bugzilla is included in oVirt 4.4.4 release, published on December 21st 2020. Since the problem described in this bug report should be resolved in oVirt 4.4.4 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report. |