Bug 2029391 - VM status flipping between Paused and Running
Summary: VM status flipping between Paused and Running
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Virtualization
Version: 4.9.4
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.13.0
Assignee: ffossemo
QA Contact: zhe peng
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-12-06 11:46 UTC by Guohua Ouyang
Modified: 2023-05-18 02:56 UTC (History)
11 users (show)

Fixed In Version: hco-bundle-registry-container-v4.13.0.rhel9-1124
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-05-18 02:55:41 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
vm status is flipping while there is an IO error (251.92 KB, image/gif)
2021-12-06 11:46 UTC, Guohua Ouyang
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github kubevirt kubevirt pull 8657 0 None open vm-cycle: use exponential backoff when pause due to IO error 2022-11-02 09:40:58 UTC
Red Hat Issue Tracker CNV-15198 0 None None None 2022-10-27 11:49:29 UTC
Red Hat Product Errata RHSA-2023:3205 0 None None None 2023-05-18 02:56:30 UTC

Description Guohua Ouyang 2021-12-06 11:46:56 UTC
Created attachment 1844894 [details]
vm status is flipping while there is an IO error

Description of problem:
Once a VM is Paused due to IO error, the VM status is flipping between Paused and Running.

Version-Release number of selected component (if applicable):
OCP 4.9.4 and master

How reproducible:
VM in a situation which has IO error with its volume

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Tal Nisan 2021-12-06 12:41:44 UTC
Guohua, can you check the VM Yaml for printable status to see if the status is toggling in the backend as well? If so this is not a bug

Comment 3 Guohua Ouyang 2021-12-07 01:14:05 UTC
(In reply to Tal Nisan from comment #2)
> Guohua, can you check the VM Yaml for printable status to see if the status
> is toggling in the backend as well? If so this is not a bug

The status is toggling in the VM yaml as well, move the bug to virt.

$ for i in {1..50}; do oc get vm rhel8-inadequate-goat -o jsonpath='{.status.printableStatus}' && echo ''; done 
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Running
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Running
Paused
Paused
Paused
Running
Paused

Comment 4 David Vossel 2021-12-07 21:50:48 UTC
Here is roughly what I believe is happening.

1. IO error causes domain to pause
2. Virt-handler detects domain is paused and adds paused condition
3. during SyncVMI() in virt-launcher, virt-launcher sees the VMI is paused but also sees that no user explicitly asked for the domain to be paused
4. virt-launcher unpauses the domain in SyncVMI()
5. virt-handler briefly see's domain is running again and removes paused condition.
6. go to step 1


When an IO error occurs and causes the domain to pause, we need to define what the recovery behavior is after that. If we attempt to auto-recover the VMI like we are today, then this loop will likely be the result. An alternative is to leave the VMI paused and require user intervention.


Below are virt-launcher logs that show this loop in action.

{"component":"virt-launcher","kind":"","level":"info","msg":"Domain unpaused.","name":"rhel8-inadequate-goat","namespace":"moran","pos":"manager.go:836","timestamp":"2021-12-07T21:31:04.075202Z","uid":"b70be1a2-dce8-4cbc-88bb-975a6ce3759b"}
{"component":"virt-launcher","kind":"","level":"info","msg":"Synced vmi","name":"rhel8-inadequate-goat","namespace":"moran","pos":"server.go:190","timestamp":"2021-12-07T21:31:04.082563Z","uid":"b70be1a2-dce8-4cbc-88bb-975a6ce3759b"}
{"component":"virt-launcher","level":"info","msg":"kubevirt domain status: Paused(3):IOError(5)","pos":"client.go:262","timestamp":"2021-12-07T21:31:04.088456Z"}
{"component":"virt-launcher","level":"info","msg":"Domain name event: moran_rhel8-inadequate-goat","pos":"client.go:386","timestamp":"2021-12-07T21:31:04.134650Z"}
{"component":"virt-launcher","level":"info","msg":"DomainLifecycle event 4 with reason 0 received","pos":"client.go:411","timestamp":"2021-12-07T21:31:04.292311Z"}
{"component":"virt-launcher","kind":"","level":"info","msg":"Domain unpaused.","name":"rhel8-inadequate-goat","namespace":"moran","pos":"manager.go:836","timestamp":"2021-12-07T21:31:04.303848Z","uid":"b70be1a2-dce8-4cbc-88bb-975a6ce3759b"}
{"component":"virt-launcher","level":"info","msg":"DomainLifecycle event 3 with reason 2 received","pos":"client.go:411","timestamp":"2021-12-07T21:31:04.312421Z"}
{"component":"virt-launcher","kind":"","level":"info","msg":"Synced vmi","name":"rhel8-inadequate-goat","namespace":"moran","pos":"server.go:190","timestamp":"2021-12-07T21:31:04.318488Z","uid":"b70be1a2-dce8-4cbc-88bb-975a6ce3759b"}
{"component":"virt-launcher","level":"info","msg":"kubevirt domain status: Paused(3):IOError(5)","pos":"client.go:262","timestamp":"2021-12-07T21:31:04.331304Z"}
{"component":"virt-launcher","level":"info","msg":"Domain name event: moran_rhel8-inadequate-goat","pos":"client.go:386","timestamp":"2021-12-07T21:31:04.346578Z"}
{"component":"virt-launcher","level":"info","msg":"kubevirt domain status: Paused(3):IOError(5)","pos":"client.go:262","timestamp":"2021-12-07T21:31:04.362494Z"}
{"component":"virt-launcher","level":"info","msg":"Domain name event: moran_rhel8-inadequate-goat","pos":"client.go:386","timestamp":"2021-12-07T21:31:04.376045Z"}
{"component":"virt-launcher","level":"info","msg":"DomainLifecycle event 4 with reason 0 received","pos":"client.go:411","timestamp":"2021-12-07T21:31:04.502867Z"}
{"component":"virt-launcher","level":"info","msg":"DomainLifecycle event 3 with reason 2 received","pos":"client.go:411","timestamp":"2021-12-07T21:31:04.515245Z"}
{"component":"virt-launcher","kind":"","level":"info","msg":"Domain unpaused.","name":"rhel8-inadequate-goat","namespace":"moran","pos":"manager.go:836","timestamp":"2021-12-07T21:31:04.524333Z","uid":"b70be1a2-dce8-4cbc-88bb-975a6ce3759b"}
{"component":"virt-launcher","level":"info","msg":"kubevirt domain status: Running(1):Unknown(5)","pos":"client.go:262","timestamp":"2021-12-07T21:31:04.526013Z"}
{"component":"virt-launcher","kind":"","level":"info","msg":"Synced vmi","name":"rhel8-inadequate-goat","namespace":"moran","pos":"server.go:190","timestamp":"2021-12-07T21:31:04.528873Z","uid":"b70be1a2-dce8-4cbc-88bb-975a6ce3759b"}
{"component":"virt-launcher","level":"info","msg":"Domain name event: moran_rhel8-inadequate-goat","pos":"client.go:386","timestamp":"2021-12-07T21:31:04.534102Z"}
{"component":"virt-launcher","level":"info","msg":"kubevirt domain status: Paused(3):IOError(5)","pos":"client.go:262","timestamp":"2021-12-07T21:31:04.554236Z"}
{"component":"virt-launcher","level":"info","msg":"Domain name event: moran_rhel8-inadequate-goat","pos":"client.go:386","timestamp":"2021-12-07T21:31:04.576413Z"}
{"component":"virt-launcher","level":"info","msg":"DomainLifecycle event 4 with reason 0 received","pos":"client.go:411","timestamp":"2021-12-07T21:31:04.622782Z"}
{"component":"virt-launcher","level":"info","msg":"kubevirt domain status: Running(1):Unknown(5)","pos":"client.go:262","timestamp":"2021-12-07T21:31:04.639922Z"}
{"component":"virt-launcher","level":"info","msg":"Domain name event: moran_rhel8-inadequate-goat","pos":"client.go:386","timestamp":"2021-12-07T21:31:04.644242Z"}
{"component":"virt-launcher","level":"info","msg":"DomainLifecycle event 3 with reason 2 received","pos":"client.go:411","timestamp":"2021-12-07T21:31:04.666535Z"}

Comment 5 sgott 2021-12-07 22:21:01 UTC
Alex, can you lend some insight into how this was intended to behave from storage's PoV? Was the intention that the user should un-pause manually? Is KubeVirt missing a que it should be using to know it's OK to un-pause again?

Comment 6 Alexander Wels 2021-12-08 13:02:57 UTC
As long as the cause of the IOError has not been resolved, I would expect the VM to remain in this loop. The VM tries to unpause, immediately gets the IOError again, and goes paused again. That behaviour would not change if we don't automatically unpause periodically and have the user do it manually. Maybe the virt controller should have a small period of time after it unpauses the VM before it updates the status to be running again, that would give the sync a chance to see if the VM is paused again due to the IOError. I don't think it is unreasonable behavior for the VM to try and unpause itself periodically because maybe the reason for the IOError resolved itself, and it is possible to continue. I think the bug is about the flip flop between running and paused due to this.

Comment 7 sgott 2021-12-13 14:04:16 UTC
Postponing to the next release as there's not a clear path forward. Any change we make will require design work.

Comment 8 Roman Mohr 2022-02-09 13:30:56 UTC
(In reply to David Vossel from comment #4)
> Here is roughly what I believe is happening.
> 
> 1. IO error causes domain to pause
> 2. Virt-handler detects domain is paused and adds paused condition
> 3. during SyncVMI() in virt-launcher, virt-launcher sees the VMI is paused
> but also sees that no user explicitly asked for the domain to be paused
> 4. virt-launcher unpauses the domain in SyncVMI()
> 5. virt-handler briefly see's domain is running again and removes paused
> condition.
> 6. go to step 1
> 
> 
> When an IO error occurs and causes the domain to pause, we need to define
> what the recovery behavior is after that. If we attempt to auto-recover the
> VMI like we are today, then this loop will likely be the result. An
> alternative is to leave the VMI paused and require user intervention.
> 
> 
> Below are virt-launcher logs that show this loop in action.
> 
> {"component":"virt-launcher","kind":"","level":"info","msg":"Domain
> unpaused.","name":"rhel8-inadequate-goat","namespace":"moran","pos":"manager.
> go:836","timestamp":"2021-12-07T21:31:04.075202Z","uid":"b70be1a2-dce8-4cbc-
> 88bb-975a6ce3759b"}
> {"component":"virt-launcher","kind":"","level":"info","msg":"Synced
> vmi","name":"rhel8-inadequate-goat","namespace":"moran","pos":"server.go:
> 190","timestamp":"2021-12-07T21:31:04.082563Z","uid":"b70be1a2-dce8-4cbc-
> 88bb-975a6ce3759b"}
> {"component":"virt-launcher","level":"info","msg":"kubevirt domain status:
> Paused(3):IOError(5)","pos":"client.go:262","timestamp":"2021-12-07T21:31:04.
> 088456Z"}
> {"component":"virt-launcher","level":"info","msg":"Domain name event:
> moran_rhel8-inadequate-goat","pos":"client.go:386","timestamp":"2021-12-
> 07T21:31:04.134650Z"}
> {"component":"virt-launcher","level":"info","msg":"DomainLifecycle event 4
> with reason 0
> received","pos":"client.go:411","timestamp":"2021-12-07T21:31:04.292311Z"}
> {"component":"virt-launcher","kind":"","level":"info","msg":"Domain
> unpaused.","name":"rhel8-inadequate-goat","namespace":"moran","pos":"manager.
> go:836","timestamp":"2021-12-07T21:31:04.303848Z","uid":"b70be1a2-dce8-4cbc-
> 88bb-975a6ce3759b"}
> {"component":"virt-launcher","level":"info","msg":"DomainLifecycle event 3
> with reason 2
> received","pos":"client.go:411","timestamp":"2021-12-07T21:31:04.312421Z"}
> {"component":"virt-launcher","kind":"","level":"info","msg":"Synced
> vmi","name":"rhel8-inadequate-goat","namespace":"moran","pos":"server.go:
> 190","timestamp":"2021-12-07T21:31:04.318488Z","uid":"b70be1a2-dce8-4cbc-
> 88bb-975a6ce3759b"}
> {"component":"virt-launcher","level":"info","msg":"kubevirt domain status:
> Paused(3):IOError(5)","pos":"client.go:262","timestamp":"2021-12-07T21:31:04.
> 331304Z"}
> {"component":"virt-launcher","level":"info","msg":"Domain name event:
> moran_rhel8-inadequate-goat","pos":"client.go:386","timestamp":"2021-12-
> 07T21:31:04.346578Z"}
> {"component":"virt-launcher","level":"info","msg":"kubevirt domain status:
> Paused(3):IOError(5)","pos":"client.go:262","timestamp":"2021-12-07T21:31:04.
> 362494Z"}
> {"component":"virt-launcher","level":"info","msg":"Domain name event:
> moran_rhel8-inadequate-goat","pos":"client.go:386","timestamp":"2021-12-
> 07T21:31:04.376045Z"}
> {"component":"virt-launcher","level":"info","msg":"DomainLifecycle event 4
> with reason 0
> received","pos":"client.go:411","timestamp":"2021-12-07T21:31:04.502867Z"}
> {"component":"virt-launcher","level":"info","msg":"DomainLifecycle event 3
> with reason 2
> received","pos":"client.go:411","timestamp":"2021-12-07T21:31:04.515245Z"}
> {"component":"virt-launcher","kind":"","level":"info","msg":"Domain
> unpaused.","name":"rhel8-inadequate-goat","namespace":"moran","pos":"manager.
> go:836","timestamp":"2021-12-07T21:31:04.524333Z","uid":"b70be1a2-dce8-4cbc-
> 88bb-975a6ce3759b"}
> {"component":"virt-launcher","level":"info","msg":"kubevirt domain status:
> Running(1):Unknown(5)","pos":"client.go:262","timestamp":"2021-12-07T21:31:
> 04.526013Z"}
> {"component":"virt-launcher","kind":"","level":"info","msg":"Synced
> vmi","name":"rhel8-inadequate-goat","namespace":"moran","pos":"server.go:
> 190","timestamp":"2021-12-07T21:31:04.528873Z","uid":"b70be1a2-dce8-4cbc-
> 88bb-975a6ce3759b"}
> {"component":"virt-launcher","level":"info","msg":"Domain name event:
> moran_rhel8-inadequate-goat","pos":"client.go:386","timestamp":"2021-12-
> 07T21:31:04.534102Z"}
> {"component":"virt-launcher","level":"info","msg":"kubevirt domain status:
> Paused(3):IOError(5)","pos":"client.go:262","timestamp":"2021-12-07T21:31:04.
> 554236Z"}
> {"component":"virt-launcher","level":"info","msg":"Domain name event:
> moran_rhel8-inadequate-goat","pos":"client.go:386","timestamp":"2021-12-
> 07T21:31:04.576413Z"}
> {"component":"virt-launcher","level":"info","msg":"DomainLifecycle event 4
> with reason 0
> received","pos":"client.go:411","timestamp":"2021-12-07T21:31:04.622782Z"}
> {"component":"virt-launcher","level":"info","msg":"kubevirt domain status:
> Running(1):Unknown(5)","pos":"client.go:262","timestamp":"2021-12-07T21:31:
> 04.639922Z"}
> {"component":"virt-launcher","level":"info","msg":"Domain name event:
> moran_rhel8-inadequate-goat","pos":"client.go:386","timestamp":"2021-12-
> 07T21:31:04.644242Z"}
> {"component":"virt-launcher","level":"info","msg":"DomainLifecycle event 3
> with reason 2
> received","pos":"client.go:411","timestamp":"2021-12-07T21:31:04.666535Z"}

Looks like this flips multiple times per second on a single VM. We should inhibit updates from Paused due to IOError when we get Running again. Something like waiting X seconds before we switch the condition again after we experienced an IO error.

Lets' see that we fix this in 4.11. It has the potential of having quite some impact on the apisever if the whole storage has an outage.

Comment 11 Antonio Cardace 2022-11-18 16:09:07 UTC
Deferring to 4.13 due to capacity.

Comment 14 zhe peng 2023-03-29 03:30:08 UTC
verify with build:CNV-v4.13.0.rhel9-1884

step:
1. create a vm and start
2. trigger IO error 
3. check vm status
$ for i in {1..50}; do oc get vm vm-fedora -o jsonpath='{.status.printableStatus}' && echo ''; done
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused
Paused

also check UI, there is no flipping.
move to verified.

Comment 17 errata-xmlrpc 2023-05-18 02:55:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.13.0 Images security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:3205


Note You need to log in before you can comment on or make changes to this bug.