Created attachment 1844894 [details] vm status is flipping while there is an IO error Description of problem: Once a VM is Paused due to IO error, the VM status is flipping between Paused and Running. Version-Release number of selected component (if applicable): OCP 4.9.4 and master How reproducible: VM in a situation which has IO error with its volume Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Guohua, can you check the VM Yaml for printable status to see if the status is toggling in the backend as well? If so this is not a bug
(In reply to Tal Nisan from comment #2) > Guohua, can you check the VM Yaml for printable status to see if the status > is toggling in the backend as well? If so this is not a bug The status is toggling in the VM yaml as well, move the bug to virt. $ for i in {1..50}; do oc get vm rhel8-inadequate-goat -o jsonpath='{.status.printableStatus}' && echo ''; done Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Running Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Running Paused Paused Paused Running Paused
Here is roughly what I believe is happening. 1. IO error causes domain to pause 2. Virt-handler detects domain is paused and adds paused condition 3. during SyncVMI() in virt-launcher, virt-launcher sees the VMI is paused but also sees that no user explicitly asked for the domain to be paused 4. virt-launcher unpauses the domain in SyncVMI() 5. virt-handler briefly see's domain is running again and removes paused condition. 6. go to step 1 When an IO error occurs and causes the domain to pause, we need to define what the recovery behavior is after that. If we attempt to auto-recover the VMI like we are today, then this loop will likely be the result. An alternative is to leave the VMI paused and require user intervention. Below are virt-launcher logs that show this loop in action. {"component":"virt-launcher","kind":"","level":"info","msg":"Domain unpaused.","name":"rhel8-inadequate-goat","namespace":"moran","pos":"manager.go:836","timestamp":"2021-12-07T21:31:04.075202Z","uid":"b70be1a2-dce8-4cbc-88bb-975a6ce3759b"} {"component":"virt-launcher","kind":"","level":"info","msg":"Synced vmi","name":"rhel8-inadequate-goat","namespace":"moran","pos":"server.go:190","timestamp":"2021-12-07T21:31:04.082563Z","uid":"b70be1a2-dce8-4cbc-88bb-975a6ce3759b"} {"component":"virt-launcher","level":"info","msg":"kubevirt domain status: Paused(3):IOError(5)","pos":"client.go:262","timestamp":"2021-12-07T21:31:04.088456Z"} {"component":"virt-launcher","level":"info","msg":"Domain name event: moran_rhel8-inadequate-goat","pos":"client.go:386","timestamp":"2021-12-07T21:31:04.134650Z"} {"component":"virt-launcher","level":"info","msg":"DomainLifecycle event 4 with reason 0 received","pos":"client.go:411","timestamp":"2021-12-07T21:31:04.292311Z"} {"component":"virt-launcher","kind":"","level":"info","msg":"Domain unpaused.","name":"rhel8-inadequate-goat","namespace":"moran","pos":"manager.go:836","timestamp":"2021-12-07T21:31:04.303848Z","uid":"b70be1a2-dce8-4cbc-88bb-975a6ce3759b"} {"component":"virt-launcher","level":"info","msg":"DomainLifecycle event 3 with reason 2 received","pos":"client.go:411","timestamp":"2021-12-07T21:31:04.312421Z"} {"component":"virt-launcher","kind":"","level":"info","msg":"Synced vmi","name":"rhel8-inadequate-goat","namespace":"moran","pos":"server.go:190","timestamp":"2021-12-07T21:31:04.318488Z","uid":"b70be1a2-dce8-4cbc-88bb-975a6ce3759b"} {"component":"virt-launcher","level":"info","msg":"kubevirt domain status: Paused(3):IOError(5)","pos":"client.go:262","timestamp":"2021-12-07T21:31:04.331304Z"} {"component":"virt-launcher","level":"info","msg":"Domain name event: moran_rhel8-inadequate-goat","pos":"client.go:386","timestamp":"2021-12-07T21:31:04.346578Z"} {"component":"virt-launcher","level":"info","msg":"kubevirt domain status: Paused(3):IOError(5)","pos":"client.go:262","timestamp":"2021-12-07T21:31:04.362494Z"} {"component":"virt-launcher","level":"info","msg":"Domain name event: moran_rhel8-inadequate-goat","pos":"client.go:386","timestamp":"2021-12-07T21:31:04.376045Z"} {"component":"virt-launcher","level":"info","msg":"DomainLifecycle event 4 with reason 0 received","pos":"client.go:411","timestamp":"2021-12-07T21:31:04.502867Z"} {"component":"virt-launcher","level":"info","msg":"DomainLifecycle event 3 with reason 2 received","pos":"client.go:411","timestamp":"2021-12-07T21:31:04.515245Z"} {"component":"virt-launcher","kind":"","level":"info","msg":"Domain unpaused.","name":"rhel8-inadequate-goat","namespace":"moran","pos":"manager.go:836","timestamp":"2021-12-07T21:31:04.524333Z","uid":"b70be1a2-dce8-4cbc-88bb-975a6ce3759b"} {"component":"virt-launcher","level":"info","msg":"kubevirt domain status: Running(1):Unknown(5)","pos":"client.go:262","timestamp":"2021-12-07T21:31:04.526013Z"} {"component":"virt-launcher","kind":"","level":"info","msg":"Synced vmi","name":"rhel8-inadequate-goat","namespace":"moran","pos":"server.go:190","timestamp":"2021-12-07T21:31:04.528873Z","uid":"b70be1a2-dce8-4cbc-88bb-975a6ce3759b"} {"component":"virt-launcher","level":"info","msg":"Domain name event: moran_rhel8-inadequate-goat","pos":"client.go:386","timestamp":"2021-12-07T21:31:04.534102Z"} {"component":"virt-launcher","level":"info","msg":"kubevirt domain status: Paused(3):IOError(5)","pos":"client.go:262","timestamp":"2021-12-07T21:31:04.554236Z"} {"component":"virt-launcher","level":"info","msg":"Domain name event: moran_rhel8-inadequate-goat","pos":"client.go:386","timestamp":"2021-12-07T21:31:04.576413Z"} {"component":"virt-launcher","level":"info","msg":"DomainLifecycle event 4 with reason 0 received","pos":"client.go:411","timestamp":"2021-12-07T21:31:04.622782Z"} {"component":"virt-launcher","level":"info","msg":"kubevirt domain status: Running(1):Unknown(5)","pos":"client.go:262","timestamp":"2021-12-07T21:31:04.639922Z"} {"component":"virt-launcher","level":"info","msg":"Domain name event: moran_rhel8-inadequate-goat","pos":"client.go:386","timestamp":"2021-12-07T21:31:04.644242Z"} {"component":"virt-launcher","level":"info","msg":"DomainLifecycle event 3 with reason 2 received","pos":"client.go:411","timestamp":"2021-12-07T21:31:04.666535Z"}
Alex, can you lend some insight into how this was intended to behave from storage's PoV? Was the intention that the user should un-pause manually? Is KubeVirt missing a que it should be using to know it's OK to un-pause again?
As long as the cause of the IOError has not been resolved, I would expect the VM to remain in this loop. The VM tries to unpause, immediately gets the IOError again, and goes paused again. That behaviour would not change if we don't automatically unpause periodically and have the user do it manually. Maybe the virt controller should have a small period of time after it unpauses the VM before it updates the status to be running again, that would give the sync a chance to see if the VM is paused again due to the IOError. I don't think it is unreasonable behavior for the VM to try and unpause itself periodically because maybe the reason for the IOError resolved itself, and it is possible to continue. I think the bug is about the flip flop between running and paused due to this.
Postponing to the next release as there's not a clear path forward. Any change we make will require design work.
(In reply to David Vossel from comment #4) > Here is roughly what I believe is happening. > > 1. IO error causes domain to pause > 2. Virt-handler detects domain is paused and adds paused condition > 3. during SyncVMI() in virt-launcher, virt-launcher sees the VMI is paused > but also sees that no user explicitly asked for the domain to be paused > 4. virt-launcher unpauses the domain in SyncVMI() > 5. virt-handler briefly see's domain is running again and removes paused > condition. > 6. go to step 1 > > > When an IO error occurs and causes the domain to pause, we need to define > what the recovery behavior is after that. If we attempt to auto-recover the > VMI like we are today, then this loop will likely be the result. An > alternative is to leave the VMI paused and require user intervention. > > > Below are virt-launcher logs that show this loop in action. > > {"component":"virt-launcher","kind":"","level":"info","msg":"Domain > unpaused.","name":"rhel8-inadequate-goat","namespace":"moran","pos":"manager. > go:836","timestamp":"2021-12-07T21:31:04.075202Z","uid":"b70be1a2-dce8-4cbc- > 88bb-975a6ce3759b"} > {"component":"virt-launcher","kind":"","level":"info","msg":"Synced > vmi","name":"rhel8-inadequate-goat","namespace":"moran","pos":"server.go: > 190","timestamp":"2021-12-07T21:31:04.082563Z","uid":"b70be1a2-dce8-4cbc- > 88bb-975a6ce3759b"} > {"component":"virt-launcher","level":"info","msg":"kubevirt domain status: > Paused(3):IOError(5)","pos":"client.go:262","timestamp":"2021-12-07T21:31:04. > 088456Z"} > {"component":"virt-launcher","level":"info","msg":"Domain name event: > moran_rhel8-inadequate-goat","pos":"client.go:386","timestamp":"2021-12- > 07T21:31:04.134650Z"} > {"component":"virt-launcher","level":"info","msg":"DomainLifecycle event 4 > with reason 0 > received","pos":"client.go:411","timestamp":"2021-12-07T21:31:04.292311Z"} > {"component":"virt-launcher","kind":"","level":"info","msg":"Domain > unpaused.","name":"rhel8-inadequate-goat","namespace":"moran","pos":"manager. > go:836","timestamp":"2021-12-07T21:31:04.303848Z","uid":"b70be1a2-dce8-4cbc- > 88bb-975a6ce3759b"} > {"component":"virt-launcher","level":"info","msg":"DomainLifecycle event 3 > with reason 2 > received","pos":"client.go:411","timestamp":"2021-12-07T21:31:04.312421Z"} > {"component":"virt-launcher","kind":"","level":"info","msg":"Synced > vmi","name":"rhel8-inadequate-goat","namespace":"moran","pos":"server.go: > 190","timestamp":"2021-12-07T21:31:04.318488Z","uid":"b70be1a2-dce8-4cbc- > 88bb-975a6ce3759b"} > {"component":"virt-launcher","level":"info","msg":"kubevirt domain status: > Paused(3):IOError(5)","pos":"client.go:262","timestamp":"2021-12-07T21:31:04. > 331304Z"} > {"component":"virt-launcher","level":"info","msg":"Domain name event: > moran_rhel8-inadequate-goat","pos":"client.go:386","timestamp":"2021-12- > 07T21:31:04.346578Z"} > {"component":"virt-launcher","level":"info","msg":"kubevirt domain status: > Paused(3):IOError(5)","pos":"client.go:262","timestamp":"2021-12-07T21:31:04. > 362494Z"} > {"component":"virt-launcher","level":"info","msg":"Domain name event: > moran_rhel8-inadequate-goat","pos":"client.go:386","timestamp":"2021-12- > 07T21:31:04.376045Z"} > {"component":"virt-launcher","level":"info","msg":"DomainLifecycle event 4 > with reason 0 > received","pos":"client.go:411","timestamp":"2021-12-07T21:31:04.502867Z"} > {"component":"virt-launcher","level":"info","msg":"DomainLifecycle event 3 > with reason 2 > received","pos":"client.go:411","timestamp":"2021-12-07T21:31:04.515245Z"} > {"component":"virt-launcher","kind":"","level":"info","msg":"Domain > unpaused.","name":"rhel8-inadequate-goat","namespace":"moran","pos":"manager. > go:836","timestamp":"2021-12-07T21:31:04.524333Z","uid":"b70be1a2-dce8-4cbc- > 88bb-975a6ce3759b"} > {"component":"virt-launcher","level":"info","msg":"kubevirt domain status: > Running(1):Unknown(5)","pos":"client.go:262","timestamp":"2021-12-07T21:31: > 04.526013Z"} > {"component":"virt-launcher","kind":"","level":"info","msg":"Synced > vmi","name":"rhel8-inadequate-goat","namespace":"moran","pos":"server.go: > 190","timestamp":"2021-12-07T21:31:04.528873Z","uid":"b70be1a2-dce8-4cbc- > 88bb-975a6ce3759b"} > {"component":"virt-launcher","level":"info","msg":"Domain name event: > moran_rhel8-inadequate-goat","pos":"client.go:386","timestamp":"2021-12- > 07T21:31:04.534102Z"} > {"component":"virt-launcher","level":"info","msg":"kubevirt domain status: > Paused(3):IOError(5)","pos":"client.go:262","timestamp":"2021-12-07T21:31:04. > 554236Z"} > {"component":"virt-launcher","level":"info","msg":"Domain name event: > moran_rhel8-inadequate-goat","pos":"client.go:386","timestamp":"2021-12- > 07T21:31:04.576413Z"} > {"component":"virt-launcher","level":"info","msg":"DomainLifecycle event 4 > with reason 0 > received","pos":"client.go:411","timestamp":"2021-12-07T21:31:04.622782Z"} > {"component":"virt-launcher","level":"info","msg":"kubevirt domain status: > Running(1):Unknown(5)","pos":"client.go:262","timestamp":"2021-12-07T21:31: > 04.639922Z"} > {"component":"virt-launcher","level":"info","msg":"Domain name event: > moran_rhel8-inadequate-goat","pos":"client.go:386","timestamp":"2021-12- > 07T21:31:04.644242Z"} > {"component":"virt-launcher","level":"info","msg":"DomainLifecycle event 3 > with reason 2 > received","pos":"client.go:411","timestamp":"2021-12-07T21:31:04.666535Z"} Looks like this flips multiple times per second on a single VM. We should inhibit updates from Paused due to IOError when we get Running again. Something like waiting X seconds before we switch the condition again after we experienced an IO error. Lets' see that we fix this in 4.11. It has the potential of having quite some impact on the apisever if the whole storage has an outage.
Deferring to 4.13 due to capacity.
verify with build:CNV-v4.13.0.rhel9-1884 step: 1. create a vm and start 2. trigger IO error 3. check vm status $ for i in {1..50}; do oc get vm vm-fedora -o jsonpath='{.status.printableStatus}' && echo ''; done Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused Paused also check UI, there is no flipping. move to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Virtualization 4.13.0 Images security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:3205