Bug 1963275 - migration controller null pointer dereference
Summary: migration controller null pointer dereference
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Virtualization
Version: 2.5.6
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: 2.6.6
Assignee: David Vossel
QA Contact: Israel Pinto
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-21 21:50 UTC by David Vossel
Modified: 2021-08-10 17:34 UTC (History)
4 users (show)

Fixed In Version: virt-operator-container-v2.6.6-3 hco-bundle-registry-container-v2.6.6-24
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-08-10 17:33:37 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubevirt kubevirt pull 5709 0 None open [release-0.34] Fixes migration controller null pointer dereference 2021-05-26 11:22:46 UTC
Red Hat Product Errata RHSA-2021:3119 0 None None None 2021-08-10 17:34:27 UTC

Description David Vossel 2021-05-21 21:50:56 UTC
Description of problem:

If a migration's target pod fails before the handoff to virt-handler on the target node occurs, the virt-controller's migration controller get get into a crash loop.

This is caused by a null pointer dereference that only occurs if the migration transitions to a failed state before the handoff to virt-handler occurs.


Version-Release number of selected component (if applicable):

This was observed in a 2.5.6 cluster

How reproducible:

It's unknown how likely this scenario is to occur in the wild. It can likely be triggered manually though. 


Steps to Reproduce:
1. post a migration object for a vmi
2. immediately delete the target pod right as it appears.
3. virt-controller might get into a crash loop. 

Actual results:

virt-controller pods begin to crash loop.

Expected results:

migration object fails and virt-controller continues to behave normally


Additional info:

In production, this can likely be mitigated by force deleting failed migration objects from the cluster if the crash loops occur.

Comment 1 David Vossel 2021-05-21 21:51:22 UTC
There is a PR posted upstream related to this https://github.com/kubevirt/kubevirt/pull/5694

Comment 3 sgott 2021-06-09 12:30:48 UTC
David,

Have you been able to actually reproduce this, or is the steps to reproduce in the description theoretical?

Comment 4 David Vossel 2021-06-09 13:00:57 UTC
> David,
> 
> Have you been able to actually reproduce this, or is the steps to reproduce
> in the description theoretical?

I've never reproduced this. 

The crash is caused by the target pod failing before the handoff to virt-handler can occur. Theoretically it can be reproduced by deleting the target pod immediately once it is posted to the cluster before our migration controller can perform the handoff, but attempting to trigger this will be a race between virt-controller and the pod deletion.


Just so everyone is aware, we know where the crash is occurring based on production logs that link the crash loop directly to a specific line in the code that tries to dereference the null pointer the POSTed prs address. So this isn't a blind fix.

Comment 5 zhe peng 2021-07-20 06:44:16 UTC
verify with build:
hco: 2.6.6-35
virt-operator-container-v2.6.6-5

step:
1. create vm and start 
2. start migration 
3. immediately delete the target pod as it appears
4. check  virt-controller status in openshift-cnv

no crash loop occurs.
check migration is failed. vm still running on source node. Virt-controller in running status
do live migration again, it works. 

test both linux and windows vm.

move to verified.

Comment 10 errata-xmlrpc 2021-08-10 17:33:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 2.6.6 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3119


Note You need to log in before you can comment on or make changes to this bug.