1963275 – migration controller null pointer dereference

Bug 1963275 - migration controller null pointer dereference

Summary: migration controller null pointer dereference

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Virtualization
Sub Component:
Version:	2.5.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	2.6.6
Assignee:	David Vossel
QA Contact:	Israel Pinto
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-05-21 21:50 UTC by David Vossel
Modified:	2024-10-01 18:18 UTC (History)
CC List:	4 users (show)
Fixed In Version:	virt-operator-container-v2.6.6-3 hco-bundle-registry-container-v2.6.6-24
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-08-10 17:33:37 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	kubevirt kubevirt pull 5709	0	None	open	[release-0.34] Fixes migration controller null pointer dereference	2021-05-26 11:22:46 UTC
Red Hat Product Errata	RHSA-2021:3119	0	None	None	None	2021-08-10 17:34:27 UTC

Description David Vossel 2021-05-21 21:50:56 UTC

Description of problem:

If a migration's target pod fails before the handoff to virt-handler on the target node occurs, the virt-controller's migration controller get get into a crash loop.

This is caused by a null pointer dereference that only occurs if the migration transitions to a failed state before the handoff to virt-handler occurs.


Version-Release number of selected component (if applicable):

This was observed in a 2.5.6 cluster

How reproducible:

It's unknown how likely this scenario is to occur in the wild. It can likely be triggered manually though. 


Steps to Reproduce:
1. post a migration object for a vmi
2. immediately delete the target pod right as it appears.
3. virt-controller might get into a crash loop. 

Actual results:

virt-controller pods begin to crash loop.

Expected results:

migration object fails and virt-controller continues to behave normally


Additional info:

In production, this can likely be mitigated by force deleting failed migration objects from the cluster if the crash loops occur.

Comment 1 David Vossel 2021-05-21 21:51:22 UTC

There is a PR posted upstream related to this https://github.com/kubevirt/kubevirt/pull/5694

Comment 3 sgott 2021-06-09 12:30:48 UTC

David,

Have you been able to actually reproduce this, or is the steps to reproduce in the description theoretical?

Comment 4 David Vossel 2021-06-09 13:00:57 UTC

> David,
> 
> Have you been able to actually reproduce this, or is the steps to reproduce
> in the description theoretical?

I've never reproduced this. 

The crash is caused by the target pod failing before the handoff to virt-handler can occur. Theoretically it can be reproduced by deleting the target pod immediately once it is posted to the cluster before our migration controller can perform the handoff, but attempting to trigger this will be a race between virt-controller and the pod deletion.


Just so everyone is aware, we know where the crash is occurring based on production logs that link the crash loop directly to a specific line in the code that tries to dereference the null pointer the POSTed prs address. So this isn't a blind fix.

Comment 5 zhe peng 2021-07-20 06:44:16 UTC

verify with build:
hco: 2.6.6-35
virt-operator-container-v2.6.6-5

step:
1. create vm and start 
2. start migration 
3. immediately delete the target pod as it appears
4. check  virt-controller status in openshift-cnv

no crash loop occurs.
check migration is failed. vm still running on source node. Virt-controller in running status
do live migration again, it works. 

test both linux and windows vm.

move to verified.

Comment 10 errata-xmlrpc 2021-08-10 17:33:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 2.6.6 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3119

Note You need to log in before you can comment on or make changes to this bug.