Description of problem: If a migration's target pod fails before the handoff to virt-handler on the target node occurs, the virt-controller's migration controller get get into a crash loop. This is caused by a null pointer dereference that only occurs if the migration transitions to a failed state before the handoff to virt-handler occurs. Version-Release number of selected component (if applicable): This was observed in a 2.5.6 cluster How reproducible: It's unknown how likely this scenario is to occur in the wild. It can likely be triggered manually though. Steps to Reproduce: 1. post a migration object for a vmi 2. immediately delete the target pod right as it appears. 3. virt-controller might get into a crash loop. Actual results: virt-controller pods begin to crash loop. Expected results: migration object fails and virt-controller continues to behave normally Additional info: In production, this can likely be mitigated by force deleting failed migration objects from the cluster if the crash loops occur.
There is a PR posted upstream related to this https://github.com/kubevirt/kubevirt/pull/5694
David, Have you been able to actually reproduce this, or is the steps to reproduce in the description theoretical?
> David, > > Have you been able to actually reproduce this, or is the steps to reproduce > in the description theoretical? I've never reproduced this. The crash is caused by the target pod failing before the handoff to virt-handler can occur. Theoretically it can be reproduced by deleting the target pod immediately once it is posted to the cluster before our migration controller can perform the handoff, but attempting to trigger this will be a race between virt-controller and the pod deletion. Just so everyone is aware, we know where the crash is occurring based on production logs that link the crash loop directly to a specific line in the code that tries to dereference the null pointer the POSTed prs address. So this isn't a blind fix.
verify with build: hco: 2.6.6-35 virt-operator-container-v2.6.6-5 step: 1. create vm and start 2. start migration 3. immediately delete the target pod as it appears 4. check virt-controller status in openshift-cnv no crash loop occurs. check migration is failed. vm still running on source node. Virt-controller in running status do live migration again, it works. test both linux and windows vm. move to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Virtualization 2.6.6 Images security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3119