Description of problem: During live migration (triggered by node drain), sometimes source virt-lancher pod status is Failed and not Completed. Version-Release number of selected component (if applicable): virt-operator 4.8.0-24 How reproducible: Cannot determine an exact number, but happens aroun 1 out of 5-10 attempts. Did not manage to reproduce by manually live migarting the VM. Steps to Reproduce: 1. Create a VM 2. Drain the node on which the VMI is running on Actual results: Sometimes source virt-lancher pod status is Failed and not Completed. Migration is successful and the source pod is deleted. Expected results: Source pod status should be Complete. Additional info: virt-launcher pod log attached (attaching also logs of a pod which reached Complete status for reference) ==================== apiVersion: kubevirt.io/v1 kind: VirtualMachine metadata: creationTimestamp: null labels: kubevirt.io/vm: vm-nodemaintenance-99818 name: vm-nodemaintenance-99818-1616679201-4999232 spec: running: false template: metadata: creationTimestamp: null labels: kubevirt.io/domain: vm-nodemaintenance-99818-1616679201-4999232 kubevirt.io/vm: vm-nodemaintenance-99818-1616679201-4999232 spec: domain: cpu: cores: 1 model: Haswell-noTSX devices: disks: - disk: bus: virtio name: containerdisk - disk: bus: virtio name: cloudinitdisk interfaces: - masquerade: {} name: default rng: {} machine: type: '' resources: requests: memory: 1024Mi evictionStrategy: LiveMigrate networks: - name: default pod: {} terminationGracePeriodSeconds: 30 volumes: - containerDisk: image: quay.io/openshift-cnv/qe-cnv-tests-fedora:33 name: containerdisk - cloudInitNoCloud: userData: '#cloud-config chpasswd: { expire: False } password: fedora ' name: cloudinitdisk ==================== Name: vm-nodemaintenance-99818-1616679201-4999232 Namespace: virt-migration-and-maintenance-test-node-maintenance Labels: kubevirt.io/domain=vm-nodemaintenance-99818-1616679201-4999232 kubevirt.io/migrationTargetNodeName=ruty-4-8-24-3-k6xf5-worker-0-9x6br kubevirt.io/nodeName=ruty-4-8-24-3-k6xf5-worker-0-9x6br kubevirt.io/vm=vm-nodemaintenance-99818-1616679201-4999232 Annotations: kubevirt.io/latest-observed-api-version: v1 kubevirt.io/storage-observed-api-version: v1alpha3 API Version: kubevirt.io/v1 Kind: VirtualMachineInstance Metadata: Creation Timestamp: 2021-03-25T13:33:22Z Finalizers: foregroundDeleteVirtualMachine Generation: 24 Managed Fields: API Version: kubevirt.io/v1alpha3 Fields Type: FieldsV1 fieldsV1: f:metadata: f:annotations: .: f:kubevirt.io/latest-observed-api-version: f:kubevirt.io/storage-observed-api-version: f:labels: .: f:kubevirt.io/domain: f:kubevirt.io/migrationTargetNodeName: f:kubevirt.io/vm: f:ownerReferences: f:spec: .: f:domain: .: f:cpu: .: f:cores: f:model: f:devices: .: f:disks: f:interfaces: f:rng: f:firmware: .: f:uuid: f:machine: .: f:type: f:resources: .: f:requests: .: f:memory: f:evictionStrategy: f:networks: f:terminationGracePeriodSeconds: f:volumes: f:status: .: f:activePods: .: f:28d14edb-ee12-45ef-a0ab-625dc1372006: f:conditions: f:guestOSInfo: f:launcherContainerImageVersion: f:migrationState: .: f:migrationUid: f:sourceNode: f:targetNode: f:targetPod: f:qosClass: Manager: virt-controller Operation: Update Time: 2021-03-25T13:34:47Z API Version: kubevirt.io/v1alpha3 Fields Type: FieldsV1 fieldsV1: f:metadata: f:labels: f:kubevirt.io/nodeName: f:status: f:interfaces: f:migrationMethod: f:migrationState: f:completed: f:endTimestamp: f:mode: f:startTimestamp: f:targetDirectMigrationNodePorts: .: f:39555: f:41675: f:46845: f:targetNodeAddress: f:targetNodeDomainDetected: f:nodeName: f:phase: f:volumeStatus: Manager: virt-handler Operation: Update Time: 2021-03-25T13:34:48Z Owner References: API Version: kubevirt.io/v1 Block Owner Deletion: true Controller: true Kind: VirtualMachine Name: vm-nodemaintenance-99818-1616679201-4999232 UID: fbfb8c03-fe94-483e-ad25-723ee9f4b260 Resource Version: 1305931 UID: 088f74d8-6962-40b8-a222-6ecd68335ff9 Spec: Domain: Cpu: Cores: 1 Model: Haswell-noTSX Devices: Disks: Disk: Bus: virtio Name: containerdisk Disk: Bus: virtio Name: cloudinitdisk Interfaces: Mac Address: 02:6a:39:00:00:17 Masquerade: Name: default Rng: Features: Acpi: Enabled: true Firmware: Uuid: 82d8bbda-5217-5edb-ad24-ccf36dc8efb4 Machine: Type: pc-q35-rhel8.3.0 Resources: Requests: Cpu: 100m Memory: 1Gi Eviction Strategy: LiveMigrate Networks: Name: default Pod: Termination Grace Period Seconds: 30 Volumes: Container Disk: Image: quay.io/openshift-cnv/qe-cnv-tests-fedora:33 Image Pull Policy: IfNotPresent Name: containerdisk Cloud Init No Cloud: User Data: #cloud-config chpasswd: { expire: False } password: fedora Name: cloudinitdisk Status: Active Pods: 28d14edb-ee12-45ef-a0ab-625dc1372006: ruty-4-8-24-3-k6xf5-worker-0-9x6br Conditions: Last Probe Time: <nil> Last Transition Time: <nil> Status: True Type: LiveMigratable Last Probe Time: 2021-03-25T13:33:59Z Last Transition Time: <nil> Status: True Type: AgentConnected Last Probe Time: 2021-03-25T13:33:59Z Last Transition Time: <nil> Status: True Type: AgentVersionNotSupported Last Probe Time: <nil> Last Transition Time: 2021-03-25T13:34:33Z Status: True Type: Ready Guest OS Info: Interfaces: Interface Name: eth0 Ip Address: 10.128.2.69 Ip Addresses: 10.128.2.69 Mac: 02:6a:39:00:00:17 Name: default Launcher Container Image Version: registry.redhat.io/container-native-virtualization/virt-launcher@sha256:ec861cba276b8986b373d900e3297773262f4bc9f21dbc794a9c2e153ad14947 Migration Method: BlockMigration Migration State: Completed: true End Timestamp: 2021-03-25T13:34:47Z Migration UID: 00209959-b03e-4e8b-a388-1e21a56816d7 Mode: PreCopy Source Node: ruty-4-8-24-3-k6xf5-worker-0-r27hd Start Timestamp: 2021-03-25T13:34:33Z Target Direct Migration Node Ports: 39555: 0 41675: 49153 46845: 49152 Target Node: ruty-4-8-24-3-k6xf5-worker-0-9x6br Target Node Address: 10.128.2.8 Target Node Domain Detected: true Target Pod: virt-launcher-vm-nodemaintenance-99818-1616679201-4999232-zkm6w Node Name: ruty-4-8-24-3-k6xf5-worker-0-9x6br Phase: Running Qos Class: Burstable Volume Status: Name: cloudinitdisk Target: vdb Name: containerdisk Target: vda Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SuccessfulCreate 6m12s disruptionbudget-controller Created PodDisruptionBudget kubevirt-disruption-budget-fwsq4 Normal SuccessfulCreate 6m12s virtualmachine-controller Created virtual machine pod virt-launcher-vm-nodemaintenance-99818-1616679201-4999232-4m4pc Normal Started 6m5s virt-handler VirtualMachineInstance started. Normal SuccessfulCreate 5m9s disruptionbudget-controller Created Migration kubevirt-evacuation-5dkd9 Normal Created 5m1s (x19 over 6m5s) virt-handler VirtualMachineInstance defined. Normal PreparingTarget 5m1s (x2 over 5m1s) virt-handler VirtualMachineInstance Migration Target Prepared. Normal PreparingTarget 5m1s virt-handler Migration Target is listening at 10.128.2.8, on ports: 39555,46845,41675 Normal Migrating 4m47s (x5 over 5m1s) virt-handler VirtualMachineInstance is migrating. Normal Created 2m39s (x8 over 4m47s) virt-handler VirtualMachineInstance defined.
Created attachment 1766322 [details] Failed launcher pod log
Created attachment 1766323 [details] Complete - launcher pod log
Regarding this as high priority because it can impact test automation.
Vladik, This BZ seems somewhat surprising. Have we noticed this behavior before?
I think a possible issue is that we didn't manage to stop the cmd server properly and it's still running. Then when the domain disappears because of a successful migration we just fail. {"component":"virt-launcher","level":"info","msg":"stopping cmd server","pos":"server.go:446","timestamp":"2021-03-25T13:34:48.104441Z"} {"component":"virt-launcher","level":"error","msg":"timeout on stopping the cmd server, continuing anyway.","pos":"server.go:457","timestamp":"2021-03-25T13:34:49.104607Z"} {"component":"virt-launcher","level":"info","msg":"Exiting...","pos":"virt-launcher.go:471","timestamp":"2021-03-25T13:34:49.104768Z"} {"component":"virt-launcher","level":"error","msg":"dirty virt-launcher shutdown","pos":"virt-launcher.go:532","reason":"wait: no child processes","timestamp":"2021-03-25T13:34:49.108344Z"}
@rnetser just so we accurately prioritize this, what is the impact on the VMI workload when the source pod exits with failure instead of completed? Are we saying that the migration completes successfully and the VMI is ultimately healthy, but the source pod exits in failure? or does this impact the VMI workload's availability somehow as well?
(In reply to David Vossel from comment #6) > @rnetser just so we accurately prioritize this, what is the > impact on the VMI workload when the source pod exits with failure instead of > completed? > > Are we saying that the migration completes successfully and the VMI is > ultimately healthy, but the source pod exits in failure? or does this impact > the VMI workload's availability somehow as well? @dvossel The source pod, sometimes (could not find a deterministic scenario) exits with failure. The migration completes successfully, a process running in the VMI prior to the migration is still running after the migration. During migration the SSH connection is broken but can be re-established once migration is done. I do wonder if the (no) effect on the VMI is consistent; will run this flow a few more times and update.
(In reply to Ruth Netser from comment #8) > (In reply to David Vossel from comment #6) > > @rnetser just so we accurately prioritize this, what is the > > impact on the VMI workload when the source pod exits with failure instead of > > completed? > > > > Are we saying that the migration completes successfully and the VMI is > > ultimately healthy, but the source pod exits in failure? or does this impact > > the VMI workload's availability somehow as well? > > @dvossel > The source pod, sometimes (could not find a deterministic scenario) exits > with failure. > The migration completes successfully, a process running in the VMI prior to > the migration is still running after the migration. > During migration the SSH connection is broken but can be re-established once > migration is done. > I do wonder if the (no) effect on the VMI is consistent; will run this flow > a few more times and update. based on the vmi logs, this looks like it's simply a race condition in our virt-launcher shutdown flow that exits with non-zero when it should be exiting successfully with a zero exit code... So a false negative. We're indicating there's a failure of some sort when there is none.
Posted PR to fix this race condition https://github.com/kubevirt/kubevirt/pull/6058.
PR was merged D/S in CNV 4.9.
Verified on: Client Version: 4.9.0-202109010444.p0.git.eab618a.assembly.stream-eab618a Server Version: 4.9.0-fc.0 Kubernetes Version: v1.22.0-rc.0+f77e483 CNV: kubevirt-hyperconverged-operator.v4.9.0 cnv-tests revision 1e497215380d114fd1b8d4f7cf9662e1db37ad83 test command: pipenv run pytest -x -s -o log_cli=false --count=10 --jira --bugzilla --tc-file=tests/global_config.py --tc-format=python --tc=nodes_cpu_architecture:INTEL --tc=server_url:50.22.150.165 --storage-class-matrix=ocs-storagecluster-ceph-rbd tests/compute/virt/migration_and_maintenance/test_node_maintenance.py -k test_node_drain_using_console_fedora NOTE: tested 10 iterations. Result: _______________________________________________________________________________ 10 of 10 completed, 10 Pass, 0 Fail, 0 Skip, 0 XPass, 0 XFail, 0 Error, 0 ReRun _______________________________________________________________________________ ----------------------------------------------------------------------------------- 50 deselected, 10 passed, 0 skipped, 0 failed, 0 error 0 xfail 0 xpass ----------------------------------------------------------------------------------- ============================================================================================================== warnings summary =============================================================================================================== tests/compute/virt/migration_and_maintenance/test_node_maintenance.py: 51 warnings /home/cnv-qe-jenkins/.local/share/virtualenvs/cnv-tests-Y3fdlEPc/lib/python3.6/site-packages/ocp_resources/utils.py:126: DeprecationWarning: TimeoutSampler() exception and exception_msg are now deprecated. Please update to use exceptions_dict by Oct 12, 2021 DeprecationWarning, -- Docs: https://docs.pytest.org/en/stable/warnings.html ========================================================================================= 10 passed, 50 deselected, 51 warnings in 2424.44s (0:40:24) =========================================================================================
Backport in progress here https://github.com/kubevirt/kubevirt/pull/6333.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Virtualization 4.9.0 Images security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:4104