1943164 – Node drain: Sometimes source virt-launcher pod status is Failed and not Completed

Bug 1943164 - Node drain: Sometimes source virt-launcher pod status is Failed and not Completed

Summary: Node drain: Sometimes source virt-launcher pod status is Failed and not Compl...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Virtualization
Sub Component:
Version:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Antonio Cardace
QA Contact:	Israel Pinto
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1998855
TreeView+	depends on / blocked

Reported:	2021-03-25 13:58 UTC by Ruth Netser
Modified:	2021-11-02 15:57 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1998855 (view as bug list)
Environment:
Last Closed:	2021-11-02 15:57:28 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Failed launcher pod log (49.13 KB, text/plain) 2021-03-25 13:59 UTC, Ruth Netser	no flags	Details
Complete - launcher pod log (48.98 KB, text/plain) 2021-03-25 13:59 UTC, Ruth Netser	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2021:4104	0	None	None	None	2021-11-02 15:57:41 UTC

Description Ruth Netser 2021-03-25 13:58:33 UTC

Description of problem:
During live migration (triggered by node drain), sometimes source virt-lancher pod status is Failed and not Completed.

Version-Release number of selected component (if applicable):
virt-operator 4.8.0-24

How reproducible:
Cannot determine an exact number, but happens aroun 1 out of 5-10 attempts.
Did not manage to reproduce by manually live migarting the VM.

Steps to Reproduce:
1. Create a VM
2. Drain the node on which the VMI is running on


Actual results:
Sometimes source virt-lancher pod status is Failed and not Completed.
Migration is successful and the source pod is deleted.

Expected results:
Source pod status should be Complete.

Additional info:
virt-launcher pod log attached (attaching also logs of a pod which reached Complete status for reference)

====================
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  creationTimestamp: null
  labels:
    kubevirt.io/vm: vm-nodemaintenance-99818
  name: vm-nodemaintenance-99818-1616679201-4999232
spec:
  running: false
  template:
    metadata:
      creationTimestamp: null
      labels:
        kubevirt.io/domain: vm-nodemaintenance-99818-1616679201-4999232
        kubevirt.io/vm: vm-nodemaintenance-99818-1616679201-4999232
    spec:
      domain:
        cpu:
          cores: 1
          model: Haswell-noTSX
        devices:
          disks:
          - disk:
              bus: virtio
            name: containerdisk
          - disk:
              bus: virtio
            name: cloudinitdisk
          interfaces:
          - masquerade: {}
            name: default
          rng: {}
        machine:
          type: ''
        resources:
          requests:
            memory: 1024Mi
      evictionStrategy: LiveMigrate
      networks:
      - name: default
        pod: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - containerDisk:
          image: quay.io/openshift-cnv/qe-cnv-tests-fedora:33
        name: containerdisk
      - cloudInitNoCloud:
          userData: '#cloud-config

            chpasswd: { expire: False }

            password: fedora

            '
        name: cloudinitdisk

====================
Name:         vm-nodemaintenance-99818-1616679201-4999232
Namespace:    virt-migration-and-maintenance-test-node-maintenance
Labels:       kubevirt.io/domain=vm-nodemaintenance-99818-1616679201-4999232
              kubevirt.io/migrationTargetNodeName=ruty-4-8-24-3-k6xf5-worker-0-9x6br
              kubevirt.io/nodeName=ruty-4-8-24-3-k6xf5-worker-0-9x6br
              kubevirt.io/vm=vm-nodemaintenance-99818-1616679201-4999232
Annotations:  kubevirt.io/latest-observed-api-version: v1
              kubevirt.io/storage-observed-api-version: v1alpha3
API Version:  kubevirt.io/v1
Kind:         VirtualMachineInstance
Metadata:
  Creation Timestamp:  2021-03-25T13:33:22Z
  Finalizers:
    foregroundDeleteVirtualMachine
  Generation:  24
  Managed Fields:
    API Version:  kubevirt.io/v1alpha3
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubevirt.io/latest-observed-api-version:
          f:kubevirt.io/storage-observed-api-version:
        f:labels:
          .:
          f:kubevirt.io/domain:
          f:kubevirt.io/migrationTargetNodeName:
          f:kubevirt.io/vm:
        f:ownerReferences:
      f:spec:
        .:
        f:domain:
          .:
          f:cpu:
            .:
            f:cores:
            f:model:
          f:devices:
            .:
            f:disks:
            f:interfaces:
            f:rng:
          f:firmware:
            .:
            f:uuid:
          f:machine:
            .:
            f:type:
          f:resources:
            .:
            f:requests:
              .:
              f:memory:
        f:evictionStrategy:
        f:networks:
        f:terminationGracePeriodSeconds:
        f:volumes:
      f:status:
        .:
        f:activePods:
          .:
          f:28d14edb-ee12-45ef-a0ab-625dc1372006:
        f:conditions:
        f:guestOSInfo:
        f:launcherContainerImageVersion:
        f:migrationState:
          .:
          f:migrationUid:
          f:sourceNode:
          f:targetNode:
          f:targetPod:
        f:qosClass:
    Manager:      virt-controller
    Operation:    Update
    Time:         2021-03-25T13:34:47Z
    API Version:  kubevirt.io/v1alpha3
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          f:kubevirt.io/nodeName:
      f:status:
        f:interfaces:
        f:migrationMethod:
        f:migrationState:
          f:completed:
          f:endTimestamp:
          f:mode:
          f:startTimestamp:
          f:targetDirectMigrationNodePorts:
            .:
            f:39555:
            f:41675:
            f:46845:
          f:targetNodeAddress:
          f:targetNodeDomainDetected:
        f:nodeName:
        f:phase:
        f:volumeStatus:
    Manager:    virt-handler
    Operation:  Update
    Time:       2021-03-25T13:34:48Z
  Owner References:
    API Version:           kubevirt.io/v1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  VirtualMachine
    Name:                  vm-nodemaintenance-99818-1616679201-4999232
    UID:                   fbfb8c03-fe94-483e-ad25-723ee9f4b260
  Resource Version:        1305931
  UID:                     088f74d8-6962-40b8-a222-6ecd68335ff9
Spec:
  Domain:
    Cpu:
      Cores:  1
      Model:  Haswell-noTSX
    Devices:
      Disks:
        Disk:
          Bus:  virtio
        Name:   containerdisk
        Disk:
          Bus:  virtio
        Name:   cloudinitdisk
      Interfaces:
        Mac Address:  02:6a:39:00:00:17
        Masquerade:
        Name:  default
      Rng:
    Features:
      Acpi:
        Enabled:  true
    Firmware:
      Uuid:  82d8bbda-5217-5edb-ad24-ccf36dc8efb4
    Machine:
      Type:  pc-q35-rhel8.3.0
    Resources:
      Requests:
        Cpu:          100m
        Memory:       1Gi
  Eviction Strategy:  LiveMigrate
  Networks:
    Name:  default
    Pod:
  Termination Grace Period Seconds:  30
  Volumes:
    Container Disk:
      Image:              quay.io/openshift-cnv/qe-cnv-tests-fedora:33
      Image Pull Policy:  IfNotPresent
    Name:                 containerdisk
    Cloud Init No Cloud:
      User Data:  #cloud-config
chpasswd: { expire: False }
password: fedora

    Name:  cloudinitdisk
Status:
  Active Pods:
    28d14edb-ee12-45ef-a0ab-625dc1372006:  ruty-4-8-24-3-k6xf5-worker-0-9x6br
  Conditions:
    Last Probe Time:       <nil>
    Last Transition Time:  <nil>
    Status:                True
    Type:                  LiveMigratable
    Last Probe Time:       2021-03-25T13:33:59Z
    Last Transition Time:  <nil>
    Status:                True
    Type:                  AgentConnected
    Last Probe Time:       2021-03-25T13:33:59Z
    Last Transition Time:  <nil>
    Status:                True
    Type:                  AgentVersionNotSupported
    Last Probe Time:       <nil>
    Last Transition Time:  2021-03-25T13:34:33Z
    Status:                True
    Type:                  Ready
  Guest OS Info:
  Interfaces:
    Interface Name:  eth0
    Ip Address:      10.128.2.69
    Ip Addresses:
      10.128.2.69
    Mac:                             02:6a:39:00:00:17
    Name:                            default
  Launcher Container Image Version:  registry.redhat.io/container-native-virtualization/virt-launcher@sha256:ec861cba276b8986b373d900e3297773262f4bc9f21dbc794a9c2e153ad14947
  Migration Method:                  BlockMigration
  Migration State:
    Completed:        true
    End Timestamp:    2021-03-25T13:34:47Z
    Migration UID:    00209959-b03e-4e8b-a388-1e21a56816d7
    Mode:             PreCopy
    Source Node:      ruty-4-8-24-3-k6xf5-worker-0-r27hd
    Start Timestamp:  2021-03-25T13:34:33Z
    Target Direct Migration Node Ports:
      39555:                      0
      41675:                      49153
      46845:                      49152
    Target Node:                  ruty-4-8-24-3-k6xf5-worker-0-9x6br
    Target Node Address:          10.128.2.8
    Target Node Domain Detected:  true
    Target Pod:                   virt-launcher-vm-nodemaintenance-99818-1616679201-4999232-zkm6w
  Node Name:                      ruty-4-8-24-3-k6xf5-worker-0-9x6br
  Phase:                          Running
  Qos Class:                      Burstable
  Volume Status:
    Name:    cloudinitdisk
    Target:  vdb
    Name:    containerdisk
    Target:  vda
Events:
  Type    Reason            Age                    From                         Message
  ----    ------            ----                   ----                         -------
  Normal  SuccessfulCreate  6m12s                  disruptionbudget-controller  Created PodDisruptionBudget kubevirt-disruption-budget-fwsq4
  Normal  SuccessfulCreate  6m12s                  virtualmachine-controller    Created virtual machine pod virt-launcher-vm-nodemaintenance-99818-1616679201-4999232-4m4pc
  Normal  Started           6m5s                   virt-handler                 VirtualMachineInstance started.
  Normal  SuccessfulCreate  5m9s                   disruptionbudget-controller  Created Migration kubevirt-evacuation-5dkd9
  Normal  Created           5m1s (x19 over 6m5s)   virt-handler                 VirtualMachineInstance defined.
  Normal  PreparingTarget   5m1s (x2 over 5m1s)    virt-handler                 VirtualMachineInstance Migration Target Prepared.
  Normal  PreparingTarget   5m1s                   virt-handler                 Migration Target is listening at 10.128.2.8, on ports: 39555,46845,41675
  Normal  Migrating         4m47s (x5 over 5m1s)   virt-handler                 VirtualMachineInstance is migrating.
  Normal  Created           2m39s (x8 over 4m47s)  virt-handler                 VirtualMachineInstance defined.

Comment 1 Ruth Netser 2021-03-25 13:59:07 UTC

Created attachment 1766322 [details]
Failed launcher pod log

Comment 2 Ruth Netser 2021-03-25 13:59:39 UTC

Created attachment 1766323 [details]
Complete -  launcher pod log

Comment 3 sgott 2021-03-29 13:21:05 UTC

Regarding this as high priority because it can impact test automation.

Comment 4 sgott 2021-03-29 17:34:29 UTC

Vladik,

This BZ seems somewhat surprising. Have we noticed this behavior before?

Comment 5 Vladik Romanovsky 2021-03-29 18:49:32 UTC

I think a possible issue is that we didn't manage to stop the cmd server properly and it's still running. Then when the domain disappears because of a successful migration we just fail.

{"component":"virt-launcher","level":"info","msg":"stopping cmd server","pos":"server.go:446","timestamp":"2021-03-25T13:34:48.104441Z"}
{"component":"virt-launcher","level":"error","msg":"timeout on stopping the cmd server, continuing anyway.","pos":"server.go:457","timestamp":"2021-03-25T13:34:49.104607Z"}
{"component":"virt-launcher","level":"info","msg":"Exiting...","pos":"virt-launcher.go:471","timestamp":"2021-03-25T13:34:49.104768Z"}
{"component":"virt-launcher","level":"error","msg":"dirty virt-launcher shutdown","pos":"virt-launcher.go:532","reason":"wait: no child processes","timestamp":"2021-03-25T13:34:49.108344Z"}

Comment 6 David Vossel 2021-05-27 19:04:52 UTC

@rnetser just so we accurately prioritize this, what is the impact on the VMI workload when the source pod exits with failure instead of completed?

Are we saying that the migration completes successfully and the VMI is ultimately healthy, but the source pod exits in failure? or does this impact the VMI workload's availability somehow as well?

Comment 8 Ruth Netser 2021-06-03 16:20:25 UTC

(In reply to David Vossel from comment #6)
> @rnetser just so we accurately prioritize this, what is the
> impact on the VMI workload when the source pod exits with failure instead of
> completed?
> 
> Are we saying that the migration completes successfully and the VMI is
> ultimately healthy, but the source pod exits in failure? or does this impact
> the VMI workload's availability somehow as well?

@dvossel
The source pod, sometimes (could not find a deterministic scenario) exits with failure.
The migration completes successfully, a process running in the VMI prior to the migration is still running after the migration.
During migration the SSH connection is broken but can be re-established once migration is done.
I do wonder if the (no) effect on the VMI is consistent; will run this flow a few more times and update.

Comment 9 David Vossel 2021-06-07 14:05:59 UTC

(In reply to Ruth Netser from comment #8)
> (In reply to David Vossel from comment #6)
> > @rnetser just so we accurately prioritize this, what is the
> > impact on the VMI workload when the source pod exits with failure instead of
> > completed?
> > 
> > Are we saying that the migration completes successfully and the VMI is
> > ultimately healthy, but the source pod exits in failure? or does this impact
> > the VMI workload's availability somehow as well?
> 
> @dvossel
> The source pod, sometimes (could not find a deterministic scenario) exits
> with failure.
> The migration completes successfully, a process running in the VMI prior to
> the migration is still running after the migration.
> During migration the SSH connection is broken but can be re-established once
> migration is done.
> I do wonder if the (no) effect on the VMI is consistent; will run this flow
> a few more times and update.

based on the vmi logs, this looks like it's simply a race condition in our virt-launcher shutdown flow that exits with non-zero when it should be exiting successfully with a zero exit code... So a false negative. We're indicating there's a failure of some sort when there is none.

Comment 10 Antonio Cardace 2021-07-19 08:42:03 UTC

Posted PR to fix this race condition https://github.com/kubevirt/kubevirt/pull/6058.

Comment 11 Antonio Cardace 2021-08-05 06:47:40 UTC

PR was merged D/S in CNV 4.9.

Comment 13 Sasha Segal 2021-09-01 16:38:32 UTC

Verified on:
Client Version: 4.9.0-202109010444.p0.git.eab618a.assembly.stream-eab618a
Server Version: 4.9.0-fc.0
Kubernetes Version: v1.22.0-rc.0+f77e483
CNV: kubevirt-hyperconverged-operator.v4.9.0

cnv-tests revision 1e497215380d114fd1b8d4f7cf9662e1db37ad83 

test command:
pipenv run pytest -x -s -o log_cli=false --count=10 --jira --bugzilla --tc-file=tests/global_config.py --tc-format=python --tc=nodes_cpu_architecture:INTEL --tc=server_url:50.22.150.165 --storage-class-matrix=ocs-storagecluster-ceph-rbd tests/compute/virt/migration_and_maintenance/test_node_maintenance.py -k test_node_drain_using_console_fedora

NOTE: tested 10 iterations.
Result:

_______________________________________________________________________________ 10 of 10 completed, 10 Pass, 0 Fail, 0 Skip, 0 XPass, 0 XFail, 0 Error, 0 ReRun _______________________________________________________________________________
----------------------------------------------------------------------------------- 50 deselected, 10 passed, 0 skipped, 0 failed, 0 error 0 xfail 0 xpass  -----------------------------------------------------------------------------------

============================================================================================================== warnings summary ===============================================================================================================
tests/compute/virt/migration_and_maintenance/test_node_maintenance.py: 51 warnings
  /home/cnv-qe-jenkins/.local/share/virtualenvs/cnv-tests-Y3fdlEPc/lib/python3.6/site-packages/ocp_resources/utils.py:126: DeprecationWarning: TimeoutSampler() exception and exception_msg are now deprecated. Please update to use exceptions_dict by Oct 12, 2021
    DeprecationWarning,

-- Docs: https://docs.pytest.org/en/stable/warnings.html
========================================================================================= 10 passed, 50 deselected, 51 warnings in 2424.44s (0:40:24) =========================================================================================

Comment 15 Antonio Cardace 2021-09-02 08:26:31 UTC

Backport in progress here https://github.com/kubevirt/kubevirt/pull/6333.

Comment 18 errata-xmlrpc 2021-11-02 15:57:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.9.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:4104

Note You need to log in before you can comment on or make changes to this bug.