Bug 2124406

Summary: Memory dump hp-volume pod keeps in pending sometimes
Product: Container Native Virtualization (CNV) Reporter: Yan Du <yadu>
Component: StorageAssignee: skagan
Status: NEW --- QA Contact: Yan Du <yadu>
Severity: low Docs Contact:
Priority: low    
Version: 4.12.0CC: akalenyu, alitke, jpeimer, ngavrilo, skagan
Target Milestone: ---   
Target Release: 4.15.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Yan Du 2022-09-06 03:08:08 UTC
Description of problem:
happened to find that sometime after VM restart, the vm is scheduled to another node for some reason, then we tried to trigger the memory dump again, the hp-volume pod keeps in pending status since the memory and can not do a new memory dump since the previous is not finished.

Version-Release number of selected component (if applicable):
CNV-v4.12.0-450


How reproducible:
Sometimes

Steps to Reproduce:
1. Create a VM
2. Do memory dump $ virtctl memory-dump get vm-fedora-datavolume --claim-name=memoryvolume --create-claim
3. Restart the VM - sometimes the vm is scheduled to another node
4. Do memory dump again $ virtctl memory-dump get vm-fedora-datavolume 


Actual results:
$ oc get pod -n default
NAME                                       READY   STATUS    RESTARTS   AGE
hp-volume-w4nz8                            0/1     Pending   0          19h
virt-launcher-vm-fedora-datavolume-qjzxj   1/1     Running   0          19h


Events:
  Type     Reason            Age                    From               Message
  ----     ------            ----                   ----               -------
  Warning  FailedScheduling  118m (x2088 over 19h)  default-scheduler  0/6 nodes are available: 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 5 node(s) didn't match Pod's node affinity/selector, 5 node(s) had volume node affinity conflict. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.
  Warning  FailedScheduling  14m                    default-scheduler  0/6 nodes are available: 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 5 node(s) didn't match Pod's node affinity/selector, 5 node(s) had volume node affinity conflict. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.
  Warning  FailedScheduling  14m                    default-scheduler  0/6 nodes are available: 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 5 node(s) didn't match Pod's node affinity/selector, 5 node(s) had volume node affinity conflict. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.
  Warning  FailedScheduling  9m52s                  default-scheduler  0/6 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/unschedulable: }, 1 node(s) were unschedulable, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 5 node(s) didn't match Pod's node affinity/selector, 5 node(s) had volume node affinity conflict. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.
  Warning  FailedScheduling  8m28s                  default-scheduler  0/6 nodes are available: 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 5 node(s) didn't match Pod's node affinity/selector, 5 node(s) had volume node affinity conflict. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.


Expected results:
maybe we could have a friendly warning about why the memory dump failed at this situation?


Additional info:

Comment 2 Adam Litke 2022-09-07 13:18:51 UTC
Shelly, it would be nice if we could somehow detect and handle this situation automatically rather then requiring a manual workaround.  I realize this is difficult because we don't want to leak storage details into kubevirt.  Maybe some sort of timeout and then we will discard the old PVC?

Comment 3 skagan 2022-09-12 11:36:19 UTC
Regarding the warning, the error seems to happen in the hotplug phase of the pvc for the memory dump, Yan, did you look at the volume status in the vmi to see if anything is shown there? dealing with such case is more relevant to the hotplug.
But regardless, currently the memory dump command just triggers the memory dump process and exits, do we want it to wait until it completes? and if not completed in the defined period of time we want to return error and disassociate the pvc? and if we also created it in the process also delete it?

Comment 4 Yan Du 2022-09-14 09:50:26 UTC
The vmi volume status is as below:

$ oc get vmi -o yaml
apiVersion: v1
items:
- apiVersion: kubevirt.io/v1
  kind: VirtualMachineInstance
  metadata:
    annotations:
      kubevirt.io/latest-observed-api-version: v1
      kubevirt.io/storage-observed-api-version: v1alpha3
    creationTimestamp: "2022-09-14T09:38:42Z"
    finalizers:
    - kubevirt.io/virtualMachineControllerFinalize
    - foregroundDeleteVirtualMachine
    generation: 13
    labels:
      kubevirt.io/nodeName: c01-yadu412-kjc7h-worker-0-n26rk
      kubevirt.io/vm: vm-datavolume
    name: vm-fedora-datavolume
    namespace: default
    ownerReferences:
    - apiVersion: kubevirt.io/v1
      blockOwnerDeletion: true
      controller: true
      kind: VirtualMachine
      name: vm-fedora-datavolume
      uid: a692cd87-03a5-4cbb-b414-73877f5f9528
    resourceVersion: "240123"
    uid: 2150b759-0a27-4ec2-8bb8-a4d248e6023b
  spec:
    domain:
      cpu:
        cores: 1
        model: host-model
        sockets: 1
        threads: 1
      devices:
        disks:
        - disk:
            bus: virtio
          name: datavolumevolume
        interfaces:
        - masquerade: {}
          name: default
      features:
        acpi:
          enabled: true
      firmware:
        uuid: e69d93b8-45ca-5bd6-b02e-bf134bb338de
      machine:
        type: pc-q35-rhel8.6.0
      resources:
        requests:
          memory: 1024M
    networks:
    - name: default
      pod: {}
    terminationGracePeriodSeconds: 0
    volumes:
    - dataVolume:
        name: fedora-dv
      name: datavolumevolume
    - memoryDump:
        claimName: pvc1
        hotpluggable: true
      name: pvc1
  status:
    activePods:
      9557121e-81c7-466d-8293-c7114cb1e791: c01-yadu412-kjc7h-worker-0-n26rk
    conditions:
    - lastProbeTime: null
      lastTransitionTime: "2022-09-14T09:38:54Z"
      status: "True"
      type: Ready
    - lastProbeTime: null
      lastTransitionTime: null
      message: 'cannot migrate VMI: PVC fedora-dv is not shared, live migration requires
        that all PVCs must be shared (using ReadWriteMany access mode)'
      reason: DisksNotLiveMigratable
      status: "False"
      type: LiveMigratable
    - lastProbeTime: "2022-09-14T09:39:11Z"
      lastTransitionTime: null
      status: "True"
      type: AgentConnected
    guestOSInfo:
      id: fedora
      kernelRelease: 5.12.11-300.fc34.x86_64
      kernelVersion: '#1 SMP Wed Jun 16 15:47:58 UTC 2021'
      name: Fedora
      prettyName: Fedora 34 (Cloud Edition)
      version: "34"
      versionId: "34"
    interfaces:
    - infoSource: domain, guest-agent
      interfaceName: eth0
      ipAddress: 10.128.2.44
      ipAddresses:
      - 10.128.2.44
      mac: 52:54:00:82:3d:b6
      name: default
      queueCount: 1
    launcherContainerImageVersion: registry.redhat.io/container-native-virtualization/virt-launcher@sha256:35bdecc535e077fe19ec3fcdfc4e30d895acd806f330c9cb8435c1e1b0da7c00
    migrationMethod: BlockMigration
    migrationTransport: Unix
    nodeName: c01-yadu412-kjc7h-worker-0-n26rk
    phase: Running
    phaseTransitionTimestamps:
    - phase: Pending
      phaseTransitionTimestamp: "2022-09-14T09:38:42Z"
    - phase: Scheduling
      phaseTransitionTimestamp: "2022-09-14T09:38:43Z"
    - phase: Scheduled
      phaseTransitionTimestamp: "2022-09-14T09:38:54Z"
    - phase: Running
      phaseTransitionTimestamp: "2022-09-14T09:38:57Z"
    qosClass: Burstable
    runtimeUser: 107
    virtualMachineRevisionName: revision-start-vm-a692cd87-03a5-4cbb-b414-73877f5f9528-2
    volumeStatus:
    - name: datavolumevolume
      persistentVolumeClaimInfo:
        accessModes:
        - ReadWriteOnce
        capacity:
          storage: 10Gi
        filesystemOverhead: "0.055"
        requests:
          storage: 10Gi
        volumeMode: Filesystem
      target: vda
    - hotplugVolume:
        attachPodName: hp-volume-j69qt
      memoryDumpVolume:
        claimName: pvc1
      message: Created hotplug attachment pod hp-volume-j69qt, for volume pvc1
      name: pvc1
      persistentVolumeClaimInfo:
        accessModes:
        - ReadWriteOnce
        capacity:
          storage: 149Gi
        filesystemOverhead: "0.055"
        requests:
          storage: "1191182336"
        volumeMode: Filesystem
      phase: AttachedToNode
      reason: SuccessfulCreate
      target: ""
kind: List
metadata:
  resourceVersion: ""



$ oc describe vmi 
----------8<--------------------
  Volume Status:
    Name:  datavolumevolume
    Persistent Volume Claim Info:
      Access Modes:
        ReadWriteOnce
      Capacity:
        Storage:            10Gi
      Filesystem Overhead:  0.055
      Requests:
        Storage:    10Gi
      Volume Mode:  Filesystem
    Target:         vda
    Hotplug Volume:
      Attach Pod Name:  hp-volume-j69qt
    Memory Dump Volume:
      Claim Name:  pvc1
    Message:       Created hotplug attachment pod hp-volume-j69qt, for volume pvc1
    Name:          pvc1
    Persistent Volume Claim Info:
      Access Modes:
        ReadWriteOnce
      Capacity:
        Storage:            149Gi
      Filesystem Overhead:  0.055
      Requests:
        Storage:    1191182336
      Volume Mode:  Filesystem
    Phase:          AttachedToNode
    Reason:         SuccessfulCreate
    Target:         
Events:
  Type    Reason            Age                    From                       Message
  ----    ------            ----                   ----                       -------
  Normal  SuccessfulCreate  9m18s                  virtualmachine-controller  Created virtual machine pod virt-launcher-vm-fedora-datavolume-wwh6r
  Normal  Created           9m3s                   virt-handler               VirtualMachineInstance defined.
  Normal  Started           9m3s                   virt-handler               VirtualMachineInstance started.
  Normal  SuccessfulCreate  8m54s                  virtualmachine-controller  Created attachment pod hp-volume-j69qt
  Normal  SuccessfulCreate  8m49s (x5 over 8m54s)  virtualmachine-controller  Created hotplug attachment pod hp-volume-j69qt, for volume pvc1

Comment 5 skagan 2022-09-14 10:54:28 UTC
OK I see, so the hotplug doesnt show there is any issue. In that term need to look into it. Regarding the memory dump behavior waiting for @alitke response

Comment 6 Alex Kalenyuk 2022-09-19 12:56:27 UTC
Summarizing grooming discussion:
This will happen regardless of memory dump/not when main disk is not topology constrained but some hotplugged volume is,
for example, ceph for main disk and hpp for hotplugged disk. Might make sense to have an extra bug for this

Maybe we want to set hotplug volume status to failed when we detect such a situation as a short-term fix,
but we should still decide if the underlying issue here is hotplug-related, or should we focus on more friendly virtctl memory-dump interaction?
@alitke

Comment 7 Adam Litke 2022-11-23 18:32:41 UTC
This is definitely a generic hotplug issue but in the specific case of memory dump I think we have an opportunity to improve the user experience.  When a user wants to trigger a new memory dump and we are in this situation (a dump PVC that cannot be attached), we can simply remove the old PVC and create a new one.  This is safe because the user already told us that they want to replace the existing memory dump with a new one.  I do think we will also encounter a similar error with VM export and we need to look into how to handle it.

Comment 8 skagan 2022-11-24 12:08:39 UTC
I think in order to do that we need to at least make the hotplug process fail or show some error cause otherwise I don't think we can know that the PVC cannot be attached, I don't think putting a timeout on that is right. In case of identifying such error I think it will be possible to delete current PVC and create a new one.