Bug 2203786

Summary: On BM, hpp-pool pods are stuck in CrashLoopBackOff
Product: Container Native Virtualization (CNV) Reporter: awax
Component: StorageAssignee: Alexander Wels <awels>
Status: ASSIGNED --- QA Contact: Natalie Gavrielov <ngavrilo>
Severity: high Docs Contact:
Priority: high    
Version: 4.13.0CC: akalenyu, alitke, jpeimer, yadu
Target Milestone: ---Flags: alitke: needinfo? (awax)
Target Release: 4.13.4   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description awax 2023-05-15 08:56:20 UTC
Description of problem:
On BM's (this one seen on bm03-cnvqe2-rdu2), hpp-pool pod gets stuck in a CrashLoopBackOff state.
Seems like it is ceph related. HPP is backed by OCS.

Version-Release number of selected component (if applicable):
$ oc get csv -A | grep kubevirt
openshift-cnv                                           kubevirt-hyperconverged-operator.v4.13.0          OpenShift Virtualization                         4.13.0                kubevirt-hyperconverged-operator.v4.12.3          Succeeded
$ oc get clusterversion
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.13.0-rc.5   True        False         11d     Cluster version is 4.13.0-rc.5



How reproducible:
I'm not sure what's triggering this issue. After running the network team test suite, which creates network components and VMs, on some BM's the hpp-pool gets into this state.


Steps to Reproduce:
1.
2.
3.


Actual results:
Hpp-pool pod gets stuck in a CrashLoopBackOff state.
$ oc get pods -n openshift-cnv | grep hpp
openshift-cnv                                           hpp-pool-29ab9406-755647446d-44jfk                                              0/1     Terminating        10                43h
openshift-cnv                                           hpp-pool-29ab9406-755647446d-d6rn7                                              0/1     CrashLoopBackOff   497 (4m5s ago)    42h
openshift-cnv                                           hpp-pool-4356e54b-7df67db896-8vq5t                                              0/1     Terminating        3                 43h
openshift-cnv                                           hpp-pool-4356e54b-7df67db896-ntqpr                                              0/1     CrashLoopBackOff   502 (3m22s ago)   42h
openshift-cnv                                           hpp-pool-7dfd761c-cf499b659-9mdk7                                               1/1     Running            0                 42h




$ oc get pods hpp-pool-29ab9406-755647446d-d6rn7 -oyaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    k8s.ovn.org/pod-networks: '{"default":{"ip_addresses":["10.128.2.5/23"],"mac_address":"0a:58:0a:80:02:05","gateway_ips":["10.128.2.1"],"ip_address":"10.128.2.5/23","gateway_ip":"10.128.2.1"}}'
    k8s.v1.cni.cncf.io/network-status: |-
      [{
          "name": "ovn-kubernetes",
          "interface": "eth0",
          "ips": [
              "10.128.2.5"
          ],
          "mac": "0a:58:0a:80:02:05",
          "default": true,
          "dns": {}
      }]
    openshift.io/scc: hostpath-provisioner-csi
  creationTimestamp: "2023-05-13T14:24:31Z"
  generateName: hpp-pool-29ab9406-755647446d-
  labels:
    hpp-pool: hpp-csi-pvc-block-hpp
    pod-template-hash: 755647446d
  name: hpp-pool-29ab9406-755647446d-d6rn7
  namespace: openshift-cnv
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: hpp-pool-29ab9406-755647446d
    uid: 6d6089af-1e72-4602-9f67-c212bcb1dac8
  resourceVersion: "22166040"
  uid: a5162c1e-babc-455e-a071-262b81d48c8a
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - cnv-qe-infra-19.cnvqe2.lab.eng.rdu2.redhat.com
  containers:
  - command:
    - /usr/bin/mounter
    - --storagePoolPath
    - /dev/data
    - --mountPath
    - /var/hpp-csi-pvc-block/csi
    - --hostPath
    - /host
    image: registry.redhat.io/container-native-virtualization/hostpath-provisioner-operator-rhel9@sha256:045ad111f8d3fe28b8cf77df49a264922c9fa4cc46759ed98ef044077225a23e
    imagePullPolicy: IfNotPresent
    name: mounter
    resources:
      requests:
        cpu: 10m
        memory: 100Mi
    securityContext:
      capabilities:
        drop:
        - KILL
        - MKNOD
        - SETGID
        - SETUID
      privileged: true
      runAsUser: 0
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeDevices:
    - devicePath: /dev/data
      name: data
    volumeMounts:
    - mountPath: /host
      mountPropagation: Bidirectional
      name: host-root
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-ql72g
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  imagePullSecrets:
  - name: hostpath-provisioner-admin-csi-dockercfg-xn7tq
  nodeName: cnv-qe-infra-19.cnvqe2.lab.eng.rdu2.redhat.com
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: hostpath-provisioner-admin-csi
  serviceAccountName: hostpath-provisioner-admin-csi
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  volumes:
  - name: data
    persistentVolumeClaim:
      claimName: hpp-pool-29ab9406
  - hostPath:
      path: /
      type: Directory
    name: host-root
  - name: kube-api-access-ql72g
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
      - configMap:
          items:
          - key: service-ca.crt
            path: service-ca.crt
          name: openshift-service-ca.crt
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2023-05-13T14:45:11Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2023-05-13T14:45:11Z"
    message: 'containers with unready status: [mounter]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2023-05-13T14:45:11Z"
    message: 'containers with unready status: [mounter]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2023-05-13T14:45:09Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: cri-o://5c71c577ce6c36921126314719346663f5cf9c072264d408d362bf45857219f9
    image: registry.redhat.io/container-native-virtualization/hostpath-provisioner-operator-rhel9@sha256:045ad111f8d3fe28b8cf77df49a264922c9fa4cc46759ed98ef044077225a23e
    imageID: registry.redhat.io/container-native-virtualization/hostpath-provisioner-operator-rhel9@sha256:045ad111f8d3fe28b8cf77df49a264922c9fa4cc46759ed98ef044077225a23e
    lastState:
      terminated:
        containerID: cri-o://5c71c577ce6c36921126314719346663f5cf9c072264d408d362bf45857219f9
        exitCode: 2
        finishedAt: "2023-05-15T08:29:59Z"
        reason: Error
        startedAt: "2023-05-15T08:29:59Z"
    name: mounter
    ready: false
    restartCount: 494
    started: false
    state:
      waiting:
        message: back-off 5m0s restarting failed container=mounter pod=hpp-pool-29ab9406-755647446d-d6rn7_openshift-cnv(a5162c1e-babc-455e-a071-262b81d48c8a)
        reason: CrashLoopBackOff
  hostIP: 10.1.156.19
  phase: Running
  podIP: 10.128.2.5
  podIPs:
  - ip: 10.128.2.5
  qosClass: Burstable
  startTime: "2023-05-13T14:45:11Z"


Expected results:


Additional info:
W/A - force delete pvn+hpp-pool pods.


Additional info from the cluster:
$ oc get pods -n openshift-storage
NAME                                                              READY   STATUS    RESTARTS       AGE
csi-addons-controller-manager-6976d48f69-fmpct                    2/2     Running   9 (42h ago)    42h
csi-cephfsplugin-7gqcc                                            2/2     Running   6              11d
csi-cephfsplugin-pgg6z                                            2/2     Running   4              11d
csi-cephfsplugin-provisioner-cc76c4b9-vmpk6                       5/5     Running   0              42h
csi-cephfsplugin-provisioner-cc76c4b9-xp9rt                       5/5     Running   0              42h
csi-cephfsplugin-q4r8n                                            2/2     Running   4              11d
csi-rbdplugin-j8465                                               3/3     Running   9              11d
csi-rbdplugin-jl4jf                                               3/3     Running   6              11d
csi-rbdplugin-provisioner-8558756f4f-fvtb2                        6/6     Running   0              42h
csi-rbdplugin-provisioner-8558756f4f-kxgpp                        6/6     Running   0              42h
csi-rbdplugin-wgjml                                               3/3     Running   6              11d
noobaa-operator-645c48c4c5-6gx4w                                  1/1     Running   0              42h
ocs-metrics-exporter-774f4b58cc-5ngc5                             1/1     Running   0              42h
ocs-operator-5b5d98d58d-zl7zq                                     1/1     Running   11 (41h ago)   42h
odf-console-78bb5b66-4mnfb                                        1/1     Running   0              42h
odf-operator-controller-manager-7db8d4fd4c-ltzkd                  2/2     Running   0              42h
rook-ceph-crashcollector-03d7e1289c5164e19d0d22d6856ffdae-9b4nt   1/1     Running   0              42h
rook-ceph-crashcollector-374253a427dc62aef82d81f5fc14643e-44bqw   1/1     Running   0              42h
rook-ceph-crashcollector-c903e190df41042ede88f92c4aa10277-n5jbj   1/1     Running   0              42h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-666b46d6k42f8   2/2     Running   0              42h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-84bb79d6hz5dp   2/2     Running   0              42h
rook-ceph-mgr-a-7fd8968d84-p2sx4                                  2/2     Running   0              42h
rook-ceph-mon-d-54b48b9549-rf69w                                  2/2     Running   0              42h
rook-ceph-mon-e-cc8d486-94tff                                     2/2     Running   0              42h
rook-ceph-mon-g-66d7d99bd7-44gjd                                  2/2     Running   0              42h
rook-ceph-operator-5b595585d7-kpnsd                               1/1     Running   8 (42h ago)    42h
rook-ceph-osd-0-7987b8c66c-89rws                                  2/2     Running   0              42h
rook-ceph-osd-1-7956cc5998-6ghk2                                  2/2     Running   0              42h
rook-ceph-osd-2-6f6cfb658f-kdcmp                                  2/2     Running   0              42h

$ oc get pods -A | grep hostpath
openshift-cnv                                           hostpath-provisioner-csi-lzvq6                                                  4/4     Running            4                 5d1h
openshift-cnv                                           hostpath-provisioner-csi-s69jh                                                  4/4     Running            8                 5d1h
openshift-cnv                                           hostpath-provisioner-csi-td8hj                                                  4/4     Running            4                 5d1h
openshift-cnv                                           hostpath-provisioner-operator-77f6f799d5-5dtlz                                  1/1     Running            1 (42h ago)       42h

$ oc get pods -A | grep hpp
openshift-cnv                                           hpp-pool-29ab9406-755647446d-44jfk                                              0/1     Terminating        10                43h
openshift-cnv                                           hpp-pool-29ab9406-755647446d-d6rn7                                              0/1     CrashLoopBackOff   497 (4m5s ago)    42h
openshift-cnv                                           hpp-pool-4356e54b-7df67db896-8vq5t                                              0/1     Terminating        3                 43h
openshift-cnv                                           hpp-pool-4356e54b-7df67db896-ntqpr                                              0/1     CrashLoopBackOff   502 (3m22s ago)   42h
openshift-cnv                                           hpp-pool-7dfd761c-cf499b659-9mdk7                                               1/1     Running            0                 42h

Comment 2 Adam Litke 2023-08-09 17:54:58 UTC
Is this still reproducing?  I wonder if it was just an intermittent environmental issue.