Description of problem: On BM's (this one seen on bm03-cnvqe2-rdu2), hpp-pool pod gets stuck in a CrashLoopBackOff state. Seems like it is ceph related. HPP is backed by OCS. Version-Release number of selected component (if applicable): $ oc get csv -A | grep kubevirt openshift-cnv kubevirt-hyperconverged-operator.v4.13.0 OpenShift Virtualization 4.13.0 kubevirt-hyperconverged-operator.v4.12.3 Succeeded $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.13.0-rc.5 True False 11d Cluster version is 4.13.0-rc.5 How reproducible: I'm not sure what's triggering this issue. After running the network team test suite, which creates network components and VMs, on some BM's the hpp-pool gets into this state. Steps to Reproduce: 1. 2. 3. Actual results: Hpp-pool pod gets stuck in a CrashLoopBackOff state. $ oc get pods -n openshift-cnv | grep hpp openshift-cnv hpp-pool-29ab9406-755647446d-44jfk 0/1 Terminating 10 43h openshift-cnv hpp-pool-29ab9406-755647446d-d6rn7 0/1 CrashLoopBackOff 497 (4m5s ago) 42h openshift-cnv hpp-pool-4356e54b-7df67db896-8vq5t 0/1 Terminating 3 43h openshift-cnv hpp-pool-4356e54b-7df67db896-ntqpr 0/1 CrashLoopBackOff 502 (3m22s ago) 42h openshift-cnv hpp-pool-7dfd761c-cf499b659-9mdk7 1/1 Running 0 42h $ oc get pods hpp-pool-29ab9406-755647446d-d6rn7 -oyaml apiVersion: v1 kind: Pod metadata: annotations: k8s.ovn.org/pod-networks: '{"default":{"ip_addresses":["10.128.2.5/23"],"mac_address":"0a:58:0a:80:02:05","gateway_ips":["10.128.2.1"],"ip_address":"10.128.2.5/23","gateway_ip":"10.128.2.1"}}' k8s.v1.cni.cncf.io/network-status: |- [{ "name": "ovn-kubernetes", "interface": "eth0", "ips": [ "10.128.2.5" ], "mac": "0a:58:0a:80:02:05", "default": true, "dns": {} }] openshift.io/scc: hostpath-provisioner-csi creationTimestamp: "2023-05-13T14:24:31Z" generateName: hpp-pool-29ab9406-755647446d- labels: hpp-pool: hpp-csi-pvc-block-hpp pod-template-hash: 755647446d name: hpp-pool-29ab9406-755647446d-d6rn7 namespace: openshift-cnv ownerReferences: - apiVersion: apps/v1 blockOwnerDeletion: true controller: true kind: ReplicaSet name: hpp-pool-29ab9406-755647446d uid: 6d6089af-1e72-4602-9f67-c212bcb1dac8 resourceVersion: "22166040" uid: a5162c1e-babc-455e-a071-262b81d48c8a spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - cnv-qe-infra-19.cnvqe2.lab.eng.rdu2.redhat.com containers: - command: - /usr/bin/mounter - --storagePoolPath - /dev/data - --mountPath - /var/hpp-csi-pvc-block/csi - --hostPath - /host image: registry.redhat.io/container-native-virtualization/hostpath-provisioner-operator-rhel9@sha256:045ad111f8d3fe28b8cf77df49a264922c9fa4cc46759ed98ef044077225a23e imagePullPolicy: IfNotPresent name: mounter resources: requests: cpu: 10m memory: 100Mi securityContext: capabilities: drop: - KILL - MKNOD - SETGID - SETUID privileged: true runAsUser: 0 terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeDevices: - devicePath: /dev/data name: data volumeMounts: - mountPath: /host mountPropagation: Bidirectional name: host-root - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: kube-api-access-ql72g readOnly: true dnsPolicy: ClusterFirst enableServiceLinks: true imagePullSecrets: - name: hostpath-provisioner-admin-csi-dockercfg-xn7tq nodeName: cnv-qe-infra-19.cnvqe2.lab.eng.rdu2.redhat.com preemptionPolicy: PreemptLowerPriority priority: 0 restartPolicy: Always schedulerName: default-scheduler securityContext: {} serviceAccount: hostpath-provisioner-admin-csi serviceAccountName: hostpath-provisioner-admin-csi terminationGracePeriodSeconds: 30 tolerations: - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 300 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 300 - effect: NoSchedule key: node.kubernetes.io/memory-pressure operator: Exists volumes: - name: data persistentVolumeClaim: claimName: hpp-pool-29ab9406 - hostPath: path: / type: Directory name: host-root - name: kube-api-access-ql72g projected: defaultMode: 420 sources: - serviceAccountToken: expirationSeconds: 3607 path: token - configMap: items: - key: ca.crt path: ca.crt name: kube-root-ca.crt - downwardAPI: items: - fieldRef: apiVersion: v1 fieldPath: metadata.namespace path: namespace - configMap: items: - key: service-ca.crt path: service-ca.crt name: openshift-service-ca.crt status: conditions: - lastProbeTime: null lastTransitionTime: "2023-05-13T14:45:11Z" status: "True" type: Initialized - lastProbeTime: null lastTransitionTime: "2023-05-13T14:45:11Z" message: 'containers with unready status: [mounter]' reason: ContainersNotReady status: "False" type: Ready - lastProbeTime: null lastTransitionTime: "2023-05-13T14:45:11Z" message: 'containers with unready status: [mounter]' reason: ContainersNotReady status: "False" type: ContainersReady - lastProbeTime: null lastTransitionTime: "2023-05-13T14:45:09Z" status: "True" type: PodScheduled containerStatuses: - containerID: cri-o://5c71c577ce6c36921126314719346663f5cf9c072264d408d362bf45857219f9 image: registry.redhat.io/container-native-virtualization/hostpath-provisioner-operator-rhel9@sha256:045ad111f8d3fe28b8cf77df49a264922c9fa4cc46759ed98ef044077225a23e imageID: registry.redhat.io/container-native-virtualization/hostpath-provisioner-operator-rhel9@sha256:045ad111f8d3fe28b8cf77df49a264922c9fa4cc46759ed98ef044077225a23e lastState: terminated: containerID: cri-o://5c71c577ce6c36921126314719346663f5cf9c072264d408d362bf45857219f9 exitCode: 2 finishedAt: "2023-05-15T08:29:59Z" reason: Error startedAt: "2023-05-15T08:29:59Z" name: mounter ready: false restartCount: 494 started: false state: waiting: message: back-off 5m0s restarting failed container=mounter pod=hpp-pool-29ab9406-755647446d-d6rn7_openshift-cnv(a5162c1e-babc-455e-a071-262b81d48c8a) reason: CrashLoopBackOff hostIP: 10.1.156.19 phase: Running podIP: 10.128.2.5 podIPs: - ip: 10.128.2.5 qosClass: Burstable startTime: "2023-05-13T14:45:11Z" Expected results: Additional info: W/A - force delete pvn+hpp-pool pods. Additional info from the cluster: $ oc get pods -n openshift-storage NAME READY STATUS RESTARTS AGE csi-addons-controller-manager-6976d48f69-fmpct 2/2 Running 9 (42h ago) 42h csi-cephfsplugin-7gqcc 2/2 Running 6 11d csi-cephfsplugin-pgg6z 2/2 Running 4 11d csi-cephfsplugin-provisioner-cc76c4b9-vmpk6 5/5 Running 0 42h csi-cephfsplugin-provisioner-cc76c4b9-xp9rt 5/5 Running 0 42h csi-cephfsplugin-q4r8n 2/2 Running 4 11d csi-rbdplugin-j8465 3/3 Running 9 11d csi-rbdplugin-jl4jf 3/3 Running 6 11d csi-rbdplugin-provisioner-8558756f4f-fvtb2 6/6 Running 0 42h csi-rbdplugin-provisioner-8558756f4f-kxgpp 6/6 Running 0 42h csi-rbdplugin-wgjml 3/3 Running 6 11d noobaa-operator-645c48c4c5-6gx4w 1/1 Running 0 42h ocs-metrics-exporter-774f4b58cc-5ngc5 1/1 Running 0 42h ocs-operator-5b5d98d58d-zl7zq 1/1 Running 11 (41h ago) 42h odf-console-78bb5b66-4mnfb 1/1 Running 0 42h odf-operator-controller-manager-7db8d4fd4c-ltzkd 2/2 Running 0 42h rook-ceph-crashcollector-03d7e1289c5164e19d0d22d6856ffdae-9b4nt 1/1 Running 0 42h rook-ceph-crashcollector-374253a427dc62aef82d81f5fc14643e-44bqw 1/1 Running 0 42h rook-ceph-crashcollector-c903e190df41042ede88f92c4aa10277-n5jbj 1/1 Running 0 42h rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-666b46d6k42f8 2/2 Running 0 42h rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-84bb79d6hz5dp 2/2 Running 0 42h rook-ceph-mgr-a-7fd8968d84-p2sx4 2/2 Running 0 42h rook-ceph-mon-d-54b48b9549-rf69w 2/2 Running 0 42h rook-ceph-mon-e-cc8d486-94tff 2/2 Running 0 42h rook-ceph-mon-g-66d7d99bd7-44gjd 2/2 Running 0 42h rook-ceph-operator-5b595585d7-kpnsd 1/1 Running 8 (42h ago) 42h rook-ceph-osd-0-7987b8c66c-89rws 2/2 Running 0 42h rook-ceph-osd-1-7956cc5998-6ghk2 2/2 Running 0 42h rook-ceph-osd-2-6f6cfb658f-kdcmp 2/2 Running 0 42h $ oc get pods -A | grep hostpath openshift-cnv hostpath-provisioner-csi-lzvq6 4/4 Running 4 5d1h openshift-cnv hostpath-provisioner-csi-s69jh 4/4 Running 8 5d1h openshift-cnv hostpath-provisioner-csi-td8hj 4/4 Running 4 5d1h openshift-cnv hostpath-provisioner-operator-77f6f799d5-5dtlz 1/1 Running 1 (42h ago) 42h $ oc get pods -A | grep hpp openshift-cnv hpp-pool-29ab9406-755647446d-44jfk 0/1 Terminating 10 43h openshift-cnv hpp-pool-29ab9406-755647446d-d6rn7 0/1 CrashLoopBackOff 497 (4m5s ago) 42h openshift-cnv hpp-pool-4356e54b-7df67db896-8vq5t 0/1 Terminating 3 43h openshift-cnv hpp-pool-4356e54b-7df67db896-ntqpr 0/1 CrashLoopBackOff 502 (3m22s ago) 42h openshift-cnv hpp-pool-7dfd761c-cf499b659-9mdk7 1/1 Running 0 42h
Is this still reproducing? I wonder if it was just an intermittent environmental issue.