2203786 – On BM, hpp-pool pods are stuck in CrashLoopBackOff

Bug 2203786 - On BM, hpp-pool pods are stuck in CrashLoopBackOff

Summary: On BM, hpp-pool pods are stuck in CrashLoopBackOff

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.13.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.13.5
Assignee:	Alexander Wels
QA Contact:	Natalie Gavrielov
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-05-15 08:56 UTC by awax
Modified:	2024-01-18 04:25 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-09-19 10:32:14 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	CNV-28805	0	None	None	None	2023-05-15 08:59:28 UTC

Description awax 2023-05-15 08:56:20 UTC

Description of problem:
On BM's (this one seen on bm03-cnvqe2-rdu2), hpp-pool pod gets stuck in a CrashLoopBackOff state.
Seems like it is ceph related. HPP is backed by OCS.

Version-Release number of selected component (if applicable):
$ oc get csv -A | grep kubevirt
openshift-cnv                                           kubevirt-hyperconverged-operator.v4.13.0          OpenShift Virtualization                         4.13.0                kubevirt-hyperconverged-operator.v4.12.3          Succeeded
$ oc get clusterversion
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.13.0-rc.5   True        False         11d     Cluster version is 4.13.0-rc.5



How reproducible:
I'm not sure what's triggering this issue. After running the network team test suite, which creates network components and VMs, on some BM's the hpp-pool gets into this state.


Steps to Reproduce:
1.
2.
3.


Actual results:
Hpp-pool pod gets stuck in a CrashLoopBackOff state.
$ oc get pods -n openshift-cnv | grep hpp
openshift-cnv                                           hpp-pool-29ab9406-755647446d-44jfk                                              0/1     Terminating        10                43h
openshift-cnv                                           hpp-pool-29ab9406-755647446d-d6rn7                                              0/1     CrashLoopBackOff   497 (4m5s ago)    42h
openshift-cnv                                           hpp-pool-4356e54b-7df67db896-8vq5t                                              0/1     Terminating        3                 43h
openshift-cnv                                           hpp-pool-4356e54b-7df67db896-ntqpr                                              0/1     CrashLoopBackOff   502 (3m22s ago)   42h
openshift-cnv                                           hpp-pool-7dfd761c-cf499b659-9mdk7                                               1/1     Running            0                 42h




$ oc get pods hpp-pool-29ab9406-755647446d-d6rn7 -oyaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    k8s.ovn.org/pod-networks: '{"default":{"ip_addresses":["10.128.2.5/23"],"mac_address":"0a:58:0a:80:02:05","gateway_ips":["10.128.2.1"],"ip_address":"10.128.2.5/23","gateway_ip":"10.128.2.1"}}'
    k8s.v1.cni.cncf.io/network-status: |-
      [{
          "name": "ovn-kubernetes",
          "interface": "eth0",
          "ips": [
              "10.128.2.5"
          ],
          "mac": "0a:58:0a:80:02:05",
          "default": true,
          "dns": {}
      }]
    openshift.io/scc: hostpath-provisioner-csi
  creationTimestamp: "2023-05-13T14:24:31Z"
  generateName: hpp-pool-29ab9406-755647446d-
  labels:
    hpp-pool: hpp-csi-pvc-block-hpp
    pod-template-hash: 755647446d
  name: hpp-pool-29ab9406-755647446d-d6rn7
  namespace: openshift-cnv
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: hpp-pool-29ab9406-755647446d
    uid: 6d6089af-1e72-4602-9f67-c212bcb1dac8
  resourceVersion: "22166040"
  uid: a5162c1e-babc-455e-a071-262b81d48c8a
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - cnv-qe-infra-19.cnvqe2.lab.eng.rdu2.redhat.com
  containers:
  - command:
    - /usr/bin/mounter
    - --storagePoolPath
    - /dev/data
    - --mountPath
    - /var/hpp-csi-pvc-block/csi
    - --hostPath
    - /host
    image: registry.redhat.io/container-native-virtualization/hostpath-provisioner-operator-rhel9@sha256:045ad111f8d3fe28b8cf77df49a264922c9fa4cc46759ed98ef044077225a23e
    imagePullPolicy: IfNotPresent
    name: mounter
    resources:
      requests:
        cpu: 10m
        memory: 100Mi
    securityContext:
      capabilities:
        drop:
        - KILL
        - MKNOD
        - SETGID
        - SETUID
      privileged: true
      runAsUser: 0
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeDevices:
    - devicePath: /dev/data
      name: data
    volumeMounts:
    - mountPath: /host
      mountPropagation: Bidirectional
      name: host-root
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-ql72g
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  imagePullSecrets:
  - name: hostpath-provisioner-admin-csi-dockercfg-xn7tq
  nodeName: cnv-qe-infra-19.cnvqe2.lab.eng.rdu2.redhat.com
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: hostpath-provisioner-admin-csi
  serviceAccountName: hostpath-provisioner-admin-csi
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  volumes:
  - name: data
    persistentVolumeClaim:
      claimName: hpp-pool-29ab9406
  - hostPath:
      path: /
      type: Directory
    name: host-root
  - name: kube-api-access-ql72g
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
      - configMap:
          items:
          - key: service-ca.crt
            path: service-ca.crt
          name: openshift-service-ca.crt
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2023-05-13T14:45:11Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2023-05-13T14:45:11Z"
    message: 'containers with unready status: [mounter]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2023-05-13T14:45:11Z"
    message: 'containers with unready status: [mounter]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2023-05-13T14:45:09Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: cri-o://5c71c577ce6c36921126314719346663f5cf9c072264d408d362bf45857219f9
    image: registry.redhat.io/container-native-virtualization/hostpath-provisioner-operator-rhel9@sha256:045ad111f8d3fe28b8cf77df49a264922c9fa4cc46759ed98ef044077225a23e
    imageID: registry.redhat.io/container-native-virtualization/hostpath-provisioner-operator-rhel9@sha256:045ad111f8d3fe28b8cf77df49a264922c9fa4cc46759ed98ef044077225a23e
    lastState:
      terminated:
        containerID: cri-o://5c71c577ce6c36921126314719346663f5cf9c072264d408d362bf45857219f9
        exitCode: 2
        finishedAt: "2023-05-15T08:29:59Z"
        reason: Error
        startedAt: "2023-05-15T08:29:59Z"
    name: mounter
    ready: false
    restartCount: 494
    started: false
    state:
      waiting:
        message: back-off 5m0s restarting failed container=mounter pod=hpp-pool-29ab9406-755647446d-d6rn7_openshift-cnv(a5162c1e-babc-455e-a071-262b81d48c8a)
        reason: CrashLoopBackOff
  hostIP: 10.1.156.19
  phase: Running
  podIP: 10.128.2.5
  podIPs:
  - ip: 10.128.2.5
  qosClass: Burstable
  startTime: "2023-05-13T14:45:11Z"


Expected results:


Additional info:
W/A - force delete pvn+hpp-pool pods.


Additional info from the cluster:
$ oc get pods -n openshift-storage
NAME                                                              READY   STATUS    RESTARTS       AGE
csi-addons-controller-manager-6976d48f69-fmpct                    2/2     Running   9 (42h ago)    42h
csi-cephfsplugin-7gqcc                                            2/2     Running   6              11d
csi-cephfsplugin-pgg6z                                            2/2     Running   4              11d
csi-cephfsplugin-provisioner-cc76c4b9-vmpk6                       5/5     Running   0              42h
csi-cephfsplugin-provisioner-cc76c4b9-xp9rt                       5/5     Running   0              42h
csi-cephfsplugin-q4r8n                                            2/2     Running   4              11d
csi-rbdplugin-j8465                                               3/3     Running   9              11d
csi-rbdplugin-jl4jf                                               3/3     Running   6              11d
csi-rbdplugin-provisioner-8558756f4f-fvtb2                        6/6     Running   0              42h
csi-rbdplugin-provisioner-8558756f4f-kxgpp                        6/6     Running   0              42h
csi-rbdplugin-wgjml                                               3/3     Running   6              11d
noobaa-operator-645c48c4c5-6gx4w                                  1/1     Running   0              42h
ocs-metrics-exporter-774f4b58cc-5ngc5                             1/1     Running   0              42h
ocs-operator-5b5d98d58d-zl7zq                                     1/1     Running   11 (41h ago)   42h
odf-console-78bb5b66-4mnfb                                        1/1     Running   0              42h
odf-operator-controller-manager-7db8d4fd4c-ltzkd                  2/2     Running   0              42h
rook-ceph-crashcollector-03d7e1289c5164e19d0d22d6856ffdae-9b4nt   1/1     Running   0              42h
rook-ceph-crashcollector-374253a427dc62aef82d81f5fc14643e-44bqw   1/1     Running   0              42h
rook-ceph-crashcollector-c903e190df41042ede88f92c4aa10277-n5jbj   1/1     Running   0              42h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-666b46d6k42f8   2/2     Running   0              42h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-84bb79d6hz5dp   2/2     Running   0              42h
rook-ceph-mgr-a-7fd8968d84-p2sx4                                  2/2     Running   0              42h
rook-ceph-mon-d-54b48b9549-rf69w                                  2/2     Running   0              42h
rook-ceph-mon-e-cc8d486-94tff                                     2/2     Running   0              42h
rook-ceph-mon-g-66d7d99bd7-44gjd                                  2/2     Running   0              42h
rook-ceph-operator-5b595585d7-kpnsd                               1/1     Running   8 (42h ago)    42h
rook-ceph-osd-0-7987b8c66c-89rws                                  2/2     Running   0              42h
rook-ceph-osd-1-7956cc5998-6ghk2                                  2/2     Running   0              42h
rook-ceph-osd-2-6f6cfb658f-kdcmp                                  2/2     Running   0              42h

$ oc get pods -A | grep hostpath
openshift-cnv                                           hostpath-provisioner-csi-lzvq6                                                  4/4     Running            4                 5d1h
openshift-cnv                                           hostpath-provisioner-csi-s69jh                                                  4/4     Running            8                 5d1h
openshift-cnv                                           hostpath-provisioner-csi-td8hj                                                  4/4     Running            4                 5d1h
openshift-cnv                                           hostpath-provisioner-operator-77f6f799d5-5dtlz                                  1/1     Running            1 (42h ago)       42h

$ oc get pods -A | grep hpp
openshift-cnv                                           hpp-pool-29ab9406-755647446d-44jfk                                              0/1     Terminating        10                43h
openshift-cnv                                           hpp-pool-29ab9406-755647446d-d6rn7                                              0/1     CrashLoopBackOff   497 (4m5s ago)    42h
openshift-cnv                                           hpp-pool-4356e54b-7df67db896-8vq5t                                              0/1     Terminating        3                 43h
openshift-cnv                                           hpp-pool-4356e54b-7df67db896-ntqpr                                              0/1     CrashLoopBackOff   502 (3m22s ago)   42h
openshift-cnv                                           hpp-pool-7dfd761c-cf499b659-9mdk7                                               1/1     Running            0                 42h

Comment 2 Adam Litke 2023-08-09 17:54:58 UTC

Is this still reproducing?  I wonder if it was just an intermittent environmental issue.

Comment 3 dalia 2023-09-19 10:33:04 UTC

can't be reproduce.

Comment 4 Joel golden 2023-09-19 23:12:55 UTC

I can reproduce this error with NFS backed storage.

[root@api-int ~]# oc debug hpp-pool-66a3ae7d-7b586fb698-znstp 
Starting pod/hpp-pool-66a3ae7d-7b586fb698-znstp-debug, command was: /usr/bin/mounter --storagePoolPath /source --mountPath /var/hpvolumes/csi --hostPath /host
Pod IP: 10.128.1.19
If you don't see a command prompt, try pressing enter.
sh-5.1# /usr/bin/mounter --storagePoolPath /source --mountPath /var/hpvolumes/csi --hostPath /host
{"level":"info","ts":1695162977.3244886,"logger":"mounter","msg":"Go Version: go1.19.10"}
{"level":"info","ts":1695162977.3245575,"logger":"mounter","msg":"Go OS/Arch: linux/amd64"}
{"level":"info","ts":1695162977.370991,"logger":"mounter","msg":"Found mount info","source path on host":"hostname.domain.net:/mnt/ovirt/openshift/nfs/vols/pvc-68b78950-9bbc-4d55-96c7-b27c5c66bbfb"}
{"level":"info","ts":1695162977.371048,"logger":"mounter","msg":"Target path","path":"/var/hpvolumes/csi"}
{"level":"info","ts":1695162977.3710966,"logger":"mounter","msg":"host path","path":"/host"}
panic: stat hostname.domain.net:/mnt/ovirt/openshift/nfs/vols/pvc-68b78950-9bbc-4d55-96c7-b27c5c66bbfb: no such file or directory

###
sh-5.1# stat /source
  File: /source
  Size: 3         	Blocks: 1          IO Block: 131072 directory
Device: 400046h/4194374d	Inode: 34          Links: 2
Access: (0777/drwxrwxrwx)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2023-09-19 19:04:47.270279314 +0000
Modify: 2023-09-19 21:49:05.032855479 +0000
Change: 2023-09-19 21:49:05.032855479 +0000
 Birth: -

###
sh-5.1# mountpoint /source/
/source/ is a mountpoint

###
sh-5.1# echo "write-test" > /source/test 
sh-5.1# cat /source/test
write-test

###
[root@api-int ~]# oc describe pod hpp-pool-66a3ae7d-7b586fb698-znstp 
Name:             hpp-pool-66a3ae7d-7b586fb698-znstp
Namespace:        openshift-cnv
Priority:         0
Service Account:  hostpath-provisioner-admin-csi
Node:             api-int.os-prd.domain.net.0.168.192.in-addr.arpa/192.168.0.26
Start Time:       Tue, 19 Sep 2023 19:04:55 +0000
Labels:           hpp-pool=local-hpp
                  pod-template-hash=7b586fb698
Annotations:      k8s.ovn.org/pod-networks:
                    {"default":{"ip_addresses":["10.128.0.160/23"],"mac_address":"0a:58:0a:80:00:a0","gateway_ips":["10.128.0.1"],"ip_address":"10.128.0.160/2...
                  k8s.v1.cni.cncf.io/network-status:
                    [{
                        "name": "ovn-kubernetes",
                        "interface": "eth0",
                        "ips": [
                            "10.128.0.160"
                        ],
                        "mac": "0a:58:0a:80:00:a0",
                        "default": true,
                        "dns": {}
                    }]
                  openshift.io/scc: hostpath-provisioner-csi
Status:           Running
IP:               10.128.0.160
IPs:
  IP:           10.128.0.160
Controlled By:  ReplicaSet/hpp-pool-66a3ae7d-7b586fb698
Containers:
  mounter:
    Container ID:  cri-o://62a93cb7d3465ec9322b40fa6cd028e12f4a36978a3af686c35e99c8d24381cc
    Image:         registry.redhat.io/container-native-virtualization/hostpath-provisioner-operator-rhel9@sha256:e5fa0aa2d6a48dd2b5e14b9d3741c144b371845c3dbee0dd3a440a1d5fa6d777
    Image ID:      registry.redhat.io/container-native-virtualization/hostpath-provisioner-operator-rhel9@sha256:e5fa0aa2d6a48dd2b5e14b9d3741c144b371845c3dbee0dd3a440a1d5fa6d777
    Port:          <none>
    Host Port:     <none>
    Command:
      /usr/bin/mounter
      --storagePoolPath
      /source
      --mountPath
      /var/hpvolumes/csi
      --hostPath
      /host
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Tue, 19 Sep 2023 23:05:52 +0000
      Finished:     Tue, 19 Sep 2023 23:05:52 +0000
    Ready:          False
    Restart Count:  52
    Requests:
      cpu:        10m
      memory:     100Mi
    Environment:  <none>
    Mounts:
      /host from host-root (rw)
      /source from data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-tvh7z (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  hpp-pool-66a3ae7d
    ReadOnly:   false
  host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  Directory
  kube-api-access-tvh7z:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason   Age                    From     Message
  ----     ------   ----                   ----     -------
  Normal   Pulled   25m (x49 over 4h5m)    kubelet  Container image "registry.redhat.io/container-native-virtualization/hostpath-provisioner-operator-rhel9@sha256:e5fa0aa2d6a48dd2b5e14b9d3741c144b371845c3dbee0dd3a440a1d5fa6d777" already present on machine
  Warning  BackOff  21s (x1118 over 4h5m)  kubelet  Back-off restarting failed container mounter in pod hpp-pool-66a3ae7d-7b586fb698-znstp_openshift-cnv(f18608e2-05c3-4c7b-80b7-08a41bb10e65)
[root@api-int ~]#

Comment 5 Red Hat Bugzilla 2024-01-18 04:25:20 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.