Description of problem (please be detailed as possible and provide log snippets): After Upgrade ODF4.15-ODF4.16 with multus [dropping holder design], pod FailedMount to pvc [ceph-fs and ceph-rbd] Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SuccessfulAttachVolume 3m12s attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-dee915f1-948e-46bc-83d9-4e0eece2fee7" Warning FailedMount 60s (x9 over 3m9s) kubelet MountVolume.MountDevice failed for volume "pvc-dee915f1-948e-46bc-83d9-4e0eece2fee7" : rpc error: code = Internal desc = rbd: map failed with error failed to get stat for /var/lib/kubelet/plugins/openshift-storage.rbd.csi.ceph.com/openshift-storage.net.ns stat /var/lib/kubelet/plugins/openshift-storage.rbd.csi.ceph.com/openshift-storage.net.ns: no such file or directory, rbd error output: Version of all relevant components (if applicable): ODF Version: odf-operator.v4.16.0-90.stable OCP Version: 4.16.0-0.nightly-2024-04-30-053518 Provider: BM Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1.Deploy OCP4.16 on BM 2.Install LSO4.14 3.Install ODF4.15.1[GA’ed] 4.Create NADS for public-net and cluster-net 5.Running multus valdation tool. 6.Create Storagecluster with multus 7.Upgrade ODF4.15.1 to ODF4.16.0 8.dropping holder design 9.Verify storagecluster on ready state, ceph status is “HEALTH OK”, 10. Create PVC: ``` apiVersion: v1 kind: PersistentVolumeClaim metadata: name: pvc-test-700476fd152b469f8c2c7267b994928 namespace: namespace-test-63f6360ea1ba4433a2a9950eb spec: accessModes: - ReadWriteOnce resources: requests: storage: 3Gi storageClassName: ocs-storagecluster-ceph-rbd ``` 11. Create Pod mount to pvc: ``` apiVersion: v1 kind: Pod metadata: name: pod-test-rbd-9904a4df491f4829b53789a051c namespace: namespace-test-63f6360ea1ba4433a2a9950eb spec: containers: - image: quay.io/ocsci/nginx:fio name: web-server volumeMounts: - mountPath: /var/lib/www/html name: mypvc nodeName: argo006.ceph.redhat.com volumes: - name: mypvc persistentVolumeClaim: claimName: pvc-test-700476fd152b469f8c2c7267b994928 readOnly: false ``` 12. Check pod status: $ oc get pods -n namespace-test-63f6360ea1ba4433a2a9950eb NAME READY STATUS RESTARTS AGE pod-test-rbd-9904a4df491f4829b53789a051c 0/1 ContainerCreating 0 2m49s $ oc describe pods -n namespace-test-63f6360ea1ba4433a2a9950eb Name: pod-test-rbd-9904a4df491f4829b53789a051c Namespace: namespace-test-63f6360ea1ba4433a2a9950eb Priority: 0 Service Account: default Node: argo006.ceph.redhat.com/10.8.128.206 Start Time: Thu, 02 May 2024 12:31:58 +0300 Labels: <none> Annotations: k8s.ovn.org/pod-networks: {"default":{"ip_addresses":["10.129.2.183/23"],"mac_address":"0a:58:0a:81:02:b7","gateway_ips":["10.129.2.1"],"routes":[{"dest":"10.128.0.... openshift.io/scc: anyuid Status: Pending IP: IPs: <none> Containers: web-server: Container ID: Image: quay.io/ocsci/nginx:fio Image ID: Port: <none> Host Port: <none> State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Environment: <none> Mounts: /var/lib/www/html from mypvc (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-tx5bn (ro) Conditions: Type Status PodReadyToStartContainers False Initialized True Ready False ContainersReady False PodScheduled True Volumes: mypvc: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: pvc-test-700476fd152b469f8c2c7267b994928 ReadOnly: false kube-api-access-tx5bn: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true ConfigMapName: openshift-service-ca.crt ConfigMapOptional: <nil> QoS Class: BestEffort Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SuccessfulAttachVolume 3m12s attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-dee915f1-948e-46bc-83d9-4e0eece2fee7" Warning FailedMount 60s (x9 over 3m9s) kubelet MountVolume.MountDevice failed for volume "pvc-dee915f1-948e-46bc-83d9-4e0eece2fee7" : rpc error: code = Internal desc = rbd: map failed with error failed to get stat for /var/lib/kubelet/plugins/openshift-storage.rbd.csi.ceph.com/openshift-storage.net.ns stat /var/lib/kubelet/plugins/openshift-storage.rbd.csi.ceph.com/openshift-storage.net.ns: no such file or directory, rbd error output: Actual results: Expected results: Additional info: https://docs.google.com/document/d/1CFvmSun2rbIpol0rmNht1AfXkEz7WouqB_XSiVrlT0c/edit
Bug fixed A.Deploy OCP4.16 4.16.0-0.nightly-2024-05-15-001800 B.Install LSO 4.14 local-storage-operator.v4.14.0-202311031050 C.Install ODF4.15.2 operator [odf-operator.v4.15.2-rhodf] D.Create nads E.Running multus valdation tool [success] F.Create Storagecluster with multus G.Check storagecluser status and ceph status: H.Upgrade ODF4.15.2 to ODF4.16.0 [4.16.0-101] I. dropping holder design 1. Edit the public-net NAD: [add routes] 2. Install NMState Operator via operatorhub 3. Create Instance of nmstate operator: 5. Reset all OSD and MDS pods: 6.Check connectivity between osds 7.Stop managing holder pods: [ CSI_DISABLE_HOLDER_PODS = "true"] 8. Verify csi-*plugin-* pods will restart, and csi-*plugin-holder-* pods will remain running. a.Reset rook ceph operator pod [bug https://bugzilla.redhat.com/show_bug.cgi?id=2278184] $ oc delete pods rook-ceph-operator-7bb7bdb698-rpgn7 pod "rook-ceph-operator-7bb7bdb698-rpgn7" deleted 9.When CSI pods all return to Running state, check that CSI pods are using the correct host networking configuration 10.Cordon and drain all the worker nodes and delete all csi-*plugin-holder* pods on the node 11.Delete the csi-*plugin-holder* daemonsets. 12.Verify all storagecluster on ready state, ceph status is “HEALTH OK”, 13.Run acceptance suite: https://url.corp.redhat.com/59cf316 For more details: https://docs.google.com/document/d/1iGFQrFQHI3tGxirKFBV-TvThjO-S9NSIAPP-NHCJ1oA/edit
I forgot to replace configmap name in my procedure I used ocs-operator-config and I need to use rook-ceph-operator-config based on this bz https://bugzilla.redhat.com/show_bug.cgi?id=2278184
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:4591