Bug 2278606

Summary:	After Upgrade ODF4.15-ODF4.16 with multus [dropping holder design], pod FailedMount to pvc [ceph-fs and ceph-rbd]
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Oded <oviner>
Component:	rook	Assignee:	Blaine Gardner <brgardne>
Status:	CLOSED ERRATA	QA Contact:	Oded <oviner>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	4.16	CC:	brgardne, hnallurv, mrajanna, odf-bz-bot, sapillai, tnielsen
Target Milestone:	---	Keywords:	TestBlocker
Target Release:	ODF 4.16.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	4.16.0-96	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2024-07-17 13:21:47 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Oded 2024-05-02 09:48:47 UTC

Description of problem (please be detailed as possible and provide log
snippets):

After Upgrade ODF4.15-ODF4.16 with multus [dropping holder design], pod FailedMount to pvc [ceph-fs and ceph-rbd]

Events:
  Type     Reason                  Age                 From                     Message
  ----     ------                  ----                ----                     -------
  Normal   SuccessfulAttachVolume  3m12s               attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-dee915f1-948e-46bc-83d9-4e0eece2fee7"
  Warning  FailedMount             60s (x9 over 3m9s)  kubelet                  MountVolume.MountDevice failed for volume "pvc-dee915f1-948e-46bc-83d9-4e0eece2fee7" : rpc error: code = Internal desc = rbd: map failed with error failed to get stat for /var/lib/kubelet/plugins/openshift-storage.rbd.csi.ceph.com/openshift-storage.net.ns stat /var/lib/kubelet/plugins/openshift-storage.rbd.csi.ceph.com/openshift-storage.net.ns: no such file or directory, rbd error output:



Version of all relevant components (if applicable):
ODF Version: odf-operator.v4.16.0-90.stable
OCP Version: 4.16.0-0.nightly-2024-04-30-053518
Provider: BM

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Deploy OCP4.16 on BM
2.Install LSO4.14
3.Install ODF4.15.1[GA’ed]
4.Create NADS for public-net and cluster-net
5.Running multus valdation tool.
6.Create Storagecluster with multus
7.Upgrade ODF4.15.1 to ODF4.16.0
8.dropping holder design
9.Verify storagecluster on ready state, ceph status is “HEALTH OK”, 
10.
Create PVC:
```
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc-test-700476fd152b469f8c2c7267b994928
  namespace: namespace-test-63f6360ea1ba4433a2a9950eb
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 3Gi
  storageClassName: ocs-storagecluster-ceph-rbd
```

11. Create Pod mount to pvc:
```
apiVersion: v1
kind: Pod
metadata:
  name: pod-test-rbd-9904a4df491f4829b53789a051c
  namespace: namespace-test-63f6360ea1ba4433a2a9950eb
spec:
  containers:
  - image: quay.io/ocsci/nginx:fio
    name: web-server
    volumeMounts:
    - mountPath: /var/lib/www/html
      name: mypvc
  nodeName: argo006.ceph.redhat.com
  volumes:
  - name: mypvc
    persistentVolumeClaim:
      claimName: pvc-test-700476fd152b469f8c2c7267b994928
      readOnly: false
```
12. Check pod status:
$ oc get pods -n namespace-test-63f6360ea1ba4433a2a9950eb 
NAME                                       READY   STATUS              RESTARTS   AGE
pod-test-rbd-9904a4df491f4829b53789a051c   0/1     ContainerCreating   0          2m49s

$ oc describe pods -n namespace-test-63f6360ea1ba4433a2a9950eb 
Name:             pod-test-rbd-9904a4df491f4829b53789a051c
Namespace:        namespace-test-63f6360ea1ba4433a2a9950eb
Priority:         0
Service Account:  default
Node:             argo006.ceph.redhat.com/10.8.128.206
Start Time:       Thu, 02 May 2024 12:31:58 +0300
Labels:           <none>
Annotations:      k8s.ovn.org/pod-networks:
                    {"default":{"ip_addresses":["10.129.2.183/23"],"mac_address":"0a:58:0a:81:02:b7","gateway_ips":["10.129.2.1"],"routes":[{"dest":"10.128.0....
                  openshift.io/scc: anyuid
Status:           Pending
IP:               
IPs:              <none>
Containers:
  web-server:
    Container ID:   
    Image:          quay.io/ocsci/nginx:fio
    Image ID:       
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/lib/www/html from mypvc (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-tx5bn (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   False 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  mypvc:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  pvc-test-700476fd152b469f8c2c7267b994928
    ReadOnly:   false
  kube-api-access-tx5bn:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                  Age                 From                     Message
  ----     ------                  ----                ----                     -------
  Normal   SuccessfulAttachVolume  3m12s               attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-dee915f1-948e-46bc-83d9-4e0eece2fee7"
  Warning  FailedMount             60s (x9 over 3m9s)  kubelet                  MountVolume.MountDevice failed for volume "pvc-dee915f1-948e-46bc-83d9-4e0eece2fee7" : rpc error: code = Internal desc = rbd: map failed with error failed to get stat for /var/lib/kubelet/plugins/openshift-storage.rbd.csi.ceph.com/openshift-storage.net.ns stat /var/lib/kubelet/plugins/openshift-storage.rbd.csi.ceph.com/openshift-storage.net.ns: no such file or directory, rbd error output:



Actual results:


Expected results:


Additional info:

https://docs.google.com/document/d/1CFvmSun2rbIpol0rmNht1AfXkEz7WouqB_XSiVrlT0c/edit

Comment 13 Oded 2024-05-17 12:23:09 UTC

Bug fixed
A.Deploy OCP4.16 4.16.0-0.nightly-2024-05-15-001800
B.Install LSO 4.14 local-storage-operator.v4.14.0-202311031050
C.Install ODF4.15.2 operator [odf-operator.v4.15.2-rhodf]
D.Create nads
E.Running multus valdation tool [success]
F.Create Storagecluster with multus
G.Check storagecluser status and ceph status:
H.Upgrade ODF4.15.2 to ODF4.16.0 [4.16.0-101]
I. dropping holder design
1. Edit the public-net NAD: [add routes]
2. Install  NMState Operator via operatorhub
3. Create Instance of nmstate operator:
5. Reset all OSD and MDS pods:
6.Check connectivity between osds
7.Stop managing holder pods: [ CSI_DISABLE_HOLDER_PODS = "true"]
8. Verify csi-*plugin-* pods will restart, and csi-*plugin-holder-* pods will remain running.
a.Reset rook ceph operator pod [bug https://bugzilla.redhat.com/show_bug.cgi?id=2278184]
$ oc delete pods rook-ceph-operator-7bb7bdb698-rpgn7 
pod "rook-ceph-operator-7bb7bdb698-rpgn7" deleted
9.When CSI pods all return to Running state, check that CSI pods are using the correct host networking configuration
10.Cordon and drain all the worker nodes and  delete all csi-*plugin-holder* pods on the node 
11.Delete the csi-*plugin-holder* daemonsets.
12.Verify all storagecluster on ready state, ceph status is “HEALTH OK”, 
13.Run acceptance suite:  https://url.corp.redhat.com/59cf316

For more details:
https://docs.google.com/document/d/1iGFQrFQHI3tGxirKFBV-TvThjO-S9NSIAPP-NHCJ1oA/edit

Comment 14 Oded 2024-05-17 12:33:28 UTC

I forgot to replace configmap name in my procedure
I used ocs-operator-config and I need to use rook-ceph-operator-config based on this bz https://bugzilla.redhat.com/show_bug.cgi?id=2278184

Comment 17 errata-xmlrpc 2024-07-17 13:21:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:4591