Bug 2185573

Summary: [Longevity] rbd pvc mount to a pod failed with error: "rbd: map failed: (108) Cannot send after transport endpoint shutdown"
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Prasad Desala <tdesala>
Component: cephAssignee: Ilya Dryomov <idryomov>
ceph sub component: RBD QA Contact: Prasad Desala <tdesala>
Status: NEW --- Docs Contact:
Severity: high    
Priority: unspecified CC: bniver, hnallurv, idryomov, muagarwa, odf-bz-bot, sheggodu, sostapov, ypadia
Version: 4.12   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Prasad Desala 2023-04-10 10:51:34 UTC
Description of problem (please be detailed as possible and provide log
snippests):
==================================================================================
rbd pvc mount to a pod failed with below error, when running Stage4 test script developed for ODF Longevity testing. This script executes concurrent PVC clone, snapshot and expand operations.

```
rbd error output: rbd: sysfs write failed
rbd: map failed: (108) Cannot send after transport endpoint shutdown

```

Events:
  Type     Reason                  Age                  From                     Message
  ----     ------                  ----                 ----                     -------
  Normal   Scheduled               5m4s                 default-scheduler        Successfully assigned stage-4-cycle-12-concurrent-operation/pod-test-rbd-5fa7cd7c079d45579522c712c82 to compute-5
  Normal   SuccessfulAttachVolume  5m4s                 attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-13e9667f-abbb-4652-bd8f-6b8e70c62c5c"
  Warning  FailedMount             58s (x2 over 3m1s)   kubelet                  Unable to attach or mount volumes: unmounted volumes=[mypvc], unattached volumes=[mypvc kube-api-access-kh2jq]: timed out waiting for the condition
  Warning  FailedMount             50s (x10 over 5m1s)  kubelet                  MountVolume.MountDevice failed for volume "pvc-13e9667f-abbb-4652-bd8f-6b8e70c62c5c" : rpc error: code = Internal desc = rbd: map failed with error an error (exit status 108) occurred while running rbd args: [--id csi-rbd-node -m 172.30.50.148:6789,172.30.78.51:6789,172.30.65.120:6789 --keyfile=***stripped*** map ocs-storagecluster-cephblockpool/csi-vol-ef418668-5bad-4741-8cf3-95c03098b9a8 --device-type krbd --options noudev], rbd error output: rbd: sysfs write failed
rbd: map failed: (108) Cannot send after transport endpoint shutdown


ocs-ci timestamps logs:
========================

15:39:08 - ThreadPoolExecutor-2980_0 - ocs_ci.helpers.helpers - INFO  - Creating new Pod pod-test-rbd-5fa7cd7c079d45579522c712c82 for test

15:39:08 - ThreadPoolExecutor-2980_0 - ocs_ci.utility.templating - INFO  - apiVersion: v1
kind: Pod
metadata:
  name: pod-test-rbd-5fa7cd7c079d45579522c712c82
  namespace: stage-4-cycle-12-concurrent-operation
spec:
  containers:
  - image: quay.io/ocsci/nginx:latest
    name: web-server
    volumeMounts:
    - mountPath: /var/lib/www/html
      name: mypvc
  volumes:
  - name: mypvc
    persistentVolumeClaim:
      claimName: clone-pvc-test-d620d47ac72d48-064609de89
      readOnly: false

15:44:13 - ThreadPoolExecutor-2980_0 - ocs_ci.ocs.ocp - WARNING  - Description of the resource(s) we were waiting for:
Name:             pod-test-rbd-5fa7cd7c079d45579522c712c82
Namespace:        stage-4-cycle-12-concurrent-operation
Priority:         0
Service Account:  default
Node:             compute-5/10.1.114.73
Start Time:       Sat, 08 Apr 2023 15:39:09 +0300
Labels:           <none>
Annotations:      k8s.ovn.org/pod-networks:
                    {"default":{"ip_addresses":["10.128.4.251/23"],"mac_address":"0a:58:0a:80:04:fb","gateway_ips":["10.128.4.1"],"ip_address":"10.128.4.251/2...
                  openshift.io/scc: privileged
Status:           Pending
IP:
IPs:              <none>
Containers:
  web-server:
    Container ID:
    Image:          quay.io/ocsci/nginx:latest
    Image ID:
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/lib/www/html from mypvc (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kh2jq (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  mypvc:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  clone-pvc-test-d620d47ac72d48-064609de89
    ReadOnly:   false
  kube-api-access-kh2jq:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                  Age                  From                     Message
  ----     ------                  ----                 ----                     -------
  Normal   Scheduled               5m4s                 default-scheduler        Successfully assigned stage-4-cycle-12-concurrent-operation/pod-test-rbd-5fa7cd7c079d45579522c712c82 to compute-5
  Normal   SuccessfulAttachVolume  5m4s                 attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-13e9667f-abbb-4652-bd8f-6b8e70c62c5c"
  Warning  FailedMount             58s (x2 over 3m1s)   kubelet                  Unable to attach or mount volumes: unmounted volumes=[mypvc], unattached volumes=[mypvc kube-api-access-kh2jq]: timed out waiting for the condition
  Warning  FailedMount             50s (x10 over 5m1s)  kubelet                  MountVolume.MountDevice failed for volume "pvc-13e9667f-abbb-4652-bd8f-6b8e70c62c5c" : rpc error: code = Internal desc = rbd: map failed with error an error (exit status 108) occurred while running rbd args: [--id csi-rbd-node -m 172.30.50.148:6789,172.30.78.51:6789,172.30.65.120:6789 --keyfile=***stripped*** map ocs-storagecluster-cephblockpool/csi-vol-ef418668-5bad-4741-8cf3-95c03098b9a8 --device-type krbd --options noudev], rbd error output: rbd: sysfs write failed
rbd: map failed: (108) Cannot send after transport endpoint shutdown

15:44:13 - ThreadPoolExecutor-2980_0 - ocs_ci.ocs.ocp - ERROR  - Wait for Pod resource pod-test-rbd-5fa7cd7c079d45579522c712c82 at column STATUS to reach desired condition Running failed, last actual status was ContainerCreating


Version of all relevant components (if applicable):


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
Reporting at first occurrence 


Can this issue reproduce from the UI?
NA

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
====================
1) Run Stage4 - https://github.com/red-hat-storage/ocs-ci/blob/master/tests/e2e/longevity/test_stage4.py setting the run time for 4 days

Summary of the steps:
1. PVC, POD Creation + fill data upto 25% of mount point space
2. Start Concurrent PVC operations of,
   a) Clone - Creation, Deletion
   b) Snapshot - Creation, Restoration, Deletion
   c) Expansion of original PVCs
3. PVC, POD deletion

Actual results:
================
rbd pvc mount failed with error: "rbd: map failed: (108) Cannot send after transport endpoint shutdown"

Expected results:
=================
RBD PVC should mount to a pod successfully without any issues/errors.

Additional info: