Bug 2073029 - When 2 MONs are down, the app pods with cephfs PVC go into error state
Summary: When 2 MONs are down, the app pods with cephfs PVC go into error state
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: odf-managed-service
Version: 4.10
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Ohad
QA Contact: Neha Berry
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-04-07 13:50 UTC by Rachael
Modified: 2023-08-09 17:00 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-04-29 02:42:05 UTC
Embargoed:


Attachments (Terms of Use)

Description Rachael 2022-04-07 13:50:32 UTC
Description of problem:
-----------------------

On the Provider cluster, 2/3 MON deployments were scaled down. After a while it was observed that all the application pods on the consumer clusters that were using cephfs PVCs went into CreateContainerError with the following error message:

  Warning  Failed     23m                   kubelet  Error: Kubelet may be retrying requests that are timing out in CRI-O due to system load: context deadline exceeded: error reserving ctr name k8s_fedora_fedorapod-cephfs-rwo-ifeqq-1-z9cbm_fedora-pods_2ffb62e1-0a79-4bb6-b5e0-1771e5cc8dbc_2 for id 48b15792cde8f948f7cd3aec2d4d6ece1956b572058ca9e28cd3669dee9f2a5f: name is reserved


The pods in the cluster using RBD PVCs were not affected and were up and running.

Once the MON deployments were scaled back up, the pods returned to 1/1 running state.


Version-Release number of selected component (if applicable):
-------------------------------------------------------------
NAME                                      DISPLAY                       VERSION           REPLACES                                  PHASE
mcg-operator.v4.10.0                      NooBaa Operator               4.10.0                                                      Succeeded
ocs-operator.v4.10.0                      OpenShift Container Storage   4.10.0                                                      Succeeded
ocs-osd-deployer.v2.0.0                   OCS OSD Deployer              2.0.0                                                       Succeeded
odf-operator.v4.10.0                      OpenShift Data Foundation     4.10.0 (full_version=4.10.0-219)                            Succeeded
ose-prometheus-operator.4.8.0             Prometheus Operator           4.8.0                                                       Succeeded
route-monitor-operator.v0.1.408-c2256a2   Route Monitor Operator        0.1.408-c2256a2   route-monitor-operator.v0.1.406-54ff884   Succeeded


How reproducible: 
-----------------

1/1


Steps to Reproduce:
-------------------

1. Create application pods using cephfs and RBD PVCs on the consumer clusters
2. Scale down 2 MON deployments on the provider cluster
3. Check the status of the application pods on the consumer cluster


Actual results:
---------------
The pods using cephfs PVC are in CreateContainerError with the following error:

  Warning  Failed     23m                   kubelet  Error: Kubelet may be retrying requests that are timing out in CRI-O due to system load: context deadline exceeded: error reserving ctr name k8s_fedora_fedorapod-cephfs-rwo-ifeqq-1-z9cbm_fedora-pods_2ffb62e1-0a79-4bb6-b5e0-1771e5cc8dbc_2 for id 48b15792cde8f948f7cd3aec2d4d6ece1956b572058ca9e28cd3669dee9f2a5f: name is reserved


Expected results:
-----------------
The application pods should not go into Error state

Comment 6 Mudit Agarwal 2022-04-15 07:24:09 UTC
Orit, I guess the above comment is incomplete. Please check

Comment 8 Greg Farnum 2022-04-29 02:42:05 UTC
This is a consequence of the MDS being unable to ping the monitors, and so timing itself out. Changing that behavior would induce negative behavior in other edge cases I think are more common, so this probably isn’t a good idea to amend.


Note You need to log in before you can comment on or make changes to this bug.