Description of problem: ----------------------- On the Provider cluster, 2/3 MON deployments were scaled down. After a while it was observed that all the application pods on the consumer clusters that were using cephfs PVCs went into CreateContainerError with the following error message: Warning Failed 23m kubelet Error: Kubelet may be retrying requests that are timing out in CRI-O due to system load: context deadline exceeded: error reserving ctr name k8s_fedora_fedorapod-cephfs-rwo-ifeqq-1-z9cbm_fedora-pods_2ffb62e1-0a79-4bb6-b5e0-1771e5cc8dbc_2 for id 48b15792cde8f948f7cd3aec2d4d6ece1956b572058ca9e28cd3669dee9f2a5f: name is reserved The pods in the cluster using RBD PVCs were not affected and were up and running. Once the MON deployments were scaled back up, the pods returned to 1/1 running state. Version-Release number of selected component (if applicable): ------------------------------------------------------------- NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.10.0 NooBaa Operator 4.10.0 Succeeded ocs-operator.v4.10.0 OpenShift Container Storage 4.10.0 Succeeded ocs-osd-deployer.v2.0.0 OCS OSD Deployer 2.0.0 Succeeded odf-operator.v4.10.0 OpenShift Data Foundation 4.10.0 (full_version=4.10.0-219) Succeeded ose-prometheus-operator.4.8.0 Prometheus Operator 4.8.0 Succeeded route-monitor-operator.v0.1.408-c2256a2 Route Monitor Operator 0.1.408-c2256a2 route-monitor-operator.v0.1.406-54ff884 Succeeded How reproducible: ----------------- 1/1 Steps to Reproduce: ------------------- 1. Create application pods using cephfs and RBD PVCs on the consumer clusters 2. Scale down 2 MON deployments on the provider cluster 3. Check the status of the application pods on the consumer cluster Actual results: --------------- The pods using cephfs PVC are in CreateContainerError with the following error: Warning Failed 23m kubelet Error: Kubelet may be retrying requests that are timing out in CRI-O due to system load: context deadline exceeded: error reserving ctr name k8s_fedora_fedorapod-cephfs-rwo-ifeqq-1-z9cbm_fedora-pods_2ffb62e1-0a79-4bb6-b5e0-1771e5cc8dbc_2 for id 48b15792cde8f948f7cd3aec2d4d6ece1956b572058ca9e28cd3669dee9f2a5f: name is reserved Expected results: ----------------- The application pods should not go into Error state
Orit, I guess the above comment is incomplete. Please check
This is a consequence of the MDS being unable to ping the monitors, and so timing itself out. Changing that behavior would induce negative behavior in other edge cases I think are more common, so this probably isn’t a good idea to amend.