Bug 2258357
| Summary: | Storagecluster is in warning state on IBM Power cluster due to "no active mgr" error | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Aaruni Aggarwal <aaaggarw> |
| Component: | ocs-operator | Assignee: | Malay Kumar parida <mparida> |
| Status: | CLOSED ERRATA | QA Contact: | Neha Berry <nberry> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.15 | CC: | akandath, bniver, dosypenk, ebenahar, mparida, muagarwa, nberry, ngowda, nojha, odf-bz-bot, sapillai, sostapov, tnielsen, uchapaga |
| Target Milestone: | --- | ||
| Target Release: | ODF 4.15.0 | ||
| Hardware: | ppc64le | ||
| OS: | Linux | ||
| Whiteboard: | verification-blocked | ||
| Fixed In Version: | 4.15.0-149 | Doc Type: | No Doc Update |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2024-03-19 15:31:24 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Aaruni Aggarwal
2024-01-14 17:42:09 UTC
[root@rdr-odf15-bastion-0 ~]# oc get csv -A NAMESPACE NAME DISPLAY VERSION REPLACES PHASE openshift-local-storage local-storage-operator.v4.15.0-202311280332 Local Storage 4.15.0-202311280332 Succeeded openshift-operator-lifecycle-manager packageserver Package Server 0.0.1-snapshot Succeeded openshift-storage mcg-operator.v4.15.0-112.stable NooBaa Operator 4.15.0-112.stable Succeeded openshift-storage ocs-operator.v4.15.0-112.stable OpenShift Container Storage 4.15.0-112.stable Succeeded openshift-storage odf-csi-addons-operator.v4.15.0-112.stable CSI Addons 4.15.0-112.stable Succeeded openshift-storage odf-operator.v4.15.0-112.stable OpenShift Data Foundation 4.15.0-112.stable Succeeded pods: [root@rdr-odf15-bastion-0 ~]# oc get pods NAME READY STATUS RESTARTS AGE csi-addons-controller-manager-6cdd5c677d-gvkn4 2/2 Running 0 22m csi-cephfsplugin-699vv 2/2 Running 1 (4d7h ago) 4d7h csi-cephfsplugin-glxq8 2/2 Running 0 4d7h csi-cephfsplugin-hhgdh 2/2 Running 0 4d7h csi-cephfsplugin-provisioner-7ff459f4bb-8hjkg 6/6 Running 0 4d7h csi-cephfsplugin-provisioner-7ff459f4bb-tkxpx 6/6 Running 0 3d8h csi-nfsplugin-9k84b 2/2 Running 0 3d18h csi-nfsplugin-nbh49 2/2 Running 0 3d18h csi-nfsplugin-provisioner-bb6658447-2fjtn 5/5 Running 0 3d8h csi-nfsplugin-provisioner-bb6658447-j7jlg 5/5 Running 0 3d18h csi-nfsplugin-wgmvg 2/2 Running 0 3d18h csi-rbdplugin-hqskw 3/3 Running 1 (4d7h ago) 4d7h csi-rbdplugin-provisioner-954997cc9-nkvfm 6/6 Running 1 (4d7h ago) 4d7h csi-rbdplugin-provisioner-954997cc9-wsxdg 6/6 Running 0 4d7h csi-rbdplugin-s9g8n 3/3 Running 1 (4d7h ago) 4d7h csi-rbdplugin-tlck4 3/3 Running 0 4d7h noobaa-core-0 1/1 Running 0 3d8h noobaa-db-pg-0 1/1 Running 0 4d7h noobaa-endpoint-687f58577c-nrk96 1/1 Running 0 3d8h noobaa-operator-55b99765cb-t2nwk 2/2 Running 0 3d8h ocs-metrics-exporter-5f4dfffd66-xx8fc 1/1 Running 1 (3d11h ago) 4d7h ocs-operator-5d54997db6-zlrwk 1/1 Running 0 3d8h odf-console-75f584d89f-bm8sr 1/1 Running 0 3d18h odf-operator-controller-manager-644d59cdb7-972k4 2/2 Running 0 3d8h rook-ceph-crashcollector-worker-0-5f94f5cd4f-w2gbn 1/1 Running 0 3d7h rook-ceph-crashcollector-worker-1-7946896c88-5c5qk 1/1 Running 0 4d7h rook-ceph-crashcollector-worker-2-6d7b7d78f7-tpmxj 1/1 Running 0 4d7h rook-ceph-exporter-worker-0-7cb9d57575-zhclg 1/1 Running 0 3d7h rook-ceph-exporter-worker-1-66448cf466-jfmtz 1/1 Running 0 4d7h rook-ceph-exporter-worker-2-58f846979f-zcxqp 1/1 Running 0 4d7h rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-7d577d46x5rsk 2/2 Running 10 (3d7h ago) 4d7h rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-5676bf49hs6w4 2/2 Running 10 (3d7h ago) 3d8h rook-ceph-mgr-a-75f4489fbc-4nlmz 2/3 CrashLoopBackOff 640 (2m17s ago) 2d9h rook-ceph-mgr-b-55b64b69cb-whw5n 2/3 CrashLoopBackOff 647 (2m58s ago) 2d10h rook-ceph-mon-a-5ff9f59fb5-rhzqm 2/2 Running 0 4d7h rook-ceph-mon-b-699455764d-dnw7w 2/2 Running 0 3d7h rook-ceph-mon-c-57bbc6dd8-hnv4f 2/2 Running 1 (3d17h ago) 4d7h rook-ceph-nfs-ocs-storagecluster-cephnfs-a-6cbc7dc485-gjzbb 2/2 Running 0 3d18h rook-ceph-operator-55dbc47d88-v52l5 1/1 Running 0 2d10h rook-ceph-osd-0-55448d6c6f-sf2gs 2/2 Running 0 4d7h rook-ceph-osd-1-7c45d956c5-cpvfq 2/2 Running 0 4d7h rook-ceph-osd-2-5555cb59bd-xtfft 2/2 Running 0 3d8h rook-ceph-osd-prepare-4eb1d1b99ad103ae56db9ccc002f991f-lvs5v 0/1 Completed 0 4d7h rook-ceph-osd-prepare-7c3335f82580fa264d344511caad5380-6xhrh 0/1 Completed 0 4d7h rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6565c6dxxxj9 2/2 Running 0 4d7h rook-ceph-tools-7997d9b857-cdlrj 1/1 Running 0 4d7h ux-backend-server-76fb4547d9-lkssp 2/2 Running 0 3d8h cephcluster: [root@rdr-odf15-bastion-0 ~]# oc get cephcluster NAME DATADIRHOSTPATH MONCOUNT AGE PHASE MESSAGE HEALTH EXTERNAL FSID ocs-storagecluster-cephcluster /var/lib/rook 3 4d7h Ready Cluster created successfully HEALTH_WARN 90280706-e0e0-45b9-adaa-24d1265e430a ceph health:
[root@rdr-odf15-bastion-0 ~]# oc rsh rook-ceph-tools-7997d9b857-cdlrj
sh-5.1$
sh-5.1$ ceph -s
cluster:
id: 90280706-e0e0-45b9-adaa-24d1265e430a
health: HEALTH_WARN
no active mgr
services:
mon: 3 daemons, quorum a,b,c (age 3d)
mgr: no daemons active (since 3m)
mds: 1/1 daemons up, 1 hot standby
osd: 3 osds: 3 up (since 3d), 3 in (since 3d)
rgw: 1 daemon active (1 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 13 pools, 217 pgs
objects: 18.25k objects, 61 GiB
usage: 145 GiB used, 1.3 TiB / 1.5 TiB avail
pgs: 217 active+clean
sh-5.1$
mgr pod description:
[root@rdr-odf15-bastion-0 odf15]# oc describe pod rook-ceph-mgr-a-75f4489fbc-4nlmz
Name: rook-ceph-mgr-a-75f4489fbc-4nlmz
Namespace: openshift-storage
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: rook-ceph-mgr
Node: worker-0/10.20.181.177
Start Time: Fri, 12 Jan 2024 03:19:41 -0500
Labels: app=rook-ceph-mgr
app.kubernetes.io/component=cephclusters.ceph.rook.io
app.kubernetes.io/created-by=rook-ceph-operator
app.kubernetes.io/instance=a
app.kubernetes.io/managed-by=rook-ceph-operator
app.kubernetes.io/name=ceph-mgr
app.kubernetes.io/part-of=ocs-storagecluster-cephcluster
ceph_daemon_id=a
ceph_daemon_type=mgr
instance=a
mgr=a
mgr_role=active
odf-resource-profile=
pod-template-hash=75f4489fbc
rook.io/operator-namespace=openshift-storage
rook_cluster=openshift-storage
Annotations: k8s.ovn.org/pod-networks:
{"default":{"ip_addresses":["10.128.2.215/23"],"mac_address":"0a:58:0a:80:02:d7","gateway_ips":["10.128.2.1"],"routes":[{"dest":"10.128.0....
k8s.v1.cni.cncf.io/network-status:
[{
"name": "ovn-kubernetes",
"interface": "eth0",
"ips": [
"10.128.2.215"
],
"mac": "0a:58:0a:80:02:d7",
"default": true,
"dns": {}
}]
openshift.io/scc: rook-ceph
prometheus.io/port: 9283
prometheus.io/scrape: true
Status: Running
IP: 10.128.2.215
IPs:
IP: 10.128.2.215
Controlled By: ReplicaSet/rook-ceph-mgr-a-75f4489fbc
Init Containers:
chown-container-data-dir:
Container ID: cri-o://58e1ee1a574d93893d179e0e70d4b5ecf14fc94a7c4ba656a514ca7f83fa4b33
Image: registry.redhat.io/rhceph/rhceph-6-rhel9@sha256:1b1870ca13fc52d3c1a6c603e471e230a90cba94baaef9cf56c02b6c7dac35ca
Image ID: aaa7a66552761bae51c79e4ece5beaef01fcbb5977d2f90edc1f6e253f321dc6
Port: <none>
Host Port: <none>
Command:
chown
Args:
--verbose
--recursive
ceph:ceph
/var/log/ceph
/var/lib/ceph/crash
/run/ceph
/var/lib/ceph/mgr/ceph-a
State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 12 Jan 2024 03:19:42 -0500
Finished: Fri, 12 Jan 2024 03:19:42 -0500
Ready: True
Restart Count: 0
Limits:
cpu: 1
memory: 1536Mi
Requests:
cpu: 1
memory: 1536Mi
Environment: <none>
Mounts:
/etc/ceph from rook-config-override (ro)
/etc/ceph/keyring-store/ from rook-ceph-mgr-a-keyring (ro)
/run/ceph from ceph-daemons-sock-dir (rw)
/var/lib/ceph/crash from rook-ceph-crash (rw)
/var/lib/ceph/mgr/ceph-a from ceph-daemon-data (rw)
/var/log/ceph from rook-ceph-log (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vjjzk (ro)
Containers:
mgr:
Container ID: cri-o://ae14e31e879c270f3a02f898a84a9101cc02328673eed16ea876ec9b582f09b6
Image: registry.redhat.io/rhceph/rhceph-6-rhel9@sha256:1b1870ca13fc52d3c1a6c603e471e230a90cba94baaef9cf56c02b6c7dac35ca
Image ID: aaa7a66552761bae51c79e4ece5beaef01fcbb5977d2f90edc1f6e253f321dc6
Ports: 6800/TCP, 9283/TCP, 7000/TCP
Host Ports: 0/TCP, 0/TCP, 0/TCP
Command:
ceph-mgr
Args:
--fsid=90280706-e0e0-45b9-adaa-24d1265e430a
--keyring=/etc/ceph/keyring-store/keyring
--default-log-to-stderr=true
--default-err-to-stderr=true
--default-mon-cluster-log-to-stderr=true
--default-log-stderr-prefix=debug
--default-log-to-file=false
--default-mon-cluster-log-to-file=false
--mon-host=$(ROOK_CEPH_MON_HOST)
--mon-initial-members=$(ROOK_CEPH_MON_INITIAL_MEMBERS)
--id=a
--setuser=ceph
--setgroup=ceph
--client-mount-uid=0
--client-mount-gid=0
--foreground
--public-addr=$(ROOK_POD_IP)
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 137
Started: Sun, 14 Jan 2024 12:58:25 -0500
Finished: Sun, 14 Jan 2024 12:58:38 -0500
Ready: False
Restart Count: 643
Limits:
cpu: 1
memory: 1536Mi
Requests:
cpu: 1
memory: 1536Mi
Liveness: exec [env -i sh -c
outp="$(ceph --admin-daemon /run/ceph/ceph-mgr.a.asok status 2>&1)"
rc=$?
if [ $rc -ne 0 ]; then
echo "ceph daemon health check failed with the following output:"
echo "$outp" | sed -e 's/^/> /g'
exit $rc
fi
] delay=10s timeout=5s period=10s #success=1 #failure=3
Startup: exec [env -i sh -c
outp="$(ceph --admin-daemon /run/ceph/ceph-mgr.a.asok status 2>&1)"
rc=$?
if [ $rc -ne 0 ]; then
echo "ceph daemon health check failed with the following output:"
echo "$outp" | sed -e 's/^/> /g'
exit $rc
fi
] delay=10s timeout=5s period=10s #success=1 #failure=6
Environment:
CONTAINER_IMAGE: registry.redhat.io/rhceph/rhceph-6-rhel9@sha256:1b1870ca13fc52d3c1a6c603e471e230a90cba94baaef9cf56c02b6c7dac35ca
POD_NAME: rook-ceph-mgr-a-75f4489fbc-4nlmz (v1:metadata.name)
POD_NAMESPACE: openshift-storage (v1:metadata.namespace)
NODE_NAME: (v1:spec.nodeName)
POD_MEMORY_LIMIT: 1610612736 (limits.memory)
POD_MEMORY_REQUEST: 1610612736 (requests.memory)
POD_CPU_LIMIT: 1 (limits.cpu)
POD_CPU_REQUEST: 1 (requests.cpu)
CEPH_USE_RANDOM_NONCE: true
ROOK_MSGR2: msgr2_true_encryption_false_compression_false
ROOK_CEPH_MON_HOST: <set to the key 'mon_host' in secret 'rook-ceph-config'> Optional: false
ROOK_CEPH_MON_INITIAL_MEMBERS: <set to the key 'mon_initial_members' in secret 'rook-ceph-config'> Optional: false
ROOK_OPERATOR_NAMESPACE: openshift-storage
ROOK_CEPH_CLUSTER_CRD_VERSION: v1
ROOK_CEPH_CLUSTER_CRD_NAME: ocs-storagecluster-cephcluster
CEPH_ARGS: --mon-host $(ROOK_CEPH_MON_HOST) --keyring /etc/ceph/keyring-store/keyring
ROOK_POD_IP: (v1:status.podIP)
Mounts:
/etc/ceph from rook-config-override (ro)
/etc/ceph/keyring-store/ from rook-ceph-mgr-a-keyring (ro)
/run/ceph from ceph-daemons-sock-dir (rw)
/var/lib/ceph/crash from rook-ceph-crash (rw)
/var/lib/ceph/mgr/ceph-a from ceph-daemon-data (rw)
/var/log/ceph from rook-ceph-log (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vjjzk (ro)
watch-active:
Container ID: cri-o://a8b26e14cf3c41aa6913bb040f02af19a9bb583c816675acbee8c0adf45f4049
Image: registry.redhat.io/odf4/rook-ceph-rhel9-operator@sha256:ce2269b3f9c04c8f7b17f58e99d81b0f2e5fb8dd071e5d96b3db685a27547758
Image ID: 3ee82db33f3d26fd1d29d70ebbf812a83de53da2c2b97e3519ae6b1dda9fb4bb
Port: <none>
Host Port: <none>
Args:
ceph
mgr
watch-active
State: Running
Started: Fri, 12 Jan 2024 03:19:42 -0500
Ready: True
Restart Count: 0
Environment:
ROOK_CLUSTER_ID: edc44670-097b-47ca-b834-b59740ea6eb9
ROOK_CLUSTER_NAME: ocs-storagecluster-cephcluster
ROOK_PRIVATE_IP: (v1:status.podIP)
ROOK_PUBLIC_IP: (v1:status.podIP)
POD_NAMESPACE: openshift-storage
ROOK_MON_ENDPOINTS: <set to the key 'data' of config map 'rook-ceph-mon-endpoints'> Optional: false
ROOK_CEPH_USERNAME: <set to the key 'ceph-username' in secret 'rook-ceph-mon'> Optional: false
ROOK_CEPH_CONFIG_OVERRIDE: /etc/rook/config/override.conf
ROOK_DASHBOARD_ENABLED: false
ROOK_MONITORING_ENABLED: true
ROOK_UPDATE_INTERVAL: 15s
ROOK_DAEMON_NAME: a
ROOK_CEPH_VERSION: ceph version 17.2.6-167 quincy
Mounts:
/var/lib/rook-ceph-mon from ceph-admin-secret (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vjjzk (ro)
log-collector:
Container ID: cri-o://3cff290b6e61ff96488307bd859699b54994c2bdd0335f98647e0ffe387132bd
Image: registry.redhat.io/rhceph/rhceph-6-rhel9@sha256:1b1870ca13fc52d3c1a6c603e471e230a90cba94baaef9cf56c02b6c7dac35ca
Image ID: aaa7a66552761bae51c79e4ece5beaef01fcbb5977d2f90edc1f6e253f321dc6
Port: <none>
Host Port: <none>
Command:
/bin/bash
-x
-e
-m
-c
CEPH_CLIENT_ID=ceph-mgr.a
PERIODICITY=daily
LOG_ROTATE_CEPH_FILE=/etc/logrotate.d/ceph
LOG_MAX_SIZE=524M
ROTATE=7
# edit the logrotate file to only rotate a specific daemon log
# otherwise we will logrotate log files without reloading certain daemons
# this might happen when multiple daemons run on the same machine
sed -i "s|*.log|$CEPH_CLIENT_ID.log|" "$LOG_ROTATE_CEPH_FILE"
# replace default daily with given user input
sed --in-place "s/daily/$PERIODICITY/g" "$LOG_ROTATE_CEPH_FILE"
# replace rotate count, default 7 for all ceph daemons other than rbd-mirror
sed --in-place "s/rotate 7/rotate $ROTATE/g" "$LOG_ROTATE_CEPH_FILE"
if [ "$LOG_MAX_SIZE" != "0" ]; then
# adding maxsize $LOG_MAX_SIZE at the 4th line of the logrotate config file with 4 spaces to maintain indentation
sed --in-place "4i \ \ \ \ maxsize $LOG_MAX_SIZE" "$LOG_ROTATE_CEPH_FILE"
fi
while true; do
# we don't force the logrorate but we let the logrotate binary handle the rotation based on user's input for periodicity and size
logrotate --verbose "$LOG_ROTATE_CEPH_FILE"
sleep 15m
done
State: Running
Started: Fri, 12 Jan 2024 03:19:42 -0500
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/etc/ceph from rook-config-override (ro)
/run/ceph from ceph-daemons-sock-dir (rw)
/var/lib/ceph/crash from rook-ceph-crash (rw)
/var/log/ceph from rook-ceph-log (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vjjzk (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
rook-config-override:
Type: Projected (a volume that contains injected data from multiple sources)
ConfigMapName: rook-config-override
ConfigMapOptional: <nil>
rook-ceph-mgr-a-keyring:
Type: Secret (a volume populated by a Secret)
SecretName: rook-ceph-mgr-a-keyring
Optional: false
ceph-daemons-sock-dir:
Type: HostPath (bare host directory volume)
Path: /var/lib/rook/exporter
HostPathType: DirectoryOrCreate
rook-ceph-log:
Type: HostPath (bare host directory volume)
Path: /var/lib/rook/openshift-storage/log
HostPathType:
rook-ceph-crash:
Type: HostPath (bare host directory volume)
Path: /var/lib/rook/openshift-storage/crash
HostPathType:
ceph-daemon-data:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
ceph-admin-secret:
Type: Secret (a volume populated by a Secret)
SecretName: rook-ceph-mon
Optional: false
kube-api-access-vjjzk:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
ConfigMapName: openshift-service-ca.crt
ConfigMapOptional: <nil>
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 5s
node.ocs.openshift.io/storage=true:NoSchedule
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 2d9h default-scheduler 0/6 nodes are available: 1 node(s) didn't match pod anti-affinity rules, 2 Insufficient memory, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }.
Normal Scheduled 2d9h default-scheduler Successfully assigned openshift-storage/rook-ceph-mgr-a-75f4489fbc-4nlmz to worker-0
Normal AddedInterface 2d9h multus Add eth0 [10.128.2.215/23] from ovn-kubernetes
Normal Pulled 2d9h kubelet Container image "registry.redhat.io/rhceph/rhceph-6-rhel9@sha256:1b1870ca13fc52d3c1a6c603e471e230a90cba94baaef9cf56c02b6c7dac35ca" already present on machine
Normal Created 2d9h kubelet Created container log-collector
Normal Started 2d9h kubelet Started container chown-container-data-dir
Normal Started 2d9h kubelet Started container log-collector
Normal Created 2d9h kubelet Created container chown-container-data-dir
Normal Pulled 2d9h kubelet Container image "registry.redhat.io/rhceph/rhceph-6-rhel9@sha256:1b1870ca13fc52d3c1a6c603e471e230a90cba94baaef9cf56c02b6c7dac35ca" already present on machine
Normal Pulled 2d9h kubelet Container image "registry.redhat.io/odf4/rook-ceph-rhel9-operator@sha256:ce2269b3f9c04c8f7b17f58e99d81b0f2e5fb8dd071e5d96b3db685a27547758" already present on machine
Normal Created 2d9h kubelet Created container watch-active
Normal Started 2d9h kubelet Started container watch-active
Normal Started 2d9h (x3 over 2d9h) kubelet Started container mgr
Normal Created 2d9h (x4 over 2d9h) kubelet Created container mgr
Warning Unhealthy 2d7h kubelet Startup probe errored: rpc error: code = NotFound desc = container is not created or running: checking if PID of 8074de53ec11674b6efe1eeda00e0f589223ebf9d72feda5dd4a5b4b3305cb85 is running failed: container process not found
Normal Pulled 3h59m (x600 over 2d9h) kubelet Container image "registry.redhat.io/rhceph/rhceph-6-rhel9@sha256:1b1870ca13fc52d3c1a6c603e471e230a90cba94baaef9cf56c02b6c7dac35ca" already present on machine
Warning Unhealthy 49m (x15 over 2d9h) kubelet Startup probe failed:
Warning BackOff 4m38s (x16047 over 2d9h) kubelet Back-off restarting failed container mgr in pod rook-ceph-mgr-a-75f4489fbc-4nlmz_openshift-storage(b39ba50a-6928-4213-9ab3-d10868e5a139)
must-gather logs: https://drive.google.com/file/d/1NMCgHPwz3_W-tGOaG4DvuCZg1hUtpo0o/view?usp=sharing Moving this to Rook for initial analysis.
At first glance, it looks like the cluster does not have enough resource to schedule a ceph-mgr pod.
Error: "0/6 nodes are available: 1 node(s) didn't match pod anti-affinity rules, 2 Insufficient memory, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }"
Please try adding more resources to the cluster.
(In reply to umanga from comment #5) > Moving this to Rook for initial analysis. > > At first glance, it looks like the cluster does not have enough resource to > schedule a ceph-mgr pod. > Error: "0/6 nodes are available: 1 node(s) didn't match pod anti-affinity > rules, 2 Insufficient memory, 3 node(s) had untolerated taint > {node-role.kubernetes.io/master: }" > > Please try adding more resources to the cluster. mgr-pods failing due to Insufficient Memory and hence the "no active mgr" error. In 4.15 we have ResourceProfiles (lean, balanced, Performance) allocating different CPU and memory based on the resource profile. And it looks like the mgr resources in this cluster are created with the Balanced Profile, that is, CPU:1 and Memory:1.5Gi Possible resolution is to change the resource profile to `Performance`. A couple of questions for the ocs operator team: 1. Is `Balanced` profile the default profile? 2. The memory limits/request for `Performance` profile for `mgr` daemon is set to 2G, but in 4.14 the limit/request was 3G. Why a downgrade? Moving it to OCS operator. Hi Santosh 1. Yes if nothing is specified the values for the balanced profile are used 2. The value of 3Gi was too high according to Travis and anyway we now have 2 mgrs from 4.15. We looked at rook upstream values & ceph recommendations to tune the numbers for different profiles. So we found it appropriate to tone down the MGR resources. After looking at the must gather I will say this issue is indeed due to an insufficient amount of resources being available. I see NFS is enabled on this case which is particularly hogging a lot of resources. So the mgr-a pod is not getting to schedule. So to get out of the situation either try adding more cpu/memory to the worker nodes or try selecting the lean profile which will reduce the amount of resource the main ceph daemons like osd, mds, mon, and mgr consume. But do keep in mind this might result in reduced performance. Prima facie it doesn't look like a bug in the code or blocker. BZ can be kept open for discussion but moving to 4.16 as dev freeze for 4.15 has arrived. After changing the profile to lean, then also mgr pods are in Crashloopbackoff state.
[root@rdr-odf15-bastion-0 ~]# oc get pods
NAME READY STATUS RESTARTS AGE
csi-addons-controller-manager-6cdd5c677d-gvkn4 2/2 Running 0 10d
csi-cephfsplugin-699vv 2/2 Running 1 (14d ago) 14d
csi-cephfsplugin-glxq8 2/2 Running 0 14d
csi-cephfsplugin-hhgdh 2/2 Running 0 14d
csi-cephfsplugin-provisioner-7ff459f4bb-8hjkg 6/6 Running 0 14d
csi-cephfsplugin-provisioner-7ff459f4bb-tkxpx 6/6 Running 0 14d
csi-nfsplugin-9k84b 2/2 Running 0 14d
csi-nfsplugin-nbh49 2/2 Running 0 14d
csi-nfsplugin-provisioner-bb6658447-2fjtn 5/5 Running 0 14d
csi-nfsplugin-provisioner-bb6658447-j7jlg 5/5 Running 0 14d
csi-nfsplugin-wgmvg 2/2 Running 0 14d
csi-rbdplugin-hqskw 3/3 Running 1 (14d ago) 14d
csi-rbdplugin-provisioner-954997cc9-nkvfm 6/6 Running 1 (14d ago) 14d
csi-rbdplugin-provisioner-954997cc9-wsxdg 6/6 Running 0 14d
csi-rbdplugin-s9g8n 3/3 Running 1 (14d ago) 14d
csi-rbdplugin-tlck4 3/3 Running 0 14d
noobaa-core-0 1/1 Running 0 14d
noobaa-db-pg-0 1/1 Running 0 14d
noobaa-endpoint-687f58577c-nrk96 1/1 Running 0 13d
noobaa-operator-55b99765cb-t2nwk 2/2 Running 0 14d
ocs-metrics-exporter-5f4dfffd66-xx8fc 1/1 Running 1 (14d ago) 14d
ocs-operator-5d54997db6-zlrwk 1/1 Running 0 14d
odf-console-75f584d89f-bm8sr 1/1 Running 0 14d
odf-operator-controller-manager-644d59cdb7-972k4 2/2 Running 0 13d
rook-ceph-crashcollector-worker-0-7447cfc595-dsgl4 1/1 Running 0 8m5s
rook-ceph-crashcollector-worker-1-7946896c88-5c5qk 1/1 Running 0 14d
rook-ceph-crashcollector-worker-2-5b48694bbc-wbg74 1/1 Running 0 7m51s
rook-ceph-exporter-worker-0-548958bd97-5r8rm 1/1 Running 0 8m2s
rook-ceph-exporter-worker-1-66448cf466-jfmtz 1/1 Running 0 14d
rook-ceph-exporter-worker-2-7b4dfb965c-rxjrw 1/1 Running 0 7m48s
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5b48cc48bm5sd 2/2 Running 0 8m5s
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-5fd4d6b845qvn 2/2 Running 0 7m45s
rook-ceph-mgr-a-587dc88f9-7p54p 2/3 CrashLoopBackOff 4 (91s ago) 6m29s
rook-ceph-mgr-b-64f5759d48-wchlw 2/3 CrashLoopBackOff 4 (53s ago) 5m26s
rook-ceph-mon-a-64ffc75bfc-9wn6w 2/2 Running 0 7m4s
rook-ceph-mon-b-8698599fb-59jkk 2/2 Running 0 8m5s
rook-ceph-mon-c-f49d8c656-lmpx5 2/2 Running 0 7m39s
rook-ceph-nfs-ocs-storagecluster-cephnfs-a-6cbc7dc485-gjzbb 2/2 Running 0 14d
rook-ceph-operator-55dbc47d88-lhlq6 1/1 Running 0 6m30s
rook-ceph-osd-0-54f866d6d9-2dslg 2/2 Running 0 2m6s
rook-ceph-osd-1-774d9bc67-px7pn 2/2 Running 0 4m45s
rook-ceph-osd-2-864b455f84-l8cvp 2/2 Running 0 3m18s
rook-ceph-osd-prepare-4eb1d1b99ad103ae56db9ccc002f991f-lvs5v 0/1 Completed 0 14d
rook-ceph-osd-prepare-7c3335f82580fa264d344511caad5380-6xhrh 0/1 Completed 0 14d
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-695447fcgxkc 2/2 Running 0 7m52s
rook-ceph-tools-7997d9b857-cdlrj 1/1 Running 0 14d
ux-backend-server-76fb4547d9-lkssp 2/2 Running 0 14d
[root@rdr-odf15-bastion-0 ~]#
[root@rdr-odf15-bastion-0 ~]# oc get cephcluster
NAME DATADIRHOSTPATH MONCOUNT AGE PHASE MESSAGE HEALTH EXTERNAL FSID
ocs-storagecluster-cephcluster /var/lib/rook 3 14d Ready Cluster created successfully HEALTH_WARN 90280706-e0e0-45b9-adaa-24d1265e430a
[root@rdr-odf15-bastion-0 ~]#
[root@rdr-odf15-bastion-0 ~]# oc rsh rook-ceph-tools-7997d9b857-cdlrj
sh-5.1$
sh-5.1$ ceph -s
cluster:
id: 90280706-e0e0-45b9-adaa-24d1265e430a
health: HEALTH_WARN
no active mgr
services:
mon: 3 daemons, quorum a,b,c (age 8m)
mgr: no daemons active (since 51s)
mds: 1/1 daemons up, 1 hot standby
osd: 3 osds: 3 up (since 3m), 3 in (since 13d)
rgw: 2 daemons active (2 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 13 pools, 217 pgs
objects: 28.35k objects, 102 GiB
usage: 267 GiB used, 1.2 TiB / 1.5 TiB avail
pgs: 217 active+clean
sh-5.1$
[root@rdr-odf15-bastion-0 ~]# oc get storagecluster -o yaml |grep resourceProfile
resourceProfile: lean
I did take a look at the setup. This does not seem like a problem due to lack of resources. I did try changing between resource profiles, even I tried by asking Aaruni to increase the memory available on each node. Every time the pods would get into running but suddenly they would be OOMKilled & then CLBO. Another fact is that no other pods fail due to a lack of resources but only the Mgr always fails. There is something else wrong with the Mgrs. One prominent error message I could see in the mgr logs is this debug 2024-01-30T13:33:35.846+0000 7fff6e5ab940 -1 client.0 error registering admin socket command: (17) File exists. Moving it back to rook team for analysis of the mgr failure This is happening on IBM Z as well. If the mgr is not starting and is getting OOM killed, this is a blocker for 4.15. When lowering the memory limits for the mgr there was some risk of this issue, and we don't have a good way to validate how many resources are really needed. The mgr is a very burstable daemon, with operations that may suddenly need more memory, and then lower again. With two mgrs, the burstable memory is even more pronounced because the standby mgr will be quite idle, while the active mgr will get all the load. My recommendation is for the mgr is to double the memory allowed for the limits compared to the requests. 1. This will allow the active mgr to burst when needed, but not waste the resources across the cluster for the standby mgr that we know doesn't require all that memory. 2. This will allow the active mgr to have the same memory limits that we had in previous releases, so we won't have a risk of regression from an OOM kill. Lean: - Requests: 1Gi - Limits: 2Gi Balanced: - Requests: 1.5Gi - Limits: 3Gi Performance: - Requests: 2Gi - Limits: 4Gi The QoS Class assigned to the mgr pod is already "Burstable" as seen in the mgr spec of Comment 3, so we will not be lowering the QoS class of the mgr by making this change. Anyway, if the mgr is evicted because of limited resources on the node, the mgr can perfectly move to another node as long as another node is available. Malay/Travis, can we please check if this is same as https://bugzilla.redhat.com/show_bug.cgi?id=2244873 That looks different since it was seen in 4.14. This issue with the mgr memory limits would only apply to 4.15. Comment added to the other BZ about a possible related fix. Aaruni can you please try to reproduce this on another setup, I was doing some experimentation with the existing setup we had I increased the mgr resources a bit and the problem was resolve. But then when I lowered the limits again I was not able to reproduce the problem at all. Can you please try to reproduce the same issue on another lcuster that will help a lot. Moving it to ASSIGNED state as it FAILED_QA Moving it to ASSIGNED state as it FAILED_QA By mistake moved the wrong BZ to Assigned, hence moving it back to POST. Apologies for the confusion. *** Bug 2264051 has been marked as a duplicate of this bug. *** Narayanaswamy G from IBM Power team is working on verification of this BZ and he will update the progress/status soon. Keeping the bug in my name for now. thanks From NarayanSwamy - We are not able to verify it because the latest build is not working as expected. storagecluster is stuck in progressing state. tried with downward build 150 as well. that as well seeing issue. https://bugzilla.redhat.com/show_bug.cgi?id=2262067 Moving to Verified based on comment#27 thanks NarayanSwamy. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:1383 |