Bug 1896810
| Summary: | [Tracker for BZ #1967164] Silence crash warning in osd removal job. | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Itzhak <ikave> | ||||
| Component: | ceph | Assignee: | Neha Ojha <nojha> | ||||
| ceph sub component: | Ceph-MGR | QA Contact: | Elad <ebenahar> | ||||
| Status: | CLOSED NOTABUG | Docs Contact: | |||||
| Severity: | high | ||||||
| Priority: | high | CC: | amagrawa, bniver, brgardne, ebenahar, edonnell, muagarwa, nberry, nojha, odf-bz-bot, oviner, owasserm, pdhange, pdhiran, rzarzyns, sdudhgao, shan, tnielsen | ||||
| Version: | 4.6 | Keywords: | AutomationBackLog | ||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | 4.7.0-272.ci | Doc Type: | Known Issue | ||||
| Doc Text: |
.Ceph status is `HEALTH_WARN` after disk replacement
After disk replacement, a warning `1 daemons have recently crashed` is seen even if all OSD pods are up and running. This warning causes a change in Ceph's status. The Ceph status should be `HEALTH_OK` instead of `HEALTH_WARN`. To workaround this issue, `rsh` to the `ceph-tools` pod and silence the warning, the Ceph health will then be back to `HEALTH_OK`.
|
Story Points: | --- | ||||
| Clone Of: | |||||||
| : | 1967164 (view as bug list) | Environment: | |||||
| Last Closed: | 2023-11-29 23:01:49 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1882359, 1967164 | ||||||
| Attachments: |
|
||||||
|
Description
Itzhak
2020-11-11 15:27:07 UTC
What do you see in the logs? Where's the OSD crash dump? If we wish Ceph engineering to look at it, let's provide them with the real details here. The Ceph health warning occurs after deleting the backing volume from the platform side. After reattaching a new volume and perform all the relevant steps, all the 3 OSD's are up and running. But we still have the warning of the old osd crash. Here are the test logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/nberry-n5-cp/nberry-n5-cp_20201105T121923/logs/failed_testcase_ocs_logs_1604930171/test_recovery_from_volume_deletion_ocs_logs/ Needinfo answered in comment #3 1. Any idea how we can propagate this issue to the user? As is, it requires a support case. Is there an alert? 2. This is on a non-LSO VMware, so less likely to be a real HW issue? (In reply to Yaniv Kaul from comment #6) > 1. Any idea how we can propagate this issue to the user? As is, it requires > a support case. > Is there an alert? https://bugzilla.redhat.com/show_bug.cgi?id=1682967 issues a health warning when there are too many repairs done by an OSD. The aim is to help identify and warn about things like bad disk, controller, etc. > > 2. This is on a non-LSO VMware, so less likely to be a real HW issue? The error message is well explained in https://bugzilla.redhat.com/show_bug.cgi?id=1856430#c7. Is there a way to confirm that there aren't any issues in the underlying layer? (In reply to Neha Ojha from comment #7) > (In reply to Yaniv Kaul from comment #6) > > 1. Any idea how we can propagate this issue to the user? As is, it requires > > a support case. > > Is there an alert? > > https://bugzilla.redhat.com/show_bug.cgi?id=1682967 issues a health warning > when there are too many repairs done by an OSD. The aim is to help identify > and warn about things like bad disk, controller, etc. > > > > > 2. This is on a non-LSO VMware, so less likely to be a real HW issue? > > The error message is well explained in > https://bugzilla.redhat.com/show_bug.cgi?id=1856430#c7. Is there a way to > confirm that there aren't any issues in the underlying layer? NEEDINFO on reporter. Based on a discussion in the "OCS leads meeting", this seems to be the right way for the product to behave. If the missing part is to add more timeout to our tests, let's do that and this can be closed as NOT A BUG. Moving to 4.7 to get more information from Itzhak following the discussion we had, I realized that there is a good chance that the osd-removal-job should take care of removing the OSD and making sure Ceph is in health OK. Servesha, is this assumption correct? For now, moving the bug to rook and let's keep it open Elad, agreed the osd-removal-job should take care of acknowledging the crash and silencing it once solved. Rohan, please get familiar with https://docs.ceph.com/en/latest/rados/operations/health-checks/#recent-crash and implement the logic in the removal job from Rook. Thanks. @Elad sorry for the late reply. I was on PTO. Your assumption is right. A job should take care of ceph's health. The query is also addressed in #comment 11. Hence clearing the needinfo... @Neha For now it sounds fair to add it as a KNOWN issue IMO. And as a resolution, we can advise them to contact support - Assuming some customers might want to apply the workaround for it. Are you sure customers know what the ceph tools pod is and how to rsh to it? The workaround is fine for support, not for end users. @Yaniv Right. As per the discussion, we will mention it as a known issue in the docs, till the osd-removal job is able to handle this scenario. The customers who are willing to get ceph health to HEALTH_OK (by silencing the OSD crash warning) will have to contact customer support in that case. (In reply to Servesha from comment #17) > @Yaniv Right. As per the discussion, we will mention it as a known issue in > the docs, till the osd-removal job is able to handle this scenario. The > customers who are willing to get ceph health to HEALTH_OK (by silencing the > OSD crash warning) will have to contact customer support in that case. exactly why it should not be in the docs. It's an edge case. Pulkit, pls change the BZ title to reflect the actual fix. Pulkit please backport https://github.com/rook/rook/pull/7001 to https://github.com/openshift/rook/, use `cherry-pick -x`. Thanks I tried to test the BZ again. I deleted a disk and created a new one. I followed the procedure here https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.6/html-single/replacing_devices/index#replacing-operational-or-failed-storage-devices-on-clusters-backed-by-local-storage-devices_rhocs except for the part of the ocs-osd-removal job name that changed to "ocs-osd-removal-job The device replacement process finished successfully, but I still have the osd crash warning at the end of the process. Here are the logs of the "ocs-osd-removal-job": 2021-03-03 14:40:33.724042 I | rookcmd: starting Rook 4.7-103.a0622de60.release_4.7 with arguments '/usr/local/bin/rook ceph osd remove --osd-ids=1' 2021-03-03 14:40:33.724161 I | rookcmd: flag values: --help=false, --log-flush-frequency=5s, --log-level=DEBUG, --operator-image=, --osd-ids=1, --service-account= 2021-03-03 14:40:33.724171 I | op-mon: parsing mon endpoints: a=172.30.18.55:6789,c=172.30.250.240:6789,d=172.30.153.229:6789 2021-03-03 14:40:33.735707 I | cephclient: writing config file /var/lib/rook/openshift-storage/openshift-storage.config 2021-03-03 14:40:33.735997 I | cephclient: generated admin config in /var/lib/rook/openshift-storage 2021-03-03 14:40:33.736105 D | cephosd: config file @ /etc/ceph/ceph.conf: [global] fsid = ab6c2054-6b8f-4d27-822a-1036cec016f7 mon initial members = a c d mon host = [v2:172.30.18.55:3300,v1:172.30.18.55:6789],[v2:172.30.250.240:3300,v1:172.30.250.240:6789],[v2:172.30.153.229:3300,v1:172.30.153.229:6789] mon_osd_full_ratio = .85 mon_osd_backfillfull_ratio = .8 mon_osd_nearfull_ratio = .75 mon_max_pg_per_osd = 600 [osd] osd_memory_target_cgroup_limit_ratio = 0.5 [client.admin] keyring = /var/lib/rook/openshift-storage/client.admin.keyring 2021-03-03 14:40:33.736303 D | exec: Running command: ceph osd dump --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json --out-file /tmp/271193002 2021-03-03 14:40:34.073687 I | cephosd: validating status of osd.1 2021-03-03 14:40:34.073712 I | cephosd: osd.1 is marked 'DOWN'. Removing it 2021-03-03 14:40:34.073810 D | exec: Running command: ceph osd find 1 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json --out-file /tmp/101348609 2021-03-03 14:40:34.380136 D | exec: Running command: ceph osd out osd.1 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json --out-file /tmp/581847660 2021-03-03 14:40:35.146811 D | exec: marked out osd.1. 2021-03-03 14:40:35.156777 I | cephosd: removing the OSD deployment "rook-ceph-osd-1" 2021-03-03 14:40:35.156806 D | op-k8sutil: removing rook-ceph-osd-1 deployment if it exists 2021-03-03 14:40:35.156810 I | op-k8sutil: removing deployment rook-ceph-osd-1 if it exists 2021-03-03 14:40:35.166034 I | op-k8sutil: Removed deployment rook-ceph-osd-1 2021-03-03 14:40:35.170123 I | op-k8sutil: "rook-ceph-osd-1" still found. waiting... 2021-03-03 14:40:37.176310 I | op-k8sutil: confirmed rook-ceph-osd-1 does not exist 2021-03-03 14:40:37.185731 I | cephosd: removing the osd prepare job "rook-ceph-osd-prepare-ocs-deviceset-2-data-0kvd87" 2021-03-03 14:40:37.193762 I | cephosd: removing the OSD PVC "ocs-deviceset-2-data-0kvd87" 2021-03-03 14:40:37.199140 D | exec: Running command: ceph osd purge osd.1 --force --yes-i-really-mean-it --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json --out-file /tmp/044786907 2021-03-03 14:40:37.536172 D | exec: purged osd.1 2021-03-03 14:40:37.536419 D | exec: Running command: ceph osd crush rm compute-2 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json --out-file /tmp/401456254 2021-03-03 14:40:38.545793 D | exec: removed item id -3 name 'compute-2' from crush map 2021-03-03 14:40:38.546037 D | exec: Running command: ceph crash ls --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json --out-file /tmp/706879941 2021-03-03 14:40:38.878203 I | cephosd: no ceph crash to silence 2021-03-03 14:40:38.878230 I | cephosd: completed removal of OSD 1 Additional info: I tested it with a vSphere LSO cluster. Versions: OCP version: Client Version: 4.6.0-0.nightly-2021-01-12-112514 Server Version: 4.7.0-0.nightly-2021-03-01-085007 Kubernetes Version: v1.20.0+5fbfd19 OCS verison: ocs-operator.v4.7.0-278.ci OpenShift Container Storage 4.7.0-278.ci Succeeded cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2021-03-01-085007 True False 26h Cluster version is 4.7.0-0.nightly-2021-03-01-085007 Rook version rook: 4.7-103.a0622de60.release_4.7 go: go1.15.5 Ceph version ceph version 14.2.11-123.el8cp (f02fa4f00c2417b1bc86e6ec7711756454e70716) nautilus (stable) Pulkit, you might need to wait for the crash to be created. The ceph crash osd usually appears after 5-15 minutes from the volume deletion. We need to consider it. 5 to 15 minutes seems like a huge difference, even 5min looks too long. I'd expect a few seconds. Yes, maybe it's also something that needs to be fixed. Cause otherwise, there is no point in running the "ocs-osd-removal" job for 5 min(or more) or updating the user that the osd crash can appear after 5 min(or more). Mudit, why is this considered a blocker? Raz has marked it as a blocker, but I guess we can re-evaluate. Raz? Moving to 4.8 since it's going to require a change to the approach to really change the behavior. Comments coming on further details of the discussion... Seb, Travis, and I had a discussion about this bug in our triage today. Notes below. When an OSD is purged from the Ceph cluster, we should *not* remove the crashes from the crash log because users may still wish to keep the information for data evaluation (for example, a postmortem). What we *should* do is clear errors for a given OSD when that OSD is purged so that the Ceph cluster can get back to a healthy state. If Ceph performs this work, then cephadm will also benefit. There could be a race condition where an OSD is removed just after an OSD crashes. The OSD crash that happened before removal is still a valid crash that some users may wish to keep a record of. That crash should still be reported to Ceph. However, we think it should be a Ceph feature to clear errors for incoming crashes reported for OSDs that have been purged from the Ceph cluster. Ceph should still accept the incoming crash dump and log it (for postmortems) but not report an error based on the crash since it is for an OSD that no longer exists. This will also mean the same "fix" will apply to cephadm clusters as well as Rook. Future work on this bug intended for OCS 4.8 will have to involve some collaboration between Ceph and Rook (and possibly cephadm) to make sure we are not removing evidence of errors while still allowing Ceph clusters to report healthy when OSDs are replaced. Sorry for the late response. Pulkit, I didn't run any workloads. @pkundra Any update on this one? Any update on this one? Since this is dependent on Ceph, not possible to include this in 4.8 https://bugzilla.redhat.com/show_bug.cgi?id=1967164 is targeted for 5.1 This issue reconstructed on ODF4.9
SetUp:
OCP Version:4.9.0-0.nightly-2021-11-26-225521
ODF Version:4.9.0-249.ci
LSO Version:local-storage-operator.4.9.0-202111151318
Ceph Version:
sh-4.4$ ceph versions
{
"mon": {
"ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)": 3
},
"mgr": {
"ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)": 1
},
"osd": {
"ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)": 3
},
"mds": {
"ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)": 2
},
"rgw": {
"ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)": 1
},
"overall": {
"ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)": 10
}
}
Test Procedure:
1.Check ceph status:
sh-4.4$ ceph status
cluster:
id: e6ae853a-3595-4738-a15e-6cb4a470fc3b
health: HEALTH_OK
2.Identify the OSD that needs to be replaced [OSD-0 COMPUTE-2]
$ oc get -n openshift-storage pods -l app=rook-ceph-osd -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
rook-ceph-osd-0-85d6df8dbc-7m4v9 2/2 Running 0 5m3s 10.131.0.76 compute-2 <none> <none>
rook-ceph-osd-1-5c6465f8d-hrnrp 2/2 Running 0 15m 10.129.2.28 compute-1 <none> <none>
rook-ceph-osd-2-868859c6c8-ck2wq 2/2 Running 0 15m 10.128.2.20 compute-0 <none> <none>
3.Delete disk via vcenter from compute-2:
OSD-0 move to CLBO
4.Scale down the OSD deployment for the OSD to be replaced:
$ osd_id_to_remove=0
$ oc scale -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove} --replicas=0
deployment.apps/rook-ceph-osd-0 scaled
5.Verify that the rook-ceph-osd pod is terminated.
$ oc get -n openshift-storage pods -l ceph-osd-id=${osd_id_to_remove}
NAME READY STATUS RESTARTS AGE
rook-ceph-osd-0-85d6df8dbc-7m4v9 0/2 Terminating 4 9m19s
$ oc delete -n openshift-storage pod rook-ceph-osd-0-85d6df8dbc-7m4v9 --grace-period=0 --force
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "rook-ceph-osd-0-85d6df8dbc-7m4v9" force deleted
$ oc get -n openshift-storage pods -l ceph-osd-id=${osd_id_to_remove}
No resources found in openshift-storage namespace.
6.Remove the old OSD from the cluster so that a new OSD can be added.
$ oc delete -n openshift-storage job ocs-osd-removal-job
Error from server (NotFound): jobs.batch "ocs-osd-removal-job" not found
7.Change to the openshift-storage project.
$ oc project openshift-storage
Already on project "openshift-storage" on server "https://api.oviner5-lso28.qe.rh-ocs.com:6443".
8.Remove the old OSD from the cluster.
$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} |oc create -n openshift-storage -f -
job.batch/ocs-osd-removal-job created
9.Verify that the OSD is removed successfully
$ oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
NAME READY STATUS RESTARTS AGE
ocs-osd-removal-job--1-blwm4 0/1 Completed 0 16s
$ oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1
2021-11-28 12:37:35.309851 I | op-flags: failed to set flag "logtostderr". no such flag -logtostderr
2021-11-28 12:37:35.310175 I | rookcmd: starting Rook 4.9-215.c3f67c6.release_4.9 with arguments '/usr/local/bin/rook ceph osd remove --osd-ids=0'
2021-11-28 12:37:35.310184 I | rookcmd: flag values: --help=false, --log-level=DEBUG, --operator-image=, --osd-ids=0, --preserve-pvc=false, --service-account=
2021-11-28 12:37:35.310192 I | op-mon: parsing mon endpoints: c=172.30.100.150:6789,a=172.30.150.178:6789,b=172.30.231.249:6789
2021-11-28 12:37:35.325316 I | cephclient: writing config file /var/lib/rook/openshift-storage/openshift-storage.config
2021-11-28 12:37:35.325522 I | cephclient: generated admin config in /var/lib/rook/openshift-storage
2021-11-28 12:37:35.325837 D | cephclient: config file @ /etc/ceph/ceph.conf: [global]
fsid = e6ae853a-3595-4738-a15e-6cb4a470fc3b
mon initial members = c a b
mon host = [v2:172.30.100.150:3300,v1:172.30.100.150:6789],[v2:172.30.150.178:3300,v1:172.30.150.178:6789],[v2:172.30.231.249:3300,v1:172.30.231.249:6789]
bdev_flock_retry = 20
mon_osd_full_ratio = .85
mon_osd_backfillfull_ratio = .8
mon_osd_nearfull_ratio = .75
mon_max_pg_per_osd = 600
mon_pg_warn_max_object_skew = 0
mon_data_avail_warn = 15
[osd]
osd_memory_target_cgroup_limit_ratio = 0.5
[client.admin]
keyring = /var/lib/rook/openshift-storage/client.admin.keyring
2021-11-28 12:37:35.325902 D | exec: Running command: ceph osd dump --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2021-11-28 12:37:35.689019 I | cephosd: validating status of osd.0
2021-11-28 12:37:35.689049 I | cephosd: osd.0 is marked 'DOWN'. Removing it
2021-11-28 12:37:35.689069 D | exec: Running command: ceph osd find 0 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2021-11-28 12:37:36.008324 D | exec: Running command: ceph osd out osd.0 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2021-11-28 12:37:37.294074 I | cephosd: removing the OSD deployment "rook-ceph-osd-0"
2021-11-28 12:37:37.294103 D | op-k8sutil: removing rook-ceph-osd-0 deployment if it exists
2021-11-28 12:37:37.294108 I | op-k8sutil: removing deployment rook-ceph-osd-0 if it exists
2021-11-28 12:37:37.307019 I | op-k8sutil: Removed deployment rook-ceph-osd-0
2021-11-28 12:37:37.311880 I | op-k8sutil: "rook-ceph-osd-0" still found. waiting...
2021-11-28 12:37:39.322341 I | op-k8sutil: confirmed rook-ceph-osd-0 does not exist
2021-11-28 12:37:39.330629 I | cephosd: removing the osd prepare job "rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-2pp2x2"
2021-11-28 12:37:39.340824 I | cephosd: removing the OSD PVC "ocs-deviceset-localblock-0-data-2pp2x2"
2021-11-28 12:37:39.352494 D | exec: Running command: ceph osd purge osd.0 --force --yes-i-really-mean-it --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2021-11-28 12:37:39.717103 D | exec: Running command: ceph osd crush rm compute-2 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2021-11-28 12:37:40.735126 D | exec: Running command: ceph crash ls --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2021-11-28 12:37:41.065961 I | cephosd: no ceph crash to silence
2021-11-28 12:37:41.066007 I | cephosd: completed removal of OSD 0
10.Delete ocs-osd-removal-job
$ oc delete -n openshift-storage job ocs-osd-removal-job
job.batch "ocs-osd-removal-job" deleted
11.Find the persistent volume (PV) that needs to be deleted by the command:
$ oc get pv -L kubernetes.io/hostname | grep localblock | grep Released
local-pv-48294c53 100Gi RWO Delete Released openshift-storage/ocs-deviceset-localblock-0-data-2pp2x2 localblock 14m compute-2
12.Physically add a new device to the node via vcenter
13.Verify that there is a new OSD running.
$ oc get -n openshift-storage pods -l app=rook-ceph-osd
NAME READY STATUS RESTARTS AGE
rook-ceph-osd-0-6ccd74f8d6-xgvsk 2/2 Running 0 2m5s
rook-ceph-osd-1-5c6465f8d-hrnrp 2/2 Running 0 30m
rook-ceph-osd-2-868859c6c8-ck2wq 2/2 Running 0 30m
14.Check Ceph status:
sh-4.4$ ceph status
cluster:
id: e6ae853a-3595-4738-a15e-6cb4a470fc3b
health: HEALTH_WARN
1 daemons have recently crashed
services:
mon: 3 daemons, quorum a,b,c (age 35m)
mgr: a(active, since 34m)
mds: 1/1 daemons up, 1 hot standby
osd: 3 osds: 3 up (since 6m), 3 in (since 6m)
rgw: 1 daemon active (1 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 11 pools, 177 pgs
objects: 460 objects, 130 MiB
usage: 370 MiB used, 300 GiB / 300 GiB avail
pgs: 177 active+clean
io:
client: 2.6 KiB/s rd, 10 KiB/s wr, 3 op/s rd, 2 op/s wr
sh-4.4$ ceph crash ls
ID ENTITY NEW
2021-11-28T12:32:42.249604Z_9d7761a5-b6a6-4502-9eac-8944a42bb48f osd.0 *
sh-4.4$ ceph crash info 2021-11-28T12:32:42.249604Z_9d7761a5-b6a6-4502-9eac-8944a42bb48f
{
"assert_condition": "abort",
"assert_file": "/builddir/build/BUILD/ceph-16.2.0/src/blk/kernel/KernelDevice.cc",
"assert_func": "void KernelDevice::_aio_thread()",
"assert_line": 600,
"assert_msg": "/builddir/build/BUILD/ceph-16.2.0/src/blk/kernel/KernelDevice.cc: In function 'void KernelDevice::_aio_thread()' thread 7f38349c3700 time 2021-11-28T12:32:42.246661+0000\n/builddir/build/BUILD/ceph-16.2.0/src/blk/kernel/KernelDevice.cc: 600: ceph_abort_msg(\"Unexpected IO error. This may suggest a hardware issue. Please check your kernel log!\")\n",
"assert_thread_name": "bstore_aio",
"backtrace": [
"/lib64/libpthread.so.0(+0x12c20) [0x7f3841a4cc20]",
"gsignal()",
"abort()",
"(ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b6) [0x55d25730da8b]",
"(KernelDevice::_aio_thread()+0x1254) [0x55d257e507e4]",
"(KernelDevice::AioCompletionThread::entry()+0x11) [0x55d257e5bae1]",
"/lib64/libpthread.so.0(+0x817a) [0x7f3841a4217a]",
"clone()"
],
"ceph_version": "16.2.0-146.el8cp",
"crash_id": "2021-11-28T12:32:42.249604Z_9d7761a5-b6a6-4502-9eac-8944a42bb48f",
"entity_name": "osd.0",
"io_error": true,
"io_error_code": -5,
"io_error_devname": "sdb",
"io_error_length": 4096,
"io_error_offset": 21028864,
"io_error_optype": 8,
"io_error_path": "/var/lib/ceph/osd/ceph-0/block",
"os_id": "rhel",
"os_name": "Red Hat Enterprise Linux",
"os_version": "8.5 (Ootpa)",
"os_version_id": "8.5",
"process_name": "ceph-osd",
"stack_sig": "b8dcbaf37e069edf8c664d423b4d383080e2b0044c722f73720098c980e72912",
"timestamp": "2021-11-28T12:32:42.249604Z",
"utsname_hostname": "rook-ceph-osd-0-85d6df8dbc-7m4v9",
"utsname_machine": "x86_64",
"utsname_release": "4.18.0-305.28.1.el8_4.x86_64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP Mon Nov 8 07:45:47 EST 2021"
}
15.Delete crash list:
sh-4.4$ ceph crash archive-all
16.Check Ceph status
sh-4.4$ ceph health
HEALTH_OK
https://bugzilla.redhat.com/show_bug.cgi?id=1967164 is targeted for RHCS5.2 Neha, is there any update on this one? Our device failure tests still contain a workaround that tends to break from time to time. It means we either fail those tests or miss coverage for some important scenarios. |