Created attachment 1728371 [details] Ceph health detail Description of problem (please be detailed as possible and provide log snippests): When running the test 'test_recovery_from_volume_deletion' all the OSD's are up and running, but in the end, we still have a warning "1 daemons have recently crashed". Version of all relevant components (if applicable): vSphere, OCP 4.6, OCS 4.6, NON-LSO. Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes, but just in the case of volume deletion. Is there any workaround available to the best of your knowledge? Yes. We can manually rsh to the ceph-tools pod and silence the warning, and then the ceph health back to be OK. Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)?3 Can this issue reproducible? yes Can this issue reproduce from the UI? No If this is a regression, please provide more details to justify this: In OCP 4.5 the Ceph health was Health OK at the end of the test. Steps to Reproduce: 1. Run the PR validation job on the PR https://github.com/red-hat-storage/ocs-ci/pull/3259/ with the conf:vSphere, OCP 4.6, OCS 4.6, NON-LSO. Actual results: At the end of the test, Ceph health is HEALTH_WARN, with the warning "1 daemons have recently crashed". Expected results: At the end of the test, Ceph health should be HEALTH_OK. Additional info: This is the validation job I ran https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/14459/. Here is a snippet of the console output https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/14459/consoleFull: "13:57:15 - MainThread - tests.manage.z_cluster.nodes.test_disk_failures - INFO - Deleting [vsanDatastore] 66242d5f-cafa-91c3-8164-e4434bd7df48/nberry-n5-cp-knlhx-dynamic-pvc-af9ee095-9a12-4713-a9db-0441819071b1.vmdk from the platform side13:57:53 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig get Node compute-0 -o yaml 13:58:03 - MainThread - ocs_ci.utility.vsphere - INFO - Detaching Disk with identifier: [vsanDatastore] 66242d5f-cafa-91c3-8164-e4434bd7df48/nberry-n5-cp-knlhx-dynamic-pvc-af9ee095-9a12-4713-a9db-0441819071b1.vmdk from compute-0 and remove from datastore=True 13:58:18 - MainThread - tests.manage.z_cluster.nodes.test_disk_failures - INFO - Scaling down OSD deployment rook-ceph-osd-0 to 0 13:58:18 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig scale --replicas=0 deployment/rook-ceph-osd-0 13:58:18 - MainThread - tests.manage.z_cluster.nodes.test_disk_failures - INFO - Waiting for OSD pod rook-ceph-osd-0-6b6df4b7b5-czp8k to get deleted" When I rsh to ceph tools pod and check "ceph health detail" I saw the old osd crash. The time of the osd crash was at 13:58:08 - which is 5 seconds after the command: "Detaching Disk with identifier: [vsanDatastore] 66242d5f-cafa-91c3-8164-e4434bd7df48/nberry-n5-cp-knlhx-dynamic-pvc-af9ee095-9a12-4713-a9db-0441819071b1.vmdk from compute-0 and remove from datastore=True"
What do you see in the logs? Where's the OSD crash dump? If we wish Ceph engineering to look at it, let's provide them with the real details here.
The Ceph health warning occurs after deleting the backing volume from the platform side. After reattaching a new volume and perform all the relevant steps, all the 3 OSD's are up and running. But we still have the warning of the old osd crash. Here are the test logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/nberry-n5-cp/nberry-n5-cp_20201105T121923/logs/failed_testcase_ocs_logs_1604930171/test_recovery_from_volume_deletion_ocs_logs/
Needinfo answered in comment #3
1. Any idea how we can propagate this issue to the user? As is, it requires a support case. Is there an alert? 2. This is on a non-LSO VMware, so less likely to be a real HW issue?
(In reply to Yaniv Kaul from comment #6) > 1. Any idea how we can propagate this issue to the user? As is, it requires > a support case. > Is there an alert? https://bugzilla.redhat.com/show_bug.cgi?id=1682967 issues a health warning when there are too many repairs done by an OSD. The aim is to help identify and warn about things like bad disk, controller, etc. > > 2. This is on a non-LSO VMware, so less likely to be a real HW issue? The error message is well explained in https://bugzilla.redhat.com/show_bug.cgi?id=1856430#c7. Is there a way to confirm that there aren't any issues in the underlying layer?
(In reply to Neha Ojha from comment #7) > (In reply to Yaniv Kaul from comment #6) > > 1. Any idea how we can propagate this issue to the user? As is, it requires > > a support case. > > Is there an alert? > > https://bugzilla.redhat.com/show_bug.cgi?id=1682967 issues a health warning > when there are too many repairs done by an OSD. The aim is to help identify > and warn about things like bad disk, controller, etc. > > > > > 2. This is on a non-LSO VMware, so less likely to be a real HW issue? > > The error message is well explained in > https://bugzilla.redhat.com/show_bug.cgi?id=1856430#c7. Is there a way to > confirm that there aren't any issues in the underlying layer? NEEDINFO on reporter.
Based on a discussion in the "OCS leads meeting", this seems to be the right way for the product to behave. If the missing part is to add more timeout to our tests, let's do that and this can be closed as NOT A BUG. Moving to 4.7 to get more information from Itzhak
following the discussion we had, I realized that there is a good chance that the osd-removal-job should take care of removing the OSD and making sure Ceph is in health OK. Servesha, is this assumption correct? For now, moving the bug to rook and let's keep it open
Elad, agreed the osd-removal-job should take care of acknowledging the crash and silencing it once solved. Rohan, please get familiar with https://docs.ceph.com/en/latest/rados/operations/health-checks/#recent-crash and implement the logic in the removal job from Rook. Thanks.
@Elad sorry for the late reply. I was on PTO. Your assumption is right. A job should take care of ceph's health. The query is also addressed in #comment 11. Hence clearing the needinfo...
@Neha For now it sounds fair to add it as a KNOWN issue IMO. And as a resolution, we can advise them to contact support - Assuming some customers might want to apply the workaround for it.
Are you sure customers know what the ceph tools pod is and how to rsh to it? The workaround is fine for support, not for end users.
@Yaniv Right. As per the discussion, we will mention it as a known issue in the docs, till the osd-removal job is able to handle this scenario. The customers who are willing to get ceph health to HEALTH_OK (by silencing the OSD crash warning) will have to contact customer support in that case.
(In reply to Servesha from comment #17) > @Yaniv Right. As per the discussion, we will mention it as a known issue in > the docs, till the osd-removal job is able to handle this scenario. The > customers who are willing to get ceph health to HEALTH_OK (by silencing the > OSD crash warning) will have to contact customer support in that case. exactly why it should not be in the docs. It's an edge case.
Pulkit, pls change the BZ title to reflect the actual fix.
Pulkit please backport https://github.com/rook/rook/pull/7001 to https://github.com/openshift/rook/, use `cherry-pick -x`. Thanks
I tried to test the BZ again. I deleted a disk and created a new one. I followed the procedure here https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.6/html-single/replacing_devices/index#replacing-operational-or-failed-storage-devices-on-clusters-backed-by-local-storage-devices_rhocs except for the part of the ocs-osd-removal job name that changed to "ocs-osd-removal-job The device replacement process finished successfully, but I still have the osd crash warning at the end of the process. Here are the logs of the "ocs-osd-removal-job": 2021-03-03 14:40:33.724042 I | rookcmd: starting Rook 4.7-103.a0622de60.release_4.7 with arguments '/usr/local/bin/rook ceph osd remove --osd-ids=1' 2021-03-03 14:40:33.724161 I | rookcmd: flag values: --help=false, --log-flush-frequency=5s, --log-level=DEBUG, --operator-image=, --osd-ids=1, --service-account= 2021-03-03 14:40:33.724171 I | op-mon: parsing mon endpoints: a=172.30.18.55:6789,c=172.30.250.240:6789,d=172.30.153.229:6789 2021-03-03 14:40:33.735707 I | cephclient: writing config file /var/lib/rook/openshift-storage/openshift-storage.config 2021-03-03 14:40:33.735997 I | cephclient: generated admin config in /var/lib/rook/openshift-storage 2021-03-03 14:40:33.736105 D | cephosd: config file @ /etc/ceph/ceph.conf: [global] fsid = ab6c2054-6b8f-4d27-822a-1036cec016f7 mon initial members = a c d mon host = [v2:172.30.18.55:3300,v1:172.30.18.55:6789],[v2:172.30.250.240:3300,v1:172.30.250.240:6789],[v2:172.30.153.229:3300,v1:172.30.153.229:6789] mon_osd_full_ratio = .85 mon_osd_backfillfull_ratio = .8 mon_osd_nearfull_ratio = .75 mon_max_pg_per_osd = 600 [osd] osd_memory_target_cgroup_limit_ratio = 0.5 [client.admin] keyring = /var/lib/rook/openshift-storage/client.admin.keyring 2021-03-03 14:40:33.736303 D | exec: Running command: ceph osd dump --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json --out-file /tmp/271193002 2021-03-03 14:40:34.073687 I | cephosd: validating status of osd.1 2021-03-03 14:40:34.073712 I | cephosd: osd.1 is marked 'DOWN'. Removing it 2021-03-03 14:40:34.073810 D | exec: Running command: ceph osd find 1 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json --out-file /tmp/101348609 2021-03-03 14:40:34.380136 D | exec: Running command: ceph osd out osd.1 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json --out-file /tmp/581847660 2021-03-03 14:40:35.146811 D | exec: marked out osd.1. 2021-03-03 14:40:35.156777 I | cephosd: removing the OSD deployment "rook-ceph-osd-1" 2021-03-03 14:40:35.156806 D | op-k8sutil: removing rook-ceph-osd-1 deployment if it exists 2021-03-03 14:40:35.156810 I | op-k8sutil: removing deployment rook-ceph-osd-1 if it exists 2021-03-03 14:40:35.166034 I | op-k8sutil: Removed deployment rook-ceph-osd-1 2021-03-03 14:40:35.170123 I | op-k8sutil: "rook-ceph-osd-1" still found. waiting... 2021-03-03 14:40:37.176310 I | op-k8sutil: confirmed rook-ceph-osd-1 does not exist 2021-03-03 14:40:37.185731 I | cephosd: removing the osd prepare job "rook-ceph-osd-prepare-ocs-deviceset-2-data-0kvd87" 2021-03-03 14:40:37.193762 I | cephosd: removing the OSD PVC "ocs-deviceset-2-data-0kvd87" 2021-03-03 14:40:37.199140 D | exec: Running command: ceph osd purge osd.1 --force --yes-i-really-mean-it --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json --out-file /tmp/044786907 2021-03-03 14:40:37.536172 D | exec: purged osd.1 2021-03-03 14:40:37.536419 D | exec: Running command: ceph osd crush rm compute-2 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json --out-file /tmp/401456254 2021-03-03 14:40:38.545793 D | exec: removed item id -3 name 'compute-2' from crush map 2021-03-03 14:40:38.546037 D | exec: Running command: ceph crash ls --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json --out-file /tmp/706879941 2021-03-03 14:40:38.878203 I | cephosd: no ceph crash to silence 2021-03-03 14:40:38.878230 I | cephosd: completed removal of OSD 1
Additional info: I tested it with a vSphere LSO cluster. Versions: OCP version: Client Version: 4.6.0-0.nightly-2021-01-12-112514 Server Version: 4.7.0-0.nightly-2021-03-01-085007 Kubernetes Version: v1.20.0+5fbfd19 OCS verison: ocs-operator.v4.7.0-278.ci OpenShift Container Storage 4.7.0-278.ci Succeeded cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2021-03-01-085007 True False 26h Cluster version is 4.7.0-0.nightly-2021-03-01-085007 Rook version rook: 4.7-103.a0622de60.release_4.7 go: go1.15.5 Ceph version ceph version 14.2.11-123.el8cp (f02fa4f00c2417b1bc86e6ec7711756454e70716) nautilus (stable)
Pulkit, you might need to wait for the crash to be created.
The ceph crash osd usually appears after 5-15 minutes from the volume deletion. We need to consider it.
5 to 15 minutes seems like a huge difference, even 5min looks too long. I'd expect a few seconds.
Yes, maybe it's also something that needs to be fixed. Cause otherwise, there is no point in running the "ocs-osd-removal" job for 5 min(or more) or updating the user that the osd crash can appear after 5 min(or more).
Mudit, why is this considered a blocker?
Raz has marked it as a blocker, but I guess we can re-evaluate. Raz?
Moving to 4.8 since it's going to require a change to the approach to really change the behavior. Comments coming on further details of the discussion...
Seb, Travis, and I had a discussion about this bug in our triage today. Notes below. When an OSD is purged from the Ceph cluster, we should *not* remove the crashes from the crash log because users may still wish to keep the information for data evaluation (for example, a postmortem). What we *should* do is clear errors for a given OSD when that OSD is purged so that the Ceph cluster can get back to a healthy state. If Ceph performs this work, then cephadm will also benefit. There could be a race condition where an OSD is removed just after an OSD crashes. The OSD crash that happened before removal is still a valid crash that some users may wish to keep a record of. That crash should still be reported to Ceph. However, we think it should be a Ceph feature to clear errors for incoming crashes reported for OSDs that have been purged from the Ceph cluster. Ceph should still accept the incoming crash dump and log it (for postmortems) but not report an error based on the crash since it is for an OSD that no longer exists. This will also mean the same "fix" will apply to cephadm clusters as well as Rook. Future work on this bug intended for OCS 4.8 will have to involve some collaboration between Ceph and Rook (and possibly cephadm) to make sure we are not removing evidence of errors while still allowing Ceph clusters to report healthy when OSDs are replaced.
Sorry for the late response. Pulkit, I didn't run any workloads.
@pkundra Any update on this one?
Any update on this one?
Since this is dependent on Ceph, not possible to include this in 4.8
https://bugzilla.redhat.com/show_bug.cgi?id=1967164 is targeted for 5.1
This issue reconstructed on ODF4.9 SetUp: OCP Version:4.9.0-0.nightly-2021-11-26-225521 ODF Version:4.9.0-249.ci LSO Version:local-storage-operator.4.9.0-202111151318 Ceph Version: sh-4.4$ ceph versions { "mon": { "ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)": 3 }, "mgr": { "ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)": 1 }, "osd": { "ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)": 3 }, "mds": { "ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)": 2 }, "rgw": { "ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)": 1 }, "overall": { "ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)": 10 } } Test Procedure: 1.Check ceph status: sh-4.4$ ceph status cluster: id: e6ae853a-3595-4738-a15e-6cb4a470fc3b health: HEALTH_OK 2.Identify the OSD that needs to be replaced [OSD-0 COMPUTE-2] $ oc get -n openshift-storage pods -l app=rook-ceph-osd -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES rook-ceph-osd-0-85d6df8dbc-7m4v9 2/2 Running 0 5m3s 10.131.0.76 compute-2 <none> <none> rook-ceph-osd-1-5c6465f8d-hrnrp 2/2 Running 0 15m 10.129.2.28 compute-1 <none> <none> rook-ceph-osd-2-868859c6c8-ck2wq 2/2 Running 0 15m 10.128.2.20 compute-0 <none> <none> 3.Delete disk via vcenter from compute-2: OSD-0 move to CLBO 4.Scale down the OSD deployment for the OSD to be replaced: $ osd_id_to_remove=0 $ oc scale -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove} --replicas=0 deployment.apps/rook-ceph-osd-0 scaled 5.Verify that the rook-ceph-osd pod is terminated. $ oc get -n openshift-storage pods -l ceph-osd-id=${osd_id_to_remove} NAME READY STATUS RESTARTS AGE rook-ceph-osd-0-85d6df8dbc-7m4v9 0/2 Terminating 4 9m19s $ oc delete -n openshift-storage pod rook-ceph-osd-0-85d6df8dbc-7m4v9 --grace-period=0 --force warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely. pod "rook-ceph-osd-0-85d6df8dbc-7m4v9" force deleted $ oc get -n openshift-storage pods -l ceph-osd-id=${osd_id_to_remove} No resources found in openshift-storage namespace. 6.Remove the old OSD from the cluster so that a new OSD can be added. $ oc delete -n openshift-storage job ocs-osd-removal-job Error from server (NotFound): jobs.batch "ocs-osd-removal-job" not found 7.Change to the openshift-storage project. $ oc project openshift-storage Already on project "openshift-storage" on server "https://api.oviner5-lso28.qe.rh-ocs.com:6443". 8.Remove the old OSD from the cluster. $ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} |oc create -n openshift-storage -f - job.batch/ocs-osd-removal-job created 9.Verify that the OSD is removed successfully $ oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage NAME READY STATUS RESTARTS AGE ocs-osd-removal-job--1-blwm4 0/1 Completed 0 16s $ oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 2021-11-28 12:37:35.309851 I | op-flags: failed to set flag "logtostderr". no such flag -logtostderr 2021-11-28 12:37:35.310175 I | rookcmd: starting Rook 4.9-215.c3f67c6.release_4.9 with arguments '/usr/local/bin/rook ceph osd remove --osd-ids=0' 2021-11-28 12:37:35.310184 I | rookcmd: flag values: --help=false, --log-level=DEBUG, --operator-image=, --osd-ids=0, --preserve-pvc=false, --service-account= 2021-11-28 12:37:35.310192 I | op-mon: parsing mon endpoints: c=172.30.100.150:6789,a=172.30.150.178:6789,b=172.30.231.249:6789 2021-11-28 12:37:35.325316 I | cephclient: writing config file /var/lib/rook/openshift-storage/openshift-storage.config 2021-11-28 12:37:35.325522 I | cephclient: generated admin config in /var/lib/rook/openshift-storage 2021-11-28 12:37:35.325837 D | cephclient: config file @ /etc/ceph/ceph.conf: [global] fsid = e6ae853a-3595-4738-a15e-6cb4a470fc3b mon initial members = c a b mon host = [v2:172.30.100.150:3300,v1:172.30.100.150:6789],[v2:172.30.150.178:3300,v1:172.30.150.178:6789],[v2:172.30.231.249:3300,v1:172.30.231.249:6789] bdev_flock_retry = 20 mon_osd_full_ratio = .85 mon_osd_backfillfull_ratio = .8 mon_osd_nearfull_ratio = .75 mon_max_pg_per_osd = 600 mon_pg_warn_max_object_skew = 0 mon_data_avail_warn = 15 [osd] osd_memory_target_cgroup_limit_ratio = 0.5 [client.admin] keyring = /var/lib/rook/openshift-storage/client.admin.keyring 2021-11-28 12:37:35.325902 D | exec: Running command: ceph osd dump --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json 2021-11-28 12:37:35.689019 I | cephosd: validating status of osd.0 2021-11-28 12:37:35.689049 I | cephosd: osd.0 is marked 'DOWN'. Removing it 2021-11-28 12:37:35.689069 D | exec: Running command: ceph osd find 0 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json 2021-11-28 12:37:36.008324 D | exec: Running command: ceph osd out osd.0 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json 2021-11-28 12:37:37.294074 I | cephosd: removing the OSD deployment "rook-ceph-osd-0" 2021-11-28 12:37:37.294103 D | op-k8sutil: removing rook-ceph-osd-0 deployment if it exists 2021-11-28 12:37:37.294108 I | op-k8sutil: removing deployment rook-ceph-osd-0 if it exists 2021-11-28 12:37:37.307019 I | op-k8sutil: Removed deployment rook-ceph-osd-0 2021-11-28 12:37:37.311880 I | op-k8sutil: "rook-ceph-osd-0" still found. waiting... 2021-11-28 12:37:39.322341 I | op-k8sutil: confirmed rook-ceph-osd-0 does not exist 2021-11-28 12:37:39.330629 I | cephosd: removing the osd prepare job "rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-2pp2x2" 2021-11-28 12:37:39.340824 I | cephosd: removing the OSD PVC "ocs-deviceset-localblock-0-data-2pp2x2" 2021-11-28 12:37:39.352494 D | exec: Running command: ceph osd purge osd.0 --force --yes-i-really-mean-it --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json 2021-11-28 12:37:39.717103 D | exec: Running command: ceph osd crush rm compute-2 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json 2021-11-28 12:37:40.735126 D | exec: Running command: ceph crash ls --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json 2021-11-28 12:37:41.065961 I | cephosd: no ceph crash to silence 2021-11-28 12:37:41.066007 I | cephosd: completed removal of OSD 0 10.Delete ocs-osd-removal-job $ oc delete -n openshift-storage job ocs-osd-removal-job job.batch "ocs-osd-removal-job" deleted 11.Find the persistent volume (PV) that needs to be deleted by the command: $ oc get pv -L kubernetes.io/hostname | grep localblock | grep Released local-pv-48294c53 100Gi RWO Delete Released openshift-storage/ocs-deviceset-localblock-0-data-2pp2x2 localblock 14m compute-2 12.Physically add a new device to the node via vcenter 13.Verify that there is a new OSD running. $ oc get -n openshift-storage pods -l app=rook-ceph-osd NAME READY STATUS RESTARTS AGE rook-ceph-osd-0-6ccd74f8d6-xgvsk 2/2 Running 0 2m5s rook-ceph-osd-1-5c6465f8d-hrnrp 2/2 Running 0 30m rook-ceph-osd-2-868859c6c8-ck2wq 2/2 Running 0 30m 14.Check Ceph status: sh-4.4$ ceph status cluster: id: e6ae853a-3595-4738-a15e-6cb4a470fc3b health: HEALTH_WARN 1 daemons have recently crashed services: mon: 3 daemons, quorum a,b,c (age 35m) mgr: a(active, since 34m) mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 3 up (since 6m), 3 in (since 6m) rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 11 pools, 177 pgs objects: 460 objects, 130 MiB usage: 370 MiB used, 300 GiB / 300 GiB avail pgs: 177 active+clean io: client: 2.6 KiB/s rd, 10 KiB/s wr, 3 op/s rd, 2 op/s wr sh-4.4$ ceph crash ls ID ENTITY NEW 2021-11-28T12:32:42.249604Z_9d7761a5-b6a6-4502-9eac-8944a42bb48f osd.0 * sh-4.4$ ceph crash info 2021-11-28T12:32:42.249604Z_9d7761a5-b6a6-4502-9eac-8944a42bb48f { "assert_condition": "abort", "assert_file": "/builddir/build/BUILD/ceph-16.2.0/src/blk/kernel/KernelDevice.cc", "assert_func": "void KernelDevice::_aio_thread()", "assert_line": 600, "assert_msg": "/builddir/build/BUILD/ceph-16.2.0/src/blk/kernel/KernelDevice.cc: In function 'void KernelDevice::_aio_thread()' thread 7f38349c3700 time 2021-11-28T12:32:42.246661+0000\n/builddir/build/BUILD/ceph-16.2.0/src/blk/kernel/KernelDevice.cc: 600: ceph_abort_msg(\"Unexpected IO error. This may suggest a hardware issue. Please check your kernel log!\")\n", "assert_thread_name": "bstore_aio", "backtrace": [ "/lib64/libpthread.so.0(+0x12c20) [0x7f3841a4cc20]", "gsignal()", "abort()", "(ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b6) [0x55d25730da8b]", "(KernelDevice::_aio_thread()+0x1254) [0x55d257e507e4]", "(KernelDevice::AioCompletionThread::entry()+0x11) [0x55d257e5bae1]", "/lib64/libpthread.so.0(+0x817a) [0x7f3841a4217a]", "clone()" ], "ceph_version": "16.2.0-146.el8cp", "crash_id": "2021-11-28T12:32:42.249604Z_9d7761a5-b6a6-4502-9eac-8944a42bb48f", "entity_name": "osd.0", "io_error": true, "io_error_code": -5, "io_error_devname": "sdb", "io_error_length": 4096, "io_error_offset": 21028864, "io_error_optype": 8, "io_error_path": "/var/lib/ceph/osd/ceph-0/block", "os_id": "rhel", "os_name": "Red Hat Enterprise Linux", "os_version": "8.5 (Ootpa)", "os_version_id": "8.5", "process_name": "ceph-osd", "stack_sig": "b8dcbaf37e069edf8c664d423b4d383080e2b0044c722f73720098c980e72912", "timestamp": "2021-11-28T12:32:42.249604Z", "utsname_hostname": "rook-ceph-osd-0-85d6df8dbc-7m4v9", "utsname_machine": "x86_64", "utsname_release": "4.18.0-305.28.1.el8_4.x86_64", "utsname_sysname": "Linux", "utsname_version": "#1 SMP Mon Nov 8 07:45:47 EST 2021" } 15.Delete crash list: sh-4.4$ ceph crash archive-all 16.Check Ceph status sh-4.4$ ceph health HEALTH_OK
https://bugzilla.redhat.com/show_bug.cgi?id=1967164 is targeted for RHCS5.2
Neha, is there any update on this one? Our device failure tests still contain a workaround that tends to break from time to time. It means we either fail those tests or miss coverage for some important scenarios.