Bug 1896810

Summary:

[Tracker for BZ #1967164] Silence crash warning in osd removal job.

Product:

[Red Hat Storage] Red Hat OpenShift Data Foundation

Reporter:

Itzhak <ikave>

Component:

ceph

Assignee:

Neha Ojha <nojha>

ceph sub component:

Ceph-MGR

QA Contact:

Elad <ebenahar>

Status:

CLOSED NOTABUG

Docs Contact:

Severity:

high

Priority:

high

CC:

amagrawa, bniver, brgardne, ebenahar, edonnell, muagarwa, nberry, nojha, odf-bz-bot, oviner, owasserm, pdhange, pdhiran, rzarzyns, sdudhgao, shan, tnielsen

Version:

4.6

Keywords:

AutomationBackLog

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

4.7.0-272.ci

Doc Type:

Known Issue

Doc Text:

.Ceph status is `HEALTH_WARN` after disk replacement After disk replacement, a warning `1 daemons have recently crashed` is seen even if all OSD pods are up and running. This warning causes a change in Ceph's status. The Ceph status should be `HEALTH_OK` instead of `HEALTH_WARN`. To workaround this issue, `rsh` to the `ceph-tools` pod and silence the warning, the Ceph health will then be back to `HEALTH_OK`.

Story Points:

---

Clone Of:

Clones:

1967164 (view as bug list)

Environment:

Last Closed:

2023-11-29 23:01:49 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1882359, 1967164

Attachments:

Description	Flags
Ceph health detail	none

Description Itzhak 2020-11-11 15:27:07 UTC

Created attachment 1728371 [details]
Ceph health detail

Description of problem (please be detailed as possible and provide log
snippests):
When running the test 'test_recovery_from_volume_deletion' all the OSD's are up and running, but in the end, we still have a warning "1 daemons have recently crashed".

Version of all relevant components (if applicable):
vSphere, OCP 4.6, OCS 4.6, NON-LSO.

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, but just in the case of volume deletion.

Is there any workaround available to the best of your knowledge?
Yes. We can manually rsh to the ceph-tools pod and silence the warning, and then the ceph health back to be OK.

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?3

Can this issue reproducible?
yes

Can this issue reproduce from the UI?
No

If this is a regression, please provide more details to justify this:
In OCP 4.5 the Ceph health was Health OK at the end of the test.

Steps to Reproduce:
1. Run the PR validation job on the PR https://github.com/red-hat-storage/ocs-ci/pull/3259/ with the conf:vSphere, OCP 4.6, OCS 4.6, NON-LSO.

Actual results:
At the end of the test, Ceph health is HEALTH_WARN, with the warning "1 daemons have recently crashed".

Expected results:
At the end of the test, Ceph health should be HEALTH_OK.

Additional info:

This is the validation job I ran https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/14459/.

Here is a snippet of the console output https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/14459/consoleFull: 
"13:57:15 - MainThread - tests.manage.z_cluster.nodes.test_disk_failures - INFO - Deleting [vsanDatastore] 66242d5f-cafa-91c3-8164-e4434bd7df48/nberry-n5-cp-knlhx-dynamic-pvc-af9ee095-9a12-4713-a9db-0441819071b1.vmdk from the platform side13:57:53 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig get Node compute-0 -o yaml
13:58:03 - MainThread - ocs_ci.utility.vsphere - INFO - Detaching Disk with identifier: [vsanDatastore] 66242d5f-cafa-91c3-8164-e4434bd7df48/nberry-n5-cp-knlhx-dynamic-pvc-af9ee095-9a12-4713-a9db-0441819071b1.vmdk from compute-0 and remove from datastore=True
13:58:18 - MainThread - tests.manage.z_cluster.nodes.test_disk_failures - INFO - Scaling down OSD deployment rook-ceph-osd-0 to 0
13:58:18 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig scale --replicas=0 deployment/rook-ceph-osd-0
13:58:18 - MainThread - tests.manage.z_cluster.nodes.test_disk_failures - INFO - Waiting for OSD pod rook-ceph-osd-0-6b6df4b7b5-czp8k to get deleted"

When I rsh to ceph tools pod and check "ceph health detail" I saw the old osd crash. The time of the osd crash was at 13:58:08 - which is 5 seconds after the command: "Detaching Disk with identifier: [vsanDatastore] 66242d5f-cafa-91c3-8164-e4434bd7df48/nberry-n5-cp-knlhx-dynamic-pvc-af9ee095-9a12-4713-a9db-0441819071b1.vmdk from compute-0 and remove from datastore=True"

Comment 2 Yaniv Kaul 2020-11-12 09:27:27 UTC

What do you see in the logs? Where's the OSD crash dump?
If we wish Ceph engineering to look at it, let's provide them with the real details here.

Comment 3 Itzhak 2020-11-12 16:31:58 UTC

The Ceph health warning occurs after deleting the backing volume from the platform side. 
After reattaching a new volume and perform all the relevant steps, all the 3 OSD's are up and running. 
But we still have the warning of the old osd crash.  

Here are the test logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/nberry-n5-cp/nberry-n5-cp_20201105T121923/logs/failed_testcase_ocs_logs_1604930171/test_recovery_from_volume_deletion_ocs_logs/

Comment 5 Elad 2020-11-15 13:25:57 UTC

Needinfo answered in comment #3

Comment 6 Yaniv Kaul 2020-11-15 15:47:14 UTC

1. Any idea how we can propagate this issue to the user? As is, it requires a support case.
Is there an alert?

2. This is on a non-LSO VMware, so less likely to be a real HW issue?

Comment 7 Neha Ojha 2020-11-16 19:44:43 UTC

(In reply to Yaniv Kaul from comment #6)
> 1. Any idea how we can propagate this issue to the user? As is, it requires
> a support case.
> Is there an alert?

https://bugzilla.redhat.com/show_bug.cgi?id=1682967 issues a health warning when there are too many repairs done by an OSD. The aim is to help identify and warn about things like bad disk, controller, etc. 

> 
> 2. This is on a non-LSO VMware, so less likely to be a real HW issue?

The error message is well explained in https://bugzilla.redhat.com/show_bug.cgi?id=1856430#c7. Is there a way to confirm that there aren't any issues in the underlying layer?

Comment 8 Yaniv Kaul 2020-11-22 12:56:37 UTC

(In reply to Neha Ojha from comment #7)
> (In reply to Yaniv Kaul from comment #6)
> > 1. Any idea how we can propagate this issue to the user? As is, it requires
> > a support case.
> > Is there an alert?
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1682967 issues a health warning
> when there are too many repairs done by an OSD. The aim is to help identify
> and warn about things like bad disk, controller, etc. 
> 
> > 
> > 2. This is on a non-LSO VMware, so less likely to be a real HW issue?
> 
> The error message is well explained in
> https://bugzilla.redhat.com/show_bug.cgi?id=1856430#c7. Is there a way to
> confirm that there aren't any issues in the underlying layer?

NEEDINFO on reporter.

Comment 9 Raz Tamir 2020-11-23 14:23:41 UTC

Based on a discussion in the "OCS leads meeting", this seems to be the right way for the product to behave.
If the missing part is to add more timeout to our tests, let's do that and this can be closed as NOT A BUG.
Moving to 4.7 to get more information from Itzhak

Comment 10 Elad 2020-11-23 16:32:34 UTC

following the discussion we had, I realized that there is a good chance that the osd-removal-job should take care of removing the OSD and making sure Ceph is in health OK. Servesha, is this assumption correct?
For now, moving the bug to rook and let's keep it open

Comment 11 Sébastien Han 2020-11-26 09:46:31 UTC

Elad, agreed the osd-removal-job should take care of acknowledging the crash and silencing it once solved.
Rohan, please get familiar with https://docs.ceph.com/en/latest/rados/operations/health-checks/#recent-crash and implement the logic in the removal job from Rook.

Thanks.

Comment 12 Servesha 2020-12-03 10:46:14 UTC

@Elad sorry for the late reply. I was on PTO. Your assumption is right. A job should take care of ceph's health.

The query is also addressed in #comment 11. Hence clearing the needinfo...

Comment 14 Servesha 2020-12-04 11:39:09 UTC

@Neha For now it sounds fair to add it as a KNOWN issue IMO. And as a resolution, we can advise them to contact support - Assuming some customers might want to apply the workaround for it.

Comment 16 Yaniv Kaul 2020-12-07 07:22:46 UTC

Are you sure customers know what the ceph tools pod is and how to rsh to it? The workaround is fine for support, not for end users.

Comment 17 Servesha 2020-12-07 08:41:07 UTC

@Yaniv Right. As per the discussion, we will mention it as a known issue in the docs, till the osd-removal job is able to handle this scenario. The customers who are willing to get ceph health to HEALTH_OK (by silencing the OSD crash warning) will have to contact customer support in that case.

Comment 18 Yaniv Kaul 2020-12-16 12:51:35 UTC

(In reply to Servesha from comment #17)
> @Yaniv Right. As per the discussion, we will mention it as a known issue in
> the docs, till the osd-removal job is able to handle this scenario. The
> customers who are willing to get ceph health to HEALTH_OK (by silencing the
> OSD crash warning) will have to contact customer support in that case.

exactly why it should not be in the docs. It's an edge case.

Comment 19 Mudit Agarwal 2021-01-13 10:12:30 UTC

Pulkit, pls change the BZ title to reflect the actual fix.

Comment 23 Sébastien Han 2021-02-18 16:38:03 UTC

Pulkit please backport https://github.com/rook/rook/pull/7001 to https://github.com/openshift/rook/, use `cherry-pick -x`.
Thanks

Comment 24 Itzhak 2021-03-03 17:00:20 UTC

I tried to test the BZ again. 
I deleted a disk and created a new one. 
I followed the procedure here https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.6/html-single/replacing_devices/index#replacing-operational-or-failed-storage-devices-on-clusters-backed-by-local-storage-devices_rhocs except for the part of the ocs-osd-removal job name that changed to "ocs-osd-removal-job

The device replacement process finished successfully, but I still have the osd crash warning at the end of the process.
Here are the logs of the "ocs-osd-removal-job":

2021-03-03 14:40:33.724042 I | rookcmd: starting Rook 4.7-103.a0622de60.release_4.7 with arguments '/usr/local/bin/rook ceph osd remove --osd-ids=1'
2021-03-03 14:40:33.724161 I | rookcmd: flag values: --help=false, --log-flush-frequency=5s, --log-level=DEBUG, --operator-image=, --osd-ids=1, --service-account=
2021-03-03 14:40:33.724171 I | op-mon: parsing mon endpoints: a=172.30.18.55:6789,c=172.30.250.240:6789,d=172.30.153.229:6789
2021-03-03 14:40:33.735707 I | cephclient: writing config file /var/lib/rook/openshift-storage/openshift-storage.config
2021-03-03 14:40:33.735997 I | cephclient: generated admin config in /var/lib/rook/openshift-storage
2021-03-03 14:40:33.736105 D | cephosd: config file @ /etc/ceph/ceph.conf: [global]
fsid                       = ab6c2054-6b8f-4d27-822a-1036cec016f7
mon initial members        = a c d
mon host                   = [v2:172.30.18.55:3300,v1:172.30.18.55:6789],[v2:172.30.250.240:3300,v1:172.30.250.240:6789],[v2:172.30.153.229:3300,v1:172.30.153.229:6789]
mon_osd_full_ratio         = .85
mon_osd_backfillfull_ratio = .8
mon_osd_nearfull_ratio     = .75
mon_max_pg_per_osd         = 600

[osd]
osd_memory_target_cgroup_limit_ratio = 0.5

[client.admin]
keyring = /var/lib/rook/openshift-storage/client.admin.keyring

2021-03-03 14:40:33.736303 D | exec: Running command: ceph osd dump --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json --out-file /tmp/271193002
2021-03-03 14:40:34.073687 I | cephosd: validating status of osd.1
2021-03-03 14:40:34.073712 I | cephosd: osd.1 is marked 'DOWN'. Removing it
2021-03-03 14:40:34.073810 D | exec: Running command: ceph osd find 1 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json --out-file /tmp/101348609
2021-03-03 14:40:34.380136 D | exec: Running command: ceph osd out osd.1 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json --out-file /tmp/581847660
2021-03-03 14:40:35.146811 D | exec: marked out osd.1. 
2021-03-03 14:40:35.156777 I | cephosd: removing the OSD deployment "rook-ceph-osd-1"
2021-03-03 14:40:35.156806 D | op-k8sutil: removing rook-ceph-osd-1 deployment if it exists
2021-03-03 14:40:35.156810 I | op-k8sutil: removing deployment rook-ceph-osd-1 if it exists
2021-03-03 14:40:35.166034 I | op-k8sutil: Removed deployment rook-ceph-osd-1
2021-03-03 14:40:35.170123 I | op-k8sutil: "rook-ceph-osd-1" still found. waiting...
2021-03-03 14:40:37.176310 I | op-k8sutil: confirmed rook-ceph-osd-1 does not exist
2021-03-03 14:40:37.185731 I | cephosd: removing the osd prepare job "rook-ceph-osd-prepare-ocs-deviceset-2-data-0kvd87"
2021-03-03 14:40:37.193762 I | cephosd: removing the OSD PVC "ocs-deviceset-2-data-0kvd87"
2021-03-03 14:40:37.199140 D | exec: Running command: ceph osd purge osd.1 --force --yes-i-really-mean-it --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json --out-file /tmp/044786907
2021-03-03 14:40:37.536172 D | exec: purged osd.1
2021-03-03 14:40:37.536419 D | exec: Running command: ceph osd crush rm compute-2 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json --out-file /tmp/401456254
2021-03-03 14:40:38.545793 D | exec: removed item id -3 name 'compute-2' from crush map
2021-03-03 14:40:38.546037 D | exec: Running command: ceph crash ls --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json --out-file /tmp/706879941
2021-03-03 14:40:38.878203 I | cephosd: no ceph crash to silence
2021-03-03 14:40:38.878230 I | cephosd: completed removal of OSD 1

Comment 25 Itzhak 2021-03-03 17:02:54 UTC

Additional info:

I tested it with a vSphere LSO cluster.

Versions: 

OCP version:
Client Version: 4.6.0-0.nightly-2021-01-12-112514
Server Version: 4.7.0-0.nightly-2021-03-01-085007
Kubernetes Version: v1.20.0+5fbfd19

OCS verison:
ocs-operator.v4.7.0-278.ci   OpenShift Container Storage   4.7.0-278.ci              Succeeded

cluster version
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-03-01-085007   True        False         26h     Cluster version is 4.7.0-0.nightly-2021-03-01-085007

Rook version
rook: 4.7-103.a0622de60.release_4.7
go: go1.15.5

Ceph version
ceph version 14.2.11-123.el8cp (f02fa4f00c2417b1bc86e6ec7711756454e70716) nautilus (stable)

Comment 26 Sébastien Han 2021-03-03 17:16:38 UTC

Pulkit, you might need to wait for the crash to be created.

Comment 27 Itzhak 2021-03-04 10:28:47 UTC

The ceph crash osd usually appears after 5-15 minutes from the volume deletion. 
We need to consider it.

Comment 28 Sébastien Han 2021-03-04 10:53:57 UTC

5 to 15 minutes seems like a huge difference, even 5min looks too long. I'd expect a few seconds.

Comment 29 Itzhak 2021-03-04 12:58:31 UTC

Yes, maybe it's also something that needs to be fixed. Cause otherwise, there is no point in running the "ocs-osd-removal" job for 5 min(or more) or updating the user that the osd crash can appear after 5 min(or more).

Comment 30 Sébastien Han 2021-03-08 08:54:42 UTC

Mudit, why is this considered a blocker?

Comment 31 Mudit Agarwal 2021-03-08 09:00:08 UTC

Raz has marked it as a blocker, but I guess we can re-evaluate. Raz?

Comment 33 Travis Nielsen 2021-03-15 17:12:02 UTC

Moving to 4.8 since it's going to require a change to the approach to really change the behavior. Comments coming on further details of the discussion...

Comment 35 Blaine Gardner 2021-03-15 17:23:09 UTC

Seb, Travis, and I had a discussion about this bug in our triage today. Notes below.

When an OSD is purged from the Ceph cluster, we should *not* remove the crashes from the crash log because users may still wish to keep the information for data evaluation (for example, a postmortem). What we *should* do is clear errors for a given OSD when that OSD is purged so that the Ceph cluster can get back to a healthy state. If Ceph performs this work, then cephadm will also benefit.

There could be a race condition where an OSD is removed just after an OSD crashes. The OSD crash that happened before removal is still a valid crash that some users may wish to keep a record of. That crash should still be reported to Ceph.

However, we think it should be a Ceph feature to clear errors for incoming crashes reported for OSDs that have been purged from the Ceph cluster. Ceph should still accept the incoming crash dump and log it (for postmortems) but not report an error based on the crash since it is for an OSD that no longer exists. This will also mean the same "fix" will apply to cephadm clusters as well as Rook.

Future work on this bug intended for OCS 4.8 will have to involve some collaboration between Ceph and Rook (and possibly cephadm) to make sure we are not removing evidence of errors while still allowing Ceph clusters to report healthy when OSDs are replaced.

Comment 36 Itzhak 2021-03-16 15:30:19 UTC

Sorry for the late response. 
Pulkit, I didn't run any workloads.

Comment 37 Travis Nielsen 2021-05-11 14:51:05 UTC

@pkundra Any update on this one?

Comment 38 Travis Nielsen 2021-05-17 15:45:17 UTC

Any update on this one?

Comment 40 Mudit Agarwal 2021-06-02 14:42:22 UTC

Since this is dependent on Ceph, not possible to include this in 4.8

Comment 42 Mudit Agarwal 2021-08-20 02:14:00 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1967164 is targeted for 5.1

Comment 43 Oded 2021-11-29 13:36:12 UTC

This issue reconstructed on ODF4.9

SetUp:
OCP Version:4.9.0-0.nightly-2021-11-26-225521
ODF Version:4.9.0-249.ci
LSO Version:local-storage-operator.4.9.0-202111151318
Ceph Version:
sh-4.4$ ceph versions
{
    "mon": {
        "ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)": 3
    },
    "mgr": {
        "ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)": 1
    },
    "osd": {
        "ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)": 3
    },
    "mds": {
        "ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)": 2
    },
    "rgw": {
        "ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)": 1
    },
    "overall": {
        "ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)": 10
    }
}

Test Procedure:
1.Check ceph status:
sh-4.4$ ceph status
  cluster:
    id:     e6ae853a-3595-4738-a15e-6cb4a470fc3b
    health: HEALTH_OK

2.Identify the OSD that needs to be replaced [OSD-0 COMPUTE-2]
$ oc get -n openshift-storage pods -l app=rook-ceph-osd -o wide
NAME                               READY   STATUS    RESTARTS   AGE    IP            NODE        NOMINATED NODE   READINESS GATES
rook-ceph-osd-0-85d6df8dbc-7m4v9   2/2     Running   0          5m3s   10.131.0.76   compute-2   <none>           <none>
rook-ceph-osd-1-5c6465f8d-hrnrp    2/2     Running   0          15m    10.129.2.28   compute-1   <none>           <none>
rook-ceph-osd-2-868859c6c8-ck2wq   2/2     Running   0          15m    10.128.2.20   compute-0   <none>           <none>

3.Delete disk via vcenter from compute-2:
OSD-0 move to CLBO

4.Scale down the OSD deployment for the OSD to be replaced:
$ osd_id_to_remove=0
$ oc scale -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove} --replicas=0
deployment.apps/rook-ceph-osd-0 scaled

5.Verify that the rook-ceph-osd pod is terminated.
$ oc get -n openshift-storage pods -l ceph-osd-id=${osd_id_to_remove}
NAME                               READY   STATUS        RESTARTS   AGE
rook-ceph-osd-0-85d6df8dbc-7m4v9   0/2     Terminating   4          9m19s

$ oc delete -n openshift-storage pod rook-ceph-osd-0-85d6df8dbc-7m4v9 --grace-period=0 --force
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "rook-ceph-osd-0-85d6df8dbc-7m4v9" force deleted

$ oc get -n openshift-storage pods -l ceph-osd-id=${osd_id_to_remove}
No resources found in openshift-storage namespace.

6.Remove the old OSD from the cluster so that a new OSD can be added.
$ oc delete -n openshift-storage job ocs-osd-removal-job
Error from server (NotFound): jobs.batch "ocs-osd-removal-job" not found

7.Change to the openshift-storage project.
$ oc project openshift-storage
Already on project "openshift-storage" on server "https://api.oviner5-lso28.qe.rh-ocs.com:6443".

8.Remove the old OSD from the cluster.
$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} |oc create -n openshift-storage -f -
job.batch/ocs-osd-removal-job created

9.Verify that the OSD is removed successfully
$ oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
NAME                           READY   STATUS      RESTARTS   AGE
ocs-osd-removal-job--1-blwm4   0/1     Completed   0          16s


$ oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1
2021-11-28 12:37:35.309851 I | op-flags: failed to set flag "logtostderr". no such flag -logtostderr
2021-11-28 12:37:35.310175 I | rookcmd: starting Rook 4.9-215.c3f67c6.release_4.9 with arguments '/usr/local/bin/rook ceph osd remove --osd-ids=0'
2021-11-28 12:37:35.310184 I | rookcmd: flag values: --help=false, --log-level=DEBUG, --operator-image=, --osd-ids=0, --preserve-pvc=false, --service-account=
2021-11-28 12:37:35.310192 I | op-mon: parsing mon endpoints: c=172.30.100.150:6789,a=172.30.150.178:6789,b=172.30.231.249:6789
2021-11-28 12:37:35.325316 I | cephclient: writing config file /var/lib/rook/openshift-storage/openshift-storage.config
2021-11-28 12:37:35.325522 I | cephclient: generated admin config in /var/lib/rook/openshift-storage
2021-11-28 12:37:35.325837 D | cephclient: config file @ /etc/ceph/ceph.conf: [global]
fsid                        = e6ae853a-3595-4738-a15e-6cb4a470fc3b
mon initial members         = c a b
mon host                    = [v2:172.30.100.150:3300,v1:172.30.100.150:6789],[v2:172.30.150.178:3300,v1:172.30.150.178:6789],[v2:172.30.231.249:3300,v1:172.30.231.249:6789]
bdev_flock_retry            = 20
mon_osd_full_ratio          = .85
mon_osd_backfillfull_ratio  = .8
mon_osd_nearfull_ratio      = .75
mon_max_pg_per_osd          = 600
mon_pg_warn_max_object_skew = 0
mon_data_avail_warn         = 15

[osd]
osd_memory_target_cgroup_limit_ratio = 0.5

[client.admin]
keyring = /var/lib/rook/openshift-storage/client.admin.keyring

2021-11-28 12:37:35.325902 D | exec: Running command: ceph osd dump --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2021-11-28 12:37:35.689019 I | cephosd: validating status of osd.0
2021-11-28 12:37:35.689049 I | cephosd: osd.0 is marked 'DOWN'. Removing it
2021-11-28 12:37:35.689069 D | exec: Running command: ceph osd find 0 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2021-11-28 12:37:36.008324 D | exec: Running command: ceph osd out osd.0 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2021-11-28 12:37:37.294074 I | cephosd: removing the OSD deployment "rook-ceph-osd-0"
2021-11-28 12:37:37.294103 D | op-k8sutil: removing rook-ceph-osd-0 deployment if it exists
2021-11-28 12:37:37.294108 I | op-k8sutil: removing deployment rook-ceph-osd-0 if it exists
2021-11-28 12:37:37.307019 I | op-k8sutil: Removed deployment rook-ceph-osd-0
2021-11-28 12:37:37.311880 I | op-k8sutil: "rook-ceph-osd-0" still found. waiting...
2021-11-28 12:37:39.322341 I | op-k8sutil: confirmed rook-ceph-osd-0 does not exist
2021-11-28 12:37:39.330629 I | cephosd: removing the osd prepare job "rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-2pp2x2"
2021-11-28 12:37:39.340824 I | cephosd: removing the OSD PVC "ocs-deviceset-localblock-0-data-2pp2x2"
2021-11-28 12:37:39.352494 D | exec: Running command: ceph osd purge osd.0 --force --yes-i-really-mean-it --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2021-11-28 12:37:39.717103 D | exec: Running command: ceph osd crush rm compute-2 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2021-11-28 12:37:40.735126 D | exec: Running command: ceph crash ls --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2021-11-28 12:37:41.065961 I | cephosd: no ceph crash to silence
2021-11-28 12:37:41.066007 I | cephosd: completed removal of OSD 0

10.Delete ocs-osd-removal-job
$ oc delete -n openshift-storage job ocs-osd-removal-job
job.batch "ocs-osd-removal-job" deleted

11.Find the persistent volume (PV) that needs to be deleted by the command:
$ oc get pv -L kubernetes.io/hostname | grep localblock | grep Released
local-pv-48294c53                          100Gi      RWO            Delete           Released   openshift-storage/ocs-deviceset-localblock-0-data-2pp2x2   localblock                             14m   compute-2

12.Physically add a new device to the node via vcenter

13.Verify that there is a new OSD running.
$  oc get -n openshift-storage pods -l app=rook-ceph-osd
NAME                               READY   STATUS    RESTARTS   AGE
rook-ceph-osd-0-6ccd74f8d6-xgvsk   2/2     Running   0          2m5s
rook-ceph-osd-1-5c6465f8d-hrnrp    2/2     Running   0          30m
rook-ceph-osd-2-868859c6c8-ck2wq   2/2     Running   0          30m

14.Check Ceph status:
sh-4.4$ ceph status
  cluster:
    id:     e6ae853a-3595-4738-a15e-6cb4a470fc3b
    health: HEALTH_WARN
            1 daemons have recently crashed
 
  services:
    mon: 3 daemons, quorum a,b,c (age 35m)
    mgr: a(active, since 34m)
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 6m), 3 in (since 6m)
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   11 pools, 177 pgs
    objects: 460 objects, 130 MiB
    usage:   370 MiB used, 300 GiB / 300 GiB avail
    pgs:     177 active+clean
 
  io:
    client:   2.6 KiB/s rd, 10 KiB/s wr, 3 op/s rd, 2 op/s wr

sh-4.4$ ceph crash ls
ID                                                                ENTITY  NEW  
2021-11-28T12:32:42.249604Z_9d7761a5-b6a6-4502-9eac-8944a42bb48f  osd.0    *   

sh-4.4$ ceph crash info 2021-11-28T12:32:42.249604Z_9d7761a5-b6a6-4502-9eac-8944a42bb48f
{
    "assert_condition": "abort",
    "assert_file": "/builddir/build/BUILD/ceph-16.2.0/src/blk/kernel/KernelDevice.cc",
    "assert_func": "void KernelDevice::_aio_thread()",
    "assert_line": 600,
    "assert_msg": "/builddir/build/BUILD/ceph-16.2.0/src/blk/kernel/KernelDevice.cc: In function 'void KernelDevice::_aio_thread()' thread 7f38349c3700 time 2021-11-28T12:32:42.246661+0000\n/builddir/build/BUILD/ceph-16.2.0/src/blk/kernel/KernelDevice.cc: 600: ceph_abort_msg(\"Unexpected IO error. This may suggest a hardware issue. Please check your kernel log!\")\n",
    "assert_thread_name": "bstore_aio",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12c20) [0x7f3841a4cc20]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b6) [0x55d25730da8b]",
        "(KernelDevice::_aio_thread()+0x1254) [0x55d257e507e4]",
        "(KernelDevice::AioCompletionThread::entry()+0x11) [0x55d257e5bae1]",
        "/lib64/libpthread.so.0(+0x817a) [0x7f3841a4217a]",
        "clone()"
    ],
    "ceph_version": "16.2.0-146.el8cp",
    "crash_id": "2021-11-28T12:32:42.249604Z_9d7761a5-b6a6-4502-9eac-8944a42bb48f",
    "entity_name": "osd.0",
    "io_error": true,
    "io_error_code": -5,
    "io_error_devname": "sdb",
    "io_error_length": 4096,
    "io_error_offset": 21028864,
    "io_error_optype": 8,
    "io_error_path": "/var/lib/ceph/osd/ceph-0/block",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "8.5 (Ootpa)",
    "os_version_id": "8.5",
    "process_name": "ceph-osd",
    "stack_sig": "b8dcbaf37e069edf8c664d423b4d383080e2b0044c722f73720098c980e72912",
    "timestamp": "2021-11-28T12:32:42.249604Z",
    "utsname_hostname": "rook-ceph-osd-0-85d6df8dbc-7m4v9",
    "utsname_machine": "x86_64",
    "utsname_release": "4.18.0-305.28.1.el8_4.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Mon Nov 8 07:45:47 EST 2021"
}

15.Delete crash list:
sh-4.4$ ceph crash archive-all

16.Check Ceph status
sh-4.4$ ceph health
HEALTH_OK

Comment 45 Mudit Agarwal 2022-01-26 11:32:05 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1967164 is targeted for RHCS5.2

Comment 68 Elad 2023-02-22 08:14:55 UTC

Neha, is there any update on this one?

Our device failure tests still contain a workaround that tends to break from time to time.
It means we either fail those tests or miss coverage for some important scenarios.