2138575 – osd removal job fails to complete

Bug 2138575 - osd removal job fails to complete

Summary: osd removal job fails to complete

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Travis Nielsen
QA Contact:	Neha Berry
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-10-29 19:45 UTC by khover
Modified:	2023-12-08 04:31 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-11-07 16:43:31 UTC
Embargoed:

Attachments	(Terms of Use)

Description khover 2022-10-29 19:45:17 UTC

osd removal job fails to complete

Customer lost 2 of 3 osds during upgrade.

Attempt was being made to recover the osds.

Disks were replaced with new disk on the aws nodes.

old local-storage links were cleaned up on the nodes.

The deployment for the osd was successfully scaled down and then removed using.

# oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=0 |oc create -n openshift-storage -f -
job.batch/ocs-osd-removal-job created

However the job hangs and never completes.

We waited for 15 min ( just to be sure ).

We tried again with force parameter with same result.

oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=0 -p FORCE_OSD_REMOVAL=true |oc create -n openshift-storage -f -

Job logs show it hanging at the same spot every time.

ceph osd purge osd.0 --force --yes-i-really-mean-it


Also ran the same command directly on the rook-ceph-operator after marking osd out and the command hangs.

===============================================

ID   CLASS  WEIGHT   TYPE NAME                       STATUS  REWEIGHT  PRI-AFF
 -1         5.85928  root default                                             
 -8         1.95309      rack rack0                                           
 -7         1.95309          host ip-10-249-131-228                           
  0    ssd  1.95309              osd.0                 down   1.00000  1.00000
 -4         1.95309      rack rack1                                           
 -3         1.95309          host ip-10-249-132-205                           
  1    ssd  1.95309              osd.1                 down   1.00000  1.00000
-12         1.95309      rack rack2                                           
-11         1.95309          host ip-10-249-128-22                            
  2    ssd  1.95309              osd.2                   up   1.00000  1.00000


  cluster:
    id:     b8f3821a-8f56-4237-82b2-d541ca333bed
    health: HEALTH_WARN
            2 clients failing to respond to capability release
            1 MDSs report slow metadata IOs
            1 MDSs report slow requests
            2 osds down
            2 hosts (2 osds) down
            2 racks (2 osds) down
            Reduced data availability: 273 pgs inactive
            Degraded data redundancy: 831580/1247370 objects degraded (66.667%), 232 pgs degraded, 273 pgs undersized
            10 pgs not deep-scrubbed in time
            2151 daemons have recently crashed
 
  services:
    mon: 3 daemons, quorum a,b,c (age 7d)
    mgr: a(active, since 7d)
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 1 up (since 5d), 3 in (since 12M)
 
  data:
    volumes: 1/1 healthy
    pools:   11 pools, 273 pgs
    objects: 415.79k objects, 1.4 TiB
    usage:   4.1 TiB used, 1.7 TiB / 5.9 TiB avail
    pgs:     100.000% pgs not active
             831580/1247370 objects degraded (66.667%)
             232 undersized+degraded+peered
             41  undersized+peered

Version of all relevant components (if applicable):

ocs-operator.v4.10.7

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

Comment 2 khover 2022-10-29 20:22:21 UTC

Customer Request

to mirror the bugzilla to IBM LTC for visibility and if required add the ibm_conf keyword if it requires it

Comment 3 khover 2022-10-29 20:33:56 UTC

Removal logs from 2nd run on osd-0 ( deployment "rook-ceph-osd-0" ) already removed on the first removal job run.


2022-10-29 16:08:11.828027 I | rookcmd: starting Rook 4.6-117.805c8bf.release_4.6 with arguments '/usr/local/bin/rook ceph osd remove --osd-ids=0'
2022-10-29 16:08:11.828115 I | rookcmd: flag values: --help=false, --log-flush-frequency=5s, --log-level=DEBUG, --operator-image=, --osd-ids=0, --service-account=
2022-10-29 16:08:11.828124 I | op-mon: parsing mon endpoints: a=172.30.138.46:6789,b=172.30.143.114:6789,c=172.30.177.93:6789
2022-10-29 16:08:11.829963 I | cephclient: writing config file /var/lib/rook/openshift-storage/openshift-storage.config
2022-10-29 16:08:11.830017 I | cephclient: generated admin config in /var/lib/rook/openshift-storage
2022-10-29 16:08:11.830093 D | cephosd: config file @ /etc/ceph/ceph.conf: [global]
fsid                = b8f3821a-8f56-4237-82b2-d541ca333bed
mon initial members = a b c
mon host            = [v2:172.30.138.46:3300,v1:172.30.138.46:6789],[v2:172.30.143.114:3300,v1:172.30.143.114:6789],[v2:172.30.177.93:3300,v1:172.30.177.93:6789]

[client.admin]
keyring = /var/lib/rook/openshift-storage/client.admin.keyring

2022-10-29 16:08:11.830213 D | exec: Running command: ceph osd dump --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json --out-file /tmp/062267478
2022-10-29 16:08:12.222533 D | cephosd: validating status of osd.0
2022-10-29 16:08:12.222554 D | cephosd: osd.0 is marked 'DOWN'. Removing it
2022-10-29 16:08:12.222626 D | exec: Running command: ceph osd find 0 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json --out-file /tmp/471435709
2022-10-29 16:08:12.578307 D | exec: Running command: ceph osd out osd.0 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json --out-file /tmp/616678392
2022-10-29 16:08:12.942387 D | exec: osd.0 is already out. 
2022-10-29 16:08:12.950437 E | cephosd: failed to fetch the deployment "rook-ceph-osd-0". deployments.apps "rook-ceph-osd-0" not found
2022-10-29 16:08:12.950553 D | exec: Running command: ceph osd purge osd.0 --force --yes-i-really-mean-it --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json --out

Comment 5 khover 2022-10-31 14:40:14 UTC

Hello Team,

Im not certain this is a bug at this time.

When joining the call this morning the removal job had completed after 8 hours.


ocs-osd-removal-0                                        1/1           8h         43h

I think there are underlying api/etcd issues at the OCP layer.

The customer hit several issues during a OCP upgrade and started rebooting machines as a workaround.

Ive never seen a removal job take 8 hours to complete.( ive never seen it take more than a few seconds for one osd )

Advised customer to open a case with the openshift team so they can look at the health of the platform.

Lets keep open for now, but I think not a bug.

Comment 13 Parth Arora 2022-11-02 08:56:43 UTC

@khover  did you had a chance to look at the cluster,
Is the cluster is in Heath_OK state, now?

If yes then I think we can close this bug as it would be not a bug.

Comment 17 Travis Nielsen 2022-11-07 16:23:56 UTC

Kevin, any update?

Comment 18 Travis Nielsen 2022-11-07 16:37:34 UTC

Looks like an environment issue, moving out of 4.12 during investigation.

Comment 21 Red Hat Bugzilla 2023-12-08 04:31:04 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.