Bug 2138575

Summary: osd removal job fails to complete
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: khover
Component: rookAssignee: Travis Nielsen <tnielsen>
Status: CLOSED NOTABUG QA Contact: Neha Berry <nberry>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.11CC: hnallurv, madam, ocs-bugs, odf-bz-bot, paarora, tnielsen, vumrao
Target Milestone: ---Flags: paarora: needinfo? (khover)
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-11-07 16:43:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description khover 2022-10-29 19:45:17 UTC
osd removal job fails to complete

Customer lost 2 of 3 osds during upgrade.

Attempt was being made to recover the osds.

Disks were replaced with new disk on the aws nodes.

old local-storage links were cleaned up on the nodes.

The deployment for the osd was successfully scaled down and then removed using.

# oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=0 |oc create -n openshift-storage -f -
job.batch/ocs-osd-removal-job created

However the job hangs and never completes.

We waited for 15 min ( just to be sure ).

We tried again with force parameter with same result.

oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=0 -p FORCE_OSD_REMOVAL=true |oc create -n openshift-storage -f -

Job logs show it hanging at the same spot every time.

ceph osd purge osd.0 --force --yes-i-really-mean-it


Also ran the same command directly on the rook-ceph-operator after marking osd out and the command hangs.

===============================================

ID   CLASS  WEIGHT   TYPE NAME                       STATUS  REWEIGHT  PRI-AFF
 -1         5.85928  root default                                             
 -8         1.95309      rack rack0                                           
 -7         1.95309          host ip-10-249-131-228                           
  0    ssd  1.95309              osd.0                 down   1.00000  1.00000
 -4         1.95309      rack rack1                                           
 -3         1.95309          host ip-10-249-132-205                           
  1    ssd  1.95309              osd.1                 down   1.00000  1.00000
-12         1.95309      rack rack2                                           
-11         1.95309          host ip-10-249-128-22                            
  2    ssd  1.95309              osd.2                   up   1.00000  1.00000


  cluster:
    id:     b8f3821a-8f56-4237-82b2-d541ca333bed
    health: HEALTH_WARN
            2 clients failing to respond to capability release
            1 MDSs report slow metadata IOs
            1 MDSs report slow requests
            2 osds down
            2 hosts (2 osds) down
            2 racks (2 osds) down
            Reduced data availability: 273 pgs inactive
            Degraded data redundancy: 831580/1247370 objects degraded (66.667%), 232 pgs degraded, 273 pgs undersized
            10 pgs not deep-scrubbed in time
            2151 daemons have recently crashed
 
  services:
    mon: 3 daemons, quorum a,b,c (age 7d)
    mgr: a(active, since 7d)
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 1 up (since 5d), 3 in (since 12M)
 
  data:
    volumes: 1/1 healthy
    pools:   11 pools, 273 pgs
    objects: 415.79k objects, 1.4 TiB
    usage:   4.1 TiB used, 1.7 TiB / 5.9 TiB avail
    pgs:     100.000% pgs not active
             831580/1247370 objects degraded (66.667%)
             232 undersized+degraded+peered
             41  undersized+peered

Version of all relevant components (if applicable):

ocs-operator.v4.10.7

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

Comment 2 khover 2022-10-29 20:22:21 UTC
Customer Request

to mirror the bugzilla to IBM LTC for visibility and if required add the ibm_conf keyword if it requires it

Comment 3 khover 2022-10-29 20:33:56 UTC
Removal logs from 2nd run on osd-0 ( deployment "rook-ceph-osd-0" ) already removed on the first removal job run.


2022-10-29 16:08:11.828027 I | rookcmd: starting Rook 4.6-117.805c8bf.release_4.6 with arguments '/usr/local/bin/rook ceph osd remove --osd-ids=0'
2022-10-29 16:08:11.828115 I | rookcmd: flag values: --help=false, --log-flush-frequency=5s, --log-level=DEBUG, --operator-image=, --osd-ids=0, --service-account=
2022-10-29 16:08:11.828124 I | op-mon: parsing mon endpoints: a=172.30.138.46:6789,b=172.30.143.114:6789,c=172.30.177.93:6789
2022-10-29 16:08:11.829963 I | cephclient: writing config file /var/lib/rook/openshift-storage/openshift-storage.config
2022-10-29 16:08:11.830017 I | cephclient: generated admin config in /var/lib/rook/openshift-storage
2022-10-29 16:08:11.830093 D | cephosd: config file @ /etc/ceph/ceph.conf: [global]
fsid                = b8f3821a-8f56-4237-82b2-d541ca333bed
mon initial members = a b c
mon host            = [v2:172.30.138.46:3300,v1:172.30.138.46:6789],[v2:172.30.143.114:3300,v1:172.30.143.114:6789],[v2:172.30.177.93:3300,v1:172.30.177.93:6789]

[client.admin]
keyring = /var/lib/rook/openshift-storage/client.admin.keyring

2022-10-29 16:08:11.830213 D | exec: Running command: ceph osd dump --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json --out-file /tmp/062267478
2022-10-29 16:08:12.222533 D | cephosd: validating status of osd.0
2022-10-29 16:08:12.222554 D | cephosd: osd.0 is marked 'DOWN'. Removing it
2022-10-29 16:08:12.222626 D | exec: Running command: ceph osd find 0 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json --out-file /tmp/471435709
2022-10-29 16:08:12.578307 D | exec: Running command: ceph osd out osd.0 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json --out-file /tmp/616678392
2022-10-29 16:08:12.942387 D | exec: osd.0 is already out. 
2022-10-29 16:08:12.950437 E | cephosd: failed to fetch the deployment "rook-ceph-osd-0". deployments.apps "rook-ceph-osd-0" not found
2022-10-29 16:08:12.950553 D | exec: Running command: ceph osd purge osd.0 --force --yes-i-really-mean-it --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json --out

Comment 5 khover 2022-10-31 14:40:14 UTC
Hello Team,

Im not certain this is a bug at this time.

When joining the call this morning the removal job had completed after 8 hours.


ocs-osd-removal-0                                        1/1           8h         43h

I think there are underlying api/etcd issues at the OCP layer.

The customer hit several issues during a OCP upgrade and started rebooting machines as a workaround.

Ive never seen a removal job take 8 hours to complete.( ive never seen it take more than a few seconds for one osd )

Advised customer to open a case with the openshift team so they can look at the health of the platform.

Lets keep open for now, but I think not a bug.

Comment 13 Parth Arora 2022-11-02 08:56:43 UTC
@khover  did you had a chance to look at the cluster,
Is the cluster is in Heath_OK state, now?

If yes then I think we can close this bug as it would be not a bug.

Comment 17 Travis Nielsen 2022-11-07 16:23:56 UTC
Kevin, any update?

Comment 18 Travis Nielsen 2022-11-07 16:37:34 UTC
Looks like an environment issue, moving out of 4.12 during investigation.