osd removal job fails to complete Customer lost 2 of 3 osds during upgrade. Attempt was being made to recover the osds. Disks were replaced with new disk on the aws nodes. old local-storage links were cleaned up on the nodes. The deployment for the osd was successfully scaled down and then removed using. # oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=0 |oc create -n openshift-storage -f - job.batch/ocs-osd-removal-job created However the job hangs and never completes. We waited for 15 min ( just to be sure ). We tried again with force parameter with same result. oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=0 -p FORCE_OSD_REMOVAL=true |oc create -n openshift-storage -f - Job logs show it hanging at the same spot every time. ceph osd purge osd.0 --force --yes-i-really-mean-it Also ran the same command directly on the rook-ceph-operator after marking osd out and the command hangs. =============================================== ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 5.85928 root default -8 1.95309 rack rack0 -7 1.95309 host ip-10-249-131-228 0 ssd 1.95309 osd.0 down 1.00000 1.00000 -4 1.95309 rack rack1 -3 1.95309 host ip-10-249-132-205 1 ssd 1.95309 osd.1 down 1.00000 1.00000 -12 1.95309 rack rack2 -11 1.95309 host ip-10-249-128-22 2 ssd 1.95309 osd.2 up 1.00000 1.00000 cluster: id: b8f3821a-8f56-4237-82b2-d541ca333bed health: HEALTH_WARN 2 clients failing to respond to capability release 1 MDSs report slow metadata IOs 1 MDSs report slow requests 2 osds down 2 hosts (2 osds) down 2 racks (2 osds) down Reduced data availability: 273 pgs inactive Degraded data redundancy: 831580/1247370 objects degraded (66.667%), 232 pgs degraded, 273 pgs undersized 10 pgs not deep-scrubbed in time 2151 daemons have recently crashed services: mon: 3 daemons, quorum a,b,c (age 7d) mgr: a(active, since 7d) mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 1 up (since 5d), 3 in (since 12M) data: volumes: 1/1 healthy pools: 11 pools, 273 pgs objects: 415.79k objects, 1.4 TiB usage: 4.1 TiB used, 1.7 TiB / 5.9 TiB avail pgs: 100.000% pgs not active 831580/1247370 objects degraded (66.667%) 232 undersized+degraded+peered 41 undersized+peered Version of all relevant components (if applicable): ocs-operator.v4.10.7 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Customer Request to mirror the bugzilla to IBM LTC for visibility and if required add the ibm_conf keyword if it requires it
Removal logs from 2nd run on osd-0 ( deployment "rook-ceph-osd-0" ) already removed on the first removal job run. 2022-10-29 16:08:11.828027 I | rookcmd: starting Rook 4.6-117.805c8bf.release_4.6 with arguments '/usr/local/bin/rook ceph osd remove --osd-ids=0' 2022-10-29 16:08:11.828115 I | rookcmd: flag values: --help=false, --log-flush-frequency=5s, --log-level=DEBUG, --operator-image=, --osd-ids=0, --service-account= 2022-10-29 16:08:11.828124 I | op-mon: parsing mon endpoints: a=172.30.138.46:6789,b=172.30.143.114:6789,c=172.30.177.93:6789 2022-10-29 16:08:11.829963 I | cephclient: writing config file /var/lib/rook/openshift-storage/openshift-storage.config 2022-10-29 16:08:11.830017 I | cephclient: generated admin config in /var/lib/rook/openshift-storage 2022-10-29 16:08:11.830093 D | cephosd: config file @ /etc/ceph/ceph.conf: [global] fsid = b8f3821a-8f56-4237-82b2-d541ca333bed mon initial members = a b c mon host = [v2:172.30.138.46:3300,v1:172.30.138.46:6789],[v2:172.30.143.114:3300,v1:172.30.143.114:6789],[v2:172.30.177.93:3300,v1:172.30.177.93:6789] [client.admin] keyring = /var/lib/rook/openshift-storage/client.admin.keyring 2022-10-29 16:08:11.830213 D | exec: Running command: ceph osd dump --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json --out-file /tmp/062267478 2022-10-29 16:08:12.222533 D | cephosd: validating status of osd.0 2022-10-29 16:08:12.222554 D | cephosd: osd.0 is marked 'DOWN'. Removing it 2022-10-29 16:08:12.222626 D | exec: Running command: ceph osd find 0 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json --out-file /tmp/471435709 2022-10-29 16:08:12.578307 D | exec: Running command: ceph osd out osd.0 --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json --out-file /tmp/616678392 2022-10-29 16:08:12.942387 D | exec: osd.0 is already out. 2022-10-29 16:08:12.950437 E | cephosd: failed to fetch the deployment "rook-ceph-osd-0". deployments.apps "rook-ceph-osd-0" not found 2022-10-29 16:08:12.950553 D | exec: Running command: ceph osd purge osd.0 --force --yes-i-really-mean-it --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json --out
Hello Team, Im not certain this is a bug at this time. When joining the call this morning the removal job had completed after 8 hours. ocs-osd-removal-0 1/1 8h 43h I think there are underlying api/etcd issues at the OCP layer. The customer hit several issues during a OCP upgrade and started rebooting machines as a workaround. Ive never seen a removal job take 8 hours to complete.( ive never seen it take more than a few seconds for one osd ) Advised customer to open a case with the openshift team so they can look at the health of the platform. Lets keep open for now, but I think not a bug.
@khover did you had a chance to look at the cluster, Is the cluster is in Heath_OK state, now? If yes then I think we can close this bug as it would be not a bug.
Kevin, any update?
Looks like an environment issue, moving out of 4.12 during investigation.