Bug 2047318
| Summary: | [GSS] pgs stay in active+degraded state on new ceph cluster | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Container Storage | Reporter: | Samuel Blais-Dowdy <sblaisdo> |
| Component: | rook | Assignee: | Travis Nielsen <tnielsen> |
| Status: | CLOSED NOTABUG | QA Contact: | Elad <ebenahar> |
| Severity: | urgent | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.8 | CC: | madam, ocs-bugs, sabose |
| Target Milestone: | --- | Flags: | sblaisdo:
needinfo+
|
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-02-08 20:23:45 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Samuel Blais-Dowdy
2022-01-27 15:34:47 UTC
One of the OSDs is not up. Since must-gather did not run, is there a way to get Rook operator logs and attach to bz? To troubleshoot why the osd is down, also please gather: - "ceph osd tree" in the toolbox will show which osd is down, if it's not obvious from a crashing pod - Logs for the down osd pod - OSD pod description for the osd that is down --- $ ceph osd tree --- sh-4.4$ ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 2.00000 root default -5 2.00000 region us-east-2 -4 1.00000 zone us-east-2b -3 1.00000 host default-0-data-0pv72z 0 ssd 1.00000 osd.0 up 1.00000 1.00000 -10 1.00000 zone us-east-2c -9 1.00000 host default-2-data-0cd28h 2 ssd 1.00000 osd.2 up 1.00000 1.00000 1 0 osd.1 down 0 1.00000 --- OSD down pod description/logs --- As the pod does not exist, and was never scheduled, this information is unfortunately unavailable. If the pods were grouped in a DaemonSet or Deployment, I believe we would receive better update/information, and we would clearly see 2/3 pods running, hinting at an issue. Samuel Is there an OSD deployment that is unschedulable? Since the osd.1 exists, this means that its OSD prepare job completed successfully, and likely the OSD deployment is pending for some reason. But without a rook operator log, or a description of the rook-ceph-osd-1 deployment it's difficult to diagnose. Getting the must-gather access will really help troubleshoot. I digged into the osd.1 prepare jobs (attached the logs), and it seems it actually failed to execute properly. There are some python tracebacks, even though it reported as Completed Successfully. ceph-volume is failing just to list if there are any OSDs already on the volume, with several stacks like this:
2022-01-26 23:10:36.953704 D | exec: Running command: stdbuf -oL ceph-volume --log-path /tmp/ceph-log raw list /mnt/default-1-data-0f8266 --format json
2022-01-26 23:10:37.200585 E | cephosd: . Traceback (most recent call last):
File "/usr/sbin/ceph-volume", line 11, in <module>
load_entry_point('ceph-volume==1.0.0', 'console_scripts', 'ceph-volume')()
File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 39, in __init__
self.main(self.argv)
File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc
return f(*a, **kw)
File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 151, in main
terminal.dispatch(self.mapper, subcommand_args)
File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
instance.main()
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/main.py", line 32, in main
terminal.dispatch(self.mapper, self.argv)
File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
instance.main()
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 136, in main
self.list(args)
File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root
return func(*a, **kw)
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 92, in list
report = self.generate(args.device)
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 80, in generate
whoami = oj[dev]['whoami']
KeyError: 'whoami'
The PV seems corrupted. This is just a new cluster, right? Can you wipe it and allow a new OSD to be created? You should just need to run the job template to purge OSD 1. Are you familiar with that job template?
Yes this was a new cluster (when the addon was deployed). I am not familiar with that job template. Any links to an SOP? Here are instructions on running the osd removal job: https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.8/html/deploying_and_managing_openshift_container_storage_using_red_hat_openstack_platform/replacing-storage-devices_osp Thanks for the link. Followed the procedure and performed following steps:
1. Delete the pvc `default-1-data-0f8266`
$ oc delete -n openshift-storage pvc default-1-data-0f8266
2. Remove finalizer (should be added to docs):
$ oc patch -n openshift-storage pvc default-1-data-0f8266 -p '{"metadata":{"finalizers":null}}'
3. Can't delete the associated PV as we don't have sufficient backplane permissions (get,list,watch):
$ oc auth can-i --list -n openshift-storage | grep persistentvolume
persistentvolumeclaims [] [] [get list watch create delete deletecollection patch update]
persistentvolumeclaims/status [] [] [get list watch]
persistentvolumes [] [] [get list watch]
4. Run the job template for OSD.1:
$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=1 | oc create -n openshift-storage -f -
5. Job keeps spinning pods that fail. Inspecting logs:
$ oc logs -n openshift-storage pod/ocs-osd-removal-job--1-97pmd
2022-02-03 19:58:24.508518 I | rookcmd: starting Rook v4.8.5-1 with arguments '/usr/local/bin/rook ceph osd remove --osd-ids=1'
2022-02-03 19:58:24.508617 I | rookcmd: flag values: --help=false, --log-flush-frequency=5s, --log-level=DEBUG, --operator-image=, --osd-ids=1, --service-account=
2022-02-03 19:58:24.508625 I | op-mon: parsing mon endpoints: c=172.30.208.123:6789,a=172.30.211.71:6789,b=172.30.144.228:6789
2022-02-03 19:58:24.523431 I | cephclient: writing config file /var/lib/rook/openshift-storage/openshift-storage.config
2022-02-03 19:58:24.523551 I | cephclient: generated admin config in /var/lib/rook/openshift-storage
2022-02-03 19:58:24.523625 D | cephclient: config file @ /etc/ceph/ceph.conf: [global]
fsid = 6c31c42c-6328-48f7-8c08-b7877b484c8c
mon initial members = a b c
mon host = [v2:172.30.211.71:3300,v1:172.30.211.71:6789],[v2:172.30.144.228:3300,v1:172.30.144.228:6789],[v2:172.30.208.123:3300,v1:172.30.208.123:6789]
bdev_flock_retry = 20
mon_osd_full_ratio = .85
mon_osd_backfillfull_ratio = .8
mon_osd_nearfull_ratio = .75
mon_max_pg_per_osd = 600
mon_pg_warn_max_object_skew = 0
[osd]
osd_memory_target_cgroup_limit_ratio = 0.5
[client.admin]
keyring = /var/lib/rook/openshift-storage/client.admin.keyring
2022-02-03 19:58:24.523760 D | exec: Running command: ceph osd dump --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json --out-file /tmp/433535554
2022-02-03 19:58:24.825215 I | cephosd: validating status of osd.1
failed to get osd status for osd 1: not found osd.1 in OSDDump
It seems the osd.1 is already removed from ceph: 2022-02-03 19:58:24.825215 I | cephosd: validating status of osd.1 failed to get osd status for osd 1: not found osd.1 in OSDDump Did the osd prepare job and OSD get deleted? If the OSD is purged from Ceph now, the operator should be able to create a new OSD to replace osd.1. If the old osd.1 was purged and you're not seeing a new osd be created automatically, try restarting the operator. Note that Ceph will re-use OSD IDs, so the new OSD will likely also be called osd.1. I removed the prepare job manually:
$ oc delete -n openshift-storage jobs/rook-ceph-osd-prepare-default-1-data-xxxx
Now waiting on operator to do it's magic. It keeps producing this log:
2022-02-03 21:12:33.616357 I | clusterdisruption-controller: all "zone" failure domains: [us-east-2b us-east-2c]. osd is down in failure domain: "". active node drains: false. pg health: "cluster is not fully clean. PGs: [{StateName:active+undersized Count:73} {StateName:active+undersized+degraded Count:23}]"
2022-02-03 21:12:33.617460 I | op-k8sutil: returning version v1.22.3 instead of v1.22.3+e790d7f
2022-02-03 21:12:33.618209 I | op-k8sutil: returning version v1.22.3 instead of v1.22.3+e790d7f
2022-02-03 21:13:03.625052 I | op-k8sutil: returning version v1.22.3 instead of v1.22.3+e790d7f
2022-02-03 21:13:03.625759 I | op-k8sutil: returning version v1.22.3 instead of v1.22.3+e790d7f
Both $ ceph status + $ ceph osd status, confirm that osd1 is gone.
Those messages from the disruption controller are expected until the new OSD is up. Please try restarting the operator (delete the rook operator pod) to trigger a new reconcile for the OSDs. Manually deleting the operator pod triggered a reconciled and:
- rook-ceph-osd-prepare-default-1-data-0bbp5l--1-86bwv job ran
- rook-ceph-osd-1-565c85fd5-8rk6c pod was created
--- $ ceph status ---
sh-4.4$ ceph status
cluster:
id: 6c31c42c-6328-48f7-8c08-b7877b484c8c
health: HEALTH_OK
services:
mon: 3 daemons, quorum a,b,c (age 7d)
mgr: a(active, since 7d)
mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay
osd: 3 osds: 3 up (since 62s), 3 in (since 62s)
data:
pools: 3 pools, 96 pgs
objects: 34 objects, 43 KiB
usage: 3.0 GiB used, 3.0 TiB / 3 TiB avail
pgs: 96 active+clean
io:
client: 853 B/s rd, 1 op/s rd, 0 op/s wr
--- $ ceph osd status ---
sh-4.4$ ceph osd status
+----+----------------------------------------------+-------+-------+--------+---------+--------+---------+-----------+
| id | host | used | avail | wr ops | wr data | rd ops | rd data | state |
+----+----------------------------------------------+-------+-------+--------+---------+--------+---------+-----------+
| 0 | ip-10-177-160-100.us-east-2.compute.internal | 1024M | 1022G | 0 | 0 | 0 | 0 | exists,up |
| 1 | ip-10-177-160-54.us-east-2.compute.internal | 1024M | 1022G | 0 | 0 | 0 | 0 | exists,up |
| 2 | ip-10-177-160-169.us-east-2.compute.internal | 1024M | 1022G | 0 | 0 | 2 | 106 | exists,up |
+----+----------------------------------------------+-------+-------+--------+---------+--------+---------+-----------+
--- Note ---
It seems we did not need to remove the PV after all. Do you think we will ever need to delete PVs with backplane access in the future? I had created this follow-up ticket, if you could please comment :) -> https://issues.redhat.com/browse/MTSRE-430
Attaching logs from the new operator pod as 'rook-ceph-operator-new-logs.txt'.
Thank you I think we we're able to resolve the issue.
--- Open questions to investigate --- 1. Will we be required to delete Persistent Volumes? 2. Why did we have to manually delete the rook-ceph-operator pod to trigger a reconciliation? 3. Why did OCS deploy with a corrupted osd.1, and why was the operator unable to reconcile and fix it? 4. Why did the rook-ceph-osd-prepare-default-1-data job succeed when it really deployed a corrupted osd.1? 5. Why did the rook-ceph-operator not detect the missing osd.1 pod? Great to see the cluster is healthy again with all three OSDs. (In reply to Samuel Blais-Dowdy from comment #17) > --- Open questions to investigate --- > > 1. Will we be required to delete Persistent Volumes? You shouldn't need to delete PVs directly. Deleting the PVC should be sufficient. PVs generally have a deletion policy of Reclaim, but it's defined by the storage class which you may not have control of. > 2. Why did we have to manually delete the rook-ceph-operator pod to trigger > a reconciliation? The rook operator will watch for many types events to automatically trigger a reconcile. For example, if the cephcluster CR is updated, or if any deployments in the rook namespace are deleted, a new reconcile will be started automatically. But Rook does not trigger a reconcile if a job is deleted, which happened in this case. > 3. Why did OCS deploy with a corrupted osd.1, and why was the operator > unable to reconcile and fix it? I've not seen reports of a corrupt OSD immediately after deployment like this. It seems like a rare condition, but if you see it again or multiple times we should investigate further. Sometimes the underlying storage just goes bad though. > 4. Why did the rook-ceph-osd-prepare-default-1-data job succeed when it > really deployed a corrupted osd.1? The OSD did fail because of the corrupt disk. Even if the job did fail, Rook wouldn't be able to fix it automatically. Rook expects intervention if there is bad underlying storage. > 5. Why did the rook-ceph-operator not detect the missing osd.1 pod? The operator pod did show it was waiting for the other osd to start. The operator just couldn't recover automatically so it was stuck. One thing that bothers me, is that the osd.1 prepare job was in a failed state (python Traceback), but reported Completed/Success:
2022-01-26 23:10:36.953704 D | exec: Running command: stdbuf -oL ceph-volume --log-path /tmp/ceph-log raw list /mnt/default-1-data-0f8266 --format json
2022-01-26 23:10:37.200585 E | cephosd: . Traceback (most recent call last):
File "/usr/sbin/ceph-volume", line 11, in <module>
load_entry_point('ceph-volume==1.0.0', 'console_scripts', 'ceph-volume')()
File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 39, in __init__
self.main(self.argv)
File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc
return f(*a, **kw)
File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 151, in main
terminal.dispatch(self.mapper, subcommand_args)
File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
instance.main()
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/main.py", line 32, in main
terminal.dispatch(self.mapper, self.argv)
File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
instance.main()
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 136, in main
self.list(args)
File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root
return func(*a, **kw)
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 92, in list
report = self.generate(args.device)
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 80, in generate
whoami = oj[dev]['whoami']
KeyError: 'whoami'
I created a post-mortem meeting and invited ocs-dedicated. Please let me know if I should invite anyone else (or use a different email), and feel free to request a different time. Thank you. Please invite me as well Samuel Can we close this issue now, or waiting for the post-mortem? We can close it. And track any action items that arise from the post-mortem in Jira. Thank you Travis ! |