Bug 2082089

Summary: [GSS] osds pods stuck in CLBO
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: amansan <amanzane>
Component: rookAssignee: Blaine Gardner <brgardne>
Status: CLOSED NOTABUG QA Contact: Neha Berry <nberry>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.8CC: brgardne, hnallurv, madam, ocs-bugs, odf-bz-bot, tnielsen
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-05-26 14:38:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 3 Blaine Gardner 2022-05-09 19:34:36 UTC
From the must-gathers, I see that OSD 21 and 26 are both failing with errors contacting the Ceph mons. This is often an authentication issue.

> 2022-03-22T14:05:57.075404842+01:00 debug 2022-03-22 13:05:57.074 7fc654dcc700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
> 2022-03-22T14:05:57.076798257+01:00 debug 2022-03-22 13:05:57.075 7fc6555cd700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
> 2022-03-22T14:05:57.076968465+01:00 debug 2022-03-22 13:05:57.075 7fc6545cb700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
> 2022-03-22T14:05:57.077797726+01:00 failed to fetch mon config (--no-mon-config to skip)

`ceph auth ls` is missing an entry for osd.26, but there is an entry for osd.21.

On Mar 22, Valentina reported "I have the suspicion that these 2 disks are from a storage class that was deleted to give the disks directly to ocs (the pvc was deleted, then the pv and finally the sc)." 

Given my first impressions above about auth issues, and from Valentina's statement here, I suspect that it's possible the underlying disks may have been wiped without using the proper OSD removal process in OCS, leaving old auth entries that are no longer valid. 

@amanzane , for the failing OSDs, please use the documented process I linked below to remove the failing OSDs. I will also propose a few modifications to the process since we are not replacing failed disks but instead wiping existing ones. If new OSDs do not come online after this process, we will continue debugging.

Before step 7 of the process, please perform another must-gather. This may be helpful if the process does not resolve the issue.

Before step 9, use `sgdisk --zap-all` to wipe the disk associated with the PV. This removes the old Ceph OSD information from the disk.

Since we are not replacing a failed disk, you can ignore step 10. I believe LSO should start re-adding the disk after step 9.

I would encourage you to follow the process once for each OSD and not for both OSDs at the same time to minimize the risk of manual errors.

OSD removal process: https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.8/html/replacing_devices/openshift_container_storage_deployed_using_local_storage_devices#replacing-operational-or-failed-storage-devices-on-clusters-backed-by-local-storage-devices_rhocs

If new OSDs fail to come online or if different OSDs begin failing, please gather a new must-gather at this point and include the must gather from before step 7 and the new must gather on your next update.


- - - - - - - - - -


On another note that is likely unrelated to the bug itself, I see in the customer case, the note I copied below. There is no supported field in `storageClassDeviceSets` named `replica`. Additionally, from the must-gathers I have, this field is not present as part of the CephCluster. , so I am confused why it is being discussed in the customer case.

> + I asked the customer about this change
> 
> [...]
>    storageDeviceSets:
>     - config: {}
>       count: 41 <---------------------- 
>       dataPVCTemplate:
>         metadata: {}
>         spec:
>           accessModes:
>           - ReadWriteOnce
>           resources:
>             requests:
>               storage: "1"
>           storageClassName: scadif
>           volumeMode: Block
>         status: {}
>       name: ocs-deviceset-scadif
>       placement: {}
>       preparePlacement: {}
>       replica: 1 <------------------------ should be 3
>       resources: {}
>     version: 4.8.0
> [...]
> 
> as the DeviceSets are configured for replica 3 someone had to change this two options so I asked why are these changed