Bug 2089398
| Summary: | [GSS]OSD pods CLBO after upgrade to 4.10 from 4.9. | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | khover |
| Component: | rook | Assignee: | Sébastien Han <shan> |
| Status: | CLOSED ERRATA | QA Contact: | Neha Berry <nberry> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.10 | CC: | agantony, assingh, hnallurv, kramdoss, madam, mduasope, muagarwa, ocs-bugs, odf-bz-bot, olakra, pbalogh, robertodocampo, shan, tnielsen |
| Target Milestone: | --- | Flags: | khover:
needinfo-
|
| Target Release: | ODF 4.10.3 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Currently, all the upgraded OpenShift Data Foundation clusters from versions 4.3 to 4.10, run Logical Volume Manager (LVM) based Object Storage Devices (OSDs). These upgraded clusters crashed due to an invalid argument on the LVM-based OSDs.
With this update, the rook operator reconciles the correct arguments for the LVM-based OSDs and the invalid argument no longer gets added. Now, the LVM-based OSDs run successfully after upgrading to version 4.10.3.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-06-14 12:26:59 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Will track with 2089397, this looks like a mistaken dup *** This bug has been marked as a duplicate of bug 2089397 *** Actually, let's use this BZ to track the 4.10 backport. All OSDs on PVs using lvm mode would be affected by this. There are few clusters that would be affected by this, but it is a blocking issue for OSDs on those clusters. The fix is very simple and low risk, we should include in the next 4.10.z. Fix is merged upstream, we just need to open the backport PR for 4.10.z Hi Travis, Sébastien, Are there any steps that the customer needs to fix their cluster ? If Yes. Please provide and I will pass them along. Hi Kevan, Unfortunately, they are none, if you edit the osd deployment spec it will be reverted by the rook operator. The only fix is to wait for the next z-stream, but I understand that it might not be realistic. I'm not sure if we can provide a hotfix or not. Mudit, thoughts? When is 4.10.4 planned? Thanks 4.10.4 is almost a month away, let me check if we can include it in 4.10.3 which is going to be released on 7th June. (In reply to Mudit Agarwal from comment #7) > 4.10.4 is almost a month away, let me check if we can include it in 4.10.3 > which is going to be released on 7th June. Thanks, it's a really small change with low scope. QE is asking for verification steps, please update the same. Kevan, please confirm the initial version installed of the cluster thanks. Hi Sébastien, From the cu: The OCP Custer and the (old brand) OCS cluster was installed on 4.3.x versions. Thanks Kevan, For QE to repro: * install a 4.3 cluster * upgrade all the way up to 4.10 * the OSDs should run normally after the upgrade Hi Sébastien, A hotfix may be needed for cu based on the following variables (unless im missing something). fix is editing the osd deployment to remove /rook/rook , then the current state > ocs operator is scaled to 0 as to not reconcile, and stomp on the edit. Upgrading to 10.x.x to get the fix may not be possible if ocs-operator is 0/1 ? Im not aware of any workaround for that scenario. (In reply to khover from comment #22) > Hi Sébastien, > > A hotfix may be needed for cu based on the following variables (unless im > missing something). > > > fix is editing the osd deployment to remove /rook/rook , then the current > state > ocs operator is scaled to 0 as to not reconcile, and stomp on the > edit. > > > Upgrading to 10.x.x to get the fix may not be possible if ocs-operator is > 0/1 ? > > > Im not aware of any workaround for that scenario. Hi Sébastien, Disregard, all operators are 1/1 Removing my needinfo. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Data Foundation 4.10.3 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:5023 To make it pass and resume the service, you can edit the rook-ceph-osd deployment: * OSD startup line * add -> "--lv-backed-pv", "true" The daemon should successfully start after this. (In reply to Sebastian Han from comment #36) > To make it pass and resume the service, you can edit the rook-ceph-osd > deployment: > > * OSD startup line > * add -> "--lv-backed-pv", "true" > > The daemon should successfully start after this. Hey Sebastian, Just to be clear, is this correct ? containers: - args: - ceph - osd - start - -- lv-backed-pv=true - --foreground - --id - "1" - --fsid - 42e1ae07-9402-4cc9-b1a4-a1fe127e6ebc - --cluster - ceph - --setuser Changing the envvar below... the osd starts successfully
- name: ROOK_LV_BACKED_PV
value: "true"
Hello Team,
Please help me understand where and how in the osd deployment this needs to be fixed.
I need to pass this to the customer and I have two versions.
1.
> * OSD startup line
> * add -> "--lv-backed-pv", "true"
2.
Changing the envvar below... the osd starts successfully
- name: ROOK_LV_BACKED_PV
value: "true"
comment from cu: After change the evvar from the osd deployments the cluster starts and after a time the status becomes operational First of all... thanks a lot!!! I'm not very confident with the lateral effects that i've on my cluster since the last upgrades. Compare my development cluster (installed months ago) and my production cluster (installed years ago) i detect that the newest cluster is not using lvm on de ODF. Can we discuss about if will be more convenient that i delete the actual ODF cluster (saving the data before) and reinstall a fresh installation that not use lvm? (In reply to khover from comment #39) > Hello Team, > > Please help me understand where and how in the osd deployment this needs to > be fixed. > > I need to pass this to the customer and I have two versions. > > 1. > > > * OSD startup line > > * add -> "--lv-backed-pv", "true" > > 2. > > Changing the envvar below... the osd starts successfully > > - name: ROOK_LV_BACKED_PV > value: "true" The env variant of the fix is easier, go with this one. |
Description of problem: OSD pods CLBO after upgrade to 4.10 from 4.9. rook-ceph-osd-0-7c5b8797dc-jpk4w 1/2 CrashLoopBackOff 29 (3m18s ago) 95m rook-ceph-osd-1-676cbfb684-fcccr 1/2 CrashLoopBackOff 28 (5s ago) 84m rook-ceph-osd-2-89bb9dbd9-p56b2 1/2 CrashLoopBackOff 11 (4m25s ago) 36m edited each one oth this 3 deployments (that are exactly the 3 deployments that i see in crashloop state) and i removed the "/rook/rook" from the args rook-ceph-osd-0 1/1 1 1 46h rook-ceph-osd-1 1/1 1 1 46h rook-ceph-osd-2 1/1 1 1 7h7m containers: - args: - /rook/rook <-- I Removed this line - ceph - osd - start - -- - --foreground - --id - "1" - --fsid - 42e1ae07-9402-4cc9-b1a4-a1fe127e6ebc - --cluster - ceph - --setuser - ceph - --setgroup - ceph - --crush-location=root=default host=xxxocpocsxxxs02 rack=rack2 - --log-to-stderr=true - --err-to-stderr=true - --mon-cluster-log-to-stderr=true - '--log-stderr-prefix=debug ' - --default-log-to-file=false - --default-mon-cluster-log-to-file=false - --ms-learn-addr-from-peer=false command: - /rook/rook After that, the osd runs fine and the ceph is available. Pd: The broken state is easy to reproduce. If i delete one of the commented deployment (oc delete deployment rook-ceph-osd-1 for example) the operator starts the reconciliation process and breaks my cluster again Version-Release number of selected component (if applicable): NAME DISPLAY VERSION REPLACES PHASE odf-operator.v4.10.2 OpenShift Data Foundation 4.10.2 odf-operator.v4.9.6 Succeeded How reproducible: customer deletes the deployment and issue is reproduced Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: