Bug 2089398

Summary: [GSS]OSD pods CLBO after upgrade to 4.10 from 4.9.
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: khover
Component: rookAssignee: Sébastien Han <shan>
Status: CLOSED ERRATA QA Contact: Neha Berry <nberry>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.10CC: agantony, assingh, hnallurv, kramdoss, madam, mduasope, muagarwa, ocs-bugs, odf-bz-bot, olakra, pbalogh, robertodocampo, shan, tnielsen
Target Milestone: ---Flags: khover: needinfo-
Target Release: ODF 4.10.3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Currently, all the upgraded OpenShift Data Foundation clusters from versions 4.3 to 4.10, run Logical Volume Manager (LVM) based Object Storage Devices (OSDs). These upgraded clusters crashed due to an invalid argument on the LVM-based OSDs. With this update, the rook operator reconciles the correct arguments for the LVM-based OSDs and the invalid argument no longer gets added. Now, the LVM-based OSDs run successfully after upgrading to version 4.10.3.
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-06-14 12:26:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description khover 2022-05-23 15:00:22 UTC
Description of problem:

OSD pods CLBO after upgrade to 4.10 from 4.9.

rook-ceph-osd-0-7c5b8797dc-jpk4w                                  1/2     CrashLoopBackOff    29 (3m18s ago)   95m
rook-ceph-osd-1-676cbfb684-fcccr                                  1/2     CrashLoopBackOff    28 (5s ago)      84m
rook-ceph-osd-2-89bb9dbd9-p56b2                                   1/2     CrashLoopBackOff    11 (4m25s ago)   36m

edited each one oth this 3 deployments (that are exactly the 3 deployments that i see in crashloop state) and i removed the "/rook/rook" from the args 
rook-ceph-osd-0                                      1/1     1            1           46h
rook-ceph-osd-1                                      1/1     1            1           46h
rook-ceph-osd-2                                      1/1     1            1           7h7m



      containers:
      - args:
        - /rook/rook <-- I Removed this line
        - ceph
        - osd
        - start
        - --
        - --foreground
        - --id
        - "1"
        - --fsid
        - 42e1ae07-9402-4cc9-b1a4-a1fe127e6ebc
        - --cluster
        - ceph
        - --setuser
        - ceph
        - --setgroup
        - ceph
        - --crush-location=root=default host=xxxocpocsxxxs02 rack=rack2
        - --log-to-stderr=true
        - --err-to-stderr=true
        - --mon-cluster-log-to-stderr=true
        - '--log-stderr-prefix=debug '
        - --default-log-to-file=false
        - --default-mon-cluster-log-to-file=false
        - --ms-learn-addr-from-peer=false
        command:
        - /rook/rook


After that, the osd runs fine and the ceph is available.


Pd: The broken state is easy to reproduce. If i delete one of the commented deployment (oc delete deployment rook-ceph-osd-1 for example) the operator starts the reconciliation  process and breaks my cluster again


Version-Release number of selected component (if applicable):
NAME                              DISPLAY                            VERSION    REPLACES                           PHASE

odf-operator.v4.10.2              OpenShift Data Foundation          4.10.2     odf-operator.v4.9.6                Succeeded

How reproducible:

customer deletes the deployment and issue is reproduced

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Travis Nielsen 2022-05-23 15:18:35 UTC
Will track with 2089397, this looks like a mistaken dup

*** This bug has been marked as a duplicate of bug 2089397 ***

Comment 3 Travis Nielsen 2022-05-23 15:23:08 UTC
Actually, let's use this BZ to track the 4.10 backport. All OSDs on PVs using lvm mode would be affected by this. There are few clusters that would be affected by this, but it is a blocking issue for OSDs on those clusters.

The fix is very simple and low risk, we should include in the next 4.10.z.

Comment 4 Travis Nielsen 2022-05-23 15:57:57 UTC
Fix is merged upstream, we just need to open the backport PR for 4.10.z

Comment 5 khover 2022-05-23 17:09:15 UTC
Hi Travis, Sébastien,

Are there any steps that the customer needs to fix their cluster ?

If Yes.

Please provide and I will pass them along.

Comment 6 Sébastien Han 2022-05-24 08:21:18 UTC
Hi Kevan,

Unfortunately, they are none, if you edit the osd deployment spec it will be reverted by the rook operator.
The only fix is to wait for the next z-stream, but I understand that it might not be realistic.

I'm not sure if we can provide a hotfix or not.
Mudit, thoughts? When is 4.10.4 planned?

Thanks

Comment 7 Mudit Agarwal 2022-05-24 09:58:47 UTC
4.10.4 is almost a month away, let me check if we can include it in 4.10.3 which is going to be released on 7th June.

Comment 8 Sébastien Han 2022-05-24 10:00:07 UTC
(In reply to Mudit Agarwal from comment #7)
> 4.10.4 is almost a month away, let me check if we can include it in 4.10.3
> which is going to be released on 7th June.

Thanks, it's a really small change with low scope.

Comment 9 Mudit Agarwal 2022-05-24 10:04:21 UTC
QE is asking for verification steps, please update the same.

Comment 10 Sébastien Han 2022-05-24 12:40:53 UTC
Kevan, please confirm the initial version installed of the cluster thanks.

Comment 11 khover 2022-05-24 15:12:46 UTC
Hi Sébastien,

From the cu:

The OCP Custer and the (old brand) OCS cluster was installed on 4.3.x versions.

Comment 12 Sébastien Han 2022-05-25 07:19:48 UTC
Thanks Kevan,

For QE to repro:

* install a 4.3 cluster
* upgrade all the way up to 4.10
* the OSDs should run normally after the upgrade

Comment 22 khover 2022-05-26 14:55:56 UTC
Hi Sébastien,

A hotfix may be needed for cu based on the following variables (unless im missing something).


fix is editing the osd deployment to remove /rook/rook , then the current state > ocs operator is scaled to 0 as to not reconcile, and stomp on the edit.


Upgrading to 10.x.x to get the fix may not be possible if ocs-operator is 0/1 ?


Im not aware of any workaround for that scenario.

Comment 23 khover 2022-05-26 15:48:51 UTC
(In reply to khover from comment #22)
> Hi Sébastien,
> 
> A hotfix may be needed for cu based on the following variables (unless im
> missing something).
> 
> 
> fix is editing the osd deployment to remove /rook/rook , then the current
> state > ocs operator is scaled to 0 as to not reconcile, and stomp on the
> edit.
> 
> 
> Upgrading to 10.x.x to get the fix may not be possible if ocs-operator is
> 0/1 ?
> 
> 
> Im not aware of any workaround for that scenario.

Hi Sébastien,

Disregard, all operators are 1/1

Comment 25 Sébastien Han 2022-06-01 08:23:09 UTC
Removing my needinfo.

Comment 34 errata-xmlrpc 2022-06-14 12:26:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.10.3 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:5023

Comment 36 Sébastien Han 2022-06-15 13:40:30 UTC
To make it pass and resume the service, you can edit the rook-ceph-osd deployment:

* OSD startup line
* add -> "--lv-backed-pv", "true"

The daemon should successfully start after this.

Comment 37 khover 2022-06-15 13:53:42 UTC
(In reply to Sebastian Han from comment #36)
> To make it pass and resume the service, you can edit the rook-ceph-osd
> deployment:
> 
> * OSD startup line
> * add -> "--lv-backed-pv", "true"
> 
> The daemon should successfully start after this.

Hey Sebastian,

Just to be clear, is this correct ?

containers:
      - args:
        - ceph
        - osd
        - start
        - -- lv-backed-pv=true
        - --foreground
        - --id
        - "1"
        - --fsid
        - 42e1ae07-9402-4cc9-b1a4-a1fe127e6ebc
        - --cluster
        - ceph
        - --setuser

Comment 38 Roberto Docampo Suarez 2022-06-15 13:56:39 UTC
Changing the envvar below... the osd starts successfully

        - name: ROOK_LV_BACKED_PV
          value: "true"

Comment 39 khover 2022-06-15 14:21:42 UTC
Hello Team,

Please help me understand where and how in the osd deployment this needs to be fixed.

I need to pass this to the customer and I have two versions.

1.

> * OSD startup line
> * add -> "--lv-backed-pv", "true"

2.

Changing the envvar below... the osd starts successfully

        - name: ROOK_LV_BACKED_PV
          value: "true"

Comment 40 khover 2022-06-15 14:24:37 UTC
comment from cu:


After change the evvar from the osd deployments the cluster starts and after a time the status becomes operational 

First of all... thanks a lot!!!

I'm not very confident with the lateral effects that i've on my cluster since the last upgrades.  Compare my development cluster (installed months ago) and my production cluster (installed years ago) i detect that the newest cluster is not using lvm on de ODF.

Can we discuss about if will be more convenient that i delete the actual ODF cluster (saving the data before) and reinstall a fresh installation that not use lvm?

Comment 41 Sébastien Han 2022-06-15 14:25:45 UTC
(In reply to khover from comment #39)
> Hello Team,
> 
> Please help me understand where and how in the osd deployment this needs to
> be fixed.
> 
> I need to pass this to the customer and I have two versions.
> 
> 1.
> 
> > * OSD startup line
> > * add -> "--lv-backed-pv", "true"
> 
> 2.
> 
> Changing the envvar below... the osd starts successfully
> 
>         - name: ROOK_LV_BACKED_PV
>           value: "true"

The env variant of the fix is easier, go with this one.