Created attachment 1882431[details]
osd describe
Description of problem:
OSD pods CLBO after upgrade to 4.10 from 4.9.
rook-ceph-osd-0-7c5b8797dc-jpk4w 1/2 CrashLoopBackOff 29 (3m18s ago) 95m
rook-ceph-osd-1-676cbfb684-fcccr 1/2 CrashLoopBackOff 28 (5s ago) 84m
rook-ceph-osd-2-89bb9dbd9-p56b2 1/2 CrashLoopBackOff 11 (4m25s ago) 36m
edited each one oth this 3 deployments (that are exactly the 3 deployments that i see in crashloop state) and i removed the "/rook/rook" from the args
rook-ceph-osd-0 1/1 1 1 46h
rook-ceph-osd-1 1/1 1 1 46h
rook-ceph-osd-2 1/1 1 1 7h7m
containers:
- args:
- /rook/rook <-- I Removed this line
- ceph
- osd
- start
- --
- --foreground
- --id
- "1"
- --fsid
- 42e1ae07-9402-4cc9-b1a4-a1fe127e6ebc
- --cluster
- ceph
- --setuser
- ceph
- --setgroup
- ceph
- --crush-location=root=default host=xxxocpocsxxxs02 rack=rack2
- --log-to-stderr=true
- --err-to-stderr=true
- --mon-cluster-log-to-stderr=true
- '--log-stderr-prefix=debug '
- --default-log-to-file=false
- --default-mon-cluster-log-to-file=false
- --ms-learn-addr-from-peer=false
command:
- /rook/rook
After that, the osd runs fine and the ceph is available.
Pd: The broken state is easy to reproduce. If i delete one of the commented deployment (oc delete deployment rook-ceph-osd-1 for example) the operator starts the reconciliation process and breaks my cluster again
Version-Release number of selected component (if applicable):
NAME DISPLAY VERSION REPLACES PHASE
odf-operator.v4.10.2 OpenShift Data Foundation 4.10.2 odf-operator.v4.9.6 Succeeded
How reproducible:
customer deletes the deployment and issue is reproduced
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
Moving to VERIFIED based on regression testing of ODF upgrade using 4.11.0-113
ocs-ci results for OCS4-11-Downstream-OCP4-11-AWS-UPI-Proxy-3AZ-RHCOS-3M-3W-upgrade-ocs-auto (BUILD ID: 4.11.0-113 RUN ID: 1658223369)
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.11.0 security, enhancement, & bugfix update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2022:6156
Created attachment 1882431 [details] osd describe Description of problem: OSD pods CLBO after upgrade to 4.10 from 4.9. rook-ceph-osd-0-7c5b8797dc-jpk4w 1/2 CrashLoopBackOff 29 (3m18s ago) 95m rook-ceph-osd-1-676cbfb684-fcccr 1/2 CrashLoopBackOff 28 (5s ago) 84m rook-ceph-osd-2-89bb9dbd9-p56b2 1/2 CrashLoopBackOff 11 (4m25s ago) 36m edited each one oth this 3 deployments (that are exactly the 3 deployments that i see in crashloop state) and i removed the "/rook/rook" from the args rook-ceph-osd-0 1/1 1 1 46h rook-ceph-osd-1 1/1 1 1 46h rook-ceph-osd-2 1/1 1 1 7h7m containers: - args: - /rook/rook <-- I Removed this line - ceph - osd - start - -- - --foreground - --id - "1" - --fsid - 42e1ae07-9402-4cc9-b1a4-a1fe127e6ebc - --cluster - ceph - --setuser - ceph - --setgroup - ceph - --crush-location=root=default host=xxxocpocsxxxs02 rack=rack2 - --log-to-stderr=true - --err-to-stderr=true - --mon-cluster-log-to-stderr=true - '--log-stderr-prefix=debug ' - --default-log-to-file=false - --default-mon-cluster-log-to-file=false - --ms-learn-addr-from-peer=false command: - /rook/rook After that, the osd runs fine and the ceph is available. Pd: The broken state is easy to reproduce. If i delete one of the commented deployment (oc delete deployment rook-ceph-osd-1 for example) the operator starts the reconciliation process and breaks my cluster again Version-Release number of selected component (if applicable): NAME DISPLAY VERSION REPLACES PHASE odf-operator.v4.10.2 OpenShift Data Foundation 4.10.2 odf-operator.v4.9.6 Succeeded How reproducible: customer deletes the deployment and issue is reproduced Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: