Bug 1943596 - [Tracker for BZ #1944611][Arbiter] When Performed zone(zone=a) Power off and Power On, 3 mon pod(zone=b,c) goes in CLBO after node Power off and 2 Osd(zone=a) goes in CLBO after node Power on
Summary: [Tracker for BZ #1944611][Arbiter] When Performed zone(zone=a) Power off and ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: ceph
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: OCS 4.7.0
Assignee: Greg Farnum
QA Contact: Pratik Surve
URL:
Whiteboard:
Depends On:
Blocks: 1944611
TreeView+ depends on / blocked
 
Reported: 2021-03-26 14:44 UTC by Pratik Surve
Modified: 2021-05-19 09:21 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1944611 (view as bug list)
Environment:
Last Closed: 2021-05-19 09:20:51 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2021:2041 0 None None None 2021-05-19 09:21:36 UTC

Description Pratik Surve 2021-03-26 14:44:39 UTC
Description of problem (please be detailed as possible and provide log
snippests):
When Performed zone(zone=a) Power off and Power On 3 mon pod(zone=b,c) goes in CLBO after node Power off and 2 Osd(zone=a) goes in CLBO after node Power on

Zone=c is the arbiter zone and it's running over the master node

Version of all relevant components (if applicable):

OCP version:- 4.7.0-0.nightly-2021-03-25-225737
OCS version:- ocs-operator.v4.7.0-322.ci

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
yes

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
4

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Deploy OCP 4.7 over BM UPI
2. Deploy OCS with arbiter enabled 
3. Write some data using App pods
4. Power off the nodes from zone=a for a long time
5. Check for mon pods status
6. Power on the nodes
7. Check for osd Pods status


Actual results:
oc get pods -o wide                                                                                                                                                               localhost.localdomain: Fri Mar 26 20:05:07 2021

NAME                                                              READY   STATUS             RESTARTS   AGE     IP             NODE                      NOMINATED NODE   READINESS GATES
csi-cephfsplugin-2l657                                            3/3     Running            0          7h14m   10.8.128.205   argo005.ceph.redhat.com   <none>           <none>
csi-cephfsplugin-72wz4                                            3/3     Running            0          7h14m   10.8.128.209   argo009.ceph.redhat.com   <none>           <none>
csi-cephfsplugin-94hf6                                            3/3     Running            0          7h14m   10.8.128.207   argo007.ceph.redhat.com   <none>           <none>
csi-cephfsplugin-hl79c                                            3/3     Running            0          7h14m   10.8.128.206   argo006.ceph.redhat.com   <none>           <none>
csi-cephfsplugin-provisioner-5d8dd85f69-9glbq                     6/6     Running            0          7h14m   10.131.0.17    argo008.ceph.redhat.com   <none>           <none>
csi-cephfsplugin-provisioner-5d8dd85f69-z4pv5                     6/6     Running            0          3h2m    10.128.2.22    argo007.ceph.redhat.com   <none>           <none>
csi-cephfsplugin-t7x8b                                            3/3     Running            0          7h14m   10.8.128.208   argo008.ceph.redhat.com   <none>           <none>
csi-rbdplugin-9f8vf                                               3/3     Running            0          7h14m   10.8.128.207   argo007.ceph.redhat.com   <none>           <none>
csi-rbdplugin-9xsj9                                               3/3     Running            0          7h14m   10.8.128.209   argo009.ceph.redhat.com   <none>           <none>
csi-rbdplugin-bjbnp                                               3/3     Running            0          7h14m   10.8.128.205   argo005.ceph.redhat.com   <none>           <none>
csi-rbdplugin-ljlcn                                               3/3     Running            0          7h14m   10.8.128.206   argo006.ceph.redhat.com   <none>           <none>
csi-rbdplugin-mk7bw                                               3/3     Running            0          7h14m   10.8.128.208   argo008.ceph.redhat.com   <none>           <none>
csi-rbdplugin-provisioner-7b8b6fc678-5c4nc                        6/6     Running            0          3h3m    10.131.0.28    argo008.ceph.redhat.com   <none>           <none>
csi-rbdplugin-provisioner-7b8b6fc678-8s2sd                        6/6     Running            0          7h14m   10.128.2.14    argo007.ceph.redhat.com   <none>           <none>
noobaa-core-0                                                     1/1     Running            0          33m     10.131.0.35    argo008.ceph.redhat.com   <none>           <none>
noobaa-db-pg-0                                                    1/1     Running            0          7h11m   10.128.2.21    argo007.ceph.redhat.com   <none>           <none>
noobaa-endpoint-78486d6bdb-c6qlx                                  1/1     Running            0          3h3m    10.131.0.27    argo008.ceph.redhat.com   <none>           <none>
noobaa-operator-84f54fccbc-95m6z                                  1/1     Running            0          3h2m    10.131.0.34    argo008.ceph.redhat.com   <none>           <none>
ocs-metrics-exporter-75f96b94b7-7grz9                             1/1     Running            0          3h2m    10.131.0.33    argo008.ceph.redhat.com   <none>           <none>
ocs-operator-c95fddb85-fwd96                                      1/1     Running            0          3h2m    10.131.0.30    argo008.ceph.redhat.com   <none>           <none>
rook-ceph-crashcollector-argo004.ceph.redhat.com-687c49477xd89g   1/1     Running            0          7h13m   10.129.0.19    argo004.ceph.redhat.com   <none>           <none>
rook-ceph-crashcollector-argo005.ceph.redhat.com-646ccbc98rwjmv   1/1     Running            0          32m     10.129.2.11    argo005.ceph.redhat.com   <none>           <none>
rook-ceph-crashcollector-argo006.ceph.redhat.com-775cff6d6ljz7j   1/1     Running            0          33m     10.131.2.12    argo006.ceph.redhat.com   <none>           <none>
rook-ceph-crashcollector-argo007.ceph.redhat.com-5cdf8dd85kgcvm   1/1     Running            0          7h13m   10.128.2.17    argo007.ceph.redhat.com   <none>           <none>
rook-ceph-crashcollector-argo008.ceph.redhat.com-5b66bdbdb9fqkv   1/1     Running            0          7h13m   10.131.0.22    argo008.ceph.redhat.com   <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-cc497cc927wh9   2/2     Running            0          3h2m    10.131.2.11    argo006.ceph.redhat.com   <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-866744f66d8tc   2/2     Running            0          7h11m   10.128.2.20    argo007.ceph.redhat.com   <none>           <none>
rook-ceph-mgr-a-54649df9d8-dl68s                                  2/2     Running            2          3h2m    10.131.0.32    argo008.ceph.redhat.com   <none>           <none>
rook-ceph-mon-a-768bd6ddf4-qbnd2                                  2/2     Running            0          3h3m    10.131.2.9     argo006.ceph.redhat.com   <none>           <none>
rook-ceph-mon-b-84864978bf-scg6d                                  2/2     Running            0          3h2m    10.129.2.12    argo005.ceph.redhat.com   <none>           <none>
rook-ceph-mon-c-585dcfdb86-xhkjn                                  2/2     Running            23         7h13m   10.131.0.19    argo008.ceph.redhat.com   <none>           <none>
rook-ceph-mon-d-545775c64f-ch6xb                                  1/2     CrashLoopBackOff   23         7h13m   10.128.2.16    argo007.ceph.redhat.com   <none>           <none>
rook-ceph-mon-e-5f68b597c4-cm28h                                  2/2     Running            27         7h13m   10.129.0.18    argo004.ceph.redhat.com   <none>           <none>
rook-ceph-operator-6b87cc65f9-jpvmc                               1/1     Running            0          7h29m   10.130.2.16    argo009.ceph.redhat.com   <none>           <none>
rook-ceph-osd-0-b54d97cd6-zzgbc                                   2/2     Running            1          7h12m   10.131.0.21    argo008.ceph.redhat.com   <none>           <none>
rook-ceph-osd-1-75bb8cb584-6h9x6                                  2/2     CrashLoopBackOff            13         3h2m    10.129.2.10    argo005.ceph.redhat.com   <none>           <none>
rook-ceph-osd-2-5848bcd794-grs2v                                  2/2     Running            1          7h12m   10.128.2.19    argo007.ceph.redhat.com   <none>           <none>
rook-ceph-osd-3-597f9c66df-q8pct                                  1/2     CrashLoopBackOff   11         3h2m    10.131.2.10    argo006.ceph.redhat.com   <none>           <none>
rook-ceph-osd-prepare-ocs-deviceset-localblock-1-data-0l24799sx   0/1     Completed          0          7h12m   10.131.0.20    argo008.ceph.redhat.com   <none>           <none>
rook-ceph-osd-prepare-ocs-deviceset-localblock-2-data-04p5w4d47   0/1     Completed          0          7h12m   10.128.2.18    argo007.ceph.redhat.com   <none>           <none>
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-568cc9d9g4cd   2/2     Running            0          3h2m    10.131.2.13    argo006.ceph.redhat.com   <none>           <none>
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-8cd59754n7sq   2/2     Running            0          7h10m   10.131.0.23    argo008.ceph.redhat.com   <none>           <none>
rook-ceph-tools-5cb7b9df75-pmhdv                                  1/1     Running            0          7h5m    10.8.128.208   argo008.ceph.redhat.com   <none>           <none>

Expected results:
There should not be any pod in CBLO mon,osd should recover after power on and power off 


Additional info:

As per bz https://bugzilla.redhat.com/show_bug.cgi?id=1939617, there were no new mon created which is expected behavior

Comment 4 Sébastien Han 2021-03-29 14:42:57 UTC
Crashing monitors are not a Rook issue, I'm moving to "ceph" for crashes investigations.
Thanks.

Comment 5 Greg Farnum 2021-03-29 17:01:03 UTC
Issue identified, generating a fix.

Comment 6 Mudit Agarwal 2021-03-30 11:04:03 UTC
Opened a BZ to track the Ceph changes https://bugzilla.redhat.com/show_bug.cgi?id=1944611

Comment 9 Mudit Agarwal 2021-04-01 05:49:07 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1944611 is in MODIFIED state.
Will move to ON_QA once we have the container build with the fix.

Comment 10 Travis Nielsen 2021-04-07 18:37:47 UTC
This fix was in rc3

Comment 15 errata-xmlrpc 2021-05-19 09:20:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2041


Note You need to log in before you can comment on or make changes to this bug.