Bug 1943275 - OSD pods re-spun after "add capacity" on cluster with KMS
Summary: OSD pods re-spun after "add capacity" on cluster with KMS
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: rook
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: OCS 4.7.0
Assignee: Sébastien Han
QA Contact: Oded
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-25 17:17 UTC by Oded
Modified: 2021-06-01 08:48 UTC (History)
7 users (show)

Fixed In Version: 4.7.0-324.ci
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-05-19 09:20:51 UTC
Embargoed:


Attachments (Terms of Use)
rook-ceph-op logs (67.96 KB, text/plain)
2021-03-26 10:00 UTC, Oded
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift rook pull 204 0 None closed Bug 1943275: ceph: avoid restarting all encrypted osd on cluster growth 2021-03-29 12:09:17 UTC
Github rook rook pull 7489 0 None open ceph: avoid restarting all encrypted osd on cluster growth 2021-03-26 15:23:13 UTC
Red Hat Product Errata RHSA-2021:2041 0 None None None 2021-05-19 09:21:36 UTC

Description Oded 2021-03-25 17:17:14 UTC
Description of problem (please be detailed as possible and provide log
snippests):
All OSD pods (old+new) re-spun after "add capacity" on cluster with KMS

Version of all relevant components (if applicable):
Provider:VSphere
OCP version: 4.7.0-0.nightly-2021-03-25-091845
OCS Version: ocs-operator.v4.7.0-318.ci

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
yes


Can this issue reproduce from the UI?
yes

If this is a regression, please provide more details to justify this:


Steps to Reproduce:

1.Install Internal cluster with KMS

2.Get OSD pods Status
$ oc get pods | grep osd
rook-ceph-osd-0-64596cd4ff-t4qgt                                  2/2     Running     0          25m
rook-ceph-osd-1-597dc6785f-m9rsg                                  2/2     Running     0          24m
rook-ceph-osd-2-5488ccf758-f8ldz                                  2/2     Running     0          24m
rook-ceph-osd-prepare-ocs-deviceset-thin-0-data-0w2mnx-vphvs      0/1     Completed   0          26m
rook-ceph-osd-prepare-ocs-deviceset-thin-1-data-0c4qkk-kdspj      0/1     Completed   0          26m
rook-ceph-osd-prepare-ocs-deviceset-thin-2-data-0pn997-xpqpc      0/1     Completed   0          25m

3.Verify new secrets created on vault server (KMS).

4.Add Capacity via UI

5.Check OSD pods status:
The old OSD pods re-spun.[Failed]

$ oc get pods | grep osd
rook-ceph-osd-0-5885f488f9-8pp7g                                  2/2     Running     0          50s
rook-ceph-osd-1-59bf6fd5f5-kzfkx                                  2/2     Running     0          84s
rook-ceph-osd-2-7db9f85bfd-gqmdp                                  2/2     Running     0          115s
rook-ceph-osd-3-855d889c4c-qn47k                                  2/2     Running     0          68s
rook-ceph-osd-4-69d4967cc-5znh7                                   2/2     Running     0          36s
rook-ceph-osd-5-7ccdc4dc9d-fqx5t                                  2/2     Running     0          35s
rook-ceph-osd-prepare-ocs-deviceset-thin-0-data-0w2mnx-vphvs      0/1     Completed   0          28m
rook-ceph-osd-prepare-ocs-deviceset-thin-0-data-16qx7j-qf57j      0/1     Completed   0          2m8s
rook-ceph-osd-prepare-ocs-deviceset-thin-1-data-0c4qkk-kdspj      0/1     Completed   0          28m
rook-ceph-osd-prepare-ocs-deviceset-thin-1-data-1fnjjw-5mgcw      0/1     Completed   0          2m6s
rook-ceph-osd-prepare-ocs-deviceset-thin-2-data-0pn997-xpqpc      0/1     Completed   0          28m

6.Verify new secrets create on vault server (KMS)

*This issue does not reconstructed on Internal cluster without KMS


Actual results:
The old OSD pods re-spun

Expected results:
The old OSD pods do not re-spun

Additional info:
https://docs.google.com/document/d/1pfzIeo2M7qJF5du0DvX5vw_ujhIR6timufGkLyBzXGs/edit

Comment 2 Mudit Agarwal 2021-03-25 17:18:36 UTC
Let's start with rook.

Comment 4 Oded 2021-03-26 09:49:21 UTC
This issue reconstructed when reboot worker node

Reboot one of the worker nodes
$ oc get nodes
NAME                                         STATUS     ROLES    AGE     VERSION
ip-10-0-139-50.us-east-2.compute.internal    Ready      master   3h24m   v1.20.0+bafe72f
ip-10-0-147-1.us-east-2.compute.internal     Ready      worker   3h15m   v1.20.0+bafe72f
ip-10-0-185-180.us-east-2.compute.internal   Ready      master   3h24m   v1.20.0+bafe72f
ip-10-0-197-54.us-east-2.compute.internal    Ready      worker   3h15m   v1.20.0+bafe72f
ip-10-0-248-214.us-east-2.compute.internal   Ready      master   3h23m   v1.20.0+bafe72f
ip-10-0-248-39.us-east-2.compute.internal    NotReady   worker   3h15m   v1.20.0+bafe72f


$ oc get nodes
NAME                                         STATUS   ROLES    AGE     VERSION
ip-10-0-139-50.us-east-2.compute.internal    Ready    master   3h25m   v1.20.0+bafe72f
ip-10-0-147-1.us-east-2.compute.internal     Ready    worker   3h16m   v1.20.0+bafe72f
ip-10-0-185-180.us-east-2.compute.internal   Ready    master   3h25m   v1.20.0+bafe72f
ip-10-0-197-54.us-east-2.compute.internal    Ready    worker   3h16m   v1.20.0+bafe72f
ip-10-0-248-214.us-east-2.compute.internal   Ready    master   3h24m   v1.20.0+bafe72f
ip-10-0-248-39.us-east-2.compute.internal    Ready    worker   3h16m   v1.20.0+bafe72f


$ oc get pods -n openshift-storage
NAME                                                              READY   STATUS      RESTARTS   AGE
rook-ceph-osd-0-6b5dbdb767-scgbx                                  2/2     Running     0          16m
rook-ceph-osd-1-7ffc9f77bf-g9xms                                  2/2     Running     0          18m
rook-ceph-osd-2-6b4cdc58b7-grxnd                                  2/2     Running     0          17m
rook-ceph-osd-3-79b99d6fb-lw6f7                                   2/2     Running     0          17m
rook-ceph-osd-4-9669b8db6-ldgp8                                   2/2     Running     0          17m
rook-ceph-osd-5-7b66cf7484-mvkbh                                  2/2     Running     0          17m
rook-ceph-osd-prepare-ocs-deviceset-gp2-0-data-0chz6r-8fdwh       0/1     Completed   0          52m
rook-ceph-osd-prepare-ocs-deviceset-gp2-0-data-1z4mgd-lg9tt       0/1     Completed   0          18m
rook-ceph-osd-prepare-ocs-deviceset-gp2-1-data-0z9fcj-t7gcj       0/1     Completed   0          52m
rook-ceph-osd-prepare-ocs-deviceset-gp2-1-data-1nfh89-lkm5m       0/1     Completed   0          18m
rook-ceph-osd-prepare-ocs-deviceset-gp2-2-data-05m6zf-spnm9       0/1     Completed   0          52m
rook-ceph-osd-prepare-ocs-deviceset-gp2-2-data-1c8j7n-pcwrn       0/1     Completed   0          18m

Comment 5 Oded 2021-03-26 10:00:43 UTC
Created attachment 1766561 [details]
rook-ceph-op logs

Comment 10 Oded 2021-03-31 11:57:08 UTC
Bug does not reconstructe on Internal cluster OCS version 4.7.0-327.ci

Test procedure:

Install OCS via UI:
1.Deploy OCP cluster
Provider :Vmware
OCP Version: 4.7.0-0.nightly-2021-03-27-082615

2.Install OCS operator [4.7.0-327.ci] +KMS

3.Add Capacity via UI:
OLD OSD PODs DON'T RE-SPUN
$ oc get pods | grep osd
rook-ceph-osd-0-7d9c4d8cc6-kpvlm                                  2/2     Running     0          19m
rook-ceph-osd-1-65c8d558c6-hz4sl                                  2/2     Running     0          18m
rook-ceph-osd-2-7cd94499cc-q6x67                                  2/2     Running     0          18m
rook-ceph-osd-3-5cc85664d7-vc4x7                                  2/2     Running     0          34s
rook-ceph-osd-4-846d988c65-wdkdw                                  2/2     Running     0          33s
rook-ceph-osd-5-7f68ccf76c-822rp                                  2/2     Running     0          33s
rook-ceph-osd-prepare-ocs-deviceset-thin-0-data-0t7n76-n7fkx      0/1     Completed   0          20m
rook-ceph-osd-prepare-ocs-deviceset-thin-0-data-17rn8l-m5t66      0/1     Completed   0          99s
rook-ceph-osd-prepare-ocs-deviceset-thin-1-data-0ww8gl-9d8jf      0/1     Completed   0          20m
rook-ceph-osd-prepare-ocs-deviceset-thin-1-data-1spfkv-b4v9n      0/1     Completed   0          97s
rook-ceph-osd-prepare-ocs-deviceset-thin-2-data-0lpkkc-2frtp      0/1     Completed   0          20m
rook-ceph-osd-prepare-ocs-deviceset-thin-2-data-1cgzd8-857q8      0/1     Completed   0          96s

4.Reboot one worker node:
OSD PODs on compute-1 and compute-2 DON'T RE-SPUN
$ oc get pods -o wide | grep osd
rook-ceph-osd-0-7d9c4d8cc6-kpvlm                                  2/2     Running     0          36m     10.131.2.69    compute-2   <none>           <none>
rook-ceph-osd-1-65c8d558c6-7dvx8                                  2/2     Running     0          3m41s   10.130.2.9     compute-0   <none>           <none>
rook-ceph-osd-2-7cd94499cc-q6x67                                  2/2     Running     0          36m     10.128.4.16    compute-1   <none>           <none>
rook-ceph-osd-3-5cc85664d7-vc4x7                                  2/2     Running     0          18m     10.128.4.19    compute-1   <none>           <none>
rook-ceph-osd-4-846d988c65-wdkdw                                  2/2     Running     0          18m     10.131.2.73    compute-2   <none>           <none>
rook-ceph-osd-5-7f68ccf76c-j55q4                                  2/2     Running     0          3m41s   10.130.2.8     compute-0   <none>           <none>
rook-ceph-osd-prepare-ocs-deviceset-thin-0-data-0t7n76-n7fkx      0/1     Completed   0          37m     10.131.2.68    compute-2   <none>           <none>
rook-ceph-osd-prepare-ocs-deviceset-thin-0-data-17rn8l-m5t66      0/1     Completed   0          19m     10.131.2.72    compute-2   <none>           <none>
rook-ceph-osd-prepare-ocs-deviceset-thin-1-data-0ww8gl-9d8jf      0/1     Completed   0          37m     10.128.4.15    compute-1   <none>           <none>
rook-ceph-osd-prepare-ocs-deviceset-thin-1-data-1spfkv-b4v9n      0/1     Completed   0          19m     10.128.4.18    compute-1   <none>           <none>


5.Device Replacement:

Only rook-ceph-osd-0-55d46c8f7d-qzwbd POD RE-SPUN
$ oc get pods | grep osd
rook-ceph-osd-0-55d46c8f7d-qzwbd                                  2/2     Running     0          14m
rook-ceph-osd-1-65c8d558c6-7dvx8                                  2/2     Running     0          36m
rook-ceph-osd-2-7cd94499cc-q6x67                                  2/2     Running     0          69m
rook-ceph-osd-3-5cc85664d7-vc4x7                                  2/2     Running     0          51m
rook-ceph-osd-4-846d988c65-wdkdw                                  2/2     Running     0          51m
rook-ceph-osd-5-7f68ccf76c-j55q4                                  2/2     Running     0          36m
rook-ceph-osd-prepare-ocs-deviceset-thin-0-data-17rn8l-m5t66      0/1     Completed   0          52m
rook-ceph-osd-prepare-ocs-deviceset-thin-0-data-29mhxp-hbd55      0/1     Completed   0          16m
rook-ceph-osd-prepare-ocs-deviceset-thin-1-data-0ww8gl-9d8jf      0/1     Completed   0          70m
rook-ceph-osd-prepare-ocs-deviceset-thin-1-data-1spfkv-b4v9n      0/1     Completed   0          52m

6.Drain one worker node

Only rook-ceph-osd-2,3 PODs RE-SPUN

$ oc get pods -o wide | grep osd
rook-ceph-osd-0-55d46c8f7d-qzwbd                                  2/2     Running     0          77m     10.131.2.76    compute-2   <none>           <none>
rook-ceph-osd-1-65c8d558c6-7dvx8                                  2/2     Running     0          99m     10.130.2.9     compute-0   <none>           <none>
rook-ceph-osd-2-7cd94499cc-5qrzj                                  2/2     Running     0          4m45s   10.128.4.22    compute-1   <none>           <none>
rook-ceph-osd-3-5cc85664d7-9cbwj                                  2/2     Running     0          4m51s   10.128.4.23    compute-1   <none>           <none>
rook-ceph-osd-4-846d988c65-wdkdw                                  2/2     Running     0          114m    10.131.2.73    compute-2   <none>           <none>
rook-ceph-osd-5-7f68ccf76c-j55q4                                  2/2     Running     0          99m     10.130.2.8     compute-0   <none>           <none>
rook-ceph-osd-prepare-ocs-deviceset-thin-0-data-17rn8l-m5t66      0/1     Completed   0          115m    10.131.2.72    compute-2   <none>           <none>
rook-ceph-osd-prepare-ocs-deviceset-thin-0-data-29mhxp-hbd55      0/1     Completed   0          79m     10.131.2.75    compute-2   <none>           <none>

7.Node Replacement
Only rook-ceph-osd-1,5 PODs RE-SPUN
$ oc get pods -o wide | grep osd 
rook-ceph-osd-0-55d46c8f7d-qzwbd                                  2/2     Running     0          91m     10.131.2.76    compute-2   <none>           <none>
rook-ceph-osd-1-65c8d558c6-rcs7x                                  2/2     Running     0          7m32s   10.129.2.206   compute-3   <none>           <none>
rook-ceph-osd-2-7cd94499cc-5qrzj                                  2/2     Running     0          18m     10.128.4.22    compute-1   <none>           <none>
rook-ceph-osd-3-5cc85664d7-9cbwj                                  2/2     Running     0          18m     10.128.4.23    compute-1   <none>           <none>
rook-ceph-osd-4-846d988c65-wdkdw                                  2/2     Running     0          127m    10.131.2.73    compute-2   <none>           <none>
rook-ceph-osd-5-7f68ccf76c-j4vz2                                  2/2     Running     0          7m47s   10.129.2.207   compute-3   <none>           <none>
rook-ceph-osd-prepare-ocs-deviceset-thin-0-data-17rn8l-m5t66      0/1     Completed   0          128m    10.131.2.72    compute-2   <none>           <none>
rook-ceph-osd-prepare-ocs-deviceset-thin-0-data-29mhxp-hbd55      0/1     Completed   0          92m     10.131.2.75    compute-2   <none>           <none>

for more details:

https://docs.google.com/document/d/1TRpohv6jrul-JRv25YT-h5sXMibHtlugxxIIRyAoF64/edit

Comment 11 Oded 2021-04-01 12:50:23 UTC
Bug does not reconstructed on LSO cluster OCS version 4.7.0-336.ci

1.Deploy OCP cluster
Provider :Vmware
OCP Version: 4.7.0-0.nightly-2021-03-31-192013

2.Install Local storage Operator: [4.7.0-202103202139.p0]

3.Install OCS Operator 4.7.0-336.ci 

4.Install storage cluster + KMS +OSD encryption

5.Add capacity without add new node
Old OSD pods do not re-spun
$ oc get pods | grep osd
rook-ceph-osd-0-7bcd55bdb4-b8p7g                                  2/2     Running     0          15m
rook-ceph-osd-1-6475b5cb98-6sgxg                                  2/2     Running     0          15m
rook-ceph-osd-2-6676dfcb4-rgwb4                                   2/2     Running     0          15m
rook-ceph-osd-3-6f4c8f7956-b6f6x                                  2/2     Running     0          16s
rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-0l5rtbnwm   0/1     Completed   0          15m
rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-19lvzj52s   0/1     Completed   0          15m
rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-26s56789m   0/1     Completed   0          15m
rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-3zkrlzwqv   0/1     Completed   0          34s

6.Reboot one of the worker node
OSD pods do not re-spun on compute-1,2  
$ oc get pods -o wide | grep osd
rook-ceph-osd-0-7bcd55bdb4-bvq4n                                  2/2     Running     0          2m28s   10.131.0.11    compute-0   <none>           <none>
rook-ceph-osd-1-6475b5cb98-6sgxg                                  2/2     Running     0          34m     10.129.2.19    compute-1   <none>           <none>
rook-ceph-osd-2-6676dfcb4-rgwb4                                   2/2     Running     0          34m     10.128.2.16    compute-2   <none>           <none>
rook-ceph-osd-3-6f4c8f7956-l4dxh                                  2/2     Running     0          2m28s   10.131.0.10    compute-0   <none>           <none>
rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-19lvzj52s   0/1     Completed   0          35m     10.129.2.18    compute-1   <none>           <none>
rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-26s56789m   0/1     Completed   0          35m     10.128.2.15    compute-2   <none>           <none>
7.Device Replacement
OSD pods do not re-spun (only new disk)
$ oc get -n openshift-storage pods -o wide -l app=rook-ceph-osd
NAME                               READY   STATUS    RESTARTS   AGE   IP            NODE        NOMINATED NODE   READINESS GATES
rook-ceph-osd-0-865c5555f8-f6vj9   2/2     Running   0          81s   10.131.0.18   compute-0   <none>           <none>
rook-ceph-osd-1-6475b5cb98-6sgxg   2/2     Running   0          52m   10.129.2.19   compute-1   <none>           <none>
rook-ceph-osd-2-6676dfcb4-rgwb4    2/2     Running   0          52m   10.128.2.16   compute-2   <none>           <none>
rook-ceph-osd-3-6f4c8f7956-l4dxh   2/2     Running   0          20m   10.131.0.10   compute-0   <none>           <none>

8.Drain one worker node
Only OSD-1 re-spun
$ oc get pods -o wide | grep osd
rook-ceph-osd-0-865c5555f8-f6vj9                                  2/2     Running     0          15m    10.131.0.18    compute-0   <none>           <none>
rook-ceph-osd-1-6475b5cb98-wfjbw                                  2/2     Running     0          115s   10.129.2.25    compute-1   <none>           <none>
rook-ceph-osd-2-6676dfcb4-rgwb4                                   2/2     Running     0          66m    10.128.2.16    compute-2   <none>           <none>
rook-ceph-osd-3-6f4c8f7956-l4dxh                                  2/2     Running     0          34m    10.131.0.10    compute-0   <none>           <none>
rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-26s56789m   0/1     Completed   0          67m    10.128.2.15    compute-2   <none>           <none>
rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-429wxz6m7   0/1     Completed   0          19m    10.131.0.17    compute-0   <none>           <none>


for more details:
https://docs.google.com/document/d/1dv4LKopEa0_c02NXs6NHtop6qh5cY5ppy0YXTEvChlA/edit

Comment 14 errata-xmlrpc 2021-05-19 09:20:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2041


Note You need to log in before you can comment on or make changes to this bug.