Description of problem (please be detailed as possible and provide log snippests): All OSD pods (old+new) re-spun after "add capacity" on cluster with KMS Version of all relevant components (if applicable): Provider:VSphere OCP version: 4.7.0-0.nightly-2021-03-25-091845 OCS Version: ocs-operator.v4.7.0-318.ci Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? yes Can this issue reproduce from the UI? yes If this is a regression, please provide more details to justify this: Steps to Reproduce: 1.Install Internal cluster with KMS 2.Get OSD pods Status $ oc get pods | grep osd rook-ceph-osd-0-64596cd4ff-t4qgt 2/2 Running 0 25m rook-ceph-osd-1-597dc6785f-m9rsg 2/2 Running 0 24m rook-ceph-osd-2-5488ccf758-f8ldz 2/2 Running 0 24m rook-ceph-osd-prepare-ocs-deviceset-thin-0-data-0w2mnx-vphvs 0/1 Completed 0 26m rook-ceph-osd-prepare-ocs-deviceset-thin-1-data-0c4qkk-kdspj 0/1 Completed 0 26m rook-ceph-osd-prepare-ocs-deviceset-thin-2-data-0pn997-xpqpc 0/1 Completed 0 25m 3.Verify new secrets created on vault server (KMS). 4.Add Capacity via UI 5.Check OSD pods status: The old OSD pods re-spun.[Failed] $ oc get pods | grep osd rook-ceph-osd-0-5885f488f9-8pp7g 2/2 Running 0 50s rook-ceph-osd-1-59bf6fd5f5-kzfkx 2/2 Running 0 84s rook-ceph-osd-2-7db9f85bfd-gqmdp 2/2 Running 0 115s rook-ceph-osd-3-855d889c4c-qn47k 2/2 Running 0 68s rook-ceph-osd-4-69d4967cc-5znh7 2/2 Running 0 36s rook-ceph-osd-5-7ccdc4dc9d-fqx5t 2/2 Running 0 35s rook-ceph-osd-prepare-ocs-deviceset-thin-0-data-0w2mnx-vphvs 0/1 Completed 0 28m rook-ceph-osd-prepare-ocs-deviceset-thin-0-data-16qx7j-qf57j 0/1 Completed 0 2m8s rook-ceph-osd-prepare-ocs-deviceset-thin-1-data-0c4qkk-kdspj 0/1 Completed 0 28m rook-ceph-osd-prepare-ocs-deviceset-thin-1-data-1fnjjw-5mgcw 0/1 Completed 0 2m6s rook-ceph-osd-prepare-ocs-deviceset-thin-2-data-0pn997-xpqpc 0/1 Completed 0 28m 6.Verify new secrets create on vault server (KMS) *This issue does not reconstructed on Internal cluster without KMS Actual results: The old OSD pods re-spun Expected results: The old OSD pods do not re-spun Additional info: https://docs.google.com/document/d/1pfzIeo2M7qJF5du0DvX5vw_ujhIR6timufGkLyBzXGs/edit
Let's start with rook.
This issue reconstructed when reboot worker node Reboot one of the worker nodes $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-139-50.us-east-2.compute.internal Ready master 3h24m v1.20.0+bafe72f ip-10-0-147-1.us-east-2.compute.internal Ready worker 3h15m v1.20.0+bafe72f ip-10-0-185-180.us-east-2.compute.internal Ready master 3h24m v1.20.0+bafe72f ip-10-0-197-54.us-east-2.compute.internal Ready worker 3h15m v1.20.0+bafe72f ip-10-0-248-214.us-east-2.compute.internal Ready master 3h23m v1.20.0+bafe72f ip-10-0-248-39.us-east-2.compute.internal NotReady worker 3h15m v1.20.0+bafe72f $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-139-50.us-east-2.compute.internal Ready master 3h25m v1.20.0+bafe72f ip-10-0-147-1.us-east-2.compute.internal Ready worker 3h16m v1.20.0+bafe72f ip-10-0-185-180.us-east-2.compute.internal Ready master 3h25m v1.20.0+bafe72f ip-10-0-197-54.us-east-2.compute.internal Ready worker 3h16m v1.20.0+bafe72f ip-10-0-248-214.us-east-2.compute.internal Ready master 3h24m v1.20.0+bafe72f ip-10-0-248-39.us-east-2.compute.internal Ready worker 3h16m v1.20.0+bafe72f $ oc get pods -n openshift-storage NAME READY STATUS RESTARTS AGE rook-ceph-osd-0-6b5dbdb767-scgbx 2/2 Running 0 16m rook-ceph-osd-1-7ffc9f77bf-g9xms 2/2 Running 0 18m rook-ceph-osd-2-6b4cdc58b7-grxnd 2/2 Running 0 17m rook-ceph-osd-3-79b99d6fb-lw6f7 2/2 Running 0 17m rook-ceph-osd-4-9669b8db6-ldgp8 2/2 Running 0 17m rook-ceph-osd-5-7b66cf7484-mvkbh 2/2 Running 0 17m rook-ceph-osd-prepare-ocs-deviceset-gp2-0-data-0chz6r-8fdwh 0/1 Completed 0 52m rook-ceph-osd-prepare-ocs-deviceset-gp2-0-data-1z4mgd-lg9tt 0/1 Completed 0 18m rook-ceph-osd-prepare-ocs-deviceset-gp2-1-data-0z9fcj-t7gcj 0/1 Completed 0 52m rook-ceph-osd-prepare-ocs-deviceset-gp2-1-data-1nfh89-lkm5m 0/1 Completed 0 18m rook-ceph-osd-prepare-ocs-deviceset-gp2-2-data-05m6zf-spnm9 0/1 Completed 0 52m rook-ceph-osd-prepare-ocs-deviceset-gp2-2-data-1c8j7n-pcwrn 0/1 Completed 0 18m
Created attachment 1766561 [details] rook-ceph-op logs
OCS must gather: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-1943275/bz-add-capacity/
Bug does not reconstructe on Internal cluster OCS version 4.7.0-327.ci Test procedure: Install OCS via UI: 1.Deploy OCP cluster Provider :Vmware OCP Version: 4.7.0-0.nightly-2021-03-27-082615 2.Install OCS operator [4.7.0-327.ci] +KMS 3.Add Capacity via UI: OLD OSD PODs DON'T RE-SPUN $ oc get pods | grep osd rook-ceph-osd-0-7d9c4d8cc6-kpvlm 2/2 Running 0 19m rook-ceph-osd-1-65c8d558c6-hz4sl 2/2 Running 0 18m rook-ceph-osd-2-7cd94499cc-q6x67 2/2 Running 0 18m rook-ceph-osd-3-5cc85664d7-vc4x7 2/2 Running 0 34s rook-ceph-osd-4-846d988c65-wdkdw 2/2 Running 0 33s rook-ceph-osd-5-7f68ccf76c-822rp 2/2 Running 0 33s rook-ceph-osd-prepare-ocs-deviceset-thin-0-data-0t7n76-n7fkx 0/1 Completed 0 20m rook-ceph-osd-prepare-ocs-deviceset-thin-0-data-17rn8l-m5t66 0/1 Completed 0 99s rook-ceph-osd-prepare-ocs-deviceset-thin-1-data-0ww8gl-9d8jf 0/1 Completed 0 20m rook-ceph-osd-prepare-ocs-deviceset-thin-1-data-1spfkv-b4v9n 0/1 Completed 0 97s rook-ceph-osd-prepare-ocs-deviceset-thin-2-data-0lpkkc-2frtp 0/1 Completed 0 20m rook-ceph-osd-prepare-ocs-deviceset-thin-2-data-1cgzd8-857q8 0/1 Completed 0 96s 4.Reboot one worker node: OSD PODs on compute-1 and compute-2 DON'T RE-SPUN $ oc get pods -o wide | grep osd rook-ceph-osd-0-7d9c4d8cc6-kpvlm 2/2 Running 0 36m 10.131.2.69 compute-2 <none> <none> rook-ceph-osd-1-65c8d558c6-7dvx8 2/2 Running 0 3m41s 10.130.2.9 compute-0 <none> <none> rook-ceph-osd-2-7cd94499cc-q6x67 2/2 Running 0 36m 10.128.4.16 compute-1 <none> <none> rook-ceph-osd-3-5cc85664d7-vc4x7 2/2 Running 0 18m 10.128.4.19 compute-1 <none> <none> rook-ceph-osd-4-846d988c65-wdkdw 2/2 Running 0 18m 10.131.2.73 compute-2 <none> <none> rook-ceph-osd-5-7f68ccf76c-j55q4 2/2 Running 0 3m41s 10.130.2.8 compute-0 <none> <none> rook-ceph-osd-prepare-ocs-deviceset-thin-0-data-0t7n76-n7fkx 0/1 Completed 0 37m 10.131.2.68 compute-2 <none> <none> rook-ceph-osd-prepare-ocs-deviceset-thin-0-data-17rn8l-m5t66 0/1 Completed 0 19m 10.131.2.72 compute-2 <none> <none> rook-ceph-osd-prepare-ocs-deviceset-thin-1-data-0ww8gl-9d8jf 0/1 Completed 0 37m 10.128.4.15 compute-1 <none> <none> rook-ceph-osd-prepare-ocs-deviceset-thin-1-data-1spfkv-b4v9n 0/1 Completed 0 19m 10.128.4.18 compute-1 <none> <none> 5.Device Replacement: Only rook-ceph-osd-0-55d46c8f7d-qzwbd POD RE-SPUN $ oc get pods | grep osd rook-ceph-osd-0-55d46c8f7d-qzwbd 2/2 Running 0 14m rook-ceph-osd-1-65c8d558c6-7dvx8 2/2 Running 0 36m rook-ceph-osd-2-7cd94499cc-q6x67 2/2 Running 0 69m rook-ceph-osd-3-5cc85664d7-vc4x7 2/2 Running 0 51m rook-ceph-osd-4-846d988c65-wdkdw 2/2 Running 0 51m rook-ceph-osd-5-7f68ccf76c-j55q4 2/2 Running 0 36m rook-ceph-osd-prepare-ocs-deviceset-thin-0-data-17rn8l-m5t66 0/1 Completed 0 52m rook-ceph-osd-prepare-ocs-deviceset-thin-0-data-29mhxp-hbd55 0/1 Completed 0 16m rook-ceph-osd-prepare-ocs-deviceset-thin-1-data-0ww8gl-9d8jf 0/1 Completed 0 70m rook-ceph-osd-prepare-ocs-deviceset-thin-1-data-1spfkv-b4v9n 0/1 Completed 0 52m 6.Drain one worker node Only rook-ceph-osd-2,3 PODs RE-SPUN $ oc get pods -o wide | grep osd rook-ceph-osd-0-55d46c8f7d-qzwbd 2/2 Running 0 77m 10.131.2.76 compute-2 <none> <none> rook-ceph-osd-1-65c8d558c6-7dvx8 2/2 Running 0 99m 10.130.2.9 compute-0 <none> <none> rook-ceph-osd-2-7cd94499cc-5qrzj 2/2 Running 0 4m45s 10.128.4.22 compute-1 <none> <none> rook-ceph-osd-3-5cc85664d7-9cbwj 2/2 Running 0 4m51s 10.128.4.23 compute-1 <none> <none> rook-ceph-osd-4-846d988c65-wdkdw 2/2 Running 0 114m 10.131.2.73 compute-2 <none> <none> rook-ceph-osd-5-7f68ccf76c-j55q4 2/2 Running 0 99m 10.130.2.8 compute-0 <none> <none> rook-ceph-osd-prepare-ocs-deviceset-thin-0-data-17rn8l-m5t66 0/1 Completed 0 115m 10.131.2.72 compute-2 <none> <none> rook-ceph-osd-prepare-ocs-deviceset-thin-0-data-29mhxp-hbd55 0/1 Completed 0 79m 10.131.2.75 compute-2 <none> <none> 7.Node Replacement Only rook-ceph-osd-1,5 PODs RE-SPUN $ oc get pods -o wide | grep osd rook-ceph-osd-0-55d46c8f7d-qzwbd 2/2 Running 0 91m 10.131.2.76 compute-2 <none> <none> rook-ceph-osd-1-65c8d558c6-rcs7x 2/2 Running 0 7m32s 10.129.2.206 compute-3 <none> <none> rook-ceph-osd-2-7cd94499cc-5qrzj 2/2 Running 0 18m 10.128.4.22 compute-1 <none> <none> rook-ceph-osd-3-5cc85664d7-9cbwj 2/2 Running 0 18m 10.128.4.23 compute-1 <none> <none> rook-ceph-osd-4-846d988c65-wdkdw 2/2 Running 0 127m 10.131.2.73 compute-2 <none> <none> rook-ceph-osd-5-7f68ccf76c-j4vz2 2/2 Running 0 7m47s 10.129.2.207 compute-3 <none> <none> rook-ceph-osd-prepare-ocs-deviceset-thin-0-data-17rn8l-m5t66 0/1 Completed 0 128m 10.131.2.72 compute-2 <none> <none> rook-ceph-osd-prepare-ocs-deviceset-thin-0-data-29mhxp-hbd55 0/1 Completed 0 92m 10.131.2.75 compute-2 <none> <none> for more details: https://docs.google.com/document/d/1TRpohv6jrul-JRv25YT-h5sXMibHtlugxxIIRyAoF64/edit
Bug does not reconstructed on LSO cluster OCS version 4.7.0-336.ci 1.Deploy OCP cluster Provider :Vmware OCP Version: 4.7.0-0.nightly-2021-03-31-192013 2.Install Local storage Operator: [4.7.0-202103202139.p0] 3.Install OCS Operator 4.7.0-336.ci 4.Install storage cluster + KMS +OSD encryption 5.Add capacity without add new node Old OSD pods do not re-spun $ oc get pods | grep osd rook-ceph-osd-0-7bcd55bdb4-b8p7g 2/2 Running 0 15m rook-ceph-osd-1-6475b5cb98-6sgxg 2/2 Running 0 15m rook-ceph-osd-2-6676dfcb4-rgwb4 2/2 Running 0 15m rook-ceph-osd-3-6f4c8f7956-b6f6x 2/2 Running 0 16s rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-0l5rtbnwm 0/1 Completed 0 15m rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-19lvzj52s 0/1 Completed 0 15m rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-26s56789m 0/1 Completed 0 15m rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-3zkrlzwqv 0/1 Completed 0 34s 6.Reboot one of the worker node OSD pods do not re-spun on compute-1,2 $ oc get pods -o wide | grep osd rook-ceph-osd-0-7bcd55bdb4-bvq4n 2/2 Running 0 2m28s 10.131.0.11 compute-0 <none> <none> rook-ceph-osd-1-6475b5cb98-6sgxg 2/2 Running 0 34m 10.129.2.19 compute-1 <none> <none> rook-ceph-osd-2-6676dfcb4-rgwb4 2/2 Running 0 34m 10.128.2.16 compute-2 <none> <none> rook-ceph-osd-3-6f4c8f7956-l4dxh 2/2 Running 0 2m28s 10.131.0.10 compute-0 <none> <none> rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-19lvzj52s 0/1 Completed 0 35m 10.129.2.18 compute-1 <none> <none> rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-26s56789m 0/1 Completed 0 35m 10.128.2.15 compute-2 <none> <none> 7.Device Replacement OSD pods do not re-spun (only new disk) $ oc get -n openshift-storage pods -o wide -l app=rook-ceph-osd NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES rook-ceph-osd-0-865c5555f8-f6vj9 2/2 Running 0 81s 10.131.0.18 compute-0 <none> <none> rook-ceph-osd-1-6475b5cb98-6sgxg 2/2 Running 0 52m 10.129.2.19 compute-1 <none> <none> rook-ceph-osd-2-6676dfcb4-rgwb4 2/2 Running 0 52m 10.128.2.16 compute-2 <none> <none> rook-ceph-osd-3-6f4c8f7956-l4dxh 2/2 Running 0 20m 10.131.0.10 compute-0 <none> <none> 8.Drain one worker node Only OSD-1 re-spun $ oc get pods -o wide | grep osd rook-ceph-osd-0-865c5555f8-f6vj9 2/2 Running 0 15m 10.131.0.18 compute-0 <none> <none> rook-ceph-osd-1-6475b5cb98-wfjbw 2/2 Running 0 115s 10.129.2.25 compute-1 <none> <none> rook-ceph-osd-2-6676dfcb4-rgwb4 2/2 Running 0 66m 10.128.2.16 compute-2 <none> <none> rook-ceph-osd-3-6f4c8f7956-l4dxh 2/2 Running 0 34m 10.131.0.10 compute-0 <none> <none> rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-26s56789m 0/1 Completed 0 67m 10.128.2.15 compute-2 <none> <none> rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-429wxz6m7 0/1 Completed 0 19m 10.131.0.17 compute-0 <none> <none> for more details: https://docs.google.com/document/d/1dv4LKopEa0_c02NXs6NHtop6qh5cY5ppy0YXTEvChlA/edit
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2041