Description of problem (please be detailed as possible and provide log snippests): In this hotfix https://access.redhat.com/articles/6035981 The procedure is to change the CEPH_IMAGE in CSV. But the rook operator is not restarted to propagate this to RGW and MDS pods. $ oc get pod -n openshift-storage NAME READY STATUS RESTARTS AGE csi-cephfsplugin-6w6zq 3/3 Running 0 58m csi-cephfsplugin-fxrvn 3/3 Running 0 58m csi-cephfsplugin-plcmn 3/3 Running 0 58m csi-cephfsplugin-provisioner-66c59d467f-f9qjq 6/6 Running 0 17m csi-cephfsplugin-provisioner-66c59d467f-hm7cs 6/6 Running 0 15m csi-rbdplugin-2t6f5 3/3 Running 0 58m csi-rbdplugin-provisioner-6b7dcf968-4pqdk 6/6 Running 0 17m csi-rbdplugin-provisioner-6b7dcf968-gkrf2 6/6 Running 0 15m csi-rbdplugin-r9q8b 3/3 Running 0 58m csi-rbdplugin-v4zfl 3/3 Running 0 58m noobaa-core-0 1/1 Running 0 14m noobaa-db-0 1/1 Running 0 15m noobaa-endpoint-8cd557c99-jrs5n 1/1 Running 1 17m noobaa-operator-546db56fcc-vqknm 1/1 Running 0 15m ocs-metrics-exporter-569957b47-4g7ft 1/1 Running 0 15m ocs-operator-67dcf65bf8-8trk4 1/1 Running 0 9m22s rook-ceph-crashcollector-compute-0-8477f8cb98-55dhm 1/1 Running 0 8m36s rook-ceph-crashcollector-compute-1-fc6b47b7c-wddlg 1/1 Running 0 5m35s rook-ceph-crashcollector-compute-2-675f7d86d-vl8k5 1/1 Running 0 7m5s rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-75d4db686dk2b 1/1 Running 0 18m rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-79f4c6bdnsz8t 1/1 Running 0 15m rook-ceph-mgr-a-6c7bfd476b-h4c4k 1/1 Running 0 5m12s rook-ceph-mon-a-ff795f5c5-s25l2 1/1 Running 0 5m35s rook-ceph-mon-b-7dc9957f8-7gjhb 1/1 Running 0 8m36s rook-ceph-mon-c-787c5f555c-t9v7b 1/1 Running 0 7m5s rook-ceph-operator-555cbb5cdf-rkht7 1/1 Running 0 18m rook-ceph-osd-0-6fdbb794fb-g4gfv 1/1 Running 0 4m56s rook-ceph-osd-1-6bc6ffccc8-rjsz2 1/1 Running 0 3m39s rook-ceph-osd-2-75bdc67b4b-9lxsp 1/1 Running 0 2m16s rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-f7865fb48lfv 1/1 Running 0 17m rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-b964b87tzdjk 1/1 Running 0 15m rook-ceph-tools-7ddd664854-prr4d 1/1 Running 0 18m Here I see rook-ceph-operator old is 18m and it didn't get restarted after applying hotfix by editing CSV . You can see here: oc rsh -n openshift-storage rook-ceph-tools-7ddd664854-prr4d ceph versions { "mon": { "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 3 }, "mgr": { "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 1 }, "osd": { "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 3 }, "mds": { "ceph version 14.2.11-139.el8cp (b8e1f91c99491fb2e5ede748a1c0738ed158d0f5) nautilus (stable)": 2 }, "rgw": { "ceph version 14.2.11-139.el8cp (b8e1f91c99491fb2e5ede748a1c0738ed158d0f5) nautilus (stable)": 2 }, "overall": { "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 7, "ceph version 14.2.11-139.el8cp (b8e1f91c99491fb2e5ede748a1c0738ed158d0f5) nautilus (stable)": 4 } } After removing rook-ceph-operator and waiting a minute or two I see: $ oc rsh -n openshift-storage rook-ceph-tools-7ddd664854-prr4d ceph versions { "mon": { "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 3 }, "mgr": { "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 1 }, "osd": { "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 3 }, "mds": { "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 2 }, "rgw": { "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 2 }, "overall": { "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 11 } } It restarted MDS and RGW pods and hotfix image is applied. Version of all relevant components (if applicable): OCS 4.6.4 OCP 4.6.12 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Restarting rook-ceph-operator image Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Yes Can this issue reproduce from the UI? No - CLI steps for hotfix If this is a regression, please provide more details to justify this: Not sure if it worked before Steps to Reproduce: 1. Install OCP 4.6 - OCS 4.6.4 2. Edit CSV and edit CEPH_IMAGE 3. rook_ceph_operator is not restarted and ceph image is not propagated to RGW, MDS pods. Actual results: CEPH_IMAGE is not propagated to RGW , MDS because rook-ceph-operator is not restarted after editing CSV Expected results: Have rook-ceph-operator restated to propagate CEPH_IMAGE to all pods. Additional info:
The operator does not need to be restarted, the operator should just respond to the event that the CephCluster was updated. I see in the log that the CephCluster was updated, but not sure why the mds and rgw controllers were not triggered to also update.
I talked to Neha and she told me that if we reproduce that I should open BZ for that (that you Travis told her to do this), so I did. 2 times in the row reproduced so I think, that there is a bug if it should do that reload hence opened the BZ. Thanks
Thanks, good to hear there is a consistent repro. I agree there is a bug here, my previous comment was just trying to say it still needs investigation.
The issue is that the version of Ceph did not change. The Rook operator will notify the file and object controllers that they need to reconcile only when the Ceph version has changed. The version check is currently only based on the build number. The two versions in this test are: "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 7, "ceph version 14.2.11-139.el8cp (b8e1f91c99491fb2e5ede748a1c0738ed158d0f5) nautilus (stable)": 4 In this case, the build numbers 14.2.11-139 are all equivalent. The rest of the build version is ignored by Rook for the comparison. If the version is detected as changed [1], the operator log would show the message: "upgrade in progress, notifying child CRs" @Petr Is the Ceph build number actually expected to be unchanged during the hotfix? Or is this just an artifact found during testing? [1] https://github.com/openshift/rook/blob/release-4.6/pkg/operator/ceph/cluster/cluster.go#L122
Thanks Travis for clarification. @bkunal this is more question to Bipin. I got the image I should test which is mentioned in article: quay.io/rh-storage-partners/rhceph:4-50.0.hotfix.bz1959254 . Bipin, can you please take a look at Travis input? As this will affecting applying of hotfix if the version is the same.
(In reply to Travis Nielsen from comment #5) > The issue is that the version of Ceph did not change. The Rook operator will > notify the file and object controllers that they need to reconcile only when > the Ceph version has changed. The version check is currently only based on > the build number. > > The two versions in this test are: > "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp > (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 7, > "ceph version 14.2.11-139.el8cp (b8e1f91c99491fb2e5ede748a1c0738ed158d0f5) > nautilus (stable)": 4 > > In this case, the build numbers 14.2.11-139 are all equivalent. The rest of > the build version is ignored by Rook for the comparison. Build number won't change for hotfix build. Hotfix must be created on the same build. we do add some suffix( 0.hotfix.bz1959254.el8cp) but I guess that doesn't get checked. Then Why did we see image getting updated from OSD, MON, etc? In my cluster, I did not even observe issues for MDS. In my cluster, I saw ceph-detect-version pods getting respin as well. > > If the version is detected as changed [1], the operator log would show the > message: > "upgrade in progress, notifying child CRs" > > @Petr Is the Ceph build number actually expected to be unchanged during the > hotfix? Or is this just an artifact found during testing? > > > [1] > https://github.com/openshift/rook/blob/release-4.6/pkg/operator/ceph/cluster/ > cluster.go#L122
@Bipin The main reconcile is triggered, which updates all the mon/mgr/osd daemons, but the mds and rgw need to have their controllers triggered also. This is being missed if the ceph version didn't change. Upstream issue opened for this: https://github.com/rook/rook/issues/7964
Santosh could you take a look?
(In reply to Travis Nielsen from comment #9) > Santosh could you take a look? on it.
In order to test this I will need to have the hotfix for some of the ceph image. E.g. I just deployed latest 4.8 (ocs-operator.v4.8.0-432.ci) cluster and I see this image is used in CSV: - name: CEPH_IMAGE value: quay.io/rhceph-dev/rhceph@sha256:725f93133acc0fb1ca845bd12e77f20d8629cad0e22d46457b2736578698eb6c ceph versions returns: { "mon": { "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 3 }, "mgr": { "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 1 }, "osd": { "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 3 }, "mds": { "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 2 }, "overall": { "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 9 } } Can someone create hotfix build which will have version like 14.2.11-181.0.hotfix.bzXXXXXX.el8cp so I can really verify this in latest 4.8 build? Maybe @branto or @muagarwa can help here? Thanks
OCS 4.8 hotfix build is available now: quay.io/rhceph-dev/ocs-registry:4.8.0-449.ci Build artifacts can be found here: https://ceph-downstream-jenkins-csb-storage.apps.ocp4.prod.psi.redhat.com/job/OCS%20Build%20Pipeline%204.8/162/ ocs-ci is still running though: https://ceph-downstream-jenkins-csb-storage.apps.ocp4.prod.psi.redhat.com/job/ocs-ci/455/
"name": "rhceph", "tag": "4-50.0.hotfix.bz1959254", "image": "quay.io/rhceph-dev/rhceph@sha256:6dbe1a5abfe1f3bf054b584d82f4011c0b0fec817924583ad834b4ff2a63c769", "nvr": "rhceph-container-4-50.0.hotfix.bz1959254" }, Deepshikha is this: 4-50.0.hotfix.bz1959254 image has version like: 14.2.11-181.0.hotfix.bzXXXXXX.el8cp as I see it has 4-50.0 in name? Deepshikha Please confirm that so I can continue with verification. Thanks
I am preparing cluster for verification here: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/4535/ As I didn't get answer from Deepshikha I will try with the image quay.io/rhceph-dev/rhceph@sha256:6dbe1a5abfe1f3bf054b584d82f4011c0b0fec817924583ad834b4ff2a63c769 and let you know the results.
I did somehow missed the comment on this bz. So sorry about that, Petr. Yes, you will have the image version like `2:14.2.11-139.0.hotfix.bz1959254.el8cp`. I can confirm from here https://quay.io/repository/rhceph-dev/rhceph/manifest/sha256:286820cca8aa3d6b72eef6c59779c8931c14cf28dafabbb229235c3ccc26e763?tab=packages
Deepshikha, the version is not good enough. I need to have exact same version. Which suppose to be: 14.2.11-181.el8cp in order to test it. So I need image like: 14.2.11-181.0.hotfix.bz1959254.el8cp For now I see ale versions changed but I cannot verify this BZ as I need to have exact same version like we have in the build itself in order to test this. $ cat versions-after-hotfix.txt { "mon": { "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 3 }, "mgr": { "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 1 }, "osd": { "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 3 }, "mds": { "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 2 }, "overall": { "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 9 } } $ cat versions-before-hotfix.txt { "mon": { "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 3 }, "mgr": { "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 1 }, "osd": { "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 3 }, "mds": { "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 2 }, "overall": { "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 9 } }
Deepshikha, what Petr is asking is to create a temporary 4.8 build with ceph tag as 4-50.0.hotfix.bz1959254
So the rhceph hotfix build 4-50.0.hotfix.bz1959254 has the ceph version `14.2.11-139.0.hotfix.bz1959254.el8cp`. Currently there is no hotfix rhceph image available for the same version i.e, 4-57. We can probably create a recent 4.8 custom build with rhceph 4-50 and you can probably upgrade from this new build to the hotfix build I provided earlier for verifying. let me know if it is fine for you?
I have triggered a custom build with rhceph tag 4-50. Link to the build pipeline: https://ceph-downstream-jenkins-csb-storage.apps.ocp4.prod.psi.redhat.com/job/OCS%20Build%20Pipeline%204.8/167/ It should probably help.
Tested with custom build and looks like it works well now. $ cat versions-after-hotfix.txt { "mon": { "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 3 }, "mgr": { "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 1 }, "osd": { "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 3 }, "mds": { "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 2 }, "rgw": { "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 1 }, "overall": { "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 10 } } $ cat versions-before-hotfix.txt { "mon": { "ceph version 14.2.11-139.el8cp (b8e1f91c99491fb2e5ede748a1c0738ed158d0f5) nautilus (stable)": 3 }, "mgr": { "ceph version 14.2.11-139.el8cp (b8e1f91c99491fb2e5ede748a1c0738ed158d0f5) nautilus (stable)": 1 }, "osd": { "ceph version 14.2.11-139.el8cp (b8e1f91c99491fb2e5ede748a1c0738ed158d0f5) nautilus (stable)": 3 }, "mds": { "ceph version 14.2.11-139.el8cp (b8e1f91c99491fb2e5ede748a1c0738ed158d0f5) nautilus (stable)": 2 }, "rgw": { "ceph version 14.2.11-139.el8cp (b8e1f91c99491fb2e5ede748a1c0738ed158d0f5) nautilus (stable)": 1 }, "overall": { "ceph version 14.2.11-139.el8cp (b8e1f91c99491fb2e5ede748a1c0738ed158d0f5) nautilus (stable)": 10 } } Marking as verified