1960784 – After editing CSV the Rook operator does not update MDS or RGW to apply new changes in CEPH image for hotfix until the operator is restarted

Bug 1960784 - After editing CSV the Rook operator does not update MDS or RGW to apply new changes in CEPH image for hotfix until the operator is restarted

Summary: After editing CSV the Rook operator does not update MDS or RGW to apply new c...

Keywords:
Status:	VERIFIED
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	OCS 4.8.0
Assignee:	Santosh Pillai
QA Contact:	Petr Balogh
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-05-14 21:01 UTC by Petr Balogh
Modified:	2023-08-03 08:30 UTC (History)
CC List:	4 users (show)
Fixed In Version:	4.8.0-416.ci
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift rook pull 250	0	None	closed	Bug 1960784: ceph: compare ceph commit hash for upgrade check	2021-06-10 09:42:21 UTC
Github	rook rook pull 8060	0	None	open	ceph: compare ceph commit ID for upgrade check	2021-06-10 07:28:50 UTC

Description Petr Balogh 2021-05-14 21:01:21 UTC

Description of problem (please be detailed as possible and provide log
snippests):
In this hotfix
https://access.redhat.com/articles/6035981
The procedure is to change the CEPH_IMAGE in CSV.

But the rook operator is not restarted to propagate this to RGW and MDS pods.

$ oc get pod -n openshift-storage
NAME                                                              READY   STATUS    RESTARTS   AGE
csi-cephfsplugin-6w6zq                                            3/3     Running   0          58m
csi-cephfsplugin-fxrvn                                            3/3     Running   0          58m
csi-cephfsplugin-plcmn                                            3/3     Running   0          58m
csi-cephfsplugin-provisioner-66c59d467f-f9qjq                     6/6     Running   0          17m
csi-cephfsplugin-provisioner-66c59d467f-hm7cs                     6/6     Running   0          15m
csi-rbdplugin-2t6f5                                               3/3     Running   0          58m
csi-rbdplugin-provisioner-6b7dcf968-4pqdk                         6/6     Running   0          17m
csi-rbdplugin-provisioner-6b7dcf968-gkrf2                         6/6     Running   0          15m
csi-rbdplugin-r9q8b                                               3/3     Running   0          58m
csi-rbdplugin-v4zfl                                               3/3     Running   0          58m
noobaa-core-0                                                     1/1     Running   0          14m
noobaa-db-0                                                       1/1     Running   0          15m
noobaa-endpoint-8cd557c99-jrs5n                                   1/1     Running   1          17m
noobaa-operator-546db56fcc-vqknm                                  1/1     Running   0          15m
ocs-metrics-exporter-569957b47-4g7ft                              1/1     Running   0          15m
ocs-operator-67dcf65bf8-8trk4                                     1/1     Running   0          9m22s
rook-ceph-crashcollector-compute-0-8477f8cb98-55dhm               1/1     Running   0          8m36s
rook-ceph-crashcollector-compute-1-fc6b47b7c-wddlg                1/1     Running   0          5m35s
rook-ceph-crashcollector-compute-2-675f7d86d-vl8k5                1/1     Running   0          7m5s
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-75d4db686dk2b   1/1     Running   0          18m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-79f4c6bdnsz8t   1/1     Running   0          15m
rook-ceph-mgr-a-6c7bfd476b-h4c4k                                  1/1     Running   0          5m12s
rook-ceph-mon-a-ff795f5c5-s25l2                                   1/1     Running   0          5m35s
rook-ceph-mon-b-7dc9957f8-7gjhb                                   1/1     Running   0          8m36s
rook-ceph-mon-c-787c5f555c-t9v7b                                  1/1     Running   0          7m5s
rook-ceph-operator-555cbb5cdf-rkht7                               1/1     Running   0          18m
rook-ceph-osd-0-6fdbb794fb-g4gfv                                  1/1     Running   0          4m56s
rook-ceph-osd-1-6bc6ffccc8-rjsz2                                  1/1     Running   0          3m39s
rook-ceph-osd-2-75bdc67b4b-9lxsp                                  1/1     Running   0          2m16s
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-f7865fb48lfv   1/1     Running   0          17m
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-b964b87tzdjk   1/1     Running   0          15m
rook-ceph-tools-7ddd664854-prr4d                                  1/1     Running   0          18m

Here I see rook-ceph-operator old is 18m and it didn't get restarted after applying hotfix by editing CSV .

You can see here:

oc rsh -n openshift-storage rook-ceph-tools-7ddd664854-prr4d ceph versions
{
    "mon": {
        "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 3
    },
    "mgr": {
        "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 1
    },
    "osd": {
        "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 3
    },
    "mds": {
        "ceph version 14.2.11-139.el8cp (b8e1f91c99491fb2e5ede748a1c0738ed158d0f5) nautilus (stable)": 2
    },
    "rgw": {
        "ceph version 14.2.11-139.el8cp (b8e1f91c99491fb2e5ede748a1c0738ed158d0f5) nautilus (stable)": 2
    },
    "overall": {
        "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 7,
        "ceph version 14.2.11-139.el8cp (b8e1f91c99491fb2e5ede748a1c0738ed158d0f5) nautilus (stable)": 4
    }
}

After removing rook-ceph-operator and waiting a minute or two I see:
$ oc rsh -n openshift-storage rook-ceph-tools-7ddd664854-prr4d ceph versions
{
    "mon": {
        "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 3
    },
    "mgr": {
        "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 1
    },
    "osd": {
        "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 3
    },
    "mds": {
        "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 2
    },
    "rgw": {
        "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 2
    },
    "overall": {
        "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 11
    }
}

It restarted MDS and RGW pods and hotfix image is applied.

Version of all relevant components (if applicable):
OCS 4.6.4
OCP 4.6.12


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?
Restarting rook-ceph-operator image


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1


Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
No - CLI steps for hotfix

If this is a regression, please provide more details to justify this:
Not sure if it worked before

Steps to Reproduce:
1. Install OCP 4.6 - OCS 4.6.4
2. Edit CSV and edit CEPH_IMAGE
3. rook_ceph_operator is not restarted and ceph image is not propagated to RGW, MDS pods.


Actual results:
CEPH_IMAGE is not propagated to RGW , MDS because rook-ceph-operator is not restarted after editing CSV

Expected results:
Have rook-ceph-operator restated to propagate CEPH_IMAGE to all pods.

Additional info:

Comment 2 Travis Nielsen 2021-05-14 21:34:29 UTC

The operator does not need to be restarted, the operator should just respond to the event that the CephCluster was updated. I see in the log that the CephCluster was updated, but not sure why the mds and rgw controllers were not triggered to also update.

Comment 3 Petr Balogh 2021-05-14 22:05:49 UTC

I talked to Neha and she told me that if we reproduce that I should open BZ for that (that you Travis told her to do this), so I did.  2 times in the row reproduced so I think, that there is a bug if it should do that reload hence opened the BZ.

Thanks

Comment 4 Travis Nielsen 2021-05-14 22:13:45 UTC

Thanks, good to hear there is a consistent repro. I agree there is a bug here, my previous comment was just trying to say it still needs investigation.

Comment 5 Travis Nielsen 2021-05-17 22:44:36 UTC

The issue is that the version of Ceph did not change. The Rook operator will notify the file and object controllers that they need to reconcile only when the Ceph version has changed. The version check is currently only based on the build number.

The two versions in this test are:
"ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 7,
"ceph version 14.2.11-139.el8cp (b8e1f91c99491fb2e5ede748a1c0738ed158d0f5) nautilus (stable)": 4

In this case, the build numbers 14.2.11-139 are all equivalent. The rest of the build version is ignored by Rook for the comparison.

If the version is detected as changed [1], the operator log would show the message:
"upgrade in progress, notifying child CRs"

@Petr Is the Ceph build number actually expected to be unchanged during the hotfix? Or is this just an artifact found during testing?


[1] https://github.com/openshift/rook/blob/release-4.6/pkg/operator/ceph/cluster/cluster.go#L122

Comment 6 Petr Balogh 2021-05-21 08:50:11 UTC

Thanks Travis for clarification.

@bkunal this is more question to Bipin.

I got the image I should test which is mentioned in article: quay.io/rh-storage-partners/rhceph:4-50.0.hotfix.bz1959254 .

Bipin, can you please take a look at Travis input?

As this will affecting applying of hotfix if the version is the same.

Comment 7 Bipin Kunal 2021-05-21 12:48:33 UTC

(In reply to Travis Nielsen from comment #5)
> The issue is that the version of Ceph did not change. The Rook operator will
> notify the file and object controllers that they need to reconcile only when
> the Ceph version has changed. The version check is currently only based on
> the build number.
> 
> The two versions in this test are:
> "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp
> (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 7,
> "ceph version 14.2.11-139.el8cp (b8e1f91c99491fb2e5ede748a1c0738ed158d0f5)
> nautilus (stable)": 4
> 
> In this case, the build numbers 14.2.11-139 are all equivalent. The rest of
> the build version is ignored by Rook for the comparison.

Build number won't change for hotfix build. Hotfix must be created on the same build.  we do add some suffix( 0.hotfix.bz1959254.el8cp) but I guess that doesn't get checked.

Then Why did we see image getting updated from OSD, MON, etc?

In my cluster, I did not even observe issues for MDS.
In my cluster, I saw ceph-detect-version pods getting respin as well. 


> 
> If the version is detected as changed [1], the operator log would show the
> message:
> "upgrade in progress, notifying child CRs"
> 
> @Petr Is the Ceph build number actually expected to be unchanged during the
> hotfix? Or is this just an artifact found during testing?
> 
> 
> [1]
> https://github.com/openshift/rook/blob/release-4.6/pkg/operator/ceph/cluster/
> cluster.go#L122

Comment 8 Travis Nielsen 2021-05-21 17:02:52 UTC

@Bipin The main reconcile is triggered, which updates all the mon/mgr/osd daemons, but the mds and rgw need to have their controllers triggered also. This is being missed if the ceph version didn't change. 

Upstream issue opened for this: https://github.com/rook/rook/issues/7964

Comment 9 Travis Nielsen 2021-05-21 17:03:16 UTC

Santosh could you take a look?

Comment 10 Santosh Pillai 2021-06-04 06:32:06 UTC

(In reply to Travis Nielsen from comment #9)
> Santosh could you take a look?

on it.

Comment 13 Petr Balogh 2021-06-29 16:19:24 UTC

In order to test this I will need to have the hotfix for some of the ceph image.

E.g. I just deployed latest 4.8  (ocs-operator.v4.8.0-432.ci) cluster and I see this image is used in CSV:

- name: CEPH_IMAGE
  value: quay.io/rhceph-dev/rhceph@sha256:725f93133acc0fb1ca845bd12e77f20d8629cad0e22d46457b2736578698eb6c

ceph versions returns:
{
    "mon": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 3
    },
    "mgr": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 1
    },
    "osd": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 3
    },
    "mds": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 2
    },
    "overall": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 9
    }
}

Can someone create hotfix build which will have version like  14.2.11-181.0.hotfix.bzXXXXXX.el8cp so I can really verify this in latest 4.8 build?

Maybe @branto or @muagarwa  can help here?

Thanks

Comment 14 Deepshikha khandelwal 2021-07-07 15:57:13 UTC

OCS 4.8 hotfix build is available now: quay.io/rhceph-dev/ocs-registry:4.8.0-449.ci

Build artifacts can be found here: https://ceph-downstream-jenkins-csb-storage.apps.ocp4.prod.psi.redhat.com/job/OCS%20Build%20Pipeline%204.8/162/

ocs-ci is still running though: https://ceph-downstream-jenkins-csb-storage.apps.ocp4.prod.psi.redhat.com/job/ocs-ci/455/

Comment 15 Petr Balogh 2021-07-09 17:18:36 UTC

    "name": "rhceph",
    "tag": "4-50.0.hotfix.bz1959254",
    "image": "quay.io/rhceph-dev/rhceph@sha256:6dbe1a5abfe1f3bf054b584d82f4011c0b0fec817924583ad834b4ff2a63c769",
    "nvr": "rhceph-container-4-50.0.hotfix.bz1959254"
  },

Deepshikha is this:  4-50.0.hotfix.bz1959254 image has version like: 14.2.11-181.0.hotfix.bzXXXXXX.el8cp as I see it has 4-50.0 in name?

Deepshikha Please confirm that so I can continue with verification.

Thanks

Comment 16 Petr Balogh 2021-07-13 13:45:43 UTC

I am preparing cluster for verification here:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/4535/

As I didn't get answer from Deepshikha I will try with the image quay.io/rhceph-dev/rhceph@sha256:6dbe1a5abfe1f3bf054b584d82f4011c0b0fec817924583ad834b4ff2a63c769 and let you know the results.

Comment 17 Deepshikha khandelwal 2021-07-13 14:54:01 UTC

I did somehow missed the comment on this bz. So sorry about that, Petr.

Yes, you will have the image version like `2:14.2.11-139.0.hotfix.bz1959254.el8cp`.

I can confirm from here https://quay.io/repository/rhceph-dev/rhceph/manifest/sha256:286820cca8aa3d6b72eef6c59779c8931c14cf28dafabbb229235c3ccc26e763?tab=packages

Comment 18 Petr Balogh 2021-07-13 16:06:29 UTC

Deepshikha, the version is not good enough.

I need to have exact same version. Which suppose to be:
14.2.11-181.el8cp in order to test it.
So I need image like:
14.2.11-181.0.hotfix.bz1959254.el8cp

For now I see ale versions changed but I cannot verify this BZ as I need to have exact same version like we have in the build itself in order to test this.

$ cat versions-after-hotfix.txt
{
    "mon": {
        "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 3
    },
    "mgr": {
        "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 1
    },
    "osd": {
        "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 3
    },
    "mds": {
        "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 2
    },
    "overall": {
        "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 9
    }
}
$ cat versions-before-hotfix.txt
{
    "mon": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 3
    },
    "mgr": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 1
    },
    "osd": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 3
    },
    "mds": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 2
    },
    "overall": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 9
    }
}

Comment 19 Mudit Agarwal 2021-07-13 16:32:12 UTC

Deepshikha, what Petr is asking is to create a temporary 4.8 build with ceph tag as 4-50.0.hotfix.bz1959254

Comment 20 Deepshikha khandelwal 2021-07-13 18:13:16 UTC

So the rhceph hotfix build 4-50.0.hotfix.bz1959254 has the ceph version `14.2.11-139.0.hotfix.bz1959254.el8cp`. Currently there is no hotfix rhceph image available for the same version i.e, 4-57. We can probably create a recent 4.8 custom build with rhceph 4-50 and you can probably upgrade from this new build to the hotfix build I provided earlier for verifying. let me know if it is fine for you?

Comment 21 Deepshikha khandelwal 2021-07-13 18:29:45 UTC

I have triggered a custom build with rhceph tag 4-50. 

Link to the build pipeline: https://ceph-downstream-jenkins-csb-storage.apps.ocp4.prod.psi.redhat.com/job/OCS%20Build%20Pipeline%204.8/167/

It should probably help.

Comment 22 Petr Balogh 2021-07-14 15:44:01 UTC

Tested with custom build and looks like it works well now.
$ cat versions-after-hotfix.txt
{
    "mon": {
        "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 3
    },
    "mgr": {
        "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 1
    },
    "osd": {
        "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 3
    },
    "mds": {
        "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 2
    },
    "rgw": {
        "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 1
    },
    "overall": {
        "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 10
    }
}
$ cat versions-before-hotfix.txt
{
    "mon": {
        "ceph version 14.2.11-139.el8cp (b8e1f91c99491fb2e5ede748a1c0738ed158d0f5) nautilus (stable)": 3
    },
    "mgr": {
        "ceph version 14.2.11-139.el8cp (b8e1f91c99491fb2e5ede748a1c0738ed158d0f5) nautilus (stable)": 1
    },
    "osd": {
        "ceph version 14.2.11-139.el8cp (b8e1f91c99491fb2e5ede748a1c0738ed158d0f5) nautilus (stable)": 3
    },
    "mds": {
        "ceph version 14.2.11-139.el8cp (b8e1f91c99491fb2e5ede748a1c0738ed158d0f5) nautilus (stable)": 2
    },
    "rgw": {
        "ceph version 14.2.11-139.el8cp (b8e1f91c99491fb2e5ede748a1c0738ed158d0f5) nautilus (stable)": 1
    },
    "overall": {
        "ceph version 14.2.11-139.el8cp (b8e1f91c99491fb2e5ede748a1c0738ed158d0f5) nautilus (stable)": 10
    }
}

Marking as verified

Note You need to log in before you can comment on or make changes to this bug.