Bug 1960784

Summary: After editing CSV the Rook operator does not update MDS or RGW to apply new changes in CEPH image for hotfix until the operator is restarted
Product: [Red Hat Storage] Red Hat OpenShift Container Storage Reporter: Petr Balogh <pbalogh>
Component: rookAssignee: Santosh Pillai <sapillai>
Status: VERIFIED --- QA Contact: Petr Balogh <pbalogh>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.6CC: bkunal, dkhandel, muagarwa, shan
Target Milestone: ---Keywords: AutomationBackLog
Target Release: OCS 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.8.0-416.ci Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Petr Balogh 2021-05-14 21:01:21 UTC
Description of problem (please be detailed as possible and provide log
snippests):
In this hotfix
https://access.redhat.com/articles/6035981
The procedure is to change the CEPH_IMAGE in CSV.

But the rook operator is not restarted to propagate this to RGW and MDS pods.

$ oc get pod -n openshift-storage
NAME                                                              READY   STATUS    RESTARTS   AGE
csi-cephfsplugin-6w6zq                                            3/3     Running   0          58m
csi-cephfsplugin-fxrvn                                            3/3     Running   0          58m
csi-cephfsplugin-plcmn                                            3/3     Running   0          58m
csi-cephfsplugin-provisioner-66c59d467f-f9qjq                     6/6     Running   0          17m
csi-cephfsplugin-provisioner-66c59d467f-hm7cs                     6/6     Running   0          15m
csi-rbdplugin-2t6f5                                               3/3     Running   0          58m
csi-rbdplugin-provisioner-6b7dcf968-4pqdk                         6/6     Running   0          17m
csi-rbdplugin-provisioner-6b7dcf968-gkrf2                         6/6     Running   0          15m
csi-rbdplugin-r9q8b                                               3/3     Running   0          58m
csi-rbdplugin-v4zfl                                               3/3     Running   0          58m
noobaa-core-0                                                     1/1     Running   0          14m
noobaa-db-0                                                       1/1     Running   0          15m
noobaa-endpoint-8cd557c99-jrs5n                                   1/1     Running   1          17m
noobaa-operator-546db56fcc-vqknm                                  1/1     Running   0          15m
ocs-metrics-exporter-569957b47-4g7ft                              1/1     Running   0          15m
ocs-operator-67dcf65bf8-8trk4                                     1/1     Running   0          9m22s
rook-ceph-crashcollector-compute-0-8477f8cb98-55dhm               1/1     Running   0          8m36s
rook-ceph-crashcollector-compute-1-fc6b47b7c-wddlg                1/1     Running   0          5m35s
rook-ceph-crashcollector-compute-2-675f7d86d-vl8k5                1/1     Running   0          7m5s
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-75d4db686dk2b   1/1     Running   0          18m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-79f4c6bdnsz8t   1/1     Running   0          15m
rook-ceph-mgr-a-6c7bfd476b-h4c4k                                  1/1     Running   0          5m12s
rook-ceph-mon-a-ff795f5c5-s25l2                                   1/1     Running   0          5m35s
rook-ceph-mon-b-7dc9957f8-7gjhb                                   1/1     Running   0          8m36s
rook-ceph-mon-c-787c5f555c-t9v7b                                  1/1     Running   0          7m5s
rook-ceph-operator-555cbb5cdf-rkht7                               1/1     Running   0          18m
rook-ceph-osd-0-6fdbb794fb-g4gfv                                  1/1     Running   0          4m56s
rook-ceph-osd-1-6bc6ffccc8-rjsz2                                  1/1     Running   0          3m39s
rook-ceph-osd-2-75bdc67b4b-9lxsp                                  1/1     Running   0          2m16s
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-f7865fb48lfv   1/1     Running   0          17m
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-b964b87tzdjk   1/1     Running   0          15m
rook-ceph-tools-7ddd664854-prr4d                                  1/1     Running   0          18m

Here I see rook-ceph-operator old is 18m and it didn't get restarted after applying hotfix by editing CSV .

You can see here:

oc rsh -n openshift-storage rook-ceph-tools-7ddd664854-prr4d ceph versions
{
    "mon": {
        "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 3
    },
    "mgr": {
        "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 1
    },
    "osd": {
        "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 3
    },
    "mds": {
        "ceph version 14.2.11-139.el8cp (b8e1f91c99491fb2e5ede748a1c0738ed158d0f5) nautilus (stable)": 2
    },
    "rgw": {
        "ceph version 14.2.11-139.el8cp (b8e1f91c99491fb2e5ede748a1c0738ed158d0f5) nautilus (stable)": 2
    },
    "overall": {
        "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 7,
        "ceph version 14.2.11-139.el8cp (b8e1f91c99491fb2e5ede748a1c0738ed158d0f5) nautilus (stable)": 4
    }
}

After removing rook-ceph-operator and waiting a minute or two I see:
$ oc rsh -n openshift-storage rook-ceph-tools-7ddd664854-prr4d ceph versions
{
    "mon": {
        "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 3
    },
    "mgr": {
        "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 1
    },
    "osd": {
        "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 3
    },
    "mds": {
        "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 2
    },
    "rgw": {
        "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 2
    },
    "overall": {
        "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 11
    }
}

It restarted MDS and RGW pods and hotfix image is applied.

Version of all relevant components (if applicable):
OCS 4.6.4
OCP 4.6.12


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?
Restarting rook-ceph-operator image


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1


Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
No - CLI steps for hotfix

If this is a regression, please provide more details to justify this:
Not sure if it worked before

Steps to Reproduce:
1. Install OCP 4.6 - OCS 4.6.4
2. Edit CSV and edit CEPH_IMAGE
3. rook_ceph_operator is not restarted and ceph image is not propagated to RGW, MDS pods.


Actual results:
CEPH_IMAGE is not propagated to RGW , MDS because rook-ceph-operator is not restarted after editing CSV

Expected results:
Have rook-ceph-operator restated to propagate CEPH_IMAGE to all pods.

Additional info:

Comment 2 Travis Nielsen 2021-05-14 21:34:29 UTC
The operator does not need to be restarted, the operator should just respond to the event that the CephCluster was updated. I see in the log that the CephCluster was updated, but not sure why the mds and rgw controllers were not triggered to also update.

Comment 3 Petr Balogh 2021-05-14 22:05:49 UTC
I talked to Neha and she told me that if we reproduce that I should open BZ for that (that you Travis told her to do this), so I did.  2 times in the row reproduced so I think, that there is a bug if it should do that reload hence opened the BZ.

Thanks

Comment 4 Travis Nielsen 2021-05-14 22:13:45 UTC
Thanks, good to hear there is a consistent repro. I agree there is a bug here, my previous comment was just trying to say it still needs investigation.

Comment 5 Travis Nielsen 2021-05-17 22:44:36 UTC
The issue is that the version of Ceph did not change. The Rook operator will notify the file and object controllers that they need to reconcile only when the Ceph version has changed. The version check is currently only based on the build number.

The two versions in this test are:
"ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 7,
"ceph version 14.2.11-139.el8cp (b8e1f91c99491fb2e5ede748a1c0738ed158d0f5) nautilus (stable)": 4

In this case, the build numbers 14.2.11-139 are all equivalent. The rest of the build version is ignored by Rook for the comparison.

If the version is detected as changed [1], the operator log would show the message:
"upgrade in progress, notifying child CRs"

@Petr Is the Ceph build number actually expected to be unchanged during the hotfix? Or is this just an artifact found during testing?


[1] https://github.com/openshift/rook/blob/release-4.6/pkg/operator/ceph/cluster/cluster.go#L122

Comment 6 Petr Balogh 2021-05-21 08:50:11 UTC
Thanks Travis for clarification.

@bkunal this is more question to Bipin.

I got the image I should test which is mentioned in article: quay.io/rh-storage-partners/rhceph:4-50.0.hotfix.bz1959254 .

Bipin, can you please take a look at Travis input?

As this will affecting applying of hotfix if the version is the same.

Comment 7 Bipin Kunal 2021-05-21 12:48:33 UTC
(In reply to Travis Nielsen from comment #5)
> The issue is that the version of Ceph did not change. The Rook operator will
> notify the file and object controllers that they need to reconcile only when
> the Ceph version has changed. The version check is currently only based on
> the build number.
> 
> The two versions in this test are:
> "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp
> (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 7,
> "ceph version 14.2.11-139.el8cp (b8e1f91c99491fb2e5ede748a1c0738ed158d0f5)
> nautilus (stable)": 4
> 
> In this case, the build numbers 14.2.11-139 are all equivalent. The rest of
> the build version is ignored by Rook for the comparison.

Build number won't change for hotfix build. Hotfix must be created on the same build.  we do add some suffix( 0.hotfix.bz1959254.el8cp) but I guess that doesn't get checked.

Then Why did we see image getting updated from OSD, MON, etc?

In my cluster, I did not even observe issues for MDS.
In my cluster, I saw ceph-detect-version pods getting respin as well. 


> 
> If the version is detected as changed [1], the operator log would show the
> message:
> "upgrade in progress, notifying child CRs"
> 
> @Petr Is the Ceph build number actually expected to be unchanged during the
> hotfix? Or is this just an artifact found during testing?
> 
> 
> [1]
> https://github.com/openshift/rook/blob/release-4.6/pkg/operator/ceph/cluster/
> cluster.go#L122

Comment 8 Travis Nielsen 2021-05-21 17:02:52 UTC
@Bipin The main reconcile is triggered, which updates all the mon/mgr/osd daemons, but the mds and rgw need to have their controllers triggered also. This is being missed if the ceph version didn't change. 

Upstream issue opened for this: https://github.com/rook/rook/issues/7964

Comment 9 Travis Nielsen 2021-05-21 17:03:16 UTC
Santosh could you take a look?

Comment 10 Santosh Pillai 2021-06-04 06:32:06 UTC
(In reply to Travis Nielsen from comment #9)
> Santosh could you take a look?

on it.

Comment 13 Petr Balogh 2021-06-29 16:19:24 UTC
In order to test this I will need to have the hotfix for some of the ceph image.

E.g. I just deployed latest 4.8  (ocs-operator.v4.8.0-432.ci) cluster and I see this image is used in CSV:

- name: CEPH_IMAGE
  value: quay.io/rhceph-dev/rhceph@sha256:725f93133acc0fb1ca845bd12e77f20d8629cad0e22d46457b2736578698eb6c

ceph versions returns:
{
    "mon": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 3
    },
    "mgr": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 1
    },
    "osd": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 3
    },
    "mds": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 2
    },
    "overall": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 9
    }
}

Can someone create hotfix build which will have version like  14.2.11-181.0.hotfix.bzXXXXXX.el8cp so I can really verify this in latest 4.8 build?

Maybe @branto or @muagarwa  can help here?

Thanks

Comment 14 Deepshikha khandelwal 2021-07-07 15:57:13 UTC
OCS 4.8 hotfix build is available now: quay.io/rhceph-dev/ocs-registry:4.8.0-449.ci

Build artifacts can be found here: https://ceph-downstream-jenkins-csb-storage.apps.ocp4.prod.psi.redhat.com/job/OCS%20Build%20Pipeline%204.8/162/

ocs-ci is still running though: https://ceph-downstream-jenkins-csb-storage.apps.ocp4.prod.psi.redhat.com/job/ocs-ci/455/

Comment 15 Petr Balogh 2021-07-09 17:18:36 UTC
    "name": "rhceph",
    "tag": "4-50.0.hotfix.bz1959254",
    "image": "quay.io/rhceph-dev/rhceph@sha256:6dbe1a5abfe1f3bf054b584d82f4011c0b0fec817924583ad834b4ff2a63c769",
    "nvr": "rhceph-container-4-50.0.hotfix.bz1959254"
  },

Deepshikha is this:  4-50.0.hotfix.bz1959254 image has version like: 14.2.11-181.0.hotfix.bzXXXXXX.el8cp as I see it has 4-50.0 in name?

Deepshikha Please confirm that so I can continue with verification.

Thanks

Comment 16 Petr Balogh 2021-07-13 13:45:43 UTC
I am preparing cluster for verification here:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/4535/

As I didn't get answer from Deepshikha I will try with the image quay.io/rhceph-dev/rhceph@sha256:6dbe1a5abfe1f3bf054b584d82f4011c0b0fec817924583ad834b4ff2a63c769 and let you know the results.

Comment 17 Deepshikha khandelwal 2021-07-13 14:54:01 UTC
I did somehow missed the comment on this bz. So sorry about that, Petr.

Yes, you will have the image version like `2:14.2.11-139.0.hotfix.bz1959254.el8cp`.

I can confirm from here https://quay.io/repository/rhceph-dev/rhceph/manifest/sha256:286820cca8aa3d6b72eef6c59779c8931c14cf28dafabbb229235c3ccc26e763?tab=packages

Comment 18 Petr Balogh 2021-07-13 16:06:29 UTC
Deepshikha, the version is not good enough.

I need to have exact same version. Which suppose to be:
14.2.11-181.el8cp in order to test it.
So I need image like:
14.2.11-181.0.hotfix.bz1959254.el8cp

For now I see ale versions changed but I cannot verify this BZ as I need to have exact same version like we have in the build itself in order to test this.

$ cat versions-after-hotfix.txt
{
    "mon": {
        "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 3
    },
    "mgr": {
        "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 1
    },
    "osd": {
        "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 3
    },
    "mds": {
        "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 2
    },
    "overall": {
        "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 9
    }
}
$ cat versions-before-hotfix.txt
{
    "mon": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 3
    },
    "mgr": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 1
    },
    "osd": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 3
    },
    "mds": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 2
    },
    "overall": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 9
    }
}

Comment 19 Mudit Agarwal 2021-07-13 16:32:12 UTC
Deepshikha, what Petr is asking is to create a temporary 4.8 build with ceph tag as 4-50.0.hotfix.bz1959254

Comment 20 Deepshikha khandelwal 2021-07-13 18:13:16 UTC
So the rhceph hotfix build 4-50.0.hotfix.bz1959254 has the ceph version `14.2.11-139.0.hotfix.bz1959254.el8cp`. Currently there is no hotfix rhceph image available for the same version i.e, 4-57. We can probably create a recent 4.8 custom build with rhceph 4-50 and you can probably upgrade from this new build to the hotfix build I provided earlier for verifying. let me know if it is fine for you?

Comment 21 Deepshikha khandelwal 2021-07-13 18:29:45 UTC
I have triggered a custom build with rhceph tag 4-50. 

Link to the build pipeline: https://ceph-downstream-jenkins-csb-storage.apps.ocp4.prod.psi.redhat.com/job/OCS%20Build%20Pipeline%204.8/167/

It should probably help.

Comment 22 Petr Balogh 2021-07-14 15:44:01 UTC
Tested with custom build and looks like it works well now.
$ cat versions-after-hotfix.txt
{
    "mon": {
        "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 3
    },
    "mgr": {
        "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 1
    },
    "osd": {
        "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 3
    },
    "mds": {
        "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 2
    },
    "rgw": {
        "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 1
    },
    "overall": {
        "ceph version 14.2.11-139.0.hotfix.bz1959254.el8cp (5c0dc966af809fd1d429ec7bac48962a746af243) nautilus (stable)": 10
    }
}
$ cat versions-before-hotfix.txt
{
    "mon": {
        "ceph version 14.2.11-139.el8cp (b8e1f91c99491fb2e5ede748a1c0738ed158d0f5) nautilus (stable)": 3
    },
    "mgr": {
        "ceph version 14.2.11-139.el8cp (b8e1f91c99491fb2e5ede748a1c0738ed158d0f5) nautilus (stable)": 1
    },
    "osd": {
        "ceph version 14.2.11-139.el8cp (b8e1f91c99491fb2e5ede748a1c0738ed158d0f5) nautilus (stable)": 3
    },
    "mds": {
        "ceph version 14.2.11-139.el8cp (b8e1f91c99491fb2e5ede748a1c0738ed158d0f5) nautilus (stable)": 2
    },
    "rgw": {
        "ceph version 14.2.11-139.el8cp (b8e1f91c99491fb2e5ede748a1c0738ed158d0f5) nautilus (stable)": 1
    },
    "overall": {
        "ceph version 14.2.11-139.el8cp (b8e1f91c99491fb2e5ede748a1c0738ed158d0f5) nautilus (stable)": 10
    }
}

Marking as verified