>> Description of problem (please be detailed as possible and provide log snippests): ---------------------------------------------------------------------- While trying to figure out why some upgrades wait for 10 mins between each OSD pod re-spin and when not, we confirmed the following behavior: 1. During OCS upgrade, the ceph pods like MONs, MGR, MDS, RGW are re-spinned only once, in case there is a change in CEPH_IMAGE. 2. The OSD pods comprises of containers with different Images - CEPH_IMAGE and ROOK_CEPH_IMAGE. Hence, if there is a change in any(or both) of these 2 images, the OSD pods are re-spinned during upgrade OSD pod respin wait time of 10 min is enforced via rook operator only if there is a chnage in ceph image. If the OSD pods are respinned due to ROOK_CEPH_IMAGE change, they are re-spinned one after the other without any wait time at all. No ceph health check is performed via rook operator. No matter which Image is getting upgraded, is it OK to re-spin all OSDs immediately after one another and not wait for PG status ? The PGs are already in unclean state when the next OSD also respins. tested this on a 6 OSD setup and the total time taken for 6 OSD upgrade was ~5m Though after a detailed discussion in chatroom, it was conferred that only change in CEPH_IMAHE is treated as an upgrade correctly and Rook will maintain wait between OSDs. But wanted to raise this issue to confirm if we are ever planning to introduce this same WAIT when the pod respin because of ROOK_CEPH_IMAGE version BZ which confirm wait time of 10 min when ROOK_CEPH_IMAGE is changed: Bug 1840729 As sene below, all 6 OSD pods were re-spinned within max 5m34s rook-ceph-osd-0-c8cdcc6fd-4wz4h 1/1 Running 0 5m34s 10.129.2.75 compute-1 <none> <none> rook-ceph-osd-1-84958c4774-5bg5t 1/1 Running 0 2m23s 10.128.2.52 compute-0 <none> <none> rook-ceph-osd-2-5b8c8f6bf8-hwxsr 1/1 Running 0 3m50s 10.131.0.90 compute-2 <none> <none> rook-ceph-osd-3-5cfb6b4dd8-jrp6m 1/1 Running 0 4m14s 10.129.2.76 compute-1 <none> <none> rook-ceph-osd-4-685f7c6f89-25clx 1/1 Running 0 113s 10.128.2.53 compute-0 <none> <none> rook-ceph-osd-5-79cc7745ff-whpqg 1/1 Running 0 3m10s 10.131.0.92 compute-2 <none> <none> >> Version of all relevant components (if applicable): ---------------------------------------------------------------------- Pre-upgrade OCS = 4.3 GA Post upgrade OCS = 4.4 GA OCP version = 4.4.6(GA) IN 4.3 - name: ROOK_CEPH_IMAGE quay.io/rhceph-dev/rook-ceph@sha256:8dee92b1f069fe7d5a00d4427a56b15f55034d58013e0f30bb68859bbc608914 - name: CEPH_IMAGE value: registry.redhat.io/rhceph/rhceph-4-rhel8@sha256:9e521d33c1b3c7f5899a8a5f36eee423b8003827b7d12d780a58a701d0a64f0d In 4.4 - name: ROOK_CEPH_IMAGE value: registry.redhat.io/ocs4/rook-ceph-rhel8-operator@sha256:8dee92b1f069fe7d5a00d4427a56b15f55034d58013e0f30bb68859bbc60891 <<<--- change in this image caused a respin of OSD pods - name: CEPH_IMAGE value: registry.redhat.io/rhceph/rhceph-4-rhel8@sha256:9e521d33c1b3c7f5899a8a5f36eee423b8003827b7d12d780a58a701d0a64f0d <<<<--- same as COS 4.3 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? ---------------------------------------------------------------------- No. I do not know the real impact as for whatever IO was running on my cluster, I didnt see any IO to fail. But, not sure if re spinning all the OSD pods in such quick succession could cause any user data related error. Is there any workaround available to the best of your knowledge? ---------------------------------------------------------------------- No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? ---------------------------------------------------------------------- 3 Can this issue reproducible? ---------------------------------------------------------------------- Yes Can this issue reproduce from the UI? ---------------------------------------------------------------------- No, but upgrade was initiated from UI If this is a regression, please provide more details to justify this: ---------------------------------------------------------------------- No.This has been the behavior in previous builds as well. Steps to Reproduce: ---------------------------------------------------------------------- 1. Create an OCS 4.3 cluster and add capacity to have 6 OSDs on the cluster 2. Initiate some IO via fedora-pods(FIO), pgsql, etc to use up same space in the ceph cluster 3. From the UI, change the channel to stable-4.4 4. With Approval Strategy: Automatic, the Upgrade from 4.3 to 4.4 will be triggered automatically 5. It was observed that CEPH_IMAGE was same between 4.3 and 4.4 builds, but there was a change in ROOK_CEPH_IMAGE and this resulted in 6 OSD pods being respinned with newer image within 5-6 mins. No check for PG satus was performed via rook operator. Actual results: ---------------------------------------------------------------------- During OCS upgrade, currently, only change in CEPH_IMAGE will be treated as an upgrade correctly and Rook will wait between each OSDs. Not for change in ROOK_CEPH_IMAGE Expected results: ---------------------------------------------------------------------- During upgrade, any one of Rook/OCS operator should control WIAT_TIME for each OSD pod respin, be it due to change in CEPH_IMAGE or ROOK_CEPH_IMAGE. Additional info: ---------------------------------------------------------------------- There was no mention of following message in rook logs "util: retrying after 1m0s, last error: cluster is not fully clean" mon: 3 daemons, quorum a,b,c (age 7h) mgr: a(active, since 7h) mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay osd: 6 osds: 6 up (since 2m), 6 in (since 6h); 85 remapped pgs rgw: 1 daemon active (ocs.storagecluster.cephobjectstore.a) data: pools: 10 pools, 192 pgs objects: 144.64k objects, 186 GiB usage: 558 GiB used, 2.4 TiB / 3.0 TiB avail pgs: 16941/433911 objects degraded (3.904%)
Since Rook v1.3 upstream (and OCS 4.5), the OSD upgrade behavior has already changed. Previously, there was only a wait when upgrading the OSDs if the ceph image was updated, but not if the Rook image was upgrade. This is what you are seeing in the 4.3 and 4.4 releases. Now the OSD upgrade behavior is to check if there is a difference in the pod spec to determine if we should wait during the upgrade. I wouldn't expect to hit this issue anymore. @leseb, Please correct me if needed. @Neha in that case, please confirm if it is already fixed in 4.5 builds.
That's correct Travis.
Acking as fixed in 4.5 and moving to ON_QA to validate.
Hi Travis, With current Upgrade builds - OCS 4.4.2 and OCS 4.5, even if we select 2 builds whose Ceph version were same internally, e.g. OCS 4.5 (v4.5.0-43.ci) , OCS 4.4.2 -GA, these are the differences, hence replicating the exact same behavior to verify is tough 1. OCS 4.4 had both rhceph and rook-ceph-rhel8 versions in the pod containers. http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-vu1cs33-t1/jnk-vu1cs33-t1_20200805T161916/logs/failed_testcase_ocs_logs_1596649015/test_add_capacity_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-183cc9be0eaec7e3ecf74cce99cfe511f296f1e023798bb5296953d3c3ffb14f/ceph/namespaces/openshift-storage/pods/rook-ceph-osd-0-777ff99fcd-dxjv4/rook-ceph-osd-0-777ff99fcd-dxjv4.yaml 2. OCS 4.5 only has rhceph-dev http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bug-1860418/must-gather.local.6020862434318975465/ceph/namespaces/openshift-storage/pods/rook-ceph-osd-0-b7859999b-xr5q6/rook-ceph-osd-0-b7859999b-xr5q6.yaml So, could you let us know what all things we need to verify or is there any other upgrade path by which we can test this
@Neha You could simulate the upgrade scenarios with the following: 1) Simulate only a change in the ceph image - Install OCS 4.5 - Change the ceph image tag so that it appears to be a different image and set it in the storage cluster CR - Watch that the ceph pods are all updated, and pod restarts wait for clean PGs 2) Simulate that the rook deployment has changed - Install OCS 4.5 - Change something in the deployment/pod spec for an OSD, such as add a new label - Restart the rook operator - Watch that the OSD is restarted because its pod spec changed
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Container Storage 4.5.0 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3754