Bug 1895402
| Summary: | Mon pods didn't get upgraded in 720 second timeout from OCS 4.5 upgrade to 4.6 | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Container Storage | Reporter: | Petr Balogh <pbalogh> |
| Component: | rook | Assignee: | Sébastien Han <shan> |
| Status: | CLOSED ERRATA | QA Contact: | Petr Balogh <pbalogh> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.6 | CC: | kramdoss, madam, muagarwa, nberry, ocs-bugs, ratamir, sapillai, tnielsen |
| Target Milestone: | --- | Keywords: | Automation, Regression, Upgrades |
| Target Release: | OCS 4.6.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | 4.6.0-178.ci | Doc Type: | No Doc Update |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-12-17 06:25:30 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Petr Balogh
2020-11-06 15:36:24 UTC
Anything in the logs? About the time we are starting checking MON pods to be upgraded I see: 2020-11-05T00:08:00.065405541Z 2020-11-05 00:08:00.065373 I | op-mon: deployment for mon rook-ceph-mon-a already exists. updating if needed 2020-11-05T00:08:00.083993260Z 2020-11-05 00:08:00.083950 I | op-k8sutil: updating deployment "rook-ceph-mon-a" after verifying it is safe to stop 2020-11-05T00:08:00.083993260Z 2020-11-05 00:08:00.083971 I | op-mon: checking if we can stop the deployment rook-ceph-mon-a 2020-11-05T00:08:00.895006062Z 2020-11-05 00:08:00.894967 I | op-mon: parsing mon endpoints: b=172.30.225.185:6789,c=172.30.55.186:6789,a=172.30.70.159:6789 2020-11-05T00:08:01.324313361Z 2020-11-05 00:08:01.324271 E | ceph-block-pool-controller: failed to reconcile failed to fetch ceph version from cephcluster "ocs-storagecluster-cephcluster": attempt to determine ceph version for the current cluster image timed out 2020-11-05T00:08:03.893037995Z 2020-11-05 00:08:03.892985 I | op-mon: parsing mon endpoints: b=172.30.225.185:6789,c=172.30.55.186:6789,a=172.30.70.159:6789 2020-11-05T00:08:04.229074275Z 2020-11-05 00:08:04.229032 E | ceph-block-pool-controller: failed to reconcile failed to fetch ceph version from cephcluster "ocs-storagecluster-cephcluster": attempt to determine ceph version for the current cluster image timed out 2020-11-05T00:08:06.576691046Z 2020-11-05 00:08:06.576598 E | ceph-crashcollector-controller: ceph version not found for image "quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9" used by cluster "ocs-storagecluster-cephcluster". attempt to determine ceph version for the current cluster image timed out 2020-11-05T00:08:06.655759168Z 2020-11-05 00:08:06.655717 E | ceph-crashcollector-controller: ceph version not found for image "quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9" used by cluster "ocs-storagecluster-cephcluster". attempt to determine ceph version for the current cluster image timed out 2020-11-05T00:08:09.354702305Z 2020-11-05 00:08:09.354450 I | op-mon: parsing mon endpoints: b=172.30.225.185:6789,c=172.30.55.186:6789,a=172.30.70.159:6789 2020-11-05T00:08:11.961807373Z 2020-11-05 00:08:11.961747 E | ceph-crashcollector-controller: ceph version not found for image "quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9" used by cluster "ocs-storagecluster-cephcluster". attempt to determine ceph version for the current cluster image timed out 2020-11-05T00:08:12.066915154Z 2020-11-05 00:08:12.066861 E | ceph-crashcollector-controller: ceph version not found for image "quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9" used by cluster "ocs-storagecluster-cephcluster". attempt to determine ceph version for the current cluster image timed out 2020-11-05T00:08:13.152956483Z 2020-11-05 00:08:13.152910 I | op-k8sutil: finished waiting for updated deployment "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a" 2020-11-05T00:08:13.152956483Z 2020-11-05 00:08:13.152928 I | op-mon: checking if we can continue the deployment rook-ceph-mds-ocs-storagecluster-cephfilesystem-a 2020-11-05T00:08:15.687403102Z 2020-11-05 00:08:15.687323 E | ceph-block-pool-controller: failed to reconcile failed to fetch ceph version from cephcluster "ocs-storagecluster-cephcluster": attempt to determine ceph version for the current cluster image timed out This I found in rook operator logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j004vu1cs33-uan/j004vu1cs33-uan_20201104T222906/logs/failed_testcase_ocs_logs_1604533021/test_upgrade_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-67f0978d8c422cc88268cbfc24223a71e8cc5ef9d6c152c6cb912ae0b50da8da/namespaces/openshift-storage/pods/rook-ceph-operator-659499495f-vwkss/rook-ceph-operator/rook-ceph-operator/logs/current.log And from our job at time: 00:20:54 I see the pod still didn't get new image. Travis can you please take a look? Removing need info from Raz as I provided QE input above, we need someone from engineering to take a look and have better understanding what delays MON pods to be upgraded in some reasonable time. Reproduced here:
dir/auth/kubeconfig get Pod rook-ceph-mon-a-58fc46c8-6ncqb -n openshift-storage -o yaml
13:26:55 - MainThread - ocs_ci.ocs.resources.pod - WARNING - Images: {'quay.io/rhceph-dev/rhceph@sha256:eafd1acb0ada5d7cf93699056118aca19ed7a22e4938411d307ef94048746cc8'} weren't upgraded in: rook-ceph-mon-a-58fc46c8-6ncqb!
13:26:55 - MainThread - ocs_ci.ocs.utils - INFO - Must gather image: quay.io/rhceph-dev/ocs-must-gather:latest-4.6 will be used.
13:26:55 - MainThread - ocs_ci.ocs.utils - INFO - OCS logs will be placed in location /home/jenkins/current-cluster-dir/logs/failed_testcase_ocs_logs_1604923078/test_upgrade_ocs_logs/ocs_must_gather
13:26:55 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.6 --dest-dir=/home/jenkins/current-cluster-dir/logs/failed_testcase_ocs_logs_1604923078/test_upgrade_ocs_logs/ocs_must_gather
13:27:25 - Thread-1 - ocs_ci.ocs.cluster_load - INFO - Cluster utilization:
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/14442/consoleFull
This time it was reproduced on AWS so we have 2 reproduction in last two builds. BTW this execution was with upgrade to RC1 build.
12 days back: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/13940/consoleFull This execution we didn't hit this issue with build: ocs-operator.v4.6.0-147.ci The osd prepare job likely cannot find the ceph image. This is the only indication in the log: ceph version not found for image "quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9" Can you get an "oc pod describe" on the ceph detect job? It should show more details about why the job is not completing, but I suspect it's just not finding the image. A few more clues: - After the operator is restarted, everything works as expected. - This issue has been seen at least a couple times during the OCS upgrade tests for 4.6. - This issue was not seen in previous releases, so there appears to be a regression. - In 4.6, The Rook controller for the cluster CR had a significant rewrite of the reconcile on the controller runtime, which could be related to a regression. Seb and I suspect an in-memory issue that sometimes happens where the cephcluster version is not updated as expected after the detect version job. That version is used by multiple controllers to know if they should reconcile, and what version of the cephcluster is being reconciled. If it only happens occasionally, this could explain why we haven't seen it upstream since v1.4. Based on what Travis mentioned in the above comment and the operator logs, I would suspect that following could be happening:
1 - In both operator logs given above, upgrade was detected when the a reconcile was already in progress:
2020-11-09T13:12:57.164427668Z 2020-11-09 13:12:57.164404 I | cephclient: writing config file /var/lib/rook/openshift-storage/openshift-storage.config
2020-11-09T13:12:57.164523996Z 2020-11-09 13:12:57.164498 I | cephclient: generated admin config in /var/lib/rook/openshift-storage
2020-11-09T13:12:57.509119960Z 2020-11-09 13:12:57.509083 I | ceph-cluster-controller: cluster "openshift-storage": version "14.2.8-91 nautilus" detected for image "quay.io/rhceph-dev/rhceph@sha256:eafd1acb0ada5d7cf93699056118aca19ed7a22e4938411d307ef94048746cc8"
2020-11-09T13:12:57.583620056Z 2020-11-09 13:12:57.583570 I | op-k8sutil: CSI_CEPHFS_PLUGIN_RESOURCE="" (default)
2020-11-09T13:12:57.857705702Z 2020-11-09 13:12:57.857660 I | op-mon: start running mons
2020-11-09T13:12:57.966652084Z 2020-11-09 13:12:57.966596 I | ceph-spec: CR has changed for "ocs-storagecluster-cephcluster". diff= v1.ClusterSpec{
2020-11-09T13:12:57.966652084Z CephVersion: v1.CephVersionSpec{
2020-11-09T13:12:57.966652084Z Image: strings.Join({
2020-11-09T13:12:57.966652084Z "quay.io/rhceph-dev/rhceph@sha256:",
2020-11-09T13:12:57.966652084Z - "eafd1acb0ada5d7cf93699056118aca19ed7a22e4938411d307ef94048746cc8",
2020-11-09T13:12:57.966652084Z + "22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9",
2020-11-09T13:12:57.966652084Z }, ""),
2020-11-09T13:12:57.966652084Z AllowUnsupported: false,
2 - Since the upgrade was detected during an active reconcile, the cephversion in the spec and status could have been different for that reconcile
- We upgrade the cephVersion in the Status in this method - https://github.com/rook/rook/blob/16d3895fbcd78d949ff1ef6d1e9d47cfb6d65c39/pkg/operator/ceph/cluster/cephstatus.go#L211
3. Because of the different ceph versions in the spec and status `GetImageVersion` method would always fail for the current reconcile. See - https://github.com/rook/rook/blob/88434ceb5167ba9172d3c354f64cb513abee93ac/pkg/operator/ceph/controller/version.go#L43
4. When the current reconcile completes, the next reconcile (which was triggered by the ceph cluster upgrade) starts:
Log Snippets:
2020-11-09 13:49:53.670013 I | ceph-cluster-controller: done reconciling ceph cluster in namespace "openshift-storage"
2020-11-09 13:49:53.714311 I | ceph-cluster-controller: reconciling ceph cluster in namespace "openshift-storage"
2020-11-09 13:49:53.722273 I | op-mon: parsing mon endpoints: a=172.30.20.250:6789,b=172.30.105.63:6789,c=172.30.142.207:6789
2020-11-09 13:49:53.732153 I | ceph-cluster-controller: detecting the ceph image version for image quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9...
2020-11-09 13:49:57.272222 I | ceph-cluster-controller: detected ceph image version: "14.2.8-111 nautilus"
2020-11-09 13:49:57.272240 I | ceph-cluster-controller: validating ceph version from provided image
2020-11-09 13:49:57.289156 I | op-mon: parsing mon endpoints: a=172.30.20.250:6789,b=172.30.105.63:6789,c=172.30.142.207:6789
2020-11-09 13:49:57.289334 I | cephclient: writing config file /var/lib/rook/openshift-storage/openshift-storage.config
2020-11-09 13:49:57.289391 I | cephclient: generated admin config in /var/lib/rook/openshift-storage
2020-11-09 13:49:57.619224 W | ceph-cluster-controller: it looks like we have more than one ceph version running. triggering upgrade. map[ceph version 14.2.8-111.el8cp (2e6029d57bc594eceba4751373da6505028c2650) nautilus (stable):2 ceph version 14.2.8-91.el8cp (75b4845da7d469665bd48d1a49badcc3677bf5cd) nautilus (stable):10]:
2020-11-09 13:49:57.968199 I | ceph-cluster-controller: upgrading ceph cluster to %q14.2.8-111 nautilus
2020-11-09 13:49:57.968223 I | ceph-cluster-controller: cluster "openshift-storage": version "14.2.8-111 nautilus" detected for image "quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9"
2020-11-09 13:49:58.037891 I | op-mon: start running mons
2020-11-09 13:49:58.045323 I | op-mon: parsing mon endpoints: a=172.30.20.250:6789,b=172.30.105.63:6789,c=172.30.142.207:6789
5. This time the cephversion in the spec and the status are both same. So upgrade works fine.
We had some discussion with Travis that I saw similar issue 28 days back in this execution: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-trigger-vsphere-upi-1az-rhcos-vsan-3m-3w-upgrade-ocs-auto-nightly/1/artifact/deploy-ocs-cluster-build.log/*view*/ in this execution for example I found that at. 17:37:24 we are starting checking upgrade of mons and 17:46:10 it was upgraded for all mons pod . This is 27 days back: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-trigger-vsphere-upi-1az-rhcos-vsan-3m-3w-upgrade-ocs-auto-nightly/1/ the upgrade job failed on other issue, but not on mon pods this case http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j001vu1cs33-uan/j001vu1cs33-uan_20201013T144335/logs/failed_testcase_ocs_logs_1602604209/test_upgrade_ocs_logs/ here you have must gather from this run if it helps to take a look how it looked about 1 month ago Error was: start_time = 1602611930.6920912, timeout = 720 err_msg = 'Waiting for 1 pods with selector: app=rook-ceph-rgw to be running and upgraded. Santosh, thanks for the analysis, that's also what Seb and I found when analyzing another log. Ignore my previous comment about the version job not completing, the error messages were misleading.
The fundamental issue here is that upgrading OCS is a two-stage process:
1. Upgrade the operator --> kicks off a reconcile and restarts daemons if needed
2. Upgrade the ceph version --> kicks off another reconcile and restarts ceph daemons again
The issue is really that the reconcile started in step 1 is taking a long time (37 minutes in this case). In 4.6 there is an update to the OSD pod spec so step 1 restarts all the daemons, which is taking a long time to wait for PGs to be clean for each daemon. What we really need to do is abort a reconcile if an updated cluster CR is committed.
The key events are:
# The first reconcile begins and the ceph version is detected:
2020-11-09T13:12:57.509119960Z 2020-11-09 13:12:57.509083 I | ceph-cluster-controller: cluster "openshift-storage": version "14.2.8-91 nautilus" detected for image "quay.io/rhceph-dev/rhceph@sha256:eafd1acb0ada5d7cf93699056118aca19ed7a22e4938411d307ef94048746cc8"
# The cluster CR is updated, which will start the second reconcile as soon as the first reconcile is done. Notice how this happens immediately after the first reconcile started, so we have to wait a long time.
2020-11-09T13:12:57.966652084Z 2020-11-09 13:12:57.966596 I | ceph-spec: CR has changed for "ocs-storagecluster-cephcluster". diff= v1.ClusterSpec{
# The second reconcile finally starts 37 minutes later
2020-11-09 13:49:57.272222 I | ceph-cluster-controller: detected ceph image version: "14.2.8-111 nautilus"
Upstream issue opened to track the suggestion improvement. https://github.com/rook/rook/issues/6587 Since this is not a regression I would recommend we not consider blocking for 4.6, but instead get the fix into 4.7 and consider a backport to 4.6.z depending on the risk of the implementation. Travis, if this is not a regression why we didn't see it in upgrade to 4.5 from 4.4 or 4.5.1 to 4.5.2 or actually any OCS 4.5.Z? This issue only manifests itself during some upgrades. The key is step 1. If there is a change to the OSD pod spec in a rook release, Rook will restart the OSDs in both step 1 and 2. If there is not update to the pod spec other than the ceph image, the OSD restart would only happen in step 2. In patch releases and some minor releases, we haven't had OSD pod spec updates that would impact step 1, which is why we haven't hit it until now. I see, thanks for the explanation, I guess it will be discussed and decided on tomorrow's program call. Running job with increased timeout here: AWS: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-upgrade-ocs-auto/67/console vSphere: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-trigger-vsphere-upi-1az-rhcos-vsan-3m-3w-upgrade-ocs-auto/42/console Let's see if increase timeout twice from 720 to 1440 helped. Just to update here, in this execution on vSphere: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/14587/consoleFull This time I see that upgrade of mon pods went much faster. But it took much longer for OSD pods this time: we started checking if OSD pods got upgraded: 10:39:17 - MainThread - ocs_ci.ocs.resources.pod - WARNING - Images: {'registry.redhat.io/rhceph/rhceph-4-rhel8@sha256:eafd1acb0ada5d7cf93699056118aca19ed7a22e4938411d307ef94048746cc8'} weren't upgraded in: rook-ceph-osd-0-8545d754f-7pncl! And finished this time: 11:23:28 - MainThread - ocs_ci.ocs.ocp - INFO - All the images: {'container': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'osd': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'blkdevmapper': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'activate': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'expand-bluefs': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'chown-container-data-dir': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9'} were successfully upgraded in: rook-ceph-osd-5-5bb5f9d6f6-cp5q9! So which means it took about 45 mins to upgrade 6 OSD pods. Must gather collected at the end of upgrade run is here: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j042vu1cs33-ua/j042vu1cs33-ua_20201112T073540/logs/testcases_1605170699/ AWS Run: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/14586/consoleFull We started checking MON pods: 09:51:16 - MainThread - ocs_ci.ocs.resources.pod - INFO - Waiting for 3 pods with selector: app=rook-ceph-mon to be running and upgraded. anf finished: 09:53:02 - MainThread - ocs_ci.ocs.ocp - INFO - All the images: {'container': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'mon': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'chown-container-data-dir': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'init-mon-fs': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9'} were successfully upgraded in: rook-ceph-mon-c-5bfff8f476-ddldd! OSDs started: 09:53:19 - MainThread - ocs_ci.ocs.resources.pod - INFO - Waiting for 6 pods with selector: app=rook-ceph-osd to be running and upgraded. And finished 10:15:12 - MainThread - ocs_ci.ocs.ocp - INFO - All the images: {'container': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'osd': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'blkdevmapper': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'activate': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'expand-bluefs': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'chown-container-data-dir': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9'} were successfully upgraded in: rook-ceph-osd-5-75f65d6c9c-nk9tm! ~ 23 Mins Must gather from this upgrade execution: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j067ai3c33-ua/j067ai3c33-ua_20201112T073330/logs/testcases_1605169828/ Both upgrade passed and haven't seen such issue on both runs which is weird. Tier1 runs are running so I can update about status of those once they will finish. Moving to 4.6.z as we don't see this as a blocker, but we really need to avoid the double upgrade and the fix will be relatively contained for the z-stream. Since there is patch now, and there is a chance that it might be ready in time, we might reconsider taking the BZ to 4.6.0 Downstream PR: https://github.com/openshift/rook/pull/149 Moved back to 4.6, please provide qa_ack Running verification job here: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/15165/console PR removing workaround from ocs-ci upgrade test: https://github.com/red-hat-storage/ocs-ci/pull/3423 Can this be moved to on_qe if it's part of RC4 build? Thanks In the verification job: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/15165/consoleFull I see we started looking if mon pods got upgraded at 15:53:44 : 15:53:44 - MainThread - ocs_ci.ocs.ocp - INFO - All the images: {'container': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'mon': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'chown-container-data-dir': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'init-mon-fs': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9'} were successfully upgraded in: rook-ceph-mon-a-89997f446-szmmg! And here we have already new images: dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9'} were successfully upgraded in: rook-ceph-mon-a-89997f446-szmmg! 15:54:09 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod rook-ceph-mon-b-57f95787ff-qxkx9 -n openshift-storage -o yaml 15:54:10 - MainThread - ocs_ci.ocs.ocp - INFO - All the images: {'container': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'mon': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'chown-container-data-dir': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'init-mon-fs': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9'} were successfully upgraded in: rook-ceph-mon-b-57f95787ff-qxkx9! 15:54:10 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod rook-ceph-mon-c-74ccd6f578-mm5vq -n openshift-storage -o yaml 15:54:10 - MainThread - ocs_ci.ocs.ocp - INFO - All the images: {'container': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'mon': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'chown-container-data-dir': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'init-mon-fs': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9'} were successfully upgraded in: rook-ceph-mon-c-74ccd6f578-mm5vq! And for sure checking also OSDs we started checking at 15:54:33: 15:54:33 - MainThread - ocs_ci.ocs.resources.pod - INFO - Waiting for 6 pods with selector: app=rook-ceph-osd to be running and upgraded. 15:54:33 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod -n openshift-storage -o yaml 15:54:38 - MainThread - ocs_ci.ocs.resources.pod - INFO - Found 6 pod(s) for selector: app=rook-ceph-osd 15:54:38 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod rook-ceph-osd-0-578d75897b-jcl6r -n openshift-storage -o yaml 15:54:39 - MainThread - ocs_ci.ocs.resources.pod - WARNING - Images: {'registry.redhat.io/rhceph/rhceph-4-rhel8@sha256:eafd1acb0ada5d7cf93699056118aca19ed7a22e4938411d307ef94048746cc8'} weren't upgraded in: rook-ceph-osd-0-578d75897b-jcl6r! All 6 OSDs finished upgrade at 16:18:47 : 16:18:47 - MainThread - ocs_ci.ocs.ocp - INFO - All the images: {'container': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'osd': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'blkdevmapper': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'activate': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'expand-bluefs': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'chown-container-data-dir': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9'} were successfully upgraded in: rook-ceph-osd-5-68ddb95dc7-h6jf8! So based on this execution I can mark it as verified on build ocs-operator.v4.6.0-178.ci . Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5605 |