Bug 1895402

Summary:	Mon pods didn't get upgraded in 720 second timeout from OCS 4.5 upgrade to 4.6
Product:	[Red Hat Storage] Red Hat OpenShift Container Storage	Reporter:	Petr Balogh <pbalogh>
Component:	rook	Assignee:	Sébastien Han <shan>
Status:	CLOSED ERRATA	QA Contact:	Petr Balogh <pbalogh>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.6	CC:	kramdoss, madam, muagarwa, nberry, ocs-bugs, ratamir, sapillai, tnielsen
Target Milestone:	---	Keywords:	Automation, Regression, Upgrades
Target Release:	OCS 4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	4.6.0-178.ci	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-12-17 06:25:30 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Petr Balogh 2020-11-06 15:36:24 UTC

Description of problem:
In this testing upgrade job:
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/14296/consoleFull

I see that we failed on timeout when upgrading OCS with running IO in BG.

From must gather:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j004vu1cs33-uan/j004vu1cs33-uan_20201104T222906/logs/failed_testcase_ocs_logs_1604533021/test_upgrade_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-67f0978d8c422cc88268cbfc24223a71e8cc5ef9d6c152c6cb912ae0b50da8da/namespaces/openshift-storage/oc_output/pods_-owide

I see that the pod rook-ceph-mon-a-879746484-c4mfn which our automation found that didn't get upgraded in timeout of 720 second is not already present from must gather - which means that it got upgraded probably right after we got timeout as I see:
rook-ceph-mon-a-f76db6bcb-66s8n                                   1/1     Running     0          10m     10.128.2.53   compute-0   <none>           <none>

00:20:53 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig get Pod rook-ceph-mon-a-879746484-c4mfn -n openshift-storage -o yaml
00:20:54 - MainThread - ocs_ci.ocs.resources.pod - WARNING - Images: {'registry.redhat.io/rhceph/rhceph-4-rhel8@sha256:eafd1acb0ada5d7cf93699056118aca19ed7a22e4938411d307ef94048746cc8'} weren't upgraded in: rook-ceph-mon-a-879746484-c4mfn!
00:20:55 - MainThread - ocs_ci.ocs.utils - INFO - Must gather image: quay.io/rhceph-dev/ocs-must-gather:latest-4.6 will be used.

We have two options, we need to update the timeout in our automation if 720 seconds is not enough for mon pods to be upgraded, or there is some regression which takes it more time now.

Probably wouldn't consider it as blocker as it finally got upgraded probably after few mins after we timeouted. But worth to check this out.

Cluster utilization during the upgrade is:
00:17:25 - Thread-2 - ocs_ci.ocs.cluster_load - INFO - Cluster utilization:
==============================================================================================================================================
Throughput: 177.76 MB/s || Latency: 198.57 ms || IOPS: 1403.60 (30.80% of the 4557.15 limit) || Used Space: 141.14 GB || Number of FIO pods: 8
==============================================================================================================================================

Version-Release number of selected component (if applicable):
OCS: 4.5 live upgrade to 4.6.0-149.ci 
OCP: 4.6.0-0.nightly-2020-11-03-172112

How reproducible:


Steps to Reproduce:
1. install OCS 4.5
2. Run some IO 
3. upgrade to 4.6

Actual results:
Mon pods won't get upgraded in timeout of 720 seconds


Expected results:
Have mon pods upgraded in less than 720 seconds


Additional info:
Must gahter:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j004vu1cs33-uan/j004vu1cs33-uan_20201104T222906/logs/failed_testcase_ocs_logs_1604533021/test_upgrade_ocs_logs/

Execution:
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/14296

Comment 2 Yaniv Kaul 2020-11-09 09:32:28 UTC

Anything in the logs?

Comment 3 Petr Balogh 2020-11-09 10:24:39 UTC

About the time we are starting checking MON pods to be upgraded I see:

2020-11-05T00:08:00.065405541Z 2020-11-05 00:08:00.065373 I | op-mon: deployment for mon rook-ceph-mon-a already exists. updating if needed
2020-11-05T00:08:00.083993260Z 2020-11-05 00:08:00.083950 I | op-k8sutil: updating deployment "rook-ceph-mon-a" after verifying it is safe to stop
2020-11-05T00:08:00.083993260Z 2020-11-05 00:08:00.083971 I | op-mon: checking if we can stop the deployment rook-ceph-mon-a
2020-11-05T00:08:00.895006062Z 2020-11-05 00:08:00.894967 I | op-mon: parsing mon endpoints: b=172.30.225.185:6789,c=172.30.55.186:6789,a=172.30.70.159:6789
2020-11-05T00:08:01.324313361Z 2020-11-05 00:08:01.324271 E | ceph-block-pool-controller: failed to reconcile failed to fetch ceph version from cephcluster "ocs-storagecluster-cephcluster": attempt to determine ceph version for the current cluster image timed out
2020-11-05T00:08:03.893037995Z 2020-11-05 00:08:03.892985 I | op-mon: parsing mon endpoints: b=172.30.225.185:6789,c=172.30.55.186:6789,a=172.30.70.159:6789
2020-11-05T00:08:04.229074275Z 2020-11-05 00:08:04.229032 E | ceph-block-pool-controller: failed to reconcile failed to fetch ceph version from cephcluster "ocs-storagecluster-cephcluster": attempt to determine ceph version for the current cluster image timed out
2020-11-05T00:08:06.576691046Z 2020-11-05 00:08:06.576598 E | ceph-crashcollector-controller: ceph version not found for image "quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9" used by cluster "ocs-storagecluster-cephcluster". attempt to determine ceph version for the current cluster image timed out
2020-11-05T00:08:06.655759168Z 2020-11-05 00:08:06.655717 E | ceph-crashcollector-controller: ceph version not found for image "quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9" used by cluster "ocs-storagecluster-cephcluster". attempt to determine ceph version for the current cluster image timed out
2020-11-05T00:08:09.354702305Z 2020-11-05 00:08:09.354450 I | op-mon: parsing mon endpoints: b=172.30.225.185:6789,c=172.30.55.186:6789,a=172.30.70.159:6789
2020-11-05T00:08:11.961807373Z 2020-11-05 00:08:11.961747 E | ceph-crashcollector-controller: ceph version not found for image "quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9" used by cluster "ocs-storagecluster-cephcluster". attempt to determine ceph version for the current cluster image timed out
2020-11-05T00:08:12.066915154Z 2020-11-05 00:08:12.066861 E | ceph-crashcollector-controller: ceph version not found for image "quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9" used by cluster "ocs-storagecluster-cephcluster". attempt to determine ceph version for the current cluster image timed out
2020-11-05T00:08:13.152956483Z 2020-11-05 00:08:13.152910 I | op-k8sutil: finished waiting for updated deployment "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a"
2020-11-05T00:08:13.152956483Z 2020-11-05 00:08:13.152928 I | op-mon: checking if we can continue the deployment rook-ceph-mds-ocs-storagecluster-cephfilesystem-a
2020-11-05T00:08:15.687403102Z 2020-11-05 00:08:15.687323 E | ceph-block-pool-controller: failed to reconcile failed to fetch ceph version from cephcluster "ocs-storagecluster-cephcluster": attempt to determine ceph version for the current cluster image timed out

This I found in rook operator logs:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j004vu1cs33-uan/j004vu1cs33-uan_20201104T222906/logs/failed_testcase_ocs_logs_1604533021/test_upgrade_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-67f0978d8c422cc88268cbfc24223a71e8cc5ef9d6c152c6cb912ae0b50da8da/namespaces/openshift-storage/pods/rook-ceph-operator-659499495f-vwkss/rook-ceph-operator/rook-ceph-operator/logs/current.log


And from our job at time: 00:20:54  I see the pod still didn't get new image.
Travis can you please take a look?

Removing need info from Raz as I provided QE input above, we need someone from engineering to take a look and have better understanding what delays MON pods to be upgraded in some reasonable time.

Comment 5 Petr Balogh 2020-11-09 14:14:47 UTC

Reproduced here:

dir/auth/kubeconfig get Pod rook-ceph-mon-a-58fc46c8-6ncqb -n openshift-storage -o yaml
13:26:55 - MainThread - ocs_ci.ocs.resources.pod - WARNING - Images: {'quay.io/rhceph-dev/rhceph@sha256:eafd1acb0ada5d7cf93699056118aca19ed7a22e4938411d307ef94048746cc8'} weren't upgraded in: rook-ceph-mon-a-58fc46c8-6ncqb!
13:26:55 - MainThread - ocs_ci.ocs.utils - INFO - Must gather image: quay.io/rhceph-dev/ocs-must-gather:latest-4.6 will be used.
13:26:55 - MainThread - ocs_ci.ocs.utils - INFO - OCS logs will be placed in location /home/jenkins/current-cluster-dir/logs/failed_testcase_ocs_logs_1604923078/test_upgrade_ocs_logs/ocs_must_gather
13:26:55 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.6 --dest-dir=/home/jenkins/current-cluster-dir/logs/failed_testcase_ocs_logs_1604923078/test_upgrade_ocs_logs/ocs_must_gather
13:27:25 - Thread-1 - ocs_ci.ocs.cluster_load - INFO - Cluster utilization:

https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/14442/consoleFull

This time it was reproduced on AWS so we have 2 reproduction in last two builds. BTW this execution was with upgrade to RC1 build.

Comment 7 Petr Balogh 2020-11-09 14:41:05 UTC

12 days back:
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/13940/consoleFull

This execution we didn't hit this issue with build:
ocs-operator.v4.6.0-147.ci

Comment 8 Travis Nielsen 2020-11-09 14:46:35 UTC

The osd prepare job likely cannot find the ceph image. This is the only indication in the log:

ceph version not found for image "quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9"

Can you get an "oc pod describe" on the ceph detect job? It should show more details about why the job is not completing, but I suspect it's just not finding the image.

Comment 9 Travis Nielsen 2020-11-09 15:20:45 UTC

A few more clues:
- After the operator is restarted, everything works as expected.
- This issue has been seen at least a couple times during the OCS upgrade tests for 4.6. 
- This issue was not seen in previous releases, so there appears to be a regression.
- In 4.6, The Rook controller for the cluster CR had a significant rewrite of the reconcile on the controller runtime, which could be related to a regression.

Seb and I suspect an in-memory issue that sometimes happens where the cephcluster version is not updated as expected after the detect version job. That version is used by multiple controllers to know if they should reconcile, and what version of the cephcluster is being reconciled. If it only happens occasionally, this could explain why we haven't seen it upstream since v1.4.

Comment 10 Santosh Pillai 2020-11-09 16:24:00 UTC

Based on what Travis mentioned in the above comment and the operator logs, I would suspect that following could be happening: 

1 - In both operator logs given above, upgrade was detected when the a reconcile was already in progress:

2020-11-09T13:12:57.164427668Z 2020-11-09 13:12:57.164404 I | cephclient: writing config file /var/lib/rook/openshift-storage/openshift-storage.config
2020-11-09T13:12:57.164523996Z 2020-11-09 13:12:57.164498 I | cephclient: generated admin config in /var/lib/rook/openshift-storage
2020-11-09T13:12:57.509119960Z 2020-11-09 13:12:57.509083 I | ceph-cluster-controller: cluster "openshift-storage": version "14.2.8-91 nautilus" detected for image "quay.io/rhceph-dev/rhceph@sha256:eafd1acb0ada5d7cf93699056118aca19ed7a22e4938411d307ef94048746cc8"
2020-11-09T13:12:57.583620056Z 2020-11-09 13:12:57.583570 I | op-k8sutil: CSI_CEPHFS_PLUGIN_RESOURCE="" (default)
2020-11-09T13:12:57.857705702Z 2020-11-09 13:12:57.857660 I | op-mon: start running mons
2020-11-09T13:12:57.966652084Z 2020-11-09 13:12:57.966596 I | ceph-spec: CR has changed for "ocs-storagecluster-cephcluster". diff=  v1.ClusterSpec{
2020-11-09T13:12:57.966652084Z   	CephVersion: v1.CephVersionSpec{
2020-11-09T13:12:57.966652084Z   		Image: strings.Join({
2020-11-09T13:12:57.966652084Z   			"quay.io/rhceph-dev/rhceph@sha256:",
2020-11-09T13:12:57.966652084Z - 			"eafd1acb0ada5d7cf93699056118aca19ed7a22e4938411d307ef94048746cc8",
2020-11-09T13:12:57.966652084Z + 			"22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9",
2020-11-09T13:12:57.966652084Z   		}, ""),
2020-11-09T13:12:57.966652084Z   		AllowUnsupported: false,



2 - Since the upgrade was detected during an active reconcile, the cephversion in the spec and status could have been different for that reconcile 
   - We upgrade the cephVersion in the Status in this method - https://github.com/rook/rook/blob/16d3895fbcd78d949ff1ef6d1e9d47cfb6d65c39/pkg/operator/ceph/cluster/cephstatus.go#L211

3. Because of the different ceph versions in the spec and status `GetImageVersion` method would always fail for the current reconcile. See - https://github.com/rook/rook/blob/88434ceb5167ba9172d3c354f64cb513abee93ac/pkg/operator/ceph/controller/version.go#L43

4. When the current reconcile completes, the next reconcile (which was triggered by the ceph cluster upgrade) starts: 
Log Snippets:

2020-11-09 13:49:53.670013 I | ceph-cluster-controller: done reconciling ceph cluster in namespace "openshift-storage"
2020-11-09 13:49:53.714311 I | ceph-cluster-controller: reconciling ceph cluster in namespace "openshift-storage"
2020-11-09 13:49:53.722273 I | op-mon: parsing mon endpoints: a=172.30.20.250:6789,b=172.30.105.63:6789,c=172.30.142.207:6789
2020-11-09 13:49:53.732153 I | ceph-cluster-controller: detecting the ceph image version for image quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9...
2020-11-09 13:49:57.272222 I | ceph-cluster-controller: detected ceph image version: "14.2.8-111 nautilus"
2020-11-09 13:49:57.272240 I | ceph-cluster-controller: validating ceph version from provided image
2020-11-09 13:49:57.289156 I | op-mon: parsing mon endpoints: a=172.30.20.250:6789,b=172.30.105.63:6789,c=172.30.142.207:6789
2020-11-09 13:49:57.289334 I | cephclient: writing config file /var/lib/rook/openshift-storage/openshift-storage.config
2020-11-09 13:49:57.289391 I | cephclient: generated admin config in /var/lib/rook/openshift-storage
2020-11-09 13:49:57.619224 W | ceph-cluster-controller: it looks like we have more than one ceph version running. triggering upgrade. map[ceph version 14.2.8-111.el8cp (2e6029d57bc594eceba4751373da6505028c2650) nautilus (stable):2 ceph version 14.2.8-91.el8cp (75b4845da7d469665bd48d1a49badcc3677bf5cd) nautilus (stable):10]:
2020-11-09 13:49:57.968199 I | ceph-cluster-controller: upgrading ceph cluster to %q14.2.8-111 nautilus
2020-11-09 13:49:57.968223 I | ceph-cluster-controller: cluster "openshift-storage": version "14.2.8-111 nautilus" detected for image "quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9"
2020-11-09 13:49:58.037891 I | op-mon: start running mons
2020-11-09 13:49:58.045323 I | op-mon: parsing mon endpoints: a=172.30.20.250:6789,b=172.30.105.63:6789,c=172.30.142.207:6789

5. This time the cephversion in the spec and the status are both same. So upgrade works fine.

Comment 11 Petr Balogh 2020-11-09 18:08:09 UTC

We had some discussion with Travis that I saw similar issue 28 days back in this execution:

https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-trigger-vsphere-upi-1az-rhcos-vsan-3m-3w-upgrade-ocs-auto-nightly/1/artifact/deploy-ocs-cluster-build.log/*view*/ in this execution for example I found that at. 17:37:24  we are starting checking upgrade of mons and 17:46:10  it was upgraded for all mons pod . This is 27 days back:  https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-trigger-vsphere-upi-1az-rhcos-vsan-3m-3w-upgrade-ocs-auto-nightly/1/
the upgrade job failed on other issue, but not on mon pods this case

http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j001vu1cs33-uan/j001vu1cs33-uan_20201013T144335/logs/failed_testcase_ocs_logs_1602604209/test_upgrade_ocs_logs/
here you have must gather from this run if it helps to take a look how it looked about 1 month ago

Error was:
start_time = 1602611930.6920912, timeout = 720
err_msg = 'Waiting for 1 pods with selector: app=rook-ceph-rgw to be running and upgraded.

Comment 12 Travis Nielsen 2020-11-09 19:04:59 UTC

Santosh, thanks for the analysis, that's also what Seb and I found when analyzing another log. Ignore my previous comment about the version job not completing, the error messages were misleading.

The fundamental issue here is that upgrading OCS is a two-stage process:
1. Upgrade the operator --> kicks off a reconcile and restarts daemons if needed
2. Upgrade the ceph version --> kicks off another reconcile and restarts ceph daemons again

The issue is really that the reconcile started in step 1 is taking a long time (37 minutes in this case). In 4.6 there is an update to the OSD pod spec so step 1 restarts all the daemons, which is taking a long time to wait for PGs to be clean for each daemon. What we really need to do is abort a reconcile if an updated cluster CR is committed.

The key events are:

# The first reconcile begins and the ceph version is detected:
2020-11-09T13:12:57.509119960Z 2020-11-09 13:12:57.509083 I | ceph-cluster-controller: cluster "openshift-storage": version "14.2.8-91 nautilus" detected for image "quay.io/rhceph-dev/rhceph@sha256:eafd1acb0ada5d7cf93699056118aca19ed7a22e4938411d307ef94048746cc8"

# The cluster CR is updated, which will start the second reconcile as soon as the first reconcile is done. Notice how this happens immediately after the first reconcile started, so we have to wait a long time.
2020-11-09T13:12:57.966652084Z 2020-11-09 13:12:57.966596 I | ceph-spec: CR has changed for "ocs-storagecluster-cephcluster". diff= v1.ClusterSpec{

# The second reconcile finally starts 37 minutes later
2020-11-09 13:49:57.272222 I | ceph-cluster-controller: detected ceph image version: "14.2.8-111 nautilus"

Comment 13 Travis Nielsen 2020-11-09 19:35:20 UTC

Upstream issue opened to track the suggestion improvement. 
https://github.com/rook/rook/issues/6587

Since this is not a regression I would recommend we not consider blocking for 4.6, but instead get the fix into 4.7 and consider a backport to 4.6.z depending on the risk of the implementation.

Comment 14 Petr Balogh 2020-11-09 22:05:58 UTC

Travis, if this is not a regression why we didn't see it in upgrade to 4.5 from 4.4 or 4.5.1 to 4.5.2 or actually any OCS 4.5.Z?

Comment 15 Travis Nielsen 2020-11-09 22:09:05 UTC

This issue only manifests itself during some upgrades. The key is step 1. If there is a change to the OSD pod spec in a rook release, Rook will restart the OSDs in both step 1 and 2. If there is not update to the pod spec other than the ceph image, the OSD restart would only happen in step 2. In patch releases and some minor releases, we haven't had OSD pod spec updates that would impact step 1, which is why we haven't hit it until now.

Comment 16 Petr Balogh 2020-11-09 22:20:20 UTC

I see, thanks for the explanation, I guess it will be discussed and decided on tomorrow's program call.

Comment 21 Petr Balogh 2020-11-12 07:35:27 UTC

Running job with increased timeout here:
AWS:
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-upgrade-ocs-auto/67/console
vSphere:
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-trigger-vsphere-upi-1az-rhcos-vsan-3m-3w-upgrade-ocs-auto/42/console

Let's see if increase timeout twice from 720 to 1440 helped.

Comment 22 Petr Balogh 2020-11-12 15:22:56 UTC

Just to update here, in this execution on vSphere:
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/14587/consoleFull


This time I see that upgrade of mon pods went much faster.

But it took much longer for OSD pods this time:


we started checking if OSD pods got upgraded:
10:39:17 - MainThread - ocs_ci.ocs.resources.pod - WARNING - Images: {'registry.redhat.io/rhceph/rhceph-4-rhel8@sha256:eafd1acb0ada5d7cf93699056118aca19ed7a22e4938411d307ef94048746cc8'} weren't upgraded in: rook-ceph-osd-0-8545d754f-7pncl!

And finished this time:
11:23:28 - MainThread - ocs_ci.ocs.ocp - INFO - All the images: {'container': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'osd': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'blkdevmapper': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'activate': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'expand-bluefs': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'chown-container-data-dir': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9'} were successfully upgraded in: rook-ceph-osd-5-5bb5f9d6f6-cp5q9!

So which means it took about 45 mins to upgrade 6 OSD pods.

Must gather collected at the end of upgrade run is here:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j042vu1cs33-ua/j042vu1cs33-ua_20201112T073540/logs/testcases_1605170699/



AWS Run:
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/14586/consoleFull

We started checking MON pods:
09:51:16 - MainThread - ocs_ci.ocs.resources.pod - INFO - Waiting for 3 pods with selector: app=rook-ceph-mon to be running and upgraded.

anf finished:
09:53:02 - MainThread - ocs_ci.ocs.ocp - INFO - All the images: {'container': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'mon': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'chown-container-data-dir': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'init-mon-fs': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9'} were successfully upgraded in: rook-ceph-mon-c-5bfff8f476-ddldd!

OSDs started:
09:53:19 - MainThread - ocs_ci.ocs.resources.pod - INFO - Waiting for 6 pods with selector: app=rook-ceph-osd to be running and upgraded.

And finished
10:15:12 - MainThread - ocs_ci.ocs.ocp - INFO - All the images: {'container': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'osd': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'blkdevmapper': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'activate': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'expand-bluefs': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'chown-container-data-dir': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9'} were successfully upgraded in: rook-ceph-osd-5-75f65d6c9c-nk9tm!

~ 23 Mins

Must gather from this upgrade execution:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j067ai3c33-ua/j067ai3c33-ua_20201112T073330/logs/testcases_1605169828/


Both upgrade passed and haven't seen such issue on both runs which is weird.

Tier1 runs are running so I can update about status of those once they will finish.

Comment 23 Travis Nielsen 2020-11-16 16:58:25 UTC

Moving to 4.6.z as we don't see this as a blocker, but we really need to avoid the double upgrade and the fix will be relatively contained for the z-stream.

Comment 24 Michael Adam 2020-11-24 15:55:20 UTC

Since there is patch now, and there is a chance that it might be ready in time, we might reconsider taking the BZ to 4.6.0

Comment 25 Sébastien Han 2020-11-25 15:46:46 UTC

Downstream PR: https://github.com/openshift/rook/pull/149

Comment 28 Mudit Agarwal 2020-11-26 08:59:47 UTC

Moved back to 4.6, please provide qa_ack

Comment 31 Petr Balogh 2020-11-26 14:40:53 UTC

Running verification job here:
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/15165/console

PR removing workaround from ocs-ci upgrade test:
https://github.com/red-hat-storage/ocs-ci/pull/3423

Comment 32 Petr Balogh 2020-11-26 14:40:56 UTC

Can this be moved to on_qe if it's part of RC4 build? Thanks

Comment 33 Petr Balogh 2020-11-26 18:50:05 UTC

In the verification job:
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/15165/consoleFull

I see we started looking if mon pods got upgraded at 15:53:44 :
15:53:44 - MainThread - ocs_ci.ocs.ocp - INFO - All the images: {'container': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'mon': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'chown-container-data-dir': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'init-mon-fs': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9'} were successfully upgraded in: rook-ceph-mon-a-89997f446-szmmg!

And here we have already new images:
dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9'} were successfully upgraded in: rook-ceph-mon-a-89997f446-szmmg!
15:54:09 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod rook-ceph-mon-b-57f95787ff-qxkx9 -n openshift-storage -o yaml
15:54:10 - MainThread - ocs_ci.ocs.ocp - INFO - All the images: {'container': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'mon': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'chown-container-data-dir': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'init-mon-fs': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9'} were successfully upgraded in: rook-ceph-mon-b-57f95787ff-qxkx9!
15:54:10 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod rook-ceph-mon-c-74ccd6f578-mm5vq -n openshift-storage -o yaml
15:54:10 - MainThread - ocs_ci.ocs.ocp - INFO - All the images: {'container': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'mon': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'chown-container-data-dir': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'init-mon-fs': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9'} were successfully upgraded in: rook-ceph-mon-c-74ccd6f578-mm5vq!

And for sure checking also OSDs we started checking at 15:54:33:
15:54:33 - MainThread - ocs_ci.ocs.resources.pod - INFO - Waiting for 6 pods with selector: app=rook-ceph-osd to be running and upgraded.
15:54:33 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod  -n openshift-storage -o yaml
15:54:38 - MainThread - ocs_ci.ocs.resources.pod - INFO - Found 6 pod(s) for selector: app=rook-ceph-osd
15:54:38 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod rook-ceph-osd-0-578d75897b-jcl6r -n openshift-storage -o yaml
15:54:39 - MainThread - ocs_ci.ocs.resources.pod - WARNING - Images: {'registry.redhat.io/rhceph/rhceph-4-rhel8@sha256:eafd1acb0ada5d7cf93699056118aca19ed7a22e4938411d307ef94048746cc8'} weren't upgraded in: rook-ceph-osd-0-578d75897b-jcl6r!

All 6 OSDs finished upgrade at 16:18:47 :
16:18:47 - MainThread - ocs_ci.ocs.ocp - INFO - All the images: {'container': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'osd': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'blkdevmapper': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'activate': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'expand-bluefs': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9', 'chown-container-data-dir': 'quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9'} were successfully upgraded in: rook-ceph-osd-5-68ddb95dc7-h6jf8!

So based on this execution I can mark it as verified on build ocs-operator.v4.6.0-178.ci .

Comment 36 errata-xmlrpc 2020-12-17 06:25:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5605