Description of problem (please be detailed as possible and provide log snippests): https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/9707/testReport/tests.ecosystem.upgrade/test_upgrade/test_upgrade/ ocs_ci.ocs.exceptions.TimeoutException: Timeout 720 reached! Error: {err_msg} Which means that even after 720 seconds timeout we still have old image in the mon pods. 18:08:51 - MainThread - ocs_ci.ocs.resources.pod - WARNING - Images: {'registry.redhat.io/rhceph/rhceph-4-rhel8@sha256:20cf789235e23ddaf38e109b391d1496bb88011239d16862c4c106d0e05fea9e'} weren't upgraded in: rook-ceph-mon-a-67d4c66748-x9ftg! When I checked fresh installation of OCS 4.5 I see the image which suppose to be there is: quay.io/rhceph-dev/rhceph@sha256:45e6f28bb67c81b826acb64fad5c0da1cac3dffb41a88992fe4ca2be79575fa6 I see that later on that the mon pods got restarted and upgraded. I can see from must gather which was taken after some later running test here: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-ai3c33-ua/jnk-ai3c33-ua_20200710T163031/logs/failed_testcase_ocs_logs_1594401865/test_upgrade_mcg_io_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-ab442c033fdeb4fdfe33af77ec81eda1eb4f7c41b623a7265f645f76df700353/oc_output/get_pods_-owide_-n_openshift-storage I am curious if there was some change in the order of upgrading of the pods or why we see this behavior? Version of all relevant components (if applicable): OCS 4.4.1 installed from live content - upgrade to internal build quay.io/rhceph-dev/ocs-olm-operator:4.5.0-485.ci OCP version: 4.4.0-0.nightly-2020-07-10-113925 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Trying now: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-upgrade-ocs-auto/27/console Can this issue reproduce from the UI? Haven't tried. If this is a regression, please provide more details to justify this: Yes, I don't remember we had hit such timeout before for mon pods. Steps to Reproduce: 1. Install 4.4.1 2. Add capacity to the cluster 3. Run some IO in the background 4. Upgrade to 4.5 internal build Actual results: Mon pods are not upgraded after 720 seconds timeout. Expected results: Have mon pods upgraded Additional info: Must gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-ai3c33-ua/jnk-ai3c33-ua_20200710T163031/logs/failed_testcase_ocs_logs_1594401865/test_upgrade_ocs_logs/ocs_must_gather/
The order of the components we are looking at is done here: https://github.com/red-hat-storage/ocs-ci/blob/master/ocs_ci/ocs/ocs_upgrade.py#L84-L122
The Rook operator log [1] shows a 15 minute delay between the start of the rook operator and when the upgrade of the rook daemons begins. The sequence is: 1. (17:55:21) The Rook operator is updated and restarts 2. The operator runs a complete reconcile, which takes about 15 minutes because the PGs were not clean. 3. (18:09:15) The CephCluster CR update event is triggered in the operator, which finally proceeds with the update to all the mons, osds, and other daemons as expected. 2020-07-10T17:55:21.506205148Z 2020-07-10 17:55:21.506083 I | rookcmd: starting Rook 4.5-32.e1c766a1.release_4.5 with arguments '/usr/local/bin/rook ceph operator' ... 2020-07-10T18:03:19.013126371Z 2020-07-10 18:03:19.013086 I | clusterdisruption-controller: pg health: "cluster is not fully clean. PGs: ... 2020-07-10T18:09:15.701716992Z 2020-07-10 18:09:15.701709 I | op-cluster: Done creating rook instance in namespace openshift-storage ... 2020-07-10T18:09:15.753901877Z 2020-07-10 18:09:15.753861 I | op-cluster: The Cluster CR has changed. diff= v1.ClusterSpec{ @Petr, was the health of the cluster different in this run from what typically happens in the test? [1] http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-ai3c33-ua/jnk-ai3c33-ua_20200710T163031/logs/failed_testcase_ocs_logs_1594401865/test_upgrade_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-ab442c033fdeb4fdfe33af77ec81eda1eb4f7c41b623a7265f645f76df700353/ceph/namespaces/openshift-storage/pods/rook-ceph-operator-6f85fdf444-x7blh/rook-ceph-operator/rook-ceph-operator/logs/current.log
That's weird Travis. I see that the Ceph Health was OK before we started to run upgrade test: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/9707/consoleFull 17:53:31 - MainThread - ocs_ci.utility.utils - INFO - Ceph cluster health is HEALTH_OK. 17:53:31 - MainThread - tests.conftest - INFO - Ceph health check passed at teardown tests/ecosystem/upgrade/test_upgrade.py::test_upgrade -------------------------------- live log call --------------------------------- From debug log: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-ai3c33-ua/jnk-ai3c33-ua_20200710T163031/logs/ocs-ci-logs-1594401865/by_outcome/failed/tests/ecosystem/upgrade/test_upgrade.py/test_upgrade/logs of our test I see it's still HEALTH_OK even just before the upgrading itself here: 2020-07-10 17:54:34,472 - MainThread - INFO - ocs_ci.ocs.ocs_upgrade.set_upgrade_images.383 - Image: quay.io/rhceph-dev/ocs-olm-operator:4.5.0-485.ci will be used for upgrade. 2020-07-10 17:54:34,479 - MainThread - INFO - ocs_ci.utility.templating.dump_data_to_temp_yaml.210 - apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: creationTimestamp: '2020-07-10T17:54:07Z' generation: 1 labels: ocs-operator-internal: 'true' name: ocs-catalogsource namespace: openshift-marketplace resourceVersion: '50018' selfLink: /apis/operators.coreos.com/v1alpha1/namespaces/openshift-marketplace/catalogsources/ocs-catalogsource uid: c9a2c8bf-e9bb-4c1a-b23b-e995df9a7473 spec: displayName: Openshift Container Storage icon: base64data: '' mediatype: '' image: quay.io/rhceph-dev/ocs-olm-operator:4.5.0-485.ci publisher: Red Hat sourceType: grpc status: connectionState: address: ocs-catalogsource.openshift-marketplace.svc:50051 lastConnect: '2020-07-10T17:54:31Z' lastObservedState: READY registryService: createdAt: '2020-07-10T17:54:07Z' port: '50051' protocol: grpc serviceName: ocs-catalogsource serviceNamespace: openshift-marketplace 2020-07-10 17:54:34,479 - MainThread - INFO - ocs_ci.utility.utils.exec_cmd.437 - Executing command: oc -n openshift-marketplace --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig apply -f /tmp/tmp9aeaacda 2020-07-10 17:54:35,054 - MainThread - DEBUG - ocs_ci.utility.utils.exec_cmd.450 - Command stdout: catalogsource.operators.coreos.com/ocs-catalogsource configured 2020-07-10 17:54:35,054 - MainThread - WARNING - ocs_ci.utility.utils.exec_cmd.456 - Command stderr: Warning: oc apply should be used on resource created by either oc create --save-config or oc apply 2020-07-10 17:54:35,054 - MainThread - DEBUG - ocs_ci.utility.utils.exec_cmd.459 - Command return code: 0 2020-07-10 17:54:35,055 - MainThread - INFO - ocs_ci.utility.utils.exec_cmd.437 - Executing command: oc -n openshift-storage --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig get subscription ocs-subscription -n openshift-storage -o yaml 2020-07-10 17:54:35,211 - Thread-2 - DEBUG - ocs_ci.utility.utils.exec_cmd.450 - Command stdout: HEALTH_OK 2020-07-10 17:54:35,211 - Thread-2 - DEBUG - ocs_ci.utility.utils.exec_cmd.458 - Command stderr is empty 2020-07-10 17:54:35,212 - Thread-2 - DEBUG - ocs_ci.utility.utils.exec_cmd.459 - Command return code: 0 2020-07-10 17:54:35,296 - MainThread - DEBUG - ocs_ci.utility.utils.exec_cmd.450 - Command stdout: apiVersion: operators.coreos.com/v1alpha1 CSV is now upgrading to 4.5: 2020-07-10 17:54:37,431 - MainThread - WARNING - ocs_ci.ocs.ocp.get.215 - Number of attempts to get resource reached! 2020-07-10 17:54:37,431 - MainThread - INFO - ocs_ci.ocs.ocp.check_phase.775 - Cannot find resource object ocs-operator.v4.5.0-485.ci 2020-07-10 17:54:37,431 - MainThread - INFO - ocs_ci.utility.utils.__iter__.774 - Going to sleep for 5 seconds before next iteration 2020-07-10 17:54:40,217 - Thread-2 - INFO - ocs_ci.utility.utils.exec_cmd.437 - Executing command: oc -n openshift-storage --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig rsh rook-ceph-tools-6c67d65646-dxwzb ceph health detail 2020-07-10 17:54:41,019 - Thread-2 - DEBUG - ocs_ci.utility.utils.exec_cmd.450 - Command stdout: HEALTH_OK I still see health OK at this time. 2020-07-10 17:54:43,028 - MainThread - INFO - ocs_ci.ocs.ocp.check_phase.780 - Resource ocs-operator.v4.5.0-485.ci is in phase: Pending! 2020-07-10 17:54:43,029 - MainThread - INFO - ocs_ci.utility.utils.__iter__.774 - Going to sleep for 5 seconds before next iteration 2020-07-10 17:54:46,025 - Thread-2 - INFO - ocs_ci.utility.utils.exec_cmd.437 - Executing command: oc -n openshift-storage --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig rsh rook-ceph-tools-6c67d65646-dxwzb ceph health detail 2020-07-10 17:54:46,946 - Thread-2 - DEBUG - ocs_ci.utility.utils.exec_cmd.450 - Command stdout: HEALTH_OK Here CSV gets to Succeeded state for a moment: 2020-07-10 17:55:46,442 - MainThread - INFO - ocs_ci.ocs.ocp.check_phase.780 - Resource ocs-operator.v4.5.0-485.ci is in phase: Succeeded! Even here I still see: 2020-07-10 17:55:51,068 - Thread-2 - DEBUG - ocs_ci.utility.utils.exec_cmd.450 - Command stdout: HEALTH_OK So the health of the cluster seems OK travis even when the upgrade itself started. So my understanding is that if PGs got unclead then it was caused by upgrade itself as we don't run IMO anything during this upgrade in the background which could lead to this. Running once more here: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/9782/consoleFull Thanks
@Petr - Can I connect to the cluster when it is in this state? - Do the PGs become clean after the upgrade is completed later? I don't see in the operator log the end of the upgrade to see if the PGs settled.
Travis the cluster got already deleted :( . The second job I triggered got stuck with some of other issue (ocs-ci) and I needed to abort the job. I re-triggered new one job but it's in the queue now. So I will see if it got reproduced and if so I can bring up another cluster for upgrade tomorrow. So when you online you can ping me on the chat and I can bring the cluster for you. I will update tomorrow how it went with reproduction of the BZ.
Hey Travis, the same was reproduced here: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/9803/ OCS must gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-ai3c33-ua/jnk-ai3c33-ua_20200714T040124/logs/failed_testcase_ocs_logs_1594702730/test_upgrade_ocs_logs/ocs_must_gather To your question: I did pause the env before the teardown and after some time the cluster got clean and upgraded: $ oc rsh -n openshift-storage rook-ceph-tools-6c67d65646-6dbvp ceph health detail HEALTH_OK $ oc get pod -n openshift-storage NAME READY STATUS RESTARTS AGE csi-cephfsplugin-6zrzj 3/3 Running 0 4h39m csi-cephfsplugin-ckfbj 3/3 Running 0 4h40m csi-cephfsplugin-lpxrh 3/3 Running 0 4h40m csi-cephfsplugin-provisioner-785f964c-fl878 5/5 Running 0 4h40m csi-cephfsplugin-provisioner-785f964c-pc5x5 5/5 Running 0 4h40m csi-rbdplugin-52hv9 3/3 Running 0 4h40m csi-rbdplugin-fwh2h 3/3 Running 0 4h40m csi-rbdplugin-hhng8 3/3 Running 0 4h40m csi-rbdplugin-provisioner-6597886578-jsk62 5/5 Running 0 4h40m csi-rbdplugin-provisioner-6597886578-x2dbk 5/5 Running 0 4h40m lib-bucket-provisioner-79f6dd6b99-g5tzr 1/1 Running 0 5h28m noobaa-core-0 1/1 Running 0 4h39m noobaa-db-0 1/1 Running 0 4h40m noobaa-endpoint-b88d46f6d-q25mh 1/1 Running 0 4h40m noobaa-operator-54649f5748-f8g7k 1/1 Running 0 4h40m ocs-operator-b5ff748b-cssvj 1/1 Running 0 4h40m rook-ceph-crashcollector-ip-10-0-149-42-54cfb7cdc5-qrx97 1/1 Running 0 4h27m rook-ceph-crashcollector-ip-10-0-185-4-7c8d8fb8c5-89v2r 1/1 Running 0 4h26m rook-ceph-crashcollector-ip-10-0-203-37-86889c889b-bv8rp 1/1 Running 0 4h24m rook-ceph-drain-canary-0aa62f8c4250f3f076a9296dc1f9056c-5867j2k 1/1 Running 0 4h37m rook-ceph-drain-canary-7b7faf30366f94a3a45e61490ff29bea-7bm8psg 1/1 Running 0 5h24m rook-ceph-drain-canary-a3cfb071ba558635c1ab8d9a14129f9c-f8mdr76 1/1 Running 0 5h24m rook-ceph-drain-canary-b0ab017f326b4d410305ae5028a48fe3-672894z 1/1 Running 0 4h31m rook-ceph-drain-canary-ip-10-0-185-4.us-east-2.compute.intrjhmb 1/1 Running 0 4h34m rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-7c76dd87n4v45 1/1 Running 0 4h40m rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-554f7b5b92b9m 1/1 Running 0 4h39m rook-ceph-mgr-a-7868c78495-mlvkc 1/1 Running 0 4h24m rook-ceph-mon-a-b9b7f9fc7-92dgj 1/1 Running 0 4h27m rook-ceph-mon-b-54f84749dc-wj5mt 1/1 Running 0 4h26m rook-ceph-mon-c-6f974df68f-874fx 1/1 Running 0 4h24m rook-ceph-operator-6f85fdf444-2w788 1/1 Running 0 4h40m rook-ceph-osd-0-8c69fb45-qgzmf 1/1 Running 0 4h24m rook-ceph-osd-1-cb95b4665-7tlcv 1/1 Running 0 4h18m rook-ceph-osd-2-86bbddf59-nbcql 1/1 Running 0 4h21m rook-ceph-osd-3-5474cc8dbd-xf96v 1/1 Running 0 4h22m rook-ceph-osd-4-666dc4758c-ntphz 1/1 Running 0 4h19m rook-ceph-osd-5-6d99f4956c-m69xp 1/1 Running 0 4h17m rook-ceph-osd-prepare-ocs-deviceset-0-0-v97bg-shnhf 0/1 Completed 0 5h25m rook-ceph-osd-prepare-ocs-deviceset-0-1-2px2d-qb44p 0/1 Completed 0 4h47m rook-ceph-osd-prepare-ocs-deviceset-1-0-4845h-cd5wj 0/1 Completed 0 5h25m rook-ceph-osd-prepare-ocs-deviceset-1-1-nnsn9-x79sf 0/1 Completed 0 4h47m rook-ceph-osd-prepare-ocs-deviceset-2-0-9kb56-xj8xn 0/1 Completed 0 5h25m rook-ceph-osd-prepare-ocs-deviceset-2-1-4mghg-7v7n7 0/1 Completed 0 4h47m rook-ceph-tools-6c67d65646-6dbvp 1/1 Running 0 5h23m $ oc get csv -n openshift-storage NAME DISPLAY VERSION REPLACES PHASE lib-bucket-provisioner.v2.0.0 lib-bucket-provisioner 2.0.0 lib-bucket-provisioner.v1.0.0 Succeeded ocs-operator.v4.5.0-485.ci OpenShift Container Storage 4.5.0-485.ci ocs-operator.v4.4.1 Succeeded Do you still need the cluster? And if so, in which shape? If I will schedule the upgrade you would like to check the progress in the upgrade itself or have access to the cluster after the upgrade is finished? Thanks, Petr
@Petr A connection to the cluster is needed to analyze why the PGs are unclean, to see if the OSDs are failing to start. Though from the logs and the pod status it doesn't appear the OSDs are failing to start so it must be another issue.
@Josh suggestions on troubleshooting these issues with PGs during upgrade with active IO? Does the test need to give more priority to the recovery to complete faster, or what could be the issue? Connecting to a cluster in progress, I see that the PGs are not healthy during or after the upgrade. After the upgrade, one OSD seems to stay down for some amount of time, although I didn't capture which OSD or how long exactly it was down. After that one OSD recovered, the OSDs are always seen to be healthy according to "oc get pod" and "ceph osd status". Here is a full capture over time of the output of several commands such as "ceph status" and "ceph osd status", showing the PGs in various stages of recovery. https://gist.github.com/travisn/4a5caf6fe69881f9be27d13128d1ded5 To extract some of the highlights... Here is the first status captured during or shortly after the upgrade. One OSD is down and recovery is actively in progress: ~/src/go/src/github.com/rook/rook$ oc get pod NAME READY STATUS RESTARTS AGE rook-ceph-osd-0-9d44cfd49-mc5nh 1/1 Running 0 4m19s rook-ceph-osd-1-6597749b-wt4zf 1/1 Running 0 7m17s rook-ceph-osd-2-85889798dd-h5fwh 1/1 Running 0 78s rook-ceph-osd-3-7587f98d4-np2fg 1/1 Running 0 13m rook-ceph-osd-4-b96474f9d-ns8jb 1/1 Running 0 2m55s rook-ceph-osd-5-6f8c4cc7b7-8zq4n 1/1 Running 0 5m48s sh-4.4# ceph -s cluster: id: bdbfe8d3-887c-4aa3-8564-c631b3d0aafc health: HEALTH_WARN 1 MDSs report slow metadata IOs 1 osds down 1 host (1 osds) down Degraded data redundancy: 4201/57402 objects degraded (7.319%), 205 pgs degraded 13 slow ops, oldest one blocked for 56 sec, daemons [osd.0,osd.1,osd.2] have slow ops. services: mon: 3 daemons, quorum a,b,c (age 8m) mgr: a(active, since 7m) mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay osd: 6 osds: 5 up (since 8s), 6 in (since 13m); 217 remapped pgs task status: scrub status: mds.ocs-storagecluster-cephfilesystem-a: idle mds.ocs-storagecluster-cephfilesystem-b: idle data: pools: 3 pools, 288 pgs objects: 19.13k objects, 74 GiB usage: 226 GiB used, 12 TiB / 12 TiB avail pgs: 30.903% pgs not active 4201/57402 objects degraded (7.319%) 21976/57402 objects misplaced (38.284%) 92 active+recovery_wait+degraded+remapped 69 activating+degraded+remapped 28 active+recovery_wait+degraded 24 active+clean 20 remapped+peering 14 active+recovery_wait+remapped 10 active+remapped+backfill_wait 8 active+recovery_wait+undersized+degraded+remapped 7 active+recovery_wait 7 active+undersized+degraded 5 active+undersized 3 active+recovery_wait+undersized+remapped 1 active+recovering+degraded+remapped io: client: 16 MiB/s wr, 0 op/s rd, 216 op/s wr recovery: 40 MiB/s, 9 objects/s To condense the examples, several minutes later here is another extract: client: 329 KiB/s rd, 7.3 MiB/s wr, 82 op/s rd, 151 op/s wr recovery: 93 MiB/s, 23 objects/s Now we see IO is really picking up and recovery is dropping: client: 12 MiB/s rd, 83 MiB/s wr, 3.13k op/s rd, 3.14k op/s wr recovery: 2.0 MiB/s, 0 objects/s Then finally the recovery is not reported anymore: client: 5.8 MiB/s rd, 36 MiB/s wr, 1.47k op/s rd, 1.53k op/s wr Even though the PGs are not healthy: pgs: 25796/57831 objects misplaced (44.606%) 190 active+remapped+backfill_wait 96 active+clean 1 active+remapped+backfilling 1 active+undersized+remapped+backfilling
(In reply to Travis Nielsen from comment #10) > @Josh suggestions on troubleshooting these issues with PGs during upgrade > with active IO? Does the test need to give more priority to the recovery to > complete faster, or what could be the issue? There are two issues here: 1) unrealistic expectations in the test - it appears to be saturating the cluster with I/O, then expecting losing 1/6th of the I/O capacity to be recoverable quickly while still saturating the cluster with I/O. The PGs go active after the first ceph -s output, so the remainder of the time is spent catching up to the ongoing I/O. Running a cluster at full I/O capacity leaves you with no room for handling failures, upgrades, or even basic operations like scrubbing. It's not a realistic way to operate - there should always be some I/O headroom in the cluster, just like we don't fill up a cluster's storage capacity to 100%. We should measure the performance of the cluster footprint we're using for these tests, and test with a limit of e.g. 50% of that max throughput. Even then, a 12 minute timeout may be too short. If the goal is to make the tests run faster, we can tweak recovery settings to make it faster just for testing. That's not necessarily the right trade-off for customer workloads, which often care more about client I/O than recovery speed. Running these tests with a larger cluster would also help, since there would be more parallelism for recovery, and a smaller proportion of the cluster is affected by a single OSD going down. As we saw in https://bugzilla.redhat.com/show_bug.cgi?id=1809064#c24 recovery with no client I/O is quick. 2) Only half the cluster has much data - 3 osds have 70G while the other 3 have 2G. This means all I/O is concentrated on 3 disks, reducing the possible performance of the cluster by 1/2. Were 3 fresh OSDs just added? If so, recovery will perform much better if we let it complete before the upgrade, since there will be twice the available throughput. If these OSDs should already have data, we have some sort of issue with the crush rules or number of pgs that is causing half the cluster to go underutilized. If we had a crush dump and pg dump we could see what's happening there.
Hey Josh, 1) I haven't wrote the pre upgrade test which does IO in the background but will try to gather some data about settings of FIO and what is actually running there. Actually I will send an email with some questions regarding the IO as this is an area where I don't have much information and we will get more details from other folks I guess. I see I/O is initialized by this fixture: https://github.com/red-hat-storage/ocs-ci/blob/2f196c6ba00f3bde26acfdc5c6881b8f72dd478d/tests/ecosystem/upgrade/conftest.py#L564 And see that FIO conf is probably done here: https://github.com/red-hat-storage/ocs-ci/blob/2f196c6ba00f3bde26acfdc5c6881b8f72dd478d/tests/ecosystem/upgrade/conftest.py#L179-L218 But don't know about FIO much so will appreciate if other folks will add their thoughts if the setting used by Filip is not the managable IO and we should reduce it somehow? 2)Add capacity test is running in this timestamp 22:34:37 : tests/manage/z_cluster/cluster_expansion/test_add_capacity.py::TestAddCapacityPreUpgrade::test_add_capacity_pre_upgrade -------------------------------- live log call --------------------------------- 22:34:37 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig get storagecluster -n openshift-storage -o yaml 22:34:37 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig get storagecluster -n openshift-storage -o yaml And the upgrade of OCS starting 22:36:39 22:36:39 - MainThread - ocs_ci.utility.utils - INFO - Ceph cluster health is HEALTH_OK. 22:36:39 - MainThread - tests.conftest - INFO - Ceph health check passed at teardown tests/ecosystem/upgrade/test_upgrade.py::test_upgrade So the difference is 2 mins from what I can see And upgrade itself starting here: 22:37:53 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig patch subscription ocs-subscription -n openshift-storage --type merge -p '{"spec":{"channel": "stable-4.5", "source": "ocs-catalogsource"}}' > If we had a crush dump and pg dump we could see what's happening there. Not sure if it's part of ocs must gather? If so we should have it somewhere here: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-ai3c33-ua/jnk-ai3c33-ua_20200710T163031/logs/failed_testcase_ocs_logs_1594401865/test_upgrade_ocs_logs/ocs_must_gather/ for first occurence. Thanks, Petr
Elad, are you helping Petr with this? 1. Can we ensure that without saturating the cluster, upgrade works? 2. Let's ensure we do not have a performance regression.
Hi Yaniv, Petr, What I suggest is to run the upgrade suite as we did so far, only with the `--log-cluster-utilization` run-ci param. It will print latency, IOPS, throughput and storage capacity utilization to the console log every 10 seconds. That way, we will see how saturated the cluster is during the upgrade with the load we are doing in the tests right now. After that, we will start using the `--io-in-bg` run-ci param, which will trigger an intelligent load mechanism, which first determine the limit and drops to the percentage we want. I saw we already have a PR for that in our infra project
Moving to the ceph component since Rook is reconciling the upgrade as expected, and it is a question of IO load during the test related to PG health.
Running IO in bg here what was done by Elad here: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/10004/console With disabled Filip's IO tests.
Recovery with Elad's IO was much quicker. @Elad or someone from performance team can maybe take a look at the numbers what we see from console output.
Looks like there is additional info since you looked at this last.
*** Bug 1856975 has been marked as a duplicate of this bug. ***
This appears to be purely a test setup issue - is there any need to keep this bz open?
*** Bug 1809064 has been marked as a duplicate of this bug. ***
I have updated the question in this BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1856975#c8 if you can please answer there will be great. Thanks
(In reply to Petr Balogh from comment #23) > I have updated the question in this BZ: > https://bugzilla.redhat.com/show_bug.cgi?id=1856975#c8 if you can please > answer there will be great. Thanks Can we close this one?
I think we can. Thanks Josh for answer in the other BZ.