Bug 1856254 - PGs are unhealthy before and after the upgrade OCS 4.4 to 4.5 with IO load
Summary: PGs are unhealthy before and after the upgrade OCS 4.4 to 4.5 with IO load
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: ceph
Version: 4.5
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: ---
Assignee: Josh Durgin
QA Contact: Raz Tamir
URL:
Whiteboard:
: 1809064 1856975 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-13 07:51 UTC by Petr Balogh
Modified: 2020-07-30 13:58 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-07-30 13:58:14 UTC
Embargoed:


Attachments (Terms of Use)

Description Petr Balogh 2020-07-13 07:51:56 UTC
Description of problem (please be detailed as possible and provide log
snippests):
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/9707/testReport/tests.ecosystem.upgrade/test_upgrade/test_upgrade/
ocs_ci.ocs.exceptions.TimeoutException: Timeout 720 reached! Error: {err_msg}

Which means that even after 720 seconds timeout we still have old image in the mon pods.

18:08:51 - MainThread - ocs_ci.ocs.resources.pod - WARNING - Images: {'registry.redhat.io/rhceph/rhceph-4-rhel8@sha256:20cf789235e23ddaf38e109b391d1496bb88011239d16862c4c106d0e05fea9e'} weren't upgraded in: rook-ceph-mon-a-67d4c66748-x9ftg!

When I checked fresh installation of OCS 4.5 I see the image which suppose to be there is:
quay.io/rhceph-dev/rhceph@sha256:45e6f28bb67c81b826acb64fad5c0da1cac3dffb41a88992fe4ca2be79575fa6


I see that later on that the mon pods got restarted and upgraded.

I can see from must gather which was taken after some later running test here:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-ai3c33-ua/jnk-ai3c33-ua_20200710T163031/logs/failed_testcase_ocs_logs_1594401865/test_upgrade_mcg_io_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-ab442c033fdeb4fdfe33af77ec81eda1eb4f7c41b623a7265f645f76df700353/oc_output/get_pods_-owide_-n_openshift-storage


I am curious if there was some change in the order of upgrading of the pods or why we see this behavior?


Version of all relevant components (if applicable):
OCS 4.4.1 installed from live content - upgrade to internal build quay.io/rhceph-dev/ocs-olm-operator:4.5.0-485.ci

OCP version:
4.4.0-0.nightly-2020-07-10-113925


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
Trying now:
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-upgrade-ocs-auto/27/console


Can this issue reproduce from the UI?
Haven't tried.


If this is a regression, please provide more details to justify this:
Yes, I don't remember we had hit such timeout before for mon pods.


Steps to Reproduce:
1. Install 4.4.1
2. Add capacity to the cluster
3. Run some IO in the background
4. Upgrade to 4.5 internal build


Actual results:
Mon pods are not upgraded after 720 seconds timeout.


Expected results:
Have mon pods upgraded


Additional info:
Must gather: 
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-ai3c33-ua/jnk-ai3c33-ua_20200710T163031/logs/failed_testcase_ocs_logs_1594401865/test_upgrade_ocs_logs/ocs_must_gather/

Comment 3 Petr Balogh 2020-07-13 13:38:15 UTC
The order of the components we are looking at is done here:

https://github.com/red-hat-storage/ocs-ci/blob/master/ocs_ci/ocs/ocs_upgrade.py#L84-L122

Comment 4 Travis Nielsen 2020-07-13 15:01:10 UTC
The Rook operator log [1] shows a 15 minute delay between the start of the rook operator and when the upgrade of the rook daemons begins. The sequence is:
1. (17:55:21) The Rook operator is updated and restarts
2. The operator runs a complete reconcile, which takes about 15 minutes because the PGs were not clean.
3. (18:09:15) The CephCluster CR update event is triggered in the operator, which finally proceeds with the update to all the mons, osds, and other daemons as expected.

2020-07-10T17:55:21.506205148Z 2020-07-10 17:55:21.506083 I | rookcmd: starting Rook 4.5-32.e1c766a1.release_4.5 with arguments '/usr/local/bin/rook ceph operator'
...
2020-07-10T18:03:19.013126371Z 2020-07-10 18:03:19.013086 I | clusterdisruption-controller: pg health: "cluster is not fully clean. PGs: 
...
2020-07-10T18:09:15.701716992Z 2020-07-10 18:09:15.701709 I | op-cluster: Done creating rook instance in namespace openshift-storage
...
2020-07-10T18:09:15.753901877Z 2020-07-10 18:09:15.753861 I | op-cluster: The Cluster CR has changed. diff=  v1.ClusterSpec{

@Petr, was the health of the cluster different in this run from what typically happens in the test? 


[1] http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-ai3c33-ua/jnk-ai3c33-ua_20200710T163031/logs/failed_testcase_ocs_logs_1594401865/test_upgrade_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-ab442c033fdeb4fdfe33af77ec81eda1eb4f7c41b623a7265f645f76df700353/ceph/namespaces/openshift-storage/pods/rook-ceph-operator-6f85fdf444-x7blh/rook-ceph-operator/rook-ceph-operator/logs/current.log

Comment 5 Petr Balogh 2020-07-13 15:46:14 UTC
That's weird Travis.

I see that the Ceph Health was OK before we started to run upgrade test:
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/9707/consoleFull

17:53:31 - MainThread - ocs_ci.utility.utils - INFO - Ceph cluster health is HEALTH_OK.
17:53:31 - MainThread - tests.conftest - INFO - Ceph health check passed at teardown

tests/ecosystem/upgrade/test_upgrade.py::test_upgrade 
-------------------------------- live log call ---------------------------------

From debug log:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-ai3c33-ua/jnk-ai3c33-ua_20200710T163031/logs/ocs-ci-logs-1594401865/by_outcome/failed/tests/ecosystem/upgrade/test_upgrade.py/test_upgrade/logs
of our test I see it's still HEALTH_OK even just before the upgrading itself here:



2020-07-10 17:54:34,472 - MainThread - INFO - ocs_ci.ocs.ocs_upgrade.set_upgrade_images.383 - Image: quay.io/rhceph-dev/ocs-olm-operator:4.5.0-485.ci will be used for upgrade.
2020-07-10 17:54:34,479 - MainThread - INFO - ocs_ci.utility.templating.dump_data_to_temp_yaml.210 - apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  creationTimestamp: '2020-07-10T17:54:07Z'
  generation: 1
  labels:
    ocs-operator-internal: 'true'
  name: ocs-catalogsource
  namespace: openshift-marketplace
  resourceVersion: '50018'
  selfLink: /apis/operators.coreos.com/v1alpha1/namespaces/openshift-marketplace/catalogsources/ocs-catalogsource
  uid: c9a2c8bf-e9bb-4c1a-b23b-e995df9a7473
spec:
  displayName: Openshift Container Storage
  icon:
    base64data: ''
    mediatype: ''
  image: quay.io/rhceph-dev/ocs-olm-operator:4.5.0-485.ci
  publisher: Red Hat
  sourceType: grpc
status:
  connectionState:
    address: ocs-catalogsource.openshift-marketplace.svc:50051
    lastConnect: '2020-07-10T17:54:31Z'
    lastObservedState: READY
  registryService:
    createdAt: '2020-07-10T17:54:07Z'
    port: '50051'
    protocol: grpc
    serviceName: ocs-catalogsource
    serviceNamespace: openshift-marketplace

2020-07-10 17:54:34,479 - MainThread - INFO - ocs_ci.utility.utils.exec_cmd.437 - Executing command: oc -n openshift-marketplace --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig apply -f /tmp/tmp9aeaacda
2020-07-10 17:54:35,054 - MainThread - DEBUG - ocs_ci.utility.utils.exec_cmd.450 - Command stdout: catalogsource.operators.coreos.com/ocs-catalogsource configured

2020-07-10 17:54:35,054 - MainThread - WARNING - ocs_ci.utility.utils.exec_cmd.456 - Command stderr: Warning: oc apply should be used on resource created by either oc create --save-config or oc apply

2020-07-10 17:54:35,054 - MainThread - DEBUG - ocs_ci.utility.utils.exec_cmd.459 - Command return code: 0
2020-07-10 17:54:35,055 - MainThread - INFO - ocs_ci.utility.utils.exec_cmd.437 - Executing command: oc -n openshift-storage --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig get subscription ocs-subscription -n openshift-storage -o yaml
2020-07-10 17:54:35,211 - Thread-2 - DEBUG - ocs_ci.utility.utils.exec_cmd.450 - Command stdout: HEALTH_OK

2020-07-10 17:54:35,211 - Thread-2 - DEBUG - ocs_ci.utility.utils.exec_cmd.458 - Command stderr is empty
2020-07-10 17:54:35,212 - Thread-2 - DEBUG - ocs_ci.utility.utils.exec_cmd.459 - Command return code: 0
2020-07-10 17:54:35,296 - MainThread - DEBUG - ocs_ci.utility.utils.exec_cmd.450 - Command stdout: apiVersion: operators.coreos.com/v1alpha1



CSV is now upgrading to 4.5:

2020-07-10 17:54:37,431 - MainThread - WARNING - ocs_ci.ocs.ocp.get.215 - Number of attempts to get resource reached!
2020-07-10 17:54:37,431 - MainThread - INFO - ocs_ci.ocs.ocp.check_phase.775 - Cannot find resource object ocs-operator.v4.5.0-485.ci
2020-07-10 17:54:37,431 - MainThread - INFO - ocs_ci.utility.utils.__iter__.774 - Going to sleep for 5 seconds before next iteration
2020-07-10 17:54:40,217 - Thread-2 - INFO - ocs_ci.utility.utils.exec_cmd.437 - Executing command: oc -n openshift-storage --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig rsh rook-ceph-tools-6c67d65646-dxwzb ceph health detail
2020-07-10 17:54:41,019 - Thread-2 - DEBUG - ocs_ci.utility.utils.exec_cmd.450 - Command stdout: HEALTH_OK

I still see health OK at this time.

2020-07-10 17:54:43,028 - MainThread - INFO - ocs_ci.ocs.ocp.check_phase.780 - Resource ocs-operator.v4.5.0-485.ci is in phase: Pending!
2020-07-10 17:54:43,029 - MainThread - INFO - ocs_ci.utility.utils.__iter__.774 - Going to sleep for 5 seconds before next iteration
2020-07-10 17:54:46,025 - Thread-2 - INFO - ocs_ci.utility.utils.exec_cmd.437 - Executing command: oc -n openshift-storage --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig rsh rook-ceph-tools-6c67d65646-dxwzb ceph health detail
2020-07-10 17:54:46,946 - Thread-2 - DEBUG - ocs_ci.utility.utils.exec_cmd.450 - Command stdout: HEALTH_OK


Here CSV gets to Succeeded state for a moment:
2020-07-10 17:55:46,442 - MainThread - INFO - ocs_ci.ocs.ocp.check_phase.780 - Resource ocs-operator.v4.5.0-485.ci is in phase: Succeeded!

Even here I still see:
2020-07-10 17:55:51,068 - Thread-2 - DEBUG - ocs_ci.utility.utils.exec_cmd.450 - Command stdout: HEALTH_OK

So the health of the cluster seems OK travis even when the upgrade itself started.  So my understanding is that if PGs got unclead then it was caused by upgrade itself as we don't run IMO anything during this upgrade in the background which could lead to this.


Running once more here:
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/9782/consoleFull

Thanks

Comment 6 Travis Nielsen 2020-07-13 17:02:27 UTC
@Petr 
- Can I connect to the cluster when it is in this state?
- Do the PGs become clean after the upgrade is completed later? I don't see in the operator log the end of the upgrade to see if the PGs settled.

Comment 7 Petr Balogh 2020-07-13 21:59:12 UTC
Travis the cluster got already deleted :( . The second job I triggered got stuck with some of other issue (ocs-ci) and I needed to abort the job.

I re-triggered new one job but it's in the queue now. So I will see if it got reproduced and if so I can bring up another cluster for upgrade tomorrow. So when you online you can ping me on the chat and I can bring the cluster for you.  I will update tomorrow how it went with reproduction of the BZ.

Comment 8 Petr Balogh 2020-07-14 10:13:58 UTC
Hey Travis, the same was reproduced here:

https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/9803/

OCS must gather:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-ai3c33-ua/jnk-ai3c33-ua_20200714T040124/logs/failed_testcase_ocs_logs_1594702730/test_upgrade_ocs_logs/ocs_must_gather


To your question:

I did pause the env before the teardown and after some time the cluster got clean and upgraded:

$ oc rsh -n openshift-storage rook-ceph-tools-6c67d65646-6dbvp ceph health detail
HEALTH_OK

$ oc get pod -n openshift-storage
NAME                                                              READY   STATUS      RESTARTS   AGE
csi-cephfsplugin-6zrzj                                            3/3     Running     0          4h39m
csi-cephfsplugin-ckfbj                                            3/3     Running     0          4h40m
csi-cephfsplugin-lpxrh                                            3/3     Running     0          4h40m
csi-cephfsplugin-provisioner-785f964c-fl878                       5/5     Running     0          4h40m
csi-cephfsplugin-provisioner-785f964c-pc5x5                       5/5     Running     0          4h40m
csi-rbdplugin-52hv9                                               3/3     Running     0          4h40m
csi-rbdplugin-fwh2h                                               3/3     Running     0          4h40m
csi-rbdplugin-hhng8                                               3/3     Running     0          4h40m
csi-rbdplugin-provisioner-6597886578-jsk62                        5/5     Running     0          4h40m
csi-rbdplugin-provisioner-6597886578-x2dbk                        5/5     Running     0          4h40m
lib-bucket-provisioner-79f6dd6b99-g5tzr                           1/1     Running     0          5h28m
noobaa-core-0                                                     1/1     Running     0          4h39m
noobaa-db-0                                                       1/1     Running     0          4h40m
noobaa-endpoint-b88d46f6d-q25mh                                   1/1     Running     0          4h40m
noobaa-operator-54649f5748-f8g7k                                  1/1     Running     0          4h40m
ocs-operator-b5ff748b-cssvj                                       1/1     Running     0          4h40m
rook-ceph-crashcollector-ip-10-0-149-42-54cfb7cdc5-qrx97          1/1     Running     0          4h27m
rook-ceph-crashcollector-ip-10-0-185-4-7c8d8fb8c5-89v2r           1/1     Running     0          4h26m
rook-ceph-crashcollector-ip-10-0-203-37-86889c889b-bv8rp          1/1     Running     0          4h24m
rook-ceph-drain-canary-0aa62f8c4250f3f076a9296dc1f9056c-5867j2k   1/1     Running     0          4h37m
rook-ceph-drain-canary-7b7faf30366f94a3a45e61490ff29bea-7bm8psg   1/1     Running     0          5h24m
rook-ceph-drain-canary-a3cfb071ba558635c1ab8d9a14129f9c-f8mdr76   1/1     Running     0          5h24m
rook-ceph-drain-canary-b0ab017f326b4d410305ae5028a48fe3-672894z   1/1     Running     0          4h31m
rook-ceph-drain-canary-ip-10-0-185-4.us-east-2.compute.intrjhmb   1/1     Running     0          4h34m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-7c76dd87n4v45   1/1     Running     0          4h40m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-554f7b5b92b9m   1/1     Running     0          4h39m
rook-ceph-mgr-a-7868c78495-mlvkc                                  1/1     Running     0          4h24m
rook-ceph-mon-a-b9b7f9fc7-92dgj                                   1/1     Running     0          4h27m
rook-ceph-mon-b-54f84749dc-wj5mt                                  1/1     Running     0          4h26m
rook-ceph-mon-c-6f974df68f-874fx                                  1/1     Running     0          4h24m
rook-ceph-operator-6f85fdf444-2w788                               1/1     Running     0          4h40m
rook-ceph-osd-0-8c69fb45-qgzmf                                    1/1     Running     0          4h24m
rook-ceph-osd-1-cb95b4665-7tlcv                                   1/1     Running     0          4h18m
rook-ceph-osd-2-86bbddf59-nbcql                                   1/1     Running     0          4h21m
rook-ceph-osd-3-5474cc8dbd-xf96v                                  1/1     Running     0          4h22m
rook-ceph-osd-4-666dc4758c-ntphz                                  1/1     Running     0          4h19m
rook-ceph-osd-5-6d99f4956c-m69xp                                  1/1     Running     0          4h17m
rook-ceph-osd-prepare-ocs-deviceset-0-0-v97bg-shnhf               0/1     Completed   0          5h25m
rook-ceph-osd-prepare-ocs-deviceset-0-1-2px2d-qb44p               0/1     Completed   0          4h47m
rook-ceph-osd-prepare-ocs-deviceset-1-0-4845h-cd5wj               0/1     Completed   0          5h25m
rook-ceph-osd-prepare-ocs-deviceset-1-1-nnsn9-x79sf               0/1     Completed   0          4h47m
rook-ceph-osd-prepare-ocs-deviceset-2-0-9kb56-xj8xn               0/1     Completed   0          5h25m
rook-ceph-osd-prepare-ocs-deviceset-2-1-4mghg-7v7n7               0/1     Completed   0          4h47m
rook-ceph-tools-6c67d65646-6dbvp                                  1/1     Running     0          5h23m

$ oc get csv -n openshift-storage
NAME                            DISPLAY                       VERSION        REPLACES                        PHASE
lib-bucket-provisioner.v2.0.0   lib-bucket-provisioner        2.0.0          lib-bucket-provisioner.v1.0.0   Succeeded
ocs-operator.v4.5.0-485.ci      OpenShift Container Storage   4.5.0-485.ci   ocs-operator.v4.4.1             Succeeded

Do you still need the cluster? And if so, in which shape? If I will schedule the upgrade you would like to check the progress in the upgrade itself or have access to the cluster after the upgrade is finished?

Thanks, Petr

Comment 9 Travis Nielsen 2020-07-14 16:53:13 UTC
@Petr A connection to the cluster is needed to analyze why the PGs are unclean, to see if the OSDs are failing to start. Though from the logs and the pod status it doesn't appear the OSDs are failing to start so it must be another issue.

Comment 10 Travis Nielsen 2020-07-15 00:03:53 UTC
@Josh suggestions on troubleshooting these issues with PGs during upgrade with active IO? Does the test need to give more priority to the recovery to complete faster, or what could be the issue? 

Connecting to a cluster in progress, I see that the PGs are not healthy during or after the upgrade.

After the upgrade, one OSD seems to stay down for some amount of time, although I didn't capture which OSD or how long exactly it was down.
After that one OSD recovered, the OSDs are always seen to be healthy according to "oc get pod" and "ceph osd status". 
 
Here is a full capture over time of the output of several commands such as "ceph status" and "ceph osd status", showing the PGs in various stages of recovery.
https://gist.github.com/travisn/4a5caf6fe69881f9be27d13128d1ded5

To extract some of the highlights...

Here is the first status captured during or shortly after the upgrade. One OSD is down and recovery is actively in progress:

~/src/go/src/github.com/rook/rook$ oc get pod
NAME                                                              READY   STATUS      RESTARTS   AGE
rook-ceph-osd-0-9d44cfd49-mc5nh                                   1/1     Running     0          4m19s
rook-ceph-osd-1-6597749b-wt4zf                                    1/1     Running     0          7m17s
rook-ceph-osd-2-85889798dd-h5fwh                                  1/1     Running     0          78s
rook-ceph-osd-3-7587f98d4-np2fg                                   1/1     Running     0          13m
rook-ceph-osd-4-b96474f9d-ns8jb                                   1/1     Running     0          2m55s
rook-ceph-osd-5-6f8c4cc7b7-8zq4n                                  1/1     Running     0          5m48s

sh-4.4# ceph -s
  cluster:
    id:     bdbfe8d3-887c-4aa3-8564-c631b3d0aafc
    health: HEALTH_WARN
            1 MDSs report slow metadata IOs
            1 osds down
            1 host (1 osds) down
            Degraded data redundancy: 4201/57402 objects degraded (7.319%), 205 pgs degraded
            13 slow ops, oldest one blocked for 56 sec, daemons [osd.0,osd.1,osd.2] have slow ops.

  services:
    mon: 3 daemons, quorum a,b,c (age 8m)
    mgr: a(active, since 7m)
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay
    osd: 6 osds: 5 up (since 8s), 6 in (since 13m); 217 remapped pgs

  task status:
    scrub status:
        mds.ocs-storagecluster-cephfilesystem-a: idle
        mds.ocs-storagecluster-cephfilesystem-b: idle

  data:
    pools:   3 pools, 288 pgs
    objects: 19.13k objects, 74 GiB
    usage:   226 GiB used, 12 TiB / 12 TiB avail
    pgs:     30.903% pgs not active
             4201/57402 objects degraded (7.319%)
             21976/57402 objects misplaced (38.284%)
             92 active+recovery_wait+degraded+remapped
             69 activating+degraded+remapped
             28 active+recovery_wait+degraded
             24 active+clean
             20 remapped+peering
             14 active+recovery_wait+remapped
             10 active+remapped+backfill_wait
             8  active+recovery_wait+undersized+degraded+remapped
             7  active+recovery_wait
             7  active+undersized+degraded
             5  active+undersized
             3  active+recovery_wait+undersized+remapped
             1  active+recovering+degraded+remapped

  io:
    client:   16 MiB/s wr, 0 op/s rd, 216 op/s wr
    recovery: 40 MiB/s, 9 objects/s


To condense the examples, several minutes later here is another extract:

    client:   329 KiB/s rd, 7.3 MiB/s wr, 82 op/s rd, 151 op/s wr
    recovery: 93 MiB/s, 23 objects/s

Now we see IO is really picking up and recovery is dropping:

    client:   12 MiB/s rd, 83 MiB/s wr, 3.13k op/s rd, 3.14k op/s wr
    recovery: 2.0 MiB/s, 0 objects/s

Then finally the recovery is not reported anymore:

   client:   5.8 MiB/s rd, 36 MiB/s wr, 1.47k op/s rd, 1.53k op/s wr

Even though the PGs are not healthy:

    pgs:     25796/57831 objects misplaced (44.606%)
             190 active+remapped+backfill_wait
             96  active+clean
             1   active+remapped+backfilling
             1   active+undersized+remapped+backfilling

Comment 11 Josh Durgin 2020-07-15 01:35:39 UTC
(In reply to Travis Nielsen from comment #10)
> @Josh suggestions on troubleshooting these issues with PGs during upgrade
> with active IO? Does the test need to give more priority to the recovery to
> complete faster, or what could be the issue? 

There are two issues here:

1) unrealistic expectations in the test - it appears to be saturating the cluster with I/O, then expecting losing 1/6th of the I/O capacity to be recoverable quickly while still saturating the cluster with I/O.

The PGs go active after the first ceph -s output, so the remainder of the time is spent catching up to the ongoing I/O.

Running a cluster at full I/O capacity leaves you with no room for handling failures, upgrades, or even basic operations like scrubbing. It's not a realistic way to operate - there should always be some I/O headroom in the cluster, just like we don't fill up a cluster's storage capacity to 100%.

We should measure the performance of the cluster footprint we're using for these tests, and test with a limit of e.g. 50% of that max throughput. Even then, a 12 minute timeout may be too short.
If the goal is to make the tests run faster, we can tweak recovery settings to make it faster just for testing. That's not necessarily the right trade-off for customer workloads, which often care more about client I/O than recovery speed.
Running these tests with a larger cluster would also help, since there would be more parallelism for recovery, and a smaller proportion of the cluster is affected by a single OSD going down.

As we saw in https://bugzilla.redhat.com/show_bug.cgi?id=1809064#c24 recovery with no client I/O is quick.

2) Only half the cluster has much data - 3 osds have 70G while the other 3 have 2G. This means all I/O is concentrated on 3 disks, reducing the possible performance of the cluster by 1/2.

Were 3 fresh OSDs just added? If so, recovery will perform much better if we let it complete before the upgrade, since there will be twice the available throughput. If these OSDs
should already have data, we have some sort of issue with the crush rules or number of pgs that is causing half the cluster to go underutilized. If we had a crush dump and pg dump we could see what's happening there.

Comment 12 Petr Balogh 2020-07-15 09:07:39 UTC
Hey Josh,


1) I haven't wrote the pre upgrade test which does IO in the background but will try to gather some data about settings of FIO and what is actually running there.
Actually I will send an email with some questions regarding the IO as this is an area where I don't have much information and we will get more details from other folks I guess.
I see I/O is initialized by this fixture:
https://github.com/red-hat-storage/ocs-ci/blob/2f196c6ba00f3bde26acfdc5c6881b8f72dd478d/tests/ecosystem/upgrade/conftest.py#L564
And see that FIO conf is probably done here:
https://github.com/red-hat-storage/ocs-ci/blob/2f196c6ba00f3bde26acfdc5c6881b8f72dd478d/tests/ecosystem/upgrade/conftest.py#L179-L218

But don't know about FIO much so will appreciate if other folks will add their thoughts if the setting used by Filip is not the managable IO and we should reduce it somehow?



2)Add capacity test is running in this timestamp 22:34:37 :
tests/manage/z_cluster/cluster_expansion/test_add_capacity.py::TestAddCapacityPreUpgrade::test_add_capacity_pre_upgrade 
-------------------------------- live log call ---------------------------------
22:34:37 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig get storagecluster  -n openshift-storage -o yaml
22:34:37 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig get storagecluster  -n openshift-storage -o yaml


And the upgrade of OCS starting 22:36:39 
22:36:39 - MainThread - ocs_ci.utility.utils - INFO - Ceph cluster health is HEALTH_OK.
22:36:39 - MainThread - tests.conftest - INFO - Ceph health check passed at teardown

tests/ecosystem/upgrade/test_upgrade.py::test_upgrade 

So the difference is 2 mins from what I can see

And upgrade itself starting here:
22:37:53 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig patch subscription ocs-subscription -n openshift-storage --type merge -p '{"spec":{"channel": "stable-4.5", "source": "ocs-catalogsource"}}'



> If we had a crush dump and pg dump we could see what's happening there.
Not sure if it's part of ocs must gather? If so we should have it somewhere here:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-ai3c33-ua/jnk-ai3c33-ua_20200710T163031/logs/failed_testcase_ocs_logs_1594401865/test_upgrade_ocs_logs/ocs_must_gather/ for first occurence.



Thanks,
Petr

Comment 13 Yaniv Kaul 2020-07-16 13:11:29 UTC
Elad, are you helping Petr with this?
1. Can we ensure that without saturating the cluster, upgrade works?
2. Let's ensure we do not have a performance regression.

Comment 14 Elad 2020-07-16 13:27:42 UTC
Hi Yaniv, Petr,

What I suggest is to run the upgrade suite as we did so far, only with the `--log-cluster-utilization` run-ci param. It will print latency, IOPS, throughput and storage capacity utilization to the console log every 10 seconds. 
That way, we will see how saturated the cluster is during the upgrade with the load we are doing in the tests right now.


After that, we will start using the `--io-in-bg` run-ci param, which will trigger an intelligent load mechanism, which first determine the limit and drops to the percentage we want. I saw we already have a PR for that in our infra project

Comment 16 Travis Nielsen 2020-07-16 23:16:28 UTC
Moving to the ceph component since Rook is reconciling the upgrade as expected, and it is a question of IO load during the test related to PG health.

Comment 17 Petr Balogh 2020-07-17 09:43:08 UTC
Running IO in bg here what was done by Elad here:

https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/10004/console

With disabled Filip's IO tests.

Comment 18 Petr Balogh 2020-07-17 13:38:31 UTC
Recovery with Elad's IO was much quicker.  @Elad or someone from performance team can maybe take a look at the numbers what we see from console output.

Comment 19 Scott Ostapovicz 2020-07-21 16:00:26 UTC
Looks like there is additional info since you looked at this last.

Comment 20 Josh Durgin 2020-07-21 19:50:19 UTC
*** Bug 1856975 has been marked as a duplicate of this bug. ***

Comment 21 Josh Durgin 2020-07-21 19:54:45 UTC
This appears to be purely a test setup issue - is there any need to keep this bz open?

Comment 22 Josh Durgin 2020-07-22 19:54:46 UTC
*** Bug 1809064 has been marked as a duplicate of this bug. ***

Comment 23 Petr Balogh 2020-07-27 08:30:30 UTC
I have updated the question in this BZ:  https://bugzilla.redhat.com/show_bug.cgi?id=1856975#c8 if you can please answer there will be great. Thanks

Comment 24 Yaniv Kaul 2020-07-30 13:54:20 UTC
(In reply to Petr Balogh from comment #23)
> I have updated the question in this BZ: 
> https://bugzilla.redhat.com/show_bug.cgi?id=1856975#c8 if you can please
> answer there will be great. Thanks

Can we close this one?

Comment 25 Petr Balogh 2020-07-30 13:58:14 UTC
I think we can.  Thanks Josh for answer in the other BZ.


Note You need to log in before you can comment on or make changes to this bug.