2209375 – [RHCS Tracker] After add capacity the rebalance does not complete, and we see 2 PGs in active+clean+scrubbing and 1 active+clean+scrubbing+deep

Bug 2209375 - [RHCS Tracker] After add capacity the rebalance does not complete, and we see 2 PGs in active+clean+scrubbing and 1 active+clean+scrubbing+deep

Summary: [RHCS Tracker] After add capacity the rebalance does not complete, and we see...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RADOS
Sub Component:
Version:	6.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	6.1
Assignee:	Aishwarya Mathuria
QA Contact:	Pawan
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	2210975 (view as bug list)
Depends On:
Blocks:	2208269
TreeView+	depends on / blocked

Reported:	2023-05-23 16:46 UTC by Neha Ojha
Modified:	2023-06-15 09:17 UTC (History)
CC List:	18 users (show)
Fixed In Version:	ceph-17.2.6-67.el9cp
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	2208269
Environment:
Last Closed:	2023-06-15 09:17:24 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHCEPH-6719	0	None	None	None	2023-05-23 16:46:51 UTC
Red Hat Product Errata	RHSA-2023:3623	0	None	None	None	2023-06-15 09:17:46 UTC

Description Neha Ojha 2023-05-23 16:46:04 UTC

+++ This bug was initially created as a clone of Bug #2208269 +++

Description of problem (please be detailed as possible and provide log
snippests):

After add capacity test we see that the ceph state getting stuck in this state:

    pgs:     166 active+clean
             2   active+clean+scrubbing
             1   active+clean+scrubbing+deep


Version of all relevant components (if applicable):
I've noticed that it has started happening from build:
quay.io/rhceph-dev/ocs-registry:4.13.0-198
In
quay.io/rhceph-dev/ocs-registry:4.13.0-197 it worked well.

But from build -198 it's reproducible in every run.

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, it's blocking our testing framework as cluster is not in good state.

Is there any workaround available to the best of your knowledge?
NO

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes - from build -198

Can this issue reproduce from the UI?
not relevant

If this is a regression, please provide more details to justify this:
Yes

Steps to Reproduce:
1. Do add capacity
2. Rebalance is not finishing in timeout we used to have
3.


Actual results:
PGs:
             2   active+clean+scrubbing
             1   active+clean+scrubbing+deep

Expected results:
Have all PGS active+clean

Additional info:

--- Additional comment from Petr Balogh on 2023-05-18 12:31:19 UTC ---

Jenkins job:
https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-trigger-vsphere-upi-1az-rhcos-vsan-3m-3w-acceptance/426/

Must gather:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-426vu1cs33-a/j-426vu1cs33-a_20230518T071906/logs/failed_testcase_ocs_logs_1684397650/test_add_capacity_cli_ocs_logs/j-426vu1cs33-a/


In compare to the build -197 and job which worked:
https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-trigger-vsphere-upi-1az-rhcos-vsan-3m-3w-acceptance/427/
Must gather for -197 build which worked OK
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-427vu1cs33-a/j-427vu1cs33-a_20230518T084506/logs/testcases_1684402689/j-427vu1cs33-a/

--- Additional comment from RHEL Program Management on 2023-05-18 12:31:26 UTC ---

This bug having no release flag set previously, is now set with release flag 'odf‑4.13.0' to '?', and so is being proposed to be fixed at the ODF 4.13.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag.

--- Additional comment from RHEL Program Management on 2023-05-18 12:31:26 UTC ---

Since this bug has severity set to 'urgent', it is being proposed as a blocker for the currently set release flag. Please resolve ASAP.

--- Additional comment from Petr Balogh on 2023-05-18 12:38:02 UTC ---

Forgot to mention that we see this issue only in vSphere - AWS acceptance suite I see is passing OK.

--- Additional comment from Neha Ojha on 2023-05-18 14:23:13 UTC ---

This could be related to https://bugzilla.redhat.com/show_bug.cgi?id=2163473 and if so, a change in the test to use high_recovery_ops, like we did for teuthology https://github.com/ceph/ceph/pull/51449, should be enough. Sridhar, can you please take a look to confirm?

--- Additional comment from Ilya Dryomov on 2023-05-18 14:49:43 UTC ---

(In reply to Neha Ojha from comment #5)
> This could be related to https://bugzilla.redhat.com/show_bug.cgi?id=2163473
> and if so, a change in the test to use high_recovery_ops, like we did for
> teuthology https://github.com/ceph/ceph/pull/51449, should be enough.

An argument can be made that QE setups should be as close to customer setups as possible.  Once this acceptance test is taken care of, other yet-to-be-run tests might exhibit the same failure.  This is up to the RADOS team of course, but I would advise downstream QE against changing to high_recovery_ops throughout like was done for upstream teuthology and instead consider adjusting test timeouts (filing separate BZs on a case by case basis if the necessary adjustment feels too excessive).

--- Additional comment from Sridhar Seshasayee on 2023-05-18 15:29:42 UTC ---

I looked at the available logs to confirm the scrub rates. Here's an example of the scrub rates from the mgr logs,

2023-05-18T10:20:38.083+0000 7ff213b3b640  0 log_channel(cluster) log [DBG] : pgmap v4391: 169 pgs: 1 active+clean+scrubbing, 1 active+clean+scrubbing+deep, 167 active+clean; 23 GiB data, 27 GiB used, 3.0 TiB / 3 TiB avail; 1023 B/s rd, 115 KiB/s wr, 4 op/s
2023-05-18T10:20:40.083+0000 7ff213b3b640  0 log_channel(cluster) log [DBG] : pgmap v4392: 169 pgs: 1 active+clean+scrubbing, 1 active+clean+scrubbing+deep, 167 active+clean; 23 GiB data, 27 GiB used, 3.0 TiB / 3 TiB avail; 1.6 KiB/s rd, 140 KiB/s wr, 4 op/s
2023-05-18T10:20:42.084+0000 7ff213b3b640  0 log_channel(cluster) log [DBG] : pgmap v4393: 169 pgs: 1 active+clean+scrubbing+deep, 1 active+clean+scrubbing, 167 active+clean; 23 GiB data, 27 GiB used, 3.0 TiB / 3 TiB avail; 1.9 KiB/s rd, 180 KiB/s wr, 6 op/s

The rate is very low and can be attributed to the current high cost setting for scrubs (osd_scrub_cost = 50 MiB)

As per Neha's update in comment#5, an interim solution would be to change the mClock profile (osd_mclock_profile) to 'high_recovery_ops'
and see if the scrub rate improves. This will confirm that the issue is indeed the high scrub cost that is throttling the scrub rate.

Another observation I wanted to highlight and confirm is if the underlying OSD devices are of rotational type (HDD)? The OSD bench test during
OSD boot-up reports inflated IOPS capacity for the HDDs and we therefore set the IOPS capacity to a realistic default value of 315 IOPS.
For example, osd.0 benchmark shown below reports 3677.781 random write IOPS (@4 KiB block size) which is not realistic for a rotational device.

If the default value is not accurate, then it may be changed accordingly.

Here are some logs from a subset of the OSDs,

osd.0
-----
2023-05-18T08:01:53.188+0000 7ff8682d72c0  1 osd.0 0  bench count 12288000 bsize 4 KiB
2023-05-18T08:02:00.193+0000 7ff8682d72c0  1 osd.0 0 maybe_override_max_osd_capacity_for_qos osd bench result - bandwidth (MiB/sec): 14.366 iops: 3677.781 elapsed_sec: 0.816
2023-05-18T08:02:00.193+0000 7ff8682d72c0  0 log_channel(cluster) log [WRN] : OSD bench result of 3677.781217 IOPS exceeded the threshold limit of 500.000000 IOPS for osd.0. IOPS capacity is unchanged at 315.000000 IOPS. The recommendation is to establish the osd's IOPS capacity using other benchmark tools (e.g. Fio) and then override osd_mclock_max_capacity_iops_[hdd|ssd].

osd.1
------
2023-05-18T08:01:56.278+0000 7f91f70e12c0  1 osd.1 0  bench count 12288000 bsize 4 KiB
2023-05-18T08:02:07.076+0000 7f91f70e12c0  1 osd.1 0 maybe_override_max_osd_capacity_for_qos osd bench result - bandwidth (MiB/sec): 13.003 iops: 3328.798 elapsed_sec: 0.901
2023-05-18T08:02:07.076+0000 7f91f70e12c0  0 log_channel(cluster) log [WRN] : OSD bench result of 3328.797968 IOPS exceeded the threshold limit of 500.000000 IOPS for osd.1. IOPS capacity is unchanged at 315.000000 IOPS. The recommendation is to establish the osd's IOPS capacity using other benchmark tools (e.g. Fio) and then override osd_mclock_max_capacity_iops_[hdd|ssd].

osd.3
-----
2023-05-18T09:15:47.556+0000 7ffa56fe32c0  1 osd.3 0  bench count 12288000 bsize 4 KiB
2023-05-18T09:15:53.902+0000 7ffa56fe32c0  1 osd.3 0 maybe_override_max_osd_capacity_for_qos osd bench result - bandwidth (MiB/sec): 18.204 iops: 4660.125 elapsed_sec: 0.644
2023-05-18T09:15:53.902+0000 7ffa56fe32c0  0 log_channel(cluster) log [WRN] : OSD bench result of 4660.124803 IOPS exceeded the threshold limit of 500.000000 IOPS for osd.3. IOPS capacity is unchanged at 315.000000 IOPS. The recommendation is to establish the osd's IOPS capacity using other benchmark tools (e.g. Fio) and then override osd_mclock_max_capacity_iops_[hdd|ssd].

--- Additional comment from Petr Balogh on 2023-05-22 07:48:33 UTC ---

We are using vSAN storage in vSphere which is consist of NVMe disks.

Is this something which should be changed in the product? Or we should do some extra configuration in our QE Setup?

There is also IO in the BG running when all test cases are running.

Let us know please if we need to change something or it has to be changed in ODF.

And if we need to change something, where and how we can do that.

Thanks

--- Additional comment from Aishwarya Mathuria on 2023-05-22 08:24:55 UTC ---

Hello Petr,
We have been working on a fix that is currently being tested and reviewed to reduce osd_scrub_cost in order to speed up scrubs with mClock. 
The new scrub cost would be 102400 ( osd_scrub_chunk_max (25) * 4KiB)

You can change osd_scrub_cost using the following command:
ceph config set osd 102400

To check osd_scrub_cost after modifying it: 
ceph config show osd.1 osd_scrub_cost

Regards,
Aishwarya

--- Additional comment from Petr Balogh on 2023-05-22 13:48:32 UTC ---

I will try to do verification of this W/A in this execution:

https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-trigger-vsphere-upi-1az-rhcos-vsan-3m-3w-acceptance/431/

Where I am going to connect to the cluster once deployed and re-run the add capacity test after applying mentioned ceph configuration change and will get back to you.

Not sure if the fix is going to be added in Ceph or if it's going to be fixed in ODF?

--- Additional comment from Petr Balogh on 2023-05-22 15:18:43 UTC ---

bash-5.1$ ceph config set osd 102400
Invalid command: missing required parameter value(<string>)
config set <who> <name> <value> [--force] :  Set a configuration option for one or more entities
Error EINVAL: invalid command

--- Additional comment from Aishwarya Mathuria on 2023-05-22 16:13:04 UTC ---

Sorry about that! 

ceph config set osd osd_scrub_cost 102400

This should work fine.

--- Additional comment from Petr Balogh on 2023-05-22 18:02:13 UTC ---

https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/7921/testReport/

Looks like after this update the add capacity test passed OK.
Must gather logs:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-431vu1cs33-a/j-431vu1cs33-a_20230522T135058/logs/testcases_1684770123/j-431vu1cs33-a/

--- Additional comment from Mudit Agarwal on 2023-05-23 05:51:39 UTC ---

Aishwarya, do we have a Ceph BZ or tracker where the fix you are working upon can be tracked by ODF?

--- Additional comment from RHEL Program Management on 2023-05-23 05:51:49 UTC ---

This BZ is being approved for ODF 4.13.0 release, upon receipt of the 3 ACKs (PM,Devel,QA) for the release flag 'odf‑4.13.0

--- Additional comment from RHEL Program Management on 2023-05-23 05:51:49 UTC ---

Since this bug has been approved for ODF 4.13.0 release, through release flag 'odf-4.13.0+', the Target Release is being set to 'ODF 4.13.0

--- Additional comment from Aishwarya Mathuria on 2023-05-23 06:00:01 UTC ---

Hi Mudit, 
The fix is being tracked here - https://tracker.ceph.com/issues/61313, it is currently under review.

Comment 2 Aishwarya Mathuria 2023-05-24 09:52:06 UTC

PR currently under review: https://github.com/ceph/ceph/pull/51728

Comment 6 Pawan 2023-06-01 10:48:11 UTC

*** Bug 2210975 has been marked as a duplicate of this bug. ***

Comment 8 errata-xmlrpc 2023-06-15 09:17:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 6.1 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:3623

Note You need to log in before you can comment on or make changes to this bug.