2129456 – Unable to Expand Storage Cannot Add New OSDs to Existing Hosts/Racks (Local Storage)

Bug 2129456 - Unable to Expand Storage Cannot Add New OSDs to Existing Hosts/Racks (Local Storage)

Summary: Unable to Expand Storage Cannot Add New OSDs to Existing Hosts/Racks (Local S...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Neha Ojha
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-09-23 20:20 UTC by Craig Wayman
Modified:	2023-08-09 16:37 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-04-05 19:07:24 UTC
Embargoed:

Attachments	(Terms of Use)

Description Craig Wayman 2022-09-23 20:20:17 UTC

Description of problem (please be detailed as possible and provide log
snippets):

  The customer opened the case due to latency associated with CEPH. This cluster is using local storage/LSO. Upon review of the must-gather CEPH reflected that osd.22 was nearfull along with the two biggest pools. The customer made the decision to expand storage and add three new NVMe disks to three pre-existing hosts. From what the customer provided us, it seems they've performed the process correctly. The three rook-ceph-osd-prepare-ocs-deviceset jobs/pods completed successfully, the PVs/PVCs were created however, rook-ceph-operator did not pick up the three OSDs (osd.27, osd.28, and osd.29). When the customer scaled ocs-operator and rook-ceph operator, delete ocsinit, and scale up just the ocs-operator osd.29 was added to rack0. Unfortunately, osd.27, and osd.28 were not picked up and added to existing hosts/racks. 

  The storagecluster.yaml reflects the correct count of 10 OSDs per host. Previously, the count was 9. osd.29 has a deployment and a pod however, osd.27 and osd.28 do not have either a deployment or a pod. 

  When looking at the Crush Map, osd.29 is populated with all the information it needs however, osd.27 and osd.28 DO APPEAR in the Crush Map, but with empty curly braces

        },
        {
            "id": 27
        },
        {
            "id": 28
        },
        {
            "id": 29,
            "arch": "x86_64",
            "back_addr": "[v2:10.224.18.27:6802/406,v1:10.224.18.27:6803/406]",
            "back_iface": "",
            "bluefs": "1",
            "bluefs_dedicated_db": "0",.......and so on...........omitted.........
        }


Version of all relevant components (if applicable):

ODF (CSV):

NAME                                     DISPLAY                                          VERSION    REPLACES                                 PHASE
container-security-operator.v3.7.6       Red Hat Quay Container Security Operator         3.7.6      container-security-operator.v3.7.5       Succeeded
elasticsearch-operator.5.5.1             OpenShift Elasticsearch Operator                 5.5.1      elasticsearch-operator.5.5.0             Succeeded
jaeger-operator.v1.36.0-2                Red Hat OpenShift distributed tracing platform   1.36.0-2   jaeger-operator.v1.34.1-5                Succeeded
kiali-operator.v1.48.2                   Kiali Operator                                   1.48.2     kiali-operator.v1.48.1                   Succeeded
kubernetes-imagepuller-operator.v1.0.1   Kubernetes Image Puller Operator                 1.0.1      kubernetes-imagepuller-operator.v1.0.0   Installing
mcg-operator.v4.10.5                     NooBaa Operator                                  4.10.5     mcg-operator.v4.9.10                     Succeeded
ocs-operator.v4.10.5                     OpenShift Container Storage                      4.10.5     ocs-operator.v4.9.10                     Succeeded
odf-csi-addons-operator.v4.10.5          CSI Addons                                       4.10.5     odf-csi-addons-operator.v4.10.4          Succeeded
odf-operator.v4.10.5                     OpenShift Data Foundation                        4.10.5     odf-operator.v4.9.10                     Succeeded
serverless-operator.v1.7.2               OpenShift Serverless Operator                    1.7.2      serverless-operator.v1.7.1               Installing
servicemeshoperator.v2.2.1               Red Hat OpenShift Service Mesh                   2.2.1-0    servicemeshoperator.v2.1.3               Succeeded

Cluster Version (OCP):
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.28   True        False         8d      Cluster version is 4.10.28



Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

  Yes, the Production Cluster is now Nearfull

Is there any workaround available to the best of your knowledge?

  The workaround we were able to implement was to 1. Scale the ocs-operator and rook-ceph-operator to –replicas=0. 2. Delete ocsinitialization/ocsinit. And 3. Scale up just the ocs-operator. With that process, rook-ceph was able to add osd.29 however, even by repeating that process osd.27 and osd.28 were not able to be added.  


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

N/A

Can this issue reproducible?

Not in a testing environment.

Can this issue reproduce from the UI?
No

Additional info:

  It may be worth it to mention that this is a very busy/demanding cluster. We’ve seen the odf-operator get OOMKilled even after increasing memory limits/requests to 700Mi. Since then they’ve increased to 900Mi. We’re constantly seeing two noobaa-endpoint pods deployed, signifying high workload/traffic.

Comment 3 Santosh Pillai 2022-09-26 14:12:36 UTC

Looked in the must gather logs. No mention of OSD 27 and OSD 28 anywhere in the logs. Looks like some stale/leftover osds. If no data is associated with these OSDs, then I would suggest purging them. 

@tnielsen Any different suggestion?

Comment 4 Travis Nielsen 2022-09-26 17:15:34 UTC

Some log entries of interest in the rook operator log. These must be related to OSD 27 and 28, showing that the OSD prepare jobs had been previously run and are now running again and completing. But what this doesn't show is why the OSD daemon pods aren't being created.

2022-09-22T21:07:34.464893032Z 2022-09-22 21:07:34.464803 I | op-osd: OSD will have its main bluestore block on "ocs-deviceset-0-data-9z89m4"
2022-09-22T21:07:34.487802419Z 2022-09-22 21:07:34.485598 I | op-k8sutil: Removing previous job rook-ceph-osd-prepare-ocs-deviceset-0-data-9z89m4 to start a new one
2022-09-22T21:07:34.499272039Z 2022-09-22 21:07:34.499071 I | op-k8sutil: batch job rook-ceph-osd-prepare-ocs-deviceset-0-data-9z89m4 still exists
2022-09-22T21:07:37.697943877Z 2022-09-22 21:07:37.695202 I | op-osd: started OSD provisioning job for PVC "ocs-deviceset-0-data-9z89m4"
...
2022-09-22T21:07:37.697943877Z 2022-09-22 21:07:37.695307 I | op-osd: OSD will have its main bluestore block on "ocs-deviceset-1-data-9zqpmn"
2022-09-22T21:07:37.903936821Z 2022-09-22 21:07:37.889714 I | op-k8sutil: Removing previous job rook-ceph-osd-prepare-ocs-deviceset-1-data-9zqpmn to start a new one
2022-09-22T21:07:37.916934008Z 2022-09-22 21:07:37.912401 I | op-k8sutil: batch job rook-ceph-osd-prepare-ocs-deviceset-1-data-9zqpmn still exists
2022-09-22T21:07:40.975107008Z 2022-09-22 21:07:40.946586 I | op-k8sutil: batch job rook-ceph-osd-prepare-ocs-deviceset-1-data-9zqpmn deleted
2022-09-22T21:07:41.041076830Z 2022-09-22 21:07:41.039806 I | op-osd: started OSD provisioning job for PVC "ocs-deviceset-1-data-9zqpmn"
...
2022-09-22T21:07:46.815205610Z 2022-09-22 21:07:46.815058 I | op-osd: OSD orchestration status for node ocs-deviceset-0-data-9z89m4 is "orchestrating"
2022-09-22T21:07:46.815619258Z 2022-09-22 21:07:46.815578 I | op-osd: OSD orchestration status for PVC ocs-deviceset-0-data-9z89m4 is "orchestrating"
2022-09-22T21:07:46.815963586Z 2022-09-22 21:07:46.815941 I | op-osd: OSD orchestration status for PVC ocs-deviceset-0-data-9z89m4 is "completed"
2022-09-22T21:07:46.824453895Z 2022-09-22 21:07:46.824370 I | op-osd: OSD orchestration status for node ocs-deviceset-1-data-9zqpmn is "orchestrating"
2022-09-22T21:07:46.824781145Z 2022-09-22 21:07:46.824760 I | op-osd: OSD orchestration status for PVC ocs-deviceset-1-data-9zqpmn is "orchestrating"
...
2022-09-22T21:07:53.087939312Z 2022-09-22 21:07:53.086125 I | op-osd: OSD orchestration status for PVC ocs-deviceset-1-data-9zqpmn is "completed"


The OSD prepare log for pod rook-ceph-osd-prepare-ocs-deviceset-1-data-9zqpmn-dlpj7 shows that it can't determine the status of the LUKS device:

2022-09-22T21:07:44.972667064Z 2022-09-22 21:07:44.972595 I | cephosd: creating and starting the osds
2022-09-22T21:07:44.972760954Z 2022-09-22 21:07:44.972750 D | cephosd: desiredDevices are [{Name:/mnt/ocs-deviceset-1-data-9zqpmn OSDsPerDevice:1 MetadataDevice: DatabaseSizeMB:0 DeviceClass: InitialWeight: IsFilter:false IsDevicePathFilter:false}]
2022-09-22T21:07:44.972779505Z 2022-09-22 21:07:44.972772 D | cephosd: context.Devices are:
2022-09-22T21:07:44.972821487Z 2022-09-22 21:07:44.972811 D | cephosd: &{Name:/mnt/ocs-deviceset-1-data-9zqpmn Parent: HasChildren:false DevLinks:/dev/disk/by-id/scsi-36000c299c61ea642991160eeb1090604 /dev/disk/by-id/scsi-SVMware_Virtual_disk_6000c299c61ea642991160eeb1090604 /dev/disk/by-id/wwn-0x6000c299c61ea642991160eeb1090604 /dev/disk/by-path/pci-0000:03:00.0-scsi-0:0:11:0 Size:2199023255552 UUID:8a84ff25-21ae-4017-81b5-8773b4580ca3 Serial:36000c299c61ea642991160eeb1090604 Type:data Rotational:false Readonly:false Partitions:[] Filesystem: Vendor:VMware Model:Virtual_disk WWN:0x6000c299c61ea642 WWNVendorExtension:0x6000c299c61ea642991160eeb1090604 Empty:false CephVolumeData: RealPath:/dev/sdk KernelName:sdk Encrypted:false}
2022-09-22T21:07:44.972871730Z 2022-09-22 21:07:44.972840 D | exec: Running command: cryptsetup luksDump /mnt/ocs-deviceset-1-data-9zqpmn
2022-09-22T21:07:44.985602529Z 2022-09-22 21:07:44.985543 E | cephosd: failed to determine if the encrypted block "/mnt/ocs-deviceset-1-data-9zqpmn" is from our cluster. failed to dump LUKS header for disk "/mnt/ocs-deviceset-1-data-9zqpmn". Device /mnt/ocs-deviceset-1-data-9zqpmn is not a valid LUKS device.: exit status 1
2022-09-22T21:07:44.985866773Z 2022-09-22 21:07:44.985851 D | exec: Running command: stdbuf -oL ceph-volume --log-path /tmp/ceph-log raw list /mnt/ocs-deviceset-1-data-9zqpmn --format json
2022-09-22T21:07:45.372591165Z 2022-09-22 21:07:45.372408 D | cephosd: {
2022-09-22T21:07:45.372591165Z     "7739597f-b777-4620-a86c-2ba1d95c708a": {
2022-09-22T21:07:45.372591165Z         "ceph_fsid": "3bc51fe9-c37e-4fd3-8599-bf49a5012407",
2022-09-22T21:07:45.372591165Z         "device": "/mnt/ocs-deviceset-1-data-9zqpmn",
2022-09-22T21:07:45.372591165Z         "osd_id": 27,
2022-09-22T21:07:45.372591165Z         "osd_uuid": "7739597f-b777-4620-a86c-2ba1d95c708a",
2022-09-22T21:07:45.372591165Z         "type": "bluestore"
2022-09-22T21:07:45.372591165Z     }
2022-09-22T21:07:45.372591165Z }


The osd prepare job for the OSD 28 also shows the similar error in pod log rook-ceph-osd-prepare-ocs-deviceset-0-data-9z89m4-qtnnl.

The logs of the previous OSD prepare jobs are not available from the original provisioning of OSDs 27 and 28 are not available, so it may be a different error in the original osd creation. We cannot find the original cause without those original OSD prepare logs. 

To get these OSDs created, I would also suggest to purge these two OSDs (as mentioned by Santosh) and wipe the disks, then try again to create them. If this does happen again after the purge and re-creation, please share the OSD prepare logs from the failure, before restarting the operator again and losing them.

Comment 8 Craig Wayman 2022-10-04 21:43:03 UTC

Good Evening, 

  As the customer proceeded to follow the latest BZ recommendation to remove OSD 27 and 28, zap/wipe, and re-add them back, they were unsuccessful and met with the following error:

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[g4yy@edcloscd101 ~ ]$ echo $osd_id_to_remove
27
[g4yy@edcloscd101 ~ ]$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} FORCE_OSD_REMOVAL=false |oc create -n openshift-storage -f -
error: unknown parameter name "FORCE_OSD_REMOVAL"
error: no objects passed to create
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

  An interesting find was from the # oc get templates ocs-osd-removal -o yaml output the customer provided. I vimdiff'd against the output of my test cluster (same version) and the customer's template was missing the following information:

 
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      spec:
        containers:
        - args:
          - ceph
          - osd
          - remove
          - --osd-ids=${FAILED_OSD_IDS}
          - --force-osd-removal  <------------MISSING
          - ${FORCE_OSD_REMOVAL} <------------MISSING
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

  Additionally, there was quite a bit of differentiation in the description at the end of the # oc get templates ocs-osd-removal -o yaml output. I will attach the latest must-gather along with the outputs mentioned above from my test cluster and from the customer's output in a private comment following this one.

  Looking for further direction from Engineering, however, because osd.27 and osd.28 were originally planned as OSDs deployed to scale/add storage capacity that was unsuccessful. Then following the 5.1. Replacing operational or failed storage devices on clusters backed by local storage devices product documentation is not working. Should we have the customer delete PVs/PVCs associated with osd.27 and osd.28 followed by going into the rook-ceph-tools pod and removing the OSDs manually from rack/host and crush map? Thank you for your time.


Regards,


Craig Wayman
TSE Red Hat OpenShift Data Foundations (ODF) 
Customer Experience and Engagement, NA

Comment 10 Santosh Pillai 2022-10-07 04:53:00 UTC

(In reply to Craig Wayman from comment #8)
> Good Evening, 
> 
>   As the customer proceeded to follow the latest BZ recommendation to remove
> OSD 27 and 28, zap/wipe, and re-add them back, they were unsuccessful and
> met with the following error:
> 
> -----------------------------------------------------------------------------
> -----------------------------------------------------------------------------
> --------------------------
> [g4yy@edcloscd101 ~ ]$ echo $osd_id_to_remove
> 27
> [g4yy@edcloscd101 ~ ]$ oc process -n openshift-storage ocs-osd-removal -p
> FAILED_OSD_IDS=${osd_id_to_remove} FORCE_OSD_REMOVAL=false |oc create -n
> openshift-storage -f -
> error: unknown parameter name "FORCE_OSD_REMOVAL"
> error: no objects passed to create
> -----------------------------------------------------------------------------
> -----------------------------------------------------------------------------
> --------------------------
> 
>   An interesting find was from the # oc get templates ocs-osd-removal -o
> yaml output the customer provided. I vimdiff'd against the output of my test
> cluster (same version) and the customer's template was missing the following
> information:
> 
>  
> -----------------------------------------------------------------------------
> -----------------------------------------------------------------------------
> --------------------------
>       spec:
>         containers:
>         - args:
>           - ceph
>           - osd
>           - remove
>           - --osd-ids=${FAILED_OSD_IDS}
>           - --force-osd-removal  <------------MISSING
>           - ${FORCE_OSD_REMOVAL} <------------MISSING
> -----------------------------------------------------------------------------
> -----------------------------------------------------------------------------
> --------------------------
> 
> 
That's strange. OCS 4.10 has this `FORCE_OSD_REMOVAL` but somehow customer is not seeing that in the template. 
I'm assuming that the OCS operator is not updating this template when cluster is upgraded. I'll take a look into this.

Comment 11 Santosh Pillai 2022-10-07 04:57:30 UTC

(In reply to Craig Wayman from comment #8)
>   Looking for further direction from Engineering, however, because osd.27
> and osd.28 were originally planned as OSDs deployed to scale/add storage
> capacity that was unsuccessful. 

is the operation unsuccessful of the failure to start the `ocs-osd-removal` job due `error: unknown parameter name "FORCE_OSD_REMOVAL"` 
if that's the case, can you try without passing this argument? For example:
 oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} |oc create -n openshift-storage -f -

> Then following the 5.1. Replacing
> operational or failed storage devices on clusters backed by local storage
> devices product documentation is not working. Should we have the customer
> delete PVs/PVCs associated with osd.27 and osd.28 followed by going into the
> rook-ceph-tools pod and removing the OSDs manually from rack/host and crush
> map? Thank you for your time.

Comment 12 Craig Wayman 2022-10-07 13:32:33 UTC

Good Morning, 

  I am unsure why the removal operation is failing. After seeing the customer's error messages regarding those parameters, I bumped it up with their output to the osd removal template with my test cluster osd removal template and made the observation that those parameters were missing. The OSDs were in an awkward state, to begin with. They were picked up by LSO, created PVs/PVCs, and the count in the storagecluster.yaml is spot on, yet rook did not assign them to a rack. Looking at it on the CEPH end they show up in # ceph osd df, as down and they appear in the crush map however, there is no data between curl braces. The other odd observation is that osd.29 was in the exact same predicament during the initial deployment of the three new disks however when the customer was given the process to scale operators, delete ocsinit, and scale ocs-operator up only osd.29 was deployed successfully and osd.27 and osd.28 were not picked up by rook.

  I will have the customer attempt to run the command again using your command to set/override the parameter and update the case with the result. Thank you for your help.

Regards,

Craig Wayman
TSE Red Hat OpenShift Data Foundations (ODF) 
Customer Experience and Engagement, NA

Comment 14 Craig Wayman 2022-10-11 19:19:53 UTC

Good Afternoon,

I would like to post an update. The customer successfully removed the devices and successfully re-added the devices (osd.27 and osd.28) as well by leaving out the FORCE_OSD_REMOVAL parameter derived from the v4.10 product documentation.

$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} FORCE_OSD_REMOVAL=false |oc create -n openshift-storage -f - <------------- Failed to remove devices

$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} |oc create -n openshift-storage -f - <------------- Successfully removed devices

The customer is now in HEALTH_OK and has satisfied the count of "10" in their storagecluster.yaml yielding a total of 30 OSDs, 10 per ODF node/rack.

I also figured out why the first command (the command that failed to remove devices) didn't work. That command was from v4.10 product documentation. Now, the customer is on ODF v4.10 however, it's been a cluster that has been upgraded over time all the way from v4.4. Their storagecluster version is v4.6.0. It seems that the oc process command processes the osd-removal template to perform the osd removal action/job and the customer's template is a by-product of their v4.6.0 storagecluster version. I've noticed that during MOST upgrades, the majority of the OCS/ODF components are upgraded except the storagecluster version.

If you review the RHOCS v4.6 Replacing Devices Product documentation. Specifically chapter/section 4.3. Replacing operational or failed storage devices on clusters backed by local storage devices, step 4-iii, there is no FORCE_OSD_REMOVAL parameter. The command below is the same command the customer ran to finally/officially remove the devices and because my v4.10 cluster was a freshly deployed v4.10 test cluster, that parameter was in fact in my template and not in the customer.

Command Below is from v4.6 Product Documentation
oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} |oc create -n openshift-storage -f -

As of right now, we're still finishing up a few things with the customer, and once completed I will update the BZ for closure.

Regards,

Craig Wayman
TSE Red Hat OpenShift Data Foundations (ODF)
Customer Experience and Engagement, NA

Comment 15 Travis Nielsen 2022-10-17 15:07:03 UTC

Good to hear it's in a better state, any update if we're ready to close this?

Comment 16 Craig Wayman 2022-10-17 16:28:44 UTC

Good Morning, 

  As of now, everything successfully removed the devices and added back. Their CEPH backend looks very good in relation to the storagecluster.yaml count and how the OSDs are distributed on the racks/hosts. However, the PGs on their two biggest pools and OSDs are very low even with the autoscaler set to "on." Currently, I am working with them to increase the PGs manually. As of right now, there is one OSD that has exceeded the 75% threshold and put them in HEALTH_WARN even though other OSDs are sitting at low 50% use. We could re-weight a couple of OSDs however, due to noting how low the PGs were, the way forward is to increase them. Increasing the PGs and then following a CEPH balancer run should spread the data out a little better. This BZ was opened because we were having issues removing/adding storage devices. As of now, that is not the issue, so you can close the BZ however, if the same issue arises, I will update the BZ. 


Regards,

Craig Wayman
TSE Red Hat OpenShift Data Foundations (ODF) 
Customer Experience and Engagement, NA

Comment 17 Travis Nielsen 2022-10-17 16:57:40 UTC

Good to hear the issue is resolved for removing OSDs, thanks for the details. We will close this BZ for now.

Comment 18 Craig Wayman 2022-10-25 21:12:44 UTC

Good Afternoon,

  I decided to re-open the BZ since this is still related to the process of adding more OSDs. When the customer opened the case it was because one osd (osd.22) triggered CEPH to go into HEALTH_WARN because it crossed the ODF CEPH nearfull threshold of 75%. The customer added x3 (three) 2 TiB disks bringing CEPH up to HEALTH_OK, only for a short while. I know they could continue to add disks however, this is an issue I've seen in ODF before where rook-ceph doesn't autoscale PGs even with the autoscaler set to "on" and PGs on the OSDs are noticeably low. As of now most of the OSDs are hovering around 50-50% use, and just a few OSDs in the 70% use and osd.22 being one of them that has exceeded 75%. Looking at the PGs on the OSDs, they're pretty low as well (50-60 PGs per OSD). The customer's concern is why isn't ODF autoscaling the PGs with the autoscaler set to "on?" 

  As a proactive step with the observation that the OSD PGs were low, I wanted to have the customer increase not only the pg_num but the pgp_num on the two biggest pools as a method to both increase the PGs on the OSDs and essentially perform a kind of re-weight with the pgp_num dividing the PGs and sending them to other OSDs. The odd part about this process is that, although the customer increased the pg_num and pgp_num on the cephblockpool to 512 the PGs on the OSDs, did not increase. Since then the autoscaler must've autoscaled back down to 256 PGs. This could be an issue with this cluster as it has come all the way (upgraded) from OCS v4.4 because I've performed this process on test clusters previously and was successfully able to scale PGs on both pools and OSDs. 

  I know we can re-weight osd.22 or any of the OSDs that are above 70% however a couple of issues. First, with ODF v4.10 I assume once those three disks were added and CEPH re-mapped then ran the balancer, shouldn't it spread out/increased PGs? It's just odd that with OSDs around 50-60 PGs and the pool/OSD utilization is high, it should've autoscaled the PGs. I know this has a lot to do with possibly setting the TARGET_RATIO on the pools, but it's set to .49 on the two biggest pools. Should this be changed? I would like to address the customer's concern of:

1. The PGs on the OSDs look very low, how do we increase them? Shouldn't rook-ceph do it automatically?

2. Should we adjust the TARGET_RATIO to accomplish this? If so, what values should they be set to and on which pools?

  I have pulled some data from the must-gather and pasted it below. For more information and logs, the recent must-gather can be found here: https://drive.google.com/drive/folders/1uFAzCAMIt7fzJ5geqFnNfv24vOP8gtbg?usp=sharing

  Thank you for your time and help!
 

--------------ceph osd df tree-----------
ID   CLASS  WEIGHT    REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL     %USE   VAR   PGS  STATUS  TYPE NAME                           
 -1         59.99799         -   60 TiB   37 TiB   37 TiB  7.4 GiB  112 GiB    23 TiB  61.38  1.00    -          root default                        
 -8         20.00000         -   20 TiB   12 TiB   12 TiB  2.6 GiB   38 GiB   7.7 TiB  61.37  1.00    -              rack rack0                      
 -7         20.00000         -   20 TiB   12 TiB   12 TiB  2.6 GiB   38 GiB   7.7 TiB  61.37  1.00    -                  host edcxosid001g-bcbsfl-com
  1    ssd   2.00000   1.00000    2 TiB  1.3 TiB  1.3 TiB  978 MiB  4.3 GiB   696 GiB  66.02  1.08   67      up              osd.1                   
  4    ssd   2.00000   1.00000    2 TiB  1.2 TiB  1.2 TiB   14 MiB  3.5 GiB   798 GiB  61.02  0.99   61      up              osd.4                   
  7    ssd   2.00000   1.00000    2 TiB  1.1 TiB  1.1 TiB  596 MiB  3.9 GiB   896 GiB  56.24  0.92   58      up              osd.7                   
 11    ssd   2.00000   1.00000    2 TiB  1.2 TiB  1.2 TiB   69 MiB  3.7 GiB   842 GiB  58.87  0.96   59      up              osd.11                  
 13    ssd   2.00000   1.00000    2 TiB  1.4 TiB  1.4 TiB   32 MiB  4.0 GiB   606 GiB  70.43  1.15   60      up              osd.13                  
 16    ssd   2.00000   1.00000    2 TiB  1.2 TiB  1.2 TiB   32 MiB  3.5 GiB   791 GiB  61.38  1.00   61      up              osd.16                  
 20    ssd   2.00000   1.00000    2 TiB  1.3 TiB  1.3 TiB  350 MiB  3.8 GiB   736 GiB  64.07  1.04   62      up              osd.20                  
 23    ssd   2.00000   1.00000    2 TiB  1.3 TiB  1.3 TiB      0 B  3.8 GiB   734 GiB  64.18  1.05   63      up              osd.23                  
 24    ssd   2.00000   1.00000    2 TiB  1.1 TiB  1.1 TiB  288 MiB  3.6 GiB   942 GiB  54.01  0.88   55      up              osd.24                  
 29    ssd   2.00000   1.00000    2 TiB  1.1 TiB  1.1 TiB  281 MiB  3.7 GiB   871 GiB  57.49  0.94   55      up              osd.29                  
-12         19.99899         -   20 TiB   12 TiB   12 TiB  2.2 GiB   38 GiB   7.7 TiB  61.38  1.00    -              rack rack1                      
-11         19.99899         -   20 TiB   12 TiB   12 TiB  2.2 GiB   38 GiB   7.7 TiB  61.38  1.00    -                  host edcxosid001f-bcbsfl-com
  2    ssd   1.99899   1.00000  2.0 TiB  1.2 TiB  1.2 TiB  494 MiB  3.7 GiB   814 GiB  60.24  0.98   65      up              osd.2                   
  3    ssd   2.00000   1.00000  2.0 TiB  1.4 TiB  1.4 TiB      0 B  4.0 GiB   657 GiB  67.94  1.11   58      up              osd.3                   
  6    ssd   2.00000   1.00000  2.0 TiB  1.2 TiB  1.2 TiB  730 MiB  4.3 GiB   769 GiB  62.43  1.02   67      up              osd.6                   
  9    ssd   2.00000   1.00000    2 TiB  1.0 TiB  1.0 TiB  303 MiB  3.4 GiB  1013 GiB  50.54  0.82   54      up              osd.9                   
 12    ssd   2.00000   1.00000    2 TiB  1.1 TiB  1.1 TiB      0 B  2.9 GiB   933 GiB  54.43  0.89   57      up              osd.12                  
 15    ssd   2.00000   1.00000    2 TiB  1.0 TiB  1.0 TiB  240 MiB  3.3 GiB  1012 GiB  50.59  0.82   55      up              osd.15                  
 18    ssd   2.00000   1.00000    2 TiB  1.4 TiB  1.4 TiB   48 MiB  4.1 GiB   651 GiB  68.24  1.11   63      up              osd.18                  
 21    ssd   2.00000   1.00000    2 TiB  1.4 TiB  1.4 TiB  311 MiB  4.4 GiB   615 GiB  69.98  1.14   60      up              osd.21                  
 26    ssd   2.00000   1.00000    2 TiB  1.4 TiB  1.4 TiB   63 MiB  3.9 GiB   586 GiB  71.39  1.16   63      up              osd.26                  
 27    ssd   2.00000   1.00000    2 TiB  1.2 TiB  1.2 TiB   98 MiB  3.8 GiB   859 GiB  58.06  0.95   59      up              osd.27                  
 -4         19.99899         -   20 TiB   12 TiB   12 TiB  2.6 GiB   36 GiB   7.7 TiB  61.38  1.00    -              rack rack2                      
 -3         19.99899         -   20 TiB   12 TiB   12 TiB  2.6 GiB   36 GiB   7.7 TiB  61.38  1.00    -                  host edcxosid001e-bcbsfl-com
  0    ssd   1.99899   1.00000  2.0 TiB  1.0 TiB  1.0 TiB   80 MiB  3.2 GiB   990 GiB  51.64  0.84   56      up              osd.0                   
  5    ssd   2.00000   1.00000  2.0 TiB  1.4 TiB  1.4 TiB      0 B  3.9 GiB   650 GiB  68.27  1.11   61      up              osd.5                   
  8    ssd   2.00000   1.00000  2.0 TiB  1.2 TiB  1.2 TiB  319 MiB  3.7 GiB   827 GiB  59.61  0.97   57      up              osd.8                   
 10    ssd   2.00000   1.00000    2 TiB  1.2 TiB  1.1 TiB  271 MiB  3.6 GiB   869 GiB  57.59  0.94   57      up              osd.10                  
 14    ssd   2.00000   1.00000    2 TiB  1.3 TiB  1.2 TiB  275 MiB  3.9 GiB   765 GiB  62.66  1.02   65      up              osd.14                  
 17    ssd   2.00000   1.00000    2 TiB  1.3 TiB  1.3 TiB      0 B  3.8 GiB   696 GiB  66.00  1.08   61      up              osd.17                  
 19    ssd   2.00000   1.00000    2 TiB  1.3 TiB  1.3 TiB   28 MiB  3.6 GiB   741 GiB  63.83  1.04   59      up              osd.19                  
 22    ssd   2.00000   1.00000    2 TiB  1.5 TiB  1.5 TiB  620 MiB  4.7 GiB   487 GiB  76.22  1.24   72      up              osd.22                  
 25    ssd   2.00000   1.00000    2 TiB  1.0 TiB  1.0 TiB   63 MiB  2.2 GiB  1008 GiB  50.80  0.83   53      up              osd.25                  
 28    ssd   2.00000   1.00000    2 TiB  1.1 TiB  1.1 TiB  961 MiB  3.7 GiB   878 GiB  57.15  0.93   60      up              osd.28                  
                         TOTAL   60 TiB   37 TiB   37 TiB  7.4 GiB  112 GiB    23 TiB  61.38                                                         
MIN/MAX VAR: 0.82/1.24  STDDEV: 6.63

-------------------------------------------------------------

-------------------ceph df detail-----------------------------
--- RAW STORAGE ---
CLASS    SIZE   AVAIL    USED  RAW USED  %RAW USED
ssd    60 TiB  23 TiB  37 TiB    37 TiB      61.38
TOTAL  60 TiB  23 TiB  37 TiB    37 TiB      61.38
 
--- POOLS ---
POOL                                                   ID  PGS   STORED   (DATA)   (OMAP)  OBJECTS     USED   (DATA)   (OMAP)  %USED  MAX AVAIL  QUOTA OBJECTS  QUOTA BYTES  DIRTY  USED COMPR  UNDER COMPR
.rgw.root                                               1    8  4.7 KiB  4.7 KiB      0 B       16  224 KiB  224 KiB      0 B      0    1.8 TiB            N/A          N/A    N/A         0 B          0 B
ocs-storagecluster-cephblockpool                        2  256  5.1 TiB  5.1 TiB  4.9 KiB    1.34M   15 TiB   15 TiB  4.9 KiB  74.35    1.8 TiB            N/A          N/A    N/A         0 B          0 B
ocs-storagecluster-cephobjectstore.rgw.control          3    8      0 B      0 B      0 B        8      0 B      0 B      0 B      0    1.8 TiB            N/A          N/A    N/A         0 B          0 B
ocs-storagecluster-cephfilesystem-metadata              4    8  2.0 GiB  509 MiB  1.5 GiB    1.09M  3.0 GiB  1.5 GiB  1.5 GiB   0.06    1.8 TiB            N/A          N/A    N/A         0 B          0 B
ocs-storagecluster-cephfilesystem-data0                 5  256  4.9 TiB  4.9 TiB      0 B    3.57M   15 TiB   15 TiB      0 B  73.66    1.8 TiB            N/A          N/A    N/A         0 B          0 B
ocs-storagecluster-cephobjectstore.rgw.meta             6    8  4.4 KiB  3.9 KiB    441 B       17  208 KiB  208 KiB    441 B      0    1.8 TiB            N/A          N/A    N/A         0 B          0 B
ocs-storagecluster-cephobjectstore.rgw.log              7    8   22 MiB  9.3 KiB   22 MiB      214   23 MiB  720 KiB   22 MiB      0    1.8 TiB            N/A          N/A    N/A         0 B          0 B
ocs-storagecluster-cephobjectstore.rgw.buckets.index    8    8  305 MiB      0 B  305 MiB       12  305 MiB      0 B  305 MiB      0    1.8 TiB            N/A          N/A    N/A         0 B          0 B
ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec   9    8      0 B      0 B      0 B        0      0 B      0 B      0 B      0    1.8 TiB            N/A          N/A    N/A         0 B          0 B
ocs-storagecluster-cephobjectstore.rgw.buckets.data    10   32  2.2 TiB  2.2 TiB      0 B    1.07M  6.7 TiB  6.7 TiB      0 B  55.98    1.8 TiB            N/A          N/A    N/A         0 B          0 B
device_health_metrics                                  11    1  3.5 MiB      0 B  3.5 MiB       30  3.5 MiB      0 B  3.5 MiB      0    1.8 TiB            N/A          N/A    N/A         0 B          0 B
------------------------------------------------------------

---------------------ceph osd pool autoscale status-------------------
POOL                                                     SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE  BULK   
.rgw.root                                               4792                 3.0        61437G  0.0000                                  1.0       8              on         False  
ocs-storagecluster-cephblockpool                        5211G                3.0        61437G  0.5000        0.4900           0.5000   1.0     256              on         False  
ocs-storagecluster-cephobjectstore.rgw.control             0                 3.0        61437G  0.0000                                  1.0       8              on         False  
ocs-storagecluster-cephfilesystem-metadata              2088M                3.0        61437G  0.0001                                  4.0       8              on         False  
ocs-storagecluster-cephfilesystem-data0                 4996G                3.0        61437G  0.5000        0.4900           0.5000   1.0     256              on         False  
ocs-storagecluster-cephobjectstore.rgw.meta             4472                 3.0        61437G  0.0000                                  1.0       8              on         False  
ocs-storagecluster-cephobjectstore.rgw.log             22426k                3.0        61437G  0.0000                                  1.0       8              on         False  
ocs-storagecluster-cephobjectstore.rgw.buckets.index   304.8M                3.0        61437G  0.0000                                  1.0       8              on         False  
ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec      0                 3.0        61437G  0.0000                                  1.0       8              on         False  
ocs-storagecluster-cephobjectstore.rgw.buckets.data     2285G                3.0        61437G  0.1116                                  1.0      32              on         False  
device_health_metrics                                   3572k                3.0        61437G  0.0000                                  1.0       1              on         False  
-------------------------------------------------------------

Regards,


Craig Wayman
TSE Red Hat OpenShift Data Foundations (ODF) 
Customer Experience and Engagement, NA

Comment 19 Santosh Pillai 2022-10-28 03:35:23 UTC

(In reply to Craig Wayman from comment #18)
> Good Afternoon,
> 
>   I decided to re-open the BZ since this is still related to the process of
>
> 
> 1. The PGs on the OSDs look very low, how do we increase them? Shouldn't
> rook-ceph do it automatically?

The number of PGs, based on the number of OSDs available, looks good to me. But I'll check again and get back to you. 

> 2. Should we adjust the TARGET_RATIO to accomplish this? If so, what values
> should they be set to and on which pools?

Not sure about this right now. I'll check and get back to you.

Comment 20 Santosh Pillai 2022-10-31 09:25:44 UTC

Checking with someone from Ceph about this. I'll update the BZ once I get the answers.

Comment 21 Craig Wayman 2022-10-31 13:19:39 UTC

  Acknowledge, From what the consensus on our team is that for customers with a decent amount of data, their OSD PGs should be at least over 100 PGs. Their PGs look pretty low. In addition to that statement, is the fact that there are just a couple of OSDs that are flirting with the 75% %USE (ODF nearfull limit) meanwhile the vast majority of the OSDs are sitting around high 50s - low 60s %USE. They actually added three more new disks as well which decreased the %USE even more but there are still those couple of OSDs causing trouble (I will post ceph df and ceph osd df tree below). 

  Now, we could re-weight those two OSDs however because the PGs were so low I was going to use the increase PG process to accomplish both by increasing the pg_num and pgp_num... the pgp_num being the action that would break up the PGs and send them to other OSDS however, even that process didn't work as expected. The Pool PGS went up but the OSD PGs stayed the same. That does not happen in my test environment. I wonder if it has to do with this cluster being upgraded over time as it started from v4.4 so maybe components like rook-ceph aren't functioning the way they should be as it would a new install.

  All that said, the big questions are, why isn't rook-ceph scaling these PGs when the autoscaler is set to "on?" Why isn't rook-ceph balancing these OSDs better when the ceph balancer runs? I've seen posts about the autoscaler not being particularly great at scaling PGs in ODF, that's why I was wondering what the next steps should be... Should we adjust the TARGET_RATIO on the pools? With CEPH there is a lot more hands-on/manual tweaking, but with ODF it is generally supposed to be more of a hand-off approach which is why the customer is curious as to why things aren't working properly. The reason why this case was opened is that they were in HEALTH_WARN because of just one OSD. Then after adding three new disks they were still in HEALTH_WARN because of that same OSD (osd.22).  

  I just wanted to explain a little better, looking forward to Engineering's input.


--- RAW STORAGE ---
CLASS    SIZE   AVAIL    USED  RAW USED  %RAW USED
ssd    66 TiB  29 TiB  37 TiB    37 TiB      56.09
TOTAL  66 TiB  29 TiB  37 TiB    37 TiB      56.09
 

--- POOLS ---
POOL                                                   ID  PGS   STORED   (DATA)   (OMAP)  OBJECTS     USED   (DATA)   (OMAP)  %USED  MAX AVAIL  QUOTA OBJECTS  QUOTA BYTES  DIRTY  USED COMPR  UNDER COMPR
.rgw.root                                               1    8  4.7 KiB  4.7 KiB      0 B       16  232 KiB  232 KiB      0 B      0    2.8 TiB            N/A          N/A    N/A         0 B          0 B
ocs-storagecluster-cephblockpool                        2  256  5.1 TiB  5.1 TiB  4.9 KiB    1.34M   15 TiB   15 TiB  4.9 KiB  64.14    2.8 TiB            N/A          N/A    N/A         0 B          0 B
ocs-storagecluster-cephobjectstore.rgw.control          3    8      0 B      0 B      0 B        8      0 B      0 B      0 B      0    2.8 TiB            N/A          N/A    N/A         0 B          0 B
ocs-storagecluster-cephfilesystem-metadata              4    8  2.1 GiB  512 MiB  1.6 GiB    1.09M  3.1 GiB  1.5 GiB  1.6 GiB   0.03    2.8 TiB            N/A          N/A    N/A         0 B          0 B
ocs-storagecluster-cephfilesystem-data0                 5  256  4.9 TiB  4.9 TiB      0 B    3.57M   15 TiB   15 TiB      0 B  63.28    2.8 TiB            N/A          N/A    N/A         0 B          0 B
ocs-storagecluster-cephobjectstore.rgw.meta             6    8  4.4 KiB  3.9 KiB    441 B       17  208 KiB  208 KiB    441 B      0    2.8 TiB            N/A          N/A    N/A         0 B          0 B
ocs-storagecluster-cephobjectstore.rgw.log              7    8   22 MiB  3.6 KiB   22 MiB      214   23 MiB  528 KiB   22 MiB      0    2.8 TiB            N/A          N/A    N/A         0 B          0 B
ocs-storagecluster-cephobjectstore.rgw.buckets.index    8    8  310 MiB      0 B  310 MiB       12  310 MiB      0 B  310 MiB      0    2.8 TiB            N/A          N/A    N/A         0 B          0 B
ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec   9    8      0 B      0 B      0 B        0      0 B      0 B      0 B      0    2.8 TiB            N/A          N/A    N/A         0 B          0 B
ocs-storagecluster-cephobjectstore.rgw.buckets.data    10   32  2.3 TiB  2.3 TiB      0 B    1.11M  6.9 TiB  6.9 TiB      0 B  44.72    2.8 TiB            N/A          N/A    N/A         0 B          0 B
device_health_metrics                                  11    1  3.9 MiB      0 B  3.9 MiB       33  3.9 MiB      0 B  3.9 MiB      0    2.8 TiB            N/A          N/A    N/A         0 B          0 B

ID   CLASS  WEIGHT    REWEIGHT  SIZE     RAW USE   DATA      OMAP      META     AVAIL     %USE   VAR   PGS  STATUS  TYPE NAME                           
 -1         65.99799         -   66 TiB    37 TiB    37 TiB   8.1 GiB  121 GiB    29 TiB  56.09  1.00    -          root default                        
 -8         22.00000         -   22 TiB    12 TiB    12 TiB   2.8 GiB   40 GiB   9.7 TiB  56.08  1.00    -              rack rack0                      
 -7         22.00000         -   22 TiB    12 TiB    12 TiB   2.8 GiB   40 GiB   9.7 TiB  56.08  1.00    -                  host edcxosid001g-bcbsfl-com
  1    ssd   2.00000   1.00000    2 TiB   1.2 TiB   1.2 TiB  1013 MiB  4.3 GiB   788 GiB  61.53  1.10   62      up              osd.1                   
  4    ssd   2.00000   1.00000    2 TiB   1.0 TiB   1.0 TiB    14 MiB  3.7 GiB   986 GiB  51.84  0.92   54      up              osd.4                   
  7    ssd   2.00000   1.00000    2 TiB   1.1 TiB   1.1 TiB   576 MiB  3.7 GiB   934 GiB  54.41  0.97   56      up              osd.7                   
 11    ssd   2.00000   1.00000    2 TiB   963 GiB   960 GiB    71 MiB  3.4 GiB   1.1 TiB  47.04  0.84   52      up              osd.11                  
 13    ssd   2.00000   1.00000    2 TiB   1.3 TiB   1.3 TiB    33 MiB  4.3 GiB   724 GiB  64.65  1.15   56      up              osd.13                  
 16    ssd   2.00000   1.00000    2 TiB   1.1 TiB   1.1 TiB    33 MiB  3.2 GiB   888 GiB  56.65  1.01   56      up              osd.16                  
 20    ssd   2.00000   1.00000    2 TiB   1.2 TiB   1.2 TiB   324 MiB  3.7 GiB   849 GiB  58.57  1.04   55      up              osd.20                  
 23    ssd   2.00000   1.00000    2 TiB   1.1 TiB   1.1 TiB       0 B  3.7 GiB   922 GiB  55.00  0.98   56      up              osd.23                  
 24    ssd   2.00000   1.00000    2 TiB   990 GiB   987 GiB   274 MiB  3.5 GiB   1.0 TiB  48.36  0.86   49      up              osd.24                  
 29    ssd   2.00000   1.00000    2 TiB   1.1 TiB   1.1 TiB   213 MiB  3.5 GiB   925 GiB  54.84  0.98   51      up              osd.29                  
 30    ssd   2.00000   1.00000    2 TiB   1.3 TiB   1.3 TiB   281 MiB  3.3 GiB   737 GiB  64.02  1.14   54      up              osd.30                  
-12         21.99899         -   22 TiB    12 TiB    12 TiB   2.6 GiB   41 GiB   9.7 TiB  56.10  1.00    -              rack rack1                      
-11         21.99899         -   22 TiB    12 TiB    12 TiB   2.6 GiB   41 GiB   9.7 TiB  56.10  1.00    -                  host edcxosid001f-bcbsfl-com
  2    ssd   1.99899   1.00000  2.0 TiB   1.1 TiB   1.1 TiB   506 MiB  3.7 GiB   892 GiB  56.44  1.01   60      up              osd.2                   
  3    ssd   2.00000   1.00000  2.0 TiB   1.2 TiB   1.2 TiB       0 B  4.2 GiB   852 GiB  58.39  1.04   53      up              osd.3                   
  6    ssd   2.00000   1.00000  2.0 TiB   1.0 TiB   1.0 TiB   728 MiB  4.2 GiB  1015 GiB  50.44  0.90   56      up              osd.6                   
  9    ssd   2.00000   1.00000    2 TiB   936 GiB   932 GiB   296 MiB  3.4 GiB   1.1 TiB  45.70  0.81   48      up              osd.9                   
 12    ssd   2.00000   1.00000    2 TiB   1.0 TiB   1.0 TiB       0 B  3.0 GiB  1011 GiB  50.64  0.90   53      up              osd.12                  
 15    ssd   2.00000   1.00000    2 TiB  1018 GiB  1014 GiB   242 MiB  3.7 GiB   1.0 TiB  49.72  0.89   54      up              osd.15                  
 18    ssd   2.00000   1.00000    2 TiB   1.2 TiB   1.2 TiB    33 MiB  3.6 GiB   771 GiB  62.35  1.11   59      up              osd.18                  
 21    ssd   2.00000   1.00000    2 TiB   1.3 TiB   1.3 TiB   316 MiB  4.4 GiB   736 GiB  64.07  1.14   56      up              osd.21                  
 26    ssd   2.00000   1.00000    2 TiB   1.3 TiB   1.3 TiB    66 MiB  4.2 GiB   757 GiB  63.03  1.12   54      up              osd.26                  
 27    ssd   2.00000   1.00000    2 TiB   1.1 TiB   1.1 TiB    74 MiB  3.4 GiB   956 GiB  53.30  0.95   52      up              osd.27                  
 32    ssd   2.00000   1.00000    2 TiB   1.3 TiB   1.3 TiB   365 MiB  3.5 GiB   758 GiB  62.99  1.12   56      up              osd.32                  
 -4         21.99899         -   22 TiB    12 TiB    12 TiB   2.8 GiB   39 GiB   9.7 TiB  56.09  1.00    -              rack rack2                      
 -3         21.99899         -   22 TiB    12 TiB    12 TiB   2.8 GiB   39 GiB   9.7 TiB  56.09  1.00    -                  host edcxosid001e-bcbsfl-com
  0    ssd   1.99899   1.00000  2.0 TiB   844 GiB   841 GiB    83 MiB  3.1 GiB   1.2 TiB  41.24  0.74   48      up              osd.0                   
  5    ssd   2.00000   1.00000  2.0 TiB   1.4 TiB   1.4 TiB       0 B  4.1 GiB   658 GiB  67.86  1.21   60      up              osd.5                   
  8    ssd   2.00000   1.00000  2.0 TiB   1.2 TiB   1.2 TiB   341 MiB  3.7 GiB   862 GiB  57.92  1.03   55      up              osd.8                   
 10    ssd   2.00000   1.00000    2 TiB   1.1 TiB   1.1 TiB   289 MiB  3.7 GiB   924 GiB  54.89  0.98   54      up              osd.10                  
 14    ssd   2.00000   1.00000    2 TiB   1.2 TiB   1.2 TiB       0 B  3.9 GiB   861 GiB  57.97  1.03   59      up              osd.14                  
 17    ssd   2.00000   1.00000    2 TiB   1.2 TiB   1.2 TiB       0 B  3.9 GiB   768 GiB  62.50  1.11   57      up              osd.17                  
 19    ssd   2.00000   1.00000    2 TiB   1.2 TiB   1.2 TiB    31 MiB  3.7 GiB   792 GiB  61.33  1.09   55      up              osd.19                  
 22    ssd   2.00000   1.00000    2 TiB   1.4 TiB   1.4 TiB   642 MiB  4.1 GiB   572 GiB  72.06  1.28   67      up              osd.22                  
 25    ssd   2.00000   1.00000    2 TiB   980 GiB   978 GiB    67 MiB  2.3 GiB   1.0 TiB  47.87  0.85   50      up              osd.25                  
 28    ssd   2.00000   1.00000    2 TiB   1.0 TiB   1.0 TiB   746 MiB  3.7 GiB   994 GiB  51.45  0.92   51      up              osd.28                  
 31    ssd   2.00000   1.00000    2 TiB   858 GiB   854 GiB   640 MiB  3.0 GiB   1.2 TiB  41.88  0.75   45      up              osd.31                  
                         TOTAL   66 TiB    37 TiB    37 TiB   8.1 GiB  121 GiB    29 TiB  56.09                                                         
MIN/MAX VAR: 0.74/1.28  STDDEV: 7.26




Regards,


Craig Wayman
TSE Red Hat OpenShift Data Foundations (ODF) 
Customer Experience and Engagement, NA

Comment 23 Travis Nielsen 2022-11-03 19:25:22 UTC

Everything looks correct from Rook's perspective since the OSDs are all online as desired, including the new OSDs created to scale up the cluster. 

The PG autoscaler is handled by core ceph. Vikhyat could someone from your team take a look to see if the PGs are expected or if there is an issue with the auto scaler?

Comment 24 Vikhyat Umrao 2022-11-03 21:26:03 UTC

(In reply to Travis Nielsen from comment #23)
> Everything looks correct from Rook's perspective since the OSDs are all
> online as desired, including the new OSDs created to scale up the cluster. 
> 
> The PG autoscaler is handled by core ceph. Vikhyat could someone from your
> team take a look to see if the PGs are expected or if there is an issue with
> the auto scaler?

Sure, Travis. Let me ask Junior if he can help!

Junior - see comment#21 looks like autoscaler is not scaling pgs, can you please take a look and help the support team? If you need more debug data feel free to request it.

Thank you,
Vikhyat

Comment 32 Craig Wayman 2022-11-23 20:44:44 UTC

(In reply to Kamoltat (Junior) Sirivadhna from comment #29)
> Hi Craig,
> 
> apologies for the delay, thank you for your patience.
> 
> The output from `ceph osd pool autoscale-status` suggests we did not utilize
> the `bulk` flag. Basically what the `bulk` flag does is that it tells the
> autoscaler that the pool is expected to be large.
> Therefore, the autoscaler will start out that pool with a large amount of
> PGs for performance purposes.
> 
> From the comments in this BZ, I can see that you want to increase
> `ocs-storagecluster-cephblockpool` and maybe
> `ocs-storagecluster-cephfilesystem-data0`,
> `ocs-storagecluster-cephobjectstore.rgw.buckets.data`.
> 
> So what you can do is use this command:
> 
> `ceph osd pool set <pool-name> bulk true`
> 
> Let me know if this helps in the short term.
> 
> To be honest, I feel like a lot of people are still unaware of the `bulk`
> flag feature, especially during pool creation, I'll also look into improving
> the pool creation process, when the autoscaler is enabled `bulk` flag should
> also be set for data pools.

Junior,

  First, thank you for pointing out that bulk flag, and yes, you were correct members of my team including myself, weren't very familiar with that particular ceph bulk flag. The interesting thing that happened is that when the customer used the bulk flag it increased the PGs however, once they were increased since the PG autoscaler still being set to "on" for those pools the bulk flag was applied to, the PGs increased, and then decreased. The good news is that this process ended up balancing their %USE on the OSDs. Everything is much more balanced now after that process. With that said, I hopefully have one more question, that once answered should put this BZ to bed. If the customer uses the bulk flag on a pool. Should that pool have the autoscaler set to "off"?

  In my view, I believe this should be the case. Now, I know this is use case specific, but for this cluster that has a good amount of data, a lot of OSDs, high-use/workload... Their PGs on their OSDs are sitting on average at around 75 PGs per OSD. My goal was to get them up over at least 100 PGs per OSD, so if the bulk flag was applied, the PGs increased to around 200 PGs per OSD. That is a good desired state. In my opinion, I don't think there is a need to have the autoscaler set to "on" since we're already at the desired state. 

  Again, thank you for your time and insight on the issue. This was helpful.


Regards,


Craig Wayman
TSE Red Hat OpenShift Data Foundations (ODF) 
Customer Experience and Engagement, NA

Comment 33 Travis Nielsen 2022-11-29 00:22:40 UTC

(In reply to Craig Wayman from comment #32)
> (In reply to Kamoltat (Junior) Sirivadhna from comment #29)
> > Hi Craig,
> > 
> > apologies for the delay, thank you for your patience.
> > 
> > The output from `ceph osd pool autoscale-status` suggests we did not utilize
> > the `bulk` flag. Basically what the `bulk` flag does is that it tells the
> > autoscaler that the pool is expected to be large.
> > Therefore, the autoscaler will start out that pool with a large amount of
> > PGs for performance purposes.
> > 
> > From the comments in this BZ, I can see that you want to increase
> > `ocs-storagecluster-cephblockpool` and maybe
> > `ocs-storagecluster-cephfilesystem-data0`,
> > `ocs-storagecluster-cephobjectstore.rgw.buckets.data`.
> > 
> > So what you can do is use this command:
> > 
> > `ceph osd pool set <pool-name> bulk true`
> > 
> > Let me know if this helps in the short term.
> > 
> > To be honest, I feel like a lot of people are still unaware of the `bulk`
> > flag feature, especially during pool creation, I'll also look into improving
> > the pool creation process, when the autoscaler is enabled `bulk` flag should
> > also be set for data pools.
> 
> Junior,
> 
>   First, thank you for pointing out that bulk flag, and yes, you were
> correct members of my team including myself, weren't very familiar with that
> particular ceph bulk flag. The interesting thing that happened is that when
> the customer used the bulk flag it increased the PGs however, once they were
> increased since the PG autoscaler still being set to "on" for those pools
> the bulk flag was applied to, the PGs increased, and then decreased. The
> good news is that this process ended up balancing their %USE on the OSDs.
> Everything is much more balanced now after that process. With that said, I
> hopefully have one more question, that once answered should put this BZ to
> bed. If the customer uses the bulk flag on a pool. Should that pool have the
> autoscaler set to "off"?
> 
>   In my view, I believe this should be the case. Now, I know this is use
> case specific, but for this cluster that has a good amount of data, a lot of
> OSDs, high-use/workload... Their PGs on their OSDs are sitting on average at
> around 75 PGs per OSD. My goal was to get them up over at least 100 PGs per
> OSD, so if the bulk flag was applied, the PGs increased to around 200 PGs
> per OSD. That is a good desired state. In my opinion, I don't think there is
> a need to have the autoscaler set to "on" since we're already at the desired
> state. 
> 
>   Again, thank you for your time and insight on the issue. This was helpful.
> 
> 
> Regards,
> 
> 
> Craig Wayman
> TSE Red Hat OpenShift Data Foundations (ODF) 
> Customer Experience and Engagement, NA

The bulk setting is only meaningful if the autoscaler is enabled, so if you're going to set it, you need the autoscaler enabled. If the autoscaler is disabled, then all PG management becomes manual, which is something ODF (and also Ceph) aims to avoid for users as much as possible.

Note You need to log in before you can comment on or make changes to this bug.