Bug 2054359 - [RFE][ODF] Deliver capability to automatic scale deployed ODF cluster when utilized capacity threshold has been reached
Summary: [RFE][ODF] Deliver capability to automatic scale deployed ODF cluster when ut...
Keywords:
Status: NEW
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ocs-operator
Version: 4.8
Hardware: All
OS: All
unspecified
medium
Target Milestone: ---
: ---
Assignee: Mudit Agarwal
QA Contact: Elad
URL:
Whiteboard:
Depends On: 2025043
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-02-14 19:16 UTC by Alex Handy
Modified: 2023-08-09 17:00 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Clone Of: 2025043
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHSTOR-1943 0 None None None 2023-03-05 13:01:13 UTC

Description Alex Handy 2022-02-14 19:16:38 UTC
Description of problem:
Presently, growing a deployed ODF environment requires manually modifying the StorageCluster definition to increment the 'count' parameter of the default storageDeviceSet (assuming the appropriate provisioner is available to create new volumes to create the new OSDs).
The MVP for the desired capability would allow a threshold to be defined that once crossed would automatically instantiate a new set of OSDs (by incrementing the count above).

- To account for transient spikes, we would want that threshold to be exceeded for a minimum period of time (say 2 or 4 hours but optimally user definable in minutes)
- A user-definable scale-factor integer may be desired to control the increment used for the count increase
- Optionally, a user-defined value of the applicable storageDeviceSet index to be scaled (default to: 0) to accommodate cases where the deployment has multiple storageDeviceSets defined 
- In an environment where dynamically provisioned Machines are used for ODF worker nodes, the MachineSet Auto-Scaler should be configured to create new nodes once the current node selection is at capacity and cannot house any more OSDs
- To ensure a cluster doesn't get into a uncontrolled/runaway state and scale in an uncontrolled manner given effectively unlimited resources (like could be available on a cloud platform), a user definable limit should be set for the maximum value permitted in count effectively bounding how large the cluster can grow automatically
- A value tracking the current state of this automated capacity scaling feature should be created to ensure additional scaling operations are not attempted while one is in progress or is awaiting resources.
- Optionally, a cool-down timer could be implemented to ensure another scaling operation is not attempted within a user-defined period of time of the previous scaling operation (e.g. 12+ hours)
- Optionally, a way to temporarily suspend this auto-scale capability would be beneficial for migrations and other known transient events (planned AZ outage, upgrade, etc.)
- A high severity alert should be generated if the cluster cannot perform this scaling operation (e.g. if the scale limit has been reached, cooldown has not been reached, MachineSet auto-scaling is not functioning or has hit its limit, the StorageCluster cannot be modified, or the new OSDs cannot be provisioned for any other reason)


Version of all relevant components (if applicable):
- Optimally, 4.8+ EUS (The customer is running 4.6 EUS today, but it is desired this feature is available in the following EUS release that will be deployed next)

Does this issue impact your ability to continue to work with the product
- This customer will be deploying a large number of ODF environments and does not want to have to individually manage storage-scaling for each environment
- Products that compete with ODF do offer this capability, and this is likely to be a common RFP requirement for customers utilizing that procurement process, especially if seeded by our competitors that know this is a capability we do not have.

Is there any workaround available to the best of your knowledge?
- No, it would likely require a custom operator be developed to monitor the state of the environment through Prometheus variables, and then perform the StorageCluster modifications accordingly.

Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
N/A

Can this issue reproduce from the UI?
N/A


Note You need to log in before you can comment on or make changes to this bug.