Bug 1890135 - When PG limit is reach via pool creation, pool is listed in oc get cephblockpool but in ceph level it is not created
Summary: When PG limit is reach via pool creation, pool is listed in oc get cephblockp...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Console Storage Plugin
Version: 4.6
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.8.0
Assignee: gowtham
QA Contact: Shay Rozen
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-21 14:01 UTC by Shay Rozen
Modified: 2023-09-15 00:50 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-04-15 10:22:03 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
failed pool creation (176.24 KB, image/png)
2020-10-21 14:01 UTC, Shay Rozen
no flags Details
ceph osd dump & ceph pg dump (76.71 KB, text/plain)
2020-10-28 14:01 UTC, Shay Rozen
no flags Details

Description Shay Rozen 2020-10-21 14:01:13 UTC
Created attachment 1723231 [details]
failed pool creation

Created attachment 1723231 [details]
failed pool creation

Description of problem (please be detailed as possible and provide log
snippests):
When creating multiple storage class with new pools via UI at the third or forth time pool creation is failed due to pg limit, but oc command "oc get cephblockpool" is showing the pool although it is not created at ceph level


Version of all relevant components (if applicable):
4.6.0-128c

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
yes

Can this issue reproduce from the UI?
yes

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Install OCS cluster
2. On storage class page on the UI create 3-4 storage class with new pools.



Actual results:
The third or forth pool is failing to create with error "An error occurred, Pool "p5" was not created" (see attached image). If you check via cli with oc get cephblockpool the pool is there. Checking in ceph level the pool doesn't exist

Expected results:
oc get cephblockpool shouldn't show a pool which is not created.

Additional info:
 ceph df
RAW STORAGE:
    CLASS     SIZE        AVAIL       USED        RAW USED     %RAW USED 
    hdd       768 GiB     544 GiB     221 GiB      224 GiB         29.10 
    TOTAL     768 GiB     544 GiB     221 GiB      224 GiB         29.10 
 
POOLS:
    POOL                                                      ID     STORED      OBJECTS     USED        %USED     MAX AVAIL 
    ocs-storagecluster-cephblockpool                           1      73 GiB      19.13k     220 GiB     33.92       143 GiB 
    ocs-storagecluster-cephobjectstore.rgw.control             2         0 B           8         0 B         0       143 GiB 
    ocs-storagecluster-cephfilesystem-metadata                 3     146 KiB          25     2.1 MiB         0       143 GiB 
    ocs-storagecluster-cephfilesystem-data0                    4       158 B           1     192 KiB         0       143 GiB 
    ocs-storagecluster-cephobjectstore.rgw.meta                5     3.0 KiB          12     1.9 MiB         0       143 GiB 
    ocs-storagecluster-cephobjectstore.rgw.log                 6     159 KiB         211     6.5 MiB         0       143 GiB 
    ocs-storagecluster-cephobjectstore.rgw.buckets.index       7         0 B          22         0 B         0       143 GiB 
    ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec      8         0 B           0         0 B         0       143 GiB 
    .rgw.root                                                  9     4.7 KiB          16     2.8 MiB         0       143 GiB 
    ocs-storagecluster-cephobjectstore.rgw.buckets.data       10       1 KiB           1     192 KiB         0       143 GiB 
    p1                                                        20         0 B           0         0 B         0       215 GiB 
    p2                                                        21         0 B           0         0 B         0       215 GiB 
    p3                                                        22         0 B           0         0 B         0       215 GiB 
$ oc get cephblockpools.ceph.rook.io 
NAME                               AGE
ocs-storagecluster-cephblockpool   7d22h
p1                                 52m
p2                                 51m
p3                                 51m
p4                                 50m
p5                                 45m

Comment 3 Kanika Murarka 2020-10-27 10:26:04 UTC
UI shows the error message if the Pool CephBlockPool object created by user gets in failed state (due to some error in creation at Ceph side). So, even if the pool creation is failed at Ceph, the corresponding CephBlockPool k8s object will still exist in CephBlockPool list (in failed status).

Comment 4 Elad 2020-10-27 10:53:30 UTC
Nishanth, Travis, any chance this BZ is similar to bug 1748001?

Comment 5 Elad 2020-10-27 10:55:10 UTC
Also, the fact that the pool is not cleaned up and keep being listed by oc get cephblockpool is a bug. Therefore, re-opening. 
Nishanth/Kanika, feel free to move to the correct component

Comment 6 Nishanth Thomas 2020-10-27 11:19:53 UTC
Moving the Bz to rook for travis to take a look. Doesn't belong to UI

Comment 7 Travis Nielsen 2020-10-27 14:16:02 UTC
There are a couple of approaches to this issue.
1. The UI could stop allowing pool creation once the PG limit is hit. 
2. Rook could automatically increase the PG limit if pool creation hits the limit. 

#1 is disruptive to the user creation of pools, so clearly #2 is preferred. 

@Josh Any concern with Rook increasing the max PG count automatically when needed to create a new pool? OCS users don't know anything about PG management.

Comment 8 Sébastien Han 2020-10-27 14:20:49 UTC
Shay, please paste the output of the following:

ceph osd dump
ceph pg dump

Or else, provide the must-gather logs.
Thanks

Comment 9 Josh Durgin 2020-10-27 14:43:02 UTC
(In reply to Travis Nielsen from comment #7)
> There are a couple of approaches to this issue.
> 1. The UI could stop allowing pool creation once the PG limit is hit. 
> 2. Rook could automatically increase the PG limit if pool creation hits the
> limit. 
> 
> #1 is disruptive to the user creation of pools, so clearly #2 is preferred. 
> 
> @Josh Any concern with Rook increasing the max PG count automatically when
> needed to create a new pool? OCS users don't know anything about PG
> management.

Yes, pgs take up finite memory/cpu resources, so OCS should prevent users from taking up too many. Limiting the number of pools in the UI makes sense to me, and I thought was the direction we were already headed for this.

Comment 10 Travis Nielsen 2020-10-27 17:21:06 UTC
I do remember now that we were going limit the number of pools. Creating three additional pools should cover production needs for now.

@Nithin @Eran In 4.6 can we restrict the number of pools created in the UI to 3 (or 4 if it's allowed)? Let's see in 4.6 if this is sufficient, or what feedback we get from customers. 

@Josh remind me... in larger clusters could we support more PGs? Some customer will surely want to create more pools at some point.

Comment 11 Kanika Murarka 2020-10-27 20:40:45 UTC
There were some discussion to have PG count validation from admission controller https://issues.redhat.com/browse/RHSTOR-1257, because this will provide same experience with CLI and UI.

Comment 12 Josh Durgin 2020-10-28 00:52:24 UTC
(In reply to Travis Nielsen from comment #10)
> I do remember now that we were going limit the number of pools. Creating
> three additional pools should cover production needs for now.
> 
> @Nithin @Eran In 4.6 can we restrict the number of pools created in the UI
> to 3 (or 4 if it's allowed)? Let's see in 4.6 if this is sufficient, or what
> feedback we get from customers. 
> 
> @Josh remind me... in larger clusters could we support more PGs? Some
> customer will surely want to create more pools at some point.

Yes, we're running into this particularly with OCS due to the small initial size.
We target 100 pgs per OSD, with a hard cutoff at 300 by default. A larger cluster
will allow more pools before hitting this limit.

Comment 13 Travis Nielsen 2020-10-28 01:29:13 UTC
(In reply to Kanika Murarka from comment #11)
> There were some discussion to have PG count validation from admission
> controller https://issues.redhat.com/browse/RHSTOR-1257, because this will
> provide same experience with CLI and UI.

Right, the admission controller is needed for this. Until then, can another check be added to the UI such as limiting the number of pools created in the UI to 3?

Comment 14 Mudit Agarwal 2020-10-28 05:13:53 UTC
Moving it back to mgmt-console based on https://bugzilla.redhat.com/show_bug.cgi?id=1890135#c13

Comment 15 Shay Rozen 2020-10-28 14:01:41 UTC
Created attachment 1724833 [details]
ceph osd dump & ceph pg dump

Comment 17 Yuval 2020-11-02 12:00:55 UTC
in this case, I would say that the pool was not created because we have reached the PG limit. does that make sense?

Comment 18 Shay Rozen 2021-04-06 11:23:50 UTC
I have opened https://bugzilla.redhat.com/show_bug.cgi?id=1946243 for block pool under ocs-operator page.
Maybe the fix could be for both of them.

Comment 19 Yaniv Kaul 2021-04-06 12:25:19 UTC
(In reply to Travis Nielsen from comment #13)
> (In reply to Kanika Murarka from comment #11)
> > There were some discussion to have PG count validation from admission
> > controller https://issues.redhat.com/browse/RHSTOR-1257, because this will
> > provide same experience with CLI and UI.
> 
> Right, the admission controller is needed for this. Until then, can another
> check be added to the UI such as limiting the number of pools created in the
> UI to 3?

3 sounds very low to me. Are we sure?

Comment 20 Travis Nielsen 2021-04-06 18:56:58 UTC
3 is certainly too limiting. Likely many more can be created, it just depends on the number of OSDs. As suggested in the linked BZ, sounds better to show an alert that tells them they need to expand the cluster, rather than adding a hard limit.

Comment 27 Travis Nielsen 2021-04-07 21:57:45 UTC
@anbehl Yes Rook just started adding events to CRs, so we are planning to add an event to the pools in case of failure too.

Comment 34 Red Hat Bugzilla 2023-09-15 00:50:03 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.