Created attachment 1723231 [details] failed pool creation Created attachment 1723231 [details] failed pool creation Description of problem (please be detailed as possible and provide log snippests): When creating multiple storage class with new pools via UI at the third or forth time pool creation is failed due to pg limit, but oc command "oc get cephblockpool" is showing the pool although it is not created at ceph level Version of all relevant components (if applicable): 4.6.0-128c Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? yes Can this issue reproduce from the UI? yes If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Install OCS cluster 2. On storage class page on the UI create 3-4 storage class with new pools. Actual results: The third or forth pool is failing to create with error "An error occurred, Pool "p5" was not created" (see attached image). If you check via cli with oc get cephblockpool the pool is there. Checking in ceph level the pool doesn't exist Expected results: oc get cephblockpool shouldn't show a pool which is not created. Additional info: ceph df RAW STORAGE: CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 768 GiB 544 GiB 221 GiB 224 GiB 29.10 TOTAL 768 GiB 544 GiB 221 GiB 224 GiB 29.10 POOLS: POOL ID STORED OBJECTS USED %USED MAX AVAIL ocs-storagecluster-cephblockpool 1 73 GiB 19.13k 220 GiB 33.92 143 GiB ocs-storagecluster-cephobjectstore.rgw.control 2 0 B 8 0 B 0 143 GiB ocs-storagecluster-cephfilesystem-metadata 3 146 KiB 25 2.1 MiB 0 143 GiB ocs-storagecluster-cephfilesystem-data0 4 158 B 1 192 KiB 0 143 GiB ocs-storagecluster-cephobjectstore.rgw.meta 5 3.0 KiB 12 1.9 MiB 0 143 GiB ocs-storagecluster-cephobjectstore.rgw.log 6 159 KiB 211 6.5 MiB 0 143 GiB ocs-storagecluster-cephobjectstore.rgw.buckets.index 7 0 B 22 0 B 0 143 GiB ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec 8 0 B 0 0 B 0 143 GiB .rgw.root 9 4.7 KiB 16 2.8 MiB 0 143 GiB ocs-storagecluster-cephobjectstore.rgw.buckets.data 10 1 KiB 1 192 KiB 0 143 GiB p1 20 0 B 0 0 B 0 215 GiB p2 21 0 B 0 0 B 0 215 GiB p3 22 0 B 0 0 B 0 215 GiB $ oc get cephblockpools.ceph.rook.io NAME AGE ocs-storagecluster-cephblockpool 7d22h p1 52m p2 51m p3 51m p4 50m p5 45m
UI shows the error message if the Pool CephBlockPool object created by user gets in failed state (due to some error in creation at Ceph side). So, even if the pool creation is failed at Ceph, the corresponding CephBlockPool k8s object will still exist in CephBlockPool list (in failed status).
Nishanth, Travis, any chance this BZ is similar to bug 1748001?
Also, the fact that the pool is not cleaned up and keep being listed by oc get cephblockpool is a bug. Therefore, re-opening. Nishanth/Kanika, feel free to move to the correct component
Moving the Bz to rook for travis to take a look. Doesn't belong to UI
There are a couple of approaches to this issue. 1. The UI could stop allowing pool creation once the PG limit is hit. 2. Rook could automatically increase the PG limit if pool creation hits the limit. #1 is disruptive to the user creation of pools, so clearly #2 is preferred. @Josh Any concern with Rook increasing the max PG count automatically when needed to create a new pool? OCS users don't know anything about PG management.
Shay, please paste the output of the following: ceph osd dump ceph pg dump Or else, provide the must-gather logs. Thanks
(In reply to Travis Nielsen from comment #7) > There are a couple of approaches to this issue. > 1. The UI could stop allowing pool creation once the PG limit is hit. > 2. Rook could automatically increase the PG limit if pool creation hits the > limit. > > #1 is disruptive to the user creation of pools, so clearly #2 is preferred. > > @Josh Any concern with Rook increasing the max PG count automatically when > needed to create a new pool? OCS users don't know anything about PG > management. Yes, pgs take up finite memory/cpu resources, so OCS should prevent users from taking up too many. Limiting the number of pools in the UI makes sense to me, and I thought was the direction we were already headed for this.
I do remember now that we were going limit the number of pools. Creating three additional pools should cover production needs for now. @Nithin @Eran In 4.6 can we restrict the number of pools created in the UI to 3 (or 4 if it's allowed)? Let's see in 4.6 if this is sufficient, or what feedback we get from customers. @Josh remind me... in larger clusters could we support more PGs? Some customer will surely want to create more pools at some point.
There were some discussion to have PG count validation from admission controller https://issues.redhat.com/browse/RHSTOR-1257, because this will provide same experience with CLI and UI.
(In reply to Travis Nielsen from comment #10) > I do remember now that we were going limit the number of pools. Creating > three additional pools should cover production needs for now. > > @Nithin @Eran In 4.6 can we restrict the number of pools created in the UI > to 3 (or 4 if it's allowed)? Let's see in 4.6 if this is sufficient, or what > feedback we get from customers. > > @Josh remind me... in larger clusters could we support more PGs? Some > customer will surely want to create more pools at some point. Yes, we're running into this particularly with OCS due to the small initial size. We target 100 pgs per OSD, with a hard cutoff at 300 by default. A larger cluster will allow more pools before hitting this limit.
(In reply to Kanika Murarka from comment #11) > There were some discussion to have PG count validation from admission > controller https://issues.redhat.com/browse/RHSTOR-1257, because this will > provide same experience with CLI and UI. Right, the admission controller is needed for this. Until then, can another check be added to the UI such as limiting the number of pools created in the UI to 3?
Moving it back to mgmt-console based on https://bugzilla.redhat.com/show_bug.cgi?id=1890135#c13
Created attachment 1724833 [details] ceph osd dump & ceph pg dump
in this case, I would say that the pool was not created because we have reached the PG limit. does that make sense?
I have opened https://bugzilla.redhat.com/show_bug.cgi?id=1946243 for block pool under ocs-operator page. Maybe the fix could be for both of them.
(In reply to Travis Nielsen from comment #13) > (In reply to Kanika Murarka from comment #11) > > There were some discussion to have PG count validation from admission > > controller https://issues.redhat.com/browse/RHSTOR-1257, because this will > > provide same experience with CLI and UI. > > Right, the admission controller is needed for this. Until then, can another > check be added to the UI such as limiting the number of pools created in the > UI to 3? 3 sounds very low to me. Are we sure?
3 is certainly too limiting. Likely many more can be created, it just depends on the number of OSDs. As suggested in the linked BZ, sounds better to show an alert that tells them they need to expand the cluster, rather than adding a hard limit.
@anbehl Yes Rook just started adding events to CRs, so we are planning to add an event to the pools in case of failure too.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days