Bug 1355723
Summary: | Pools list dashboard page provides incorrect storage utilization/capacity data | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Storage Console | Reporter: | Martin Bukatovic <mbukatov> | ||||||||||||||||
Component: | Ceph | Assignee: | anmol babu <anbabu> | ||||||||||||||||
Ceph sub component: | configuration | QA Contact: | Martin Kudlej <mkudlej> | ||||||||||||||||
Status: | CLOSED EOL | Docs Contact: | |||||||||||||||||
Severity: | high | ||||||||||||||||||
Priority: | unspecified | CC: | anbabu, mkudlej, nthomas, rghatvis | ||||||||||||||||
Version: | 2 | ||||||||||||||||||
Target Milestone: | --- | ||||||||||||||||||
Target Release: | 3 | ||||||||||||||||||
Hardware: | Unspecified | ||||||||||||||||||
OS: | Unspecified | ||||||||||||||||||
Whiteboard: | |||||||||||||||||||
Fixed In Version: | rhscon-ceph-0.0.38-1.el7scon.x86_64 | Doc Type: | Known Issue | ||||||||||||||||
Doc Text: |
Pools list in Console displays incorrect storage utilization and capacity data
Pool utilization values are not calculated by Ceph appropriately if there are multiple CRUSH hierarchies. As a result of this:
* Pool utilization values on the dashboard, clusters view and pool listing page are displayed incorrectly.
* No alerts will be sent if the actual pool utilization surpasses the configured thresholds
* False alerts might be generated for pool utilization
This issue occurs only when the user creates multiple storage profiles for a cluster, which in turn creates multiple CRUSH hierarchies. To avoid this problem, include all the OSDs in a single storage profile.
|
Story Points: | --- | ||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||
Last Closed: | Type: | Bug | |||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||
Embargoed: | |||||||||||||||||||
Bug Depends On: | 1360230 | ||||||||||||||||||
Bug Blocks: | 1346350 | ||||||||||||||||||
Attachments: |
|
Description
Martin Bukatovic
2016-07-12 10:52:02 UTC
Created attachment 1178853 [details]
screenshot 1: pools list before cluster sync
Created attachment 1178856 [details]
screenshot 2: pools list after cluster sync
Created attachment 1178859 [details]
screenshot 3: pool unitization event detail page
(In reply to Martin Bukatovic from comment #0) > Description of problem > ====================== > > Pools page which provides a list of pools shows information about maximal > available storage capacity for each pool. But this information is valid only > when there is no problem with the cluster. > > Eg. when (some) OSD are removed from cluster map (eg. in case of some > problem), > the pool item in the Pools list provides incorrect storage capacity > information. > > Version-Release > =============== > > On RHSC 2.0 server: > > rhscon-ceph-0.0.27-1.el7scon.x86_64 > rhscon-core-0.0.28-1.el7scon.x86_64 > rhscon-core-selinux-0.0.28-1.el7scon.noarch > rhscon-ui-0.0.42-1.el7scon.noarch > ceph-ansible-1.0.5-23.el7scon.noarch > ceph-installer-1.0.12-3.el7scon.noarch > > On Ceph Storage nodes: > > rhscon-agent-0.0.13-1.el7scon.noarch > ceph-osd-10.2.2-5.el7cp.x86_64 > > How reproducible > ================ > > 100 % > > Steps to Reproduce > ================== > > 1. Install RHSC 2.0 following the documentation. > > 2. Accept few nodes for the ceph cluster. > > 3. Create new ceph cluster named 'alpha'. > > 4. Create 2 RBD (along with new backing pool each time) in the cluster. > > 5. Check CRUSH cluster map, make sure it's ok and then make a backup of it: > > ~~~ > # ceph --cluster alpha osd getcrushmap -o ceph-crushmap.ok.compiled > # crushtool -d ceph-crushmap.ok.compiled -o ceph-crushmap.ok > ~~~ > > 7. Edit CRUSH cluster map so that there are no OSDs in cluster hierarchy > which is used by the backing pools of RBDs created in step 4: > > ~~~ > # cp ceph-crushmap.ok ceph-crushmap.err01 > # sed -i '/.*item\ osd\.[0-9]\+\ weight\ [0-9\.]\+$/d' ceph-crushmap.err01 > $ sed -i 's/weight\ [0-9\.]\+$/weight 0.000/' ceph-crushmap.err01 > ~~~ > > So that for example: > > ~~~ > # diff ceph-crushmap.ok ceph-crushmap.err01 > 65c65 > < # weight 0.010 > --- > > # weight 0.000 > 68d67 > < item osd.1 weight 0.010 > 72c71 > < # weight 0.010 > --- > > # weight 0.000 > 75d73 > < item osd.2 weight 0.010 > 79c77 > < # weight 0.010 > --- > > # weight 0.000 > 82d79 > < item osd.0 weight 0.010 > 86c83 > < # weight 0.010 > --- > > # weight 0.000 > 89d85 > < item osd.3 weight 0.010 > 93c89 > < # weight 0.040 > --- > > # weight 0.000 > 96,99c92,95 > < item mbukatov-usm1-node2.os1.phx2.redhat.com-general weight 0.010 > < item mbukatov-usm1-node3.os1.phx2.redhat.com-general weight 0.010 > < item mbukatov-usm1-node1.os1.phx2.redhat.com-general weight 0.010 > < item mbukatov-usm1-node4.os1.phx2.redhat.com-general weight 0.010 > --- > > item mbukatov-usm1-node2.os1.phx2.redhat.com-general weight 0.000 > > item mbukatov-usm1-node3.os1.phx2.redhat.com-general weight 0.000 > > item mbukatov-usm1-node1.os1.phx2.redhat.com-general weight 0.000 > > item mbukatov-usm1-node4.os1.phx2.redhat.com-general weight 0.000 > ~~~ > > 8. Compile new broken CRUSH map and push it into the cluster: > > ~~~ > # crushtool -c ceph-crushmap.err01 -o ceph-crushmap.err01.compiled > # ceph --cluster alpha osd setcrushmap -i ceph-crushmap.err01.compiled > ~~~ > > At this point, `ceph --cluster alpha status` should report something like: > > * health HEALTH_WARN > * recovery ... objects misplaced (100.000%) > > 9. Check output of `ceph df` command. Since you haven't loaded any data > into the RBDs, ceph should report that both pools are empty (0 for > %USED and 0 for MAX AVAIL) . > > 10. Make sure the sync of cluster state happened before you go on. > > You can either wait for RHSC 2.0 UI to sync new cluster state > automatically. But it means waiting for at least 24 hours (which is > the current default value of clustersSyncInterval). > > Or you can force the sync by restarting skyring service: > > ~~~ > systemctl restart skyring > ~~~ > > 11. Check Pools list page. > > Actual results > ============== > > While `ceph df` provides this information: > > ~~~ > # ceph --cluster alpha df > GLOBAL: > SIZE AVAIL RAW USED %RAW USED > 40915M 40760M 155M 0.38 > POOLS: > NAME ID USED %USED MAX AVAIL OBJECTS > rbd_pool 1 7486 0 0 16 > def_pool 3 114 0 0 4 > ~~~ Here: USED --> used space of the pool MAX AVAIL --> Maximum size available to the pool. this will be TOTAL-USED So the total is not provided by the ceph CLI so USM will calculate TOTAL as USED+MAX AVAIL Also note that USM execute json fromatted API call to get the pool stats(ceph df -f json) and this CLI won't provide the percentage value so we need to calculate which is done as USED/TOTAL * 100 where TOTAL is USED+MAX AVAIL. > > The *Pools* page shown different storage capacity data for both pools: > > * rbd_pool: 100.0%, 7.3 KB of 7.3 KB used > * def_pool: 100.0%, 114.0 B of 114.0 B used > > (see screenshot #2) > > When we ignore the wrong units there for now (there is another BZ for that, > see https://bugzilla.redhat.com/show_bug.cgi?id=1340747#c4), the data > presented > there doesn't match data reported by `ceph df`: > > * '%USED' should not be reported as 100% by UI (ceph df stil shows 0%, > moreover both pools are essentially empty). > * MAX AVAIL is not shown in the pool list by UI, and the > `7.3 KB of 7.3 KB used` statement is misleading in this case > > The related event states: > > > Pool utilization for pool rbd_pool on alpha cluster has moved to CRITICAL > > which while technically true, it doesn't cover the actual issue in any way > and admin is again forced to use ceph command line tools to debug the > problem. > > Expected results > ================ > > The storage capacity information should not conflict with information as > provided by `ceph df` command. > > Statement "7.3 KB of 7.3 KB used" would be better replaced by presenting > 'USED' and 'MAX AVAIL' values instead. > > Related events should better describe the problem. At least information about > zero value of 'MAX AVAIL' should be conveyed. > I have explained the logic above and it is clear that USM is doing the calculation as expected. So I don't treat this an issue. We can discuss this in the bug scrub meeting take a call > Additional info > =============== > > Besides the problem with interpreting the storage capacity data, the sync > interval itself makes things harder to understand as well: > > * See screenshot #1 which I created after I did all steps from "Steps to > Reproduce" section but without restarting skyring, so that the cluster > state is still not synced. > * And compare that with screenshot #2 which shows Pools list page after the > skyring restart (which triggered the sync). We now pick the utilization from the normal cli command ceph df instead of its json form. So the percentage used and used size is now the same as what the cli returns and the total calculation is as follows(as per sjust's suggestions): POOL TOTAL SIZE = MAX AVAIL / (.01 * (100 - %USED)) where MAX AVAIL and %USED are directly from ceph df cli o/p. As you can see on screenshots utilization charts are not correct. My configuration: 1 cluster 2 storage profiles: default - 2 OSDs, user defined - 2 OSDs 2 pools - double replication each; one on each storage profile(pool1 on default, pool2 on user_defined) each OSD has 10GB Please check screenshots. Tested with ceph-ansible-1.0.5-31.el7scon.noarch ceph-installer-1.0.14-1.el7scon.noarch rhscon-ceph-0.0.39-1.el7scon.x86_64 rhscon-core-0.0.39-1.el7scon.x86_64 rhscon-core-selinux-0.0.39-1.el7scon.noarch rhscon-ui-0.0.51-1.el7scon.noarch Created attachment 1186442 [details]
build 39 list of pools
As you can see used is bigger than total size and % utilization is not correct.
There are 7 objects 1GB each in double replicated pool.
Created attachment 1186443 [details]
build 39 - main dashboard
Created attachment 1186444 [details]
build 39 - cluster dashboard
Created attachment 1186445 [details]
build 39 - cluster list
%USED doesn't show how much of the space for the pool is used, but how much of the space of cluster is used in the pool. With that said if I have several hierarchies the number is not correct. (At least not correct for calculating total size of pool from it.) Looks good to me This product is EOL now |