Bug 1358267
Summary: | Wrong size and utilization of pool | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Storage Console | Reporter: | Martin Kudlej <mkudlej> |
Component: | core | Assignee: | anmol babu <anbabu> |
core sub component: | monitoring | QA Contact: | Filip Balák <fbalak> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | unspecified | ||
Priority: | unspecified | CC: | anbabu, asriram, fbalak, japplewh, ltrilety, nthomas, rghatvis, shtripat, vsarmila |
Version: | 2 | Keywords: | TestBlocker |
Target Milestone: | --- | ||
Target Release: | 2 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | rhscon-ceph-0.0.41-1.el7scon.x86_64 | Doc Type: | Bug Fix |
Doc Text: |
Previously, pools list in Console displayed incorrect storage utilization and capacity data due to multiple CRUSH hierarchies. Due to this:
* Pool utilization values on the dashboard, clusters view and pool listing page displayed incorrectly.
* No alerts was sent if the actual pool utilization surpasses the configured thresholds
* False alerts might be generated for pool utilization
With this update, the pools list in Console displays the correct storage utilization and capacity data.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2016-10-19 15:20:24 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1360230, 1377269 | ||
Bug Blocks: | 1346350, 1357777, 1357845 |
Description
Martin Kudlej
2016-07-20 11:57:21 UTC
I still see this with ceph-ansible-1.0.5-31.el7scon.noarch ceph-installer-1.0.14-1.el7scon.noarch rhscon-ceph-0.0.34-1.el7scon.x86_64 rhscon-core-0.0.35-1.el7scon.x86_64 rhscon-core-selinux-0.0.35-1.el7scon.noarch rhscon-ui-0.0.49-1.el7scon.noarch ./clusters.sh | jq . [ { "storageid": "00875c43-6221-4922-bc5a-e5499a7cae97", "name": "klkl", "type": "", "tags": [], "clusterid": "af45a20f-b0a6-4c6a-952a-9145715e53f4", "size": "5GB", "status": 0, "replicas": 2, "profile": "default", "snapshots_enabled": false, "snapshot_schedule_ids": [], "quota_enabled": false, "quota_params": {}, "options": { "crash_replay_interval": "0", "crush_ruleset": "0", "full": "false", "hashpspool": "true", "id": "1", "min_size": "1", "pg_num": "128", "pgp_num": "128" }, "usage": { "used": 9810477056, "total": 876523520, "percentused": 1119.2485805743124, "updatedat": "2016-07-21 14:01:53.359976493 +0200 CEST" }, "state": "", "almstatus": 1, "almwarncount": 0, "almcritcount": 1, "slus": [ "9ad26031-c48c-44d5-94b7-4fea54f0a18a", "c2b85ce3-eaff-4b7c-8b3a-74da1216056f" ] } ] Tested on: rhscon-core-0.0.36-1.el7scon.x86_64 rhscon-ui-0.0.50-1.el7scon.noarch rhscon-ceph-0.0.36-1.el7scon.x86_64 rhscon-core-selinux-0.0.36-1.el7scon.noarch { "storageid": "f268a5d8-66be-42f2-af31-f16dd6f7c55f", "name": "default", "type": "", "tags": [], "clusterid": "712ba28f-9840-4b5d-b19f-2ef2631ee762", "size": "5GB", "status": 0, "replicas": 4, "profile": "default", "snapshots_enabled": false, "snapshot_schedule_ids": [], "quota_enabled": false, "quota_params": {}, "options": { "crash_replay_interval": "0", "crush_ruleset": "0", "full": "false", "hashpspool": "true", "id": "2", "min_size": "2", "pg_num": "128", "pgp_num": "128" }, "usage": { "used": 5242880000, "total": 12015363917, "percentused": 43.63479987969473, "updatedat": "2016-07-25 15:30:28.211639431 +0200 CEST" }, "state": "", "almstatus": 4, "almwarncount": 1, "almcritcount": 0, "slus": [ "99ff63c0-974c-4737-b016-c4d4ef8488f8", "cf944995-4ccd-4311-be8a-551519120e9c", "4c5f6d14-68be-430c-8e35-75907fef7bbe", "32b6f092-e5c7-4c9f-9f1a-053040371c4e", "e71cd79a-1f7b-4803-a442-4267322599ac" ] } GLOBAL: SIZE AVAIL RAW USED %RAW USED 92060M 68733M 23327M 25.34 POOLS: NAME ID USED %USED MAX AVAIL OBJECTS ectest 1 2000M 3.26 18385M 2 default 2 5000M 21.72 6458M 5 We depend on json format output of "ceph df" command and that does not have the information about % used. So we calculate it in skyring as follows: Percent Used = (USED * 100) / (USED + MAX AVAIL) But this value doesn't match with the one provided by the ceph df output. There is a confusion around how to calculate the utilization percent of a pool. Have raised a Bug: https://bugzilla.redhat.com/show_bug.cgi?id=1360230 in ceph for getting clarity about this. If you decide to reuse standard output of `ceph df`, you myn need to tweak the
template displayed in the pool list.
Right now, the template from the pools list states something like:
> 2.0 GB out of 10.0 GB used
While the `ceph df` would show:
~~~
# ceph --cluster beta df
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
20457M 16292M 4165M 20.36
POOLS:
NAME ID USED %USED MAX AVAIL OBJECTS
beta_01_pool 1 2048M 20.02 8146M 2
~~~
So if you decide to reuse `ceph df` completely, you need to rework the
message, because showing eg. `2048M of 8146M` would not make sense.
(In reply to Lubos Trilety from comment #5) > BTW > 'rados df' give better information about the pool then 'cepf df' > # rados df -p default > pool name KB objects clones degraded > unfound rd rd KB wr wr KB > default 5242880 5 0 0 > 0 0 0 1280 5242880 > total used 21258036 5 > total avail 62537772 > total space 83795808 > > I am pretty sure we could use used, avail and total space without any > problem. With that said the percentage of used make sense In my case it was: # ceph df GLOBAL: SIZE AVAIL RAW USED %RAW USED 81831M 61072M 20759M 25.37 POOLS: NAME ID USED %USED MAX AVAIL OBJECTS default 1 5120M 25.03 5072M 5 However the total space is not correct though as it seems all disks are counted not just those which are part of the used hierarchy. In my case I have pool with replication set to 4 and default storage profile with 4 osds. # ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -11 0.00999 root test -10 0.00999 host dhcp-126-101-test 0 0.00999 osd.0 up 1.00000 1.00000 -9 0.02998 root ec_test -6 0.00999 host dhcp-126-103-ec_test 4 0.00999 osd.4 up 1.00000 1.00000 -7 0.00999 host dhcp-126-102-ec_test 3 0.00999 osd.3 up 1.00000 1.00000 -8 0.00999 host dhcp-126-105-ec_test 6 0.00999 osd.6 up 1.00000 1.00000 -1 0.03998 root default -2 0.00999 host dhcp-126-101 1 0.00999 osd.1 up 1.00000 1.00000 -3 0.00999 host dhcp-126-102 2 0.00999 osd.2 up 1.00000 1.00000 -4 0.00999 host dhcp-126-103 5 0.00999 osd.5 up 1.00000 1.00000 -5 0.00999 host dhcp-126-105 7 0.00999 osd.7 up 1.00000 1.00000 All my osds have 10GB of space. So I agree that what usm is telling is more correct then what the ceph says. From that point of view the issue is more on the ceph side than on usm side. We now pick the utilization from the normal cli command ceph df instead of its json form. So the percentage used and used size is now the same as what the cli returns and the total calculation is as follows(as per sjust's suggestions): POOL TOTAL SIZE = MAX AVAIL / (.01 * (100 - %USED)) where MAX AVAIL and %USED are directly from ceph df cli o/p. %USED doesn't show how much of the space for the pool is used, but how much of the space of cluster is used in the pool. With that said if I have several hierarchies the number is not correct. (At least not correct for calculating total size of pool from it.) Reopening BZ 1360230 (In reply to anmol babu from comment #8) > We now pick the utilization from the normal cli command ceph df instead of > its > json form. So the percentage used and used size is now the same as what the > cli returns > and the total calculation is as follows(as per sjust's suggestions): > POOL TOTAL SIZE = MAX AVAIL / (.01 * (100 - %USED)) > where MAX AVAIL and %USED are directly from ceph df cli o/p. BTW your calculation for POOL TOTAL SIZE is wrong as MAX AVAIL counts with other pools as well. E.g. I have two pools in the same hierarchy (the same storage profile) and I fill one pool with some data MAX AVAIL for the other pool will be changed as well, because free space in the hierarchy for both pools is the same. My configuration: 10GB OSDs, two pools with the same replication 4, one without replication One pool is filled with some data. # ceph df GLOBAL: SIZE AVAIL RAW USED %RAW USED 81831M 52873M 28957M 35.39 POOLS: NAME ID USED %USED MAX AVAIL OBJECTS testpool 1 7168M 35.04 3023M 7 testpool2 4 0 0 3023M 0 norep 2 0 0 12093M 0 As you can see MAX AVAIL was changed for all of them, not just the one with some data. I forgot to add that in my configuration pools are created on default hierarchy with 4 OSDs. # ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -11 0.00999 root test -10 0.00999 host dhcp-126-101-test 0 0.00999 osd.0 up 1.00000 1.00000 -9 0.02998 root ectest -6 0.00999 host dhcp-126-105-ectest 6 0.00999 osd.6 up 1.00000 1.00000 -7 0.00999 host dhcp-126-102-ectest 2 0.00999 osd.2 up 1.00000 1.00000 -8 0.00999 host dhcp-126-103-ectest 4 0.00999 osd.4 up 1.00000 1.00000 -1 0.03998 root default -2 0.00999 host dhcp-126-101 1 0.00999 osd.1 up 1.00000 1.00000 -3 0.00999 host dhcp-126-102 3 0.00999 osd.3 up 1.00000 1.00000 -4 0.00999 host dhcp-126-103 5 0.00999 osd.5 up 1.00000 1.00000 -5 0.00999 host dhcp-126-105 7 0.00999 osd.7 up 1.00000 1.00000 Looks good to me. Tested with latest ceph builds(ceph-mon-10.2.2-39.el7cp.x86_64) and it works as expected. Dependant bug(https://bugzilla.redhat.com/show_bug.cgi?id=1360230) also marked as verified Tested with Server: ceph-ansible-1.0.5-33.el7scon.noarch ceph-installer-1.0.15-2.el7scon.noarch graphite-web-0.9.12-8.1.el7.noarch rhscon-ceph-0.0.42-1.el7scon.x86_64 rhscon-core-selinux-0.0.43-1.el7scon.noarch rhscon-core-0.0.43-1.el7scon.x86_64 rhscon-ui-0.0.57-1.el7scon.noarch Node: calamari-server-1.4.8-1.el7cp.x86_64 ceph-base-10.2.2-41.el7cp.x86_64 ceph-common-10.2.2-41.el7cp.x86_64 ceph-mon-10.2.2-41.el7cp.x86_64 ceph-osd-10.2.2-41.el7cp.x86_64 ceph-selinux-10.2.2-41.el7cp.x86_64 libcephfs1-10.2.2-41.el7cp.x86_64 python-cephfs-10.2.2-41.el7cp.x86_64 rhscon-agent-0.0.19-1.el7scon.noarch rhscon-core-selinux-0.0.43-1.el7scon.noarch and it works as it is expected. --> Verified Hi Anmol, I have edited the doc text for this bug. Kindly review and approve the text to be included in the async errata. Regards, Anjana Looks good to me Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2016:2082 |