Bug 1349513
Summary: | Pool PGs stuck in creating state in ceph | ||||||
---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Storage Console | Reporter: | Lubos Trilety <ltrilety> | ||||
Component: | unclassified | Assignee: | Nishanth Thomas <nthomas> | ||||
Status: | CLOSED ERRATA | QA Contact: | Lubos Trilety <ltrilety> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 2 | CC: | bjq1016, gmeno, ltrilety, mbukatov, mkudlej, sankarshan | ||||
Target Milestone: | --- | Keywords: | TestBlocker | ||||
Target Release: | 2 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | rhscon-core-0.0.34-1.el7scon.x86_64 rhscon-ceph-0.0.33-1.el7scon.x86_64 rhscon-ui-0.0.47-1.el7scon.noarch | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2016-08-23 19:55:56 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1353450 | ||||||
Attachments: |
|
Description
Lubos Trilety
2016-06-23 15:01:17 UTC
Pools can be created by ceph commands and a new object could be created in them. In my setup UI wants to create pool with 128 PGs and they were never created, they stuck in creating or creating+peering state. I tried to make a pool using 'ceph osd pool create <pool_name> <PG_number>' command with 8000 PGs. It takes long but it succeed. So number of PGs is not an issue. This seems to me to be the same root cause as this https://bugzilla.redhat.com/show_bug.cgi?id=1329190 Would you please try going through these steps http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#crush-constraints-cannot-be-satisfied Additionally provide the crush map, output of "ceph health detail" and access to the environment where this issue is present. (In reply to Gregory Meno from comment #2) > This seems to me to be the same root cause as this > https://bugzilla.redhat.com/show_bug.cgi?id=1329190 > > Would you please try going through these steps > http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/ > #crush-constraints-cannot-be-satisfied > Additionally provide the crush map, output of "ceph health detail" and > access to the environment where this issue is present. All our environments have the same issue example of mine, I have 4 OSDs, 2 in general storage profile, 2 in my new_profile: $ ceph osd crush rule ls [ "replicated_ruleset", "general", "new_profile" ] $ ceph osd crush rule dump general { "rule_id": 1, "rule_name": "general", "ruleset": 1, "type": 1, "min_size": 1, "max_size": 10, "steps": [ { "op": "take", "item": -8, "item_name": "general" }, { "op": "chooseleaf_firstn", "num": 0, "type": "host" }, { "op": "emit" } ] } $ ceph osd getcrushmap > crush.map $ $ crushtool -i crush.map --test --show-bad-mappings --rule 1 --num-rep 2 --min-x 1 --max-x $((1024 * 1024)) bad mapping rule 2 x 1 num_rep 2 result [] bad mapping rule 2 x 2 num_rep 2 result [] bad mapping rule 2 x 3 num_rep 2 result [] bad mapping rule 2 x 4 num_rep 2 result [] bad mapping rule 2 x 5 num_rep 2 result [] bad mapping rule 2 x 6 num_rep 2 result [] ... and so on all of them bad, the same for any other num-rep just to be sure that the problem is not somewhere else I run the same command for default replicated crush rule replicated_ruleset $ crushtool -i crush.map --test --show-bad-mappings --rule 0 --num-rep 2 --min-x 1 --max-x $((1024 * 1024)) $ no problem here $ crushtool --decompile crush.map > crush.txt $ $ cat crush.txt ... # buckets host dhcp-126-101 { id -2 # do not change unnecessarily # weight 0.010 alg straw hash 0 # rjenkins1 item osd.0 weight 0.010 } ... root default { id -1 # do not change unnecessarily # weight 0.040 alg straw hash 0 # rjenkins1 item dhcp-126-101 weight 0.010 item dhcp-126-102 weight 0.010 item dhcp-126-103 weight 0.010 item dhcp-126-105 weight 0.010 } host dhcp-126-101.lab.eng.brq.redhat.com-general { id -6 # do not change unnecessarily # weight 0.000 alg straw hash 0 # rjenkins1 } ... root general { id -8 # do not change unnecessarily # weight 0.000 alg straw hash 0 # rjenkins1 item dhcp-126-101.lab.eng.brq.redhat.com-general weight 0.000 item dhcp-126-102.lab.eng.brq.redhat.com-general weight 0.000 } ... I noticed that for some unknown reason general uses different hosts than default. I tried to change them to the same ones i.e. dhcp-126-101 and dhcp-126-102. $ crushtool --compile crush.txt -o better-crush.map $ $ crushtool -i better-crush.map --test --show-bad-mappings --rule 1 --num-rep 1 --min-x 1 --max-x $((1024 * 1024)) | more $ $ crushtool -i better-crush.map --test --show-bad-mappings --rule 1 --num-rep 2 --min-x 1 --max-x $((1024 * 1024)) bad mapping rule 1 x 1 num_rep 2 result [0] bad mapping rule 1 x 2 num_rep 2 result [0] bad mapping rule 1 x 3 num_rep 2 result [0] bad mapping rule 1 x 4 num_rep 2 result [0] bad mapping rule 1 x 5 num_rep 2 result [0] ... still some issues here, so I change the weight to the same as for default i.e. 0.010 $ crushtool --compile crush.txt -o better-crush.map $ $ crushtool -i better-crush.map --test --show-bad-mappings --rule 1 --num-rep 1 --min-x 1 --max-x $((1024 * 1024)) | more $ $ crushtool -i better-crush.map --test --show-bad-mappings --rule 1 --num-rep 2 --min-x 1 --max-x $((1024 * 1024)) | more $ no problem with this settings Original crush map attached Created attachment 1171871 [details]
crush map
Checked on: rhscon-ui-0.0.42-1.el7scon.noarch rhscon-core-0.0.28-1.el7scon.x86_64 rhscon-ceph-0.0.27-1.el7scon.x86_64 rhscon-core-selinux-0.0.28-1.el7scon.noarch Rules created by console are working, but the default one is not working anymore. Hence pools created from CLI using default rule set are not working. $ ceph osd getcrushmap > crush.map $ $ crushtool -i crush.map --test --show-bad-mappings --rule 0 --num-rep 2 --min-x 1 --max-x $((1024 * 1024)) bad mapping rule 2 x 1 num_rep 2 result [] bad mapping rule 2 x 2 num_rep 2 result [] bad mapping rule 2 x 3 num_rep 2 result [] bad mapping rule 2 x 4 num_rep 2 result [] bad mapping rule 2 x 5 num_rep 2 result [] bad mapping rule 2 x 6 num_rep 2 result [] ... and so on all of them bad, the same for any other num-rep $ crushtool -i crush.map --test --show-bad-mappings --rule 1 --num-rep 2 --min-x 1 --max-x $((1024 * 1024)) $ no problem here $ cat crush.txt ... # buckets host dhcp-126-101 { id -2 # do not change unnecessarily # weight 0.000 alg straw hash 0 # rjenkins1 } ... root default { id -1 # do not change unnecessarily # weight 0.000 alg straw hash 0 # rjenkins1 item dhcp-126-101 weight 0.000 item dhcp-126-102 weight 0.000 item dhcp-126-103 weight 0.000 item dhcp-126-105 weight 0.000 } host dhcp-126-101.lab.eng.brq.redhat.com-general { id -6 # do not change unnecessarily # weight 0.000 alg straw hash 0 # rjenkins1 item osd.0 weight 0.010 } ... root general { id -8 # do not change unnecessarily # weight 0.040 alg straw hash 0 # rjenkins1 item dhcp-126-101.lab.eng.brq.redhat.com-general weight 0.010 item dhcp-126-102.lab.eng.brq.redhat.com-general weight 0.010 item dhcp-126-103.lab.eng.brq.redhat.com-general weight 0.010 item dhcp-126-105.lab.eng.brq.redhat.com-general weight 0.010 } ... Information about osd is missing for those hosts which uses default rule set now. Before it was vice versa. *** Bug 1354603 has been marked as a duplicate of this bug. *** Since this BZ has been flagged as a duplicate of BZ 1354603, QE team have to fully test a scenario described in BZ 1354603 during validation of this BZ. Tested on: rhscon-core-0.0.38-1.el7scon.x86_64 rhscon-ceph-0.0.38-1.el7scon.x86_64 rhscon-core-selinux-0.0.38-1.el7scon.noarch rhscon-ui-0.0.51-1.el7scon.noarch Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2016:1754 (In reply to errata-xmlrpc from comment #11) > Since the problem described in this bug report should be > resolved in a recent advisory, it has been closed with a > resolution of ERRATA. > > For information on the advisory, and where to find the updated > files, follow the link below. > > If the solution does not work for you, open a new bug report. > > https://access.redhat.com/errata/RHEA-2016:1754 But where can I find the commits about this bug? |