Description of problem: Any pool created using UI fails with timed out. Failed. error: admin:32ebc534-6a8d-4a34-afff-f78021bf8c57 - Syncing request status timed out However a pool is created in ceph, but all its PGs are stuck in creating or creating+peering state. # ceph -s cluster e973da70-c41b-493b-a7a1-4053e4b80df3 health HEALTH_ERR 128 pgs are stuck inactive for more than 300 seconds 73 pgs peering 128 pgs stuck inactive monmap e3: 3 mons at {dhcp-126-71=10.34.126.71:6789/0,dhcp-126-72=10.34.126.72:6789/0,dhcp-126-73=10.34.126.73:6789/0} election epoch 14, quorum 0,1,2 dhcp-126-71,dhcp-126-72,dhcp-126-73 osdmap e105: 4 osds: 4 up, 4 in flags sortbitwise pgmap v282: 328 pgs, 2 pools, 0 bytes data, 0 objects 147 MB used, 40768 MB / 40915 MB avail 200 active+clean 73 creating+peering 55 creating It never changes to active+clean state. Because of that cluster state changes to HEALTH_ERR. Moreover no object can be created in such pools. Version-Release number of selected component (if applicable): rhscon-core-selinux-0.0.26-1.el7scon.noarch rhscon-ui-0.0.41-1.el7scon.noarch rhscon-core-0.0.26-1.el7scon.x86_64 rhscon-ceph-0.0.26-1.el7scon.x86_64 How reproducible: 100% Steps to Reproduce: 1. Create cluster 2. Try to create an object pool 3. Actual results: Pool creation fails with timed out, however pool is created in ceph and so it appears in the list. The pool cannot be used, because all of its PGs are stuck. Expected results: Pool creation succeed Additional info:
Pools can be created by ceph commands and a new object could be created in them. In my setup UI wants to create pool with 128 PGs and they were never created, they stuck in creating or creating+peering state. I tried to make a pool using 'ceph osd pool create <pool_name> <PG_number>' command with 8000 PGs. It takes long but it succeed. So number of PGs is not an issue.
This seems to me to be the same root cause as this https://bugzilla.redhat.com/show_bug.cgi?id=1329190 Would you please try going through these steps http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#crush-constraints-cannot-be-satisfied Additionally provide the crush map, output of "ceph health detail" and access to the environment where this issue is present.
(In reply to Gregory Meno from comment #2) > This seems to me to be the same root cause as this > https://bugzilla.redhat.com/show_bug.cgi?id=1329190 > > Would you please try going through these steps > http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/ > #crush-constraints-cannot-be-satisfied > Additionally provide the crush map, output of "ceph health detail" and > access to the environment where this issue is present. All our environments have the same issue example of mine, I have 4 OSDs, 2 in general storage profile, 2 in my new_profile: $ ceph osd crush rule ls [ "replicated_ruleset", "general", "new_profile" ] $ ceph osd crush rule dump general { "rule_id": 1, "rule_name": "general", "ruleset": 1, "type": 1, "min_size": 1, "max_size": 10, "steps": [ { "op": "take", "item": -8, "item_name": "general" }, { "op": "chooseleaf_firstn", "num": 0, "type": "host" }, { "op": "emit" } ] } $ ceph osd getcrushmap > crush.map $ $ crushtool -i crush.map --test --show-bad-mappings --rule 1 --num-rep 2 --min-x 1 --max-x $((1024 * 1024)) bad mapping rule 2 x 1 num_rep 2 result [] bad mapping rule 2 x 2 num_rep 2 result [] bad mapping rule 2 x 3 num_rep 2 result [] bad mapping rule 2 x 4 num_rep 2 result [] bad mapping rule 2 x 5 num_rep 2 result [] bad mapping rule 2 x 6 num_rep 2 result [] ... and so on all of them bad, the same for any other num-rep just to be sure that the problem is not somewhere else I run the same command for default replicated crush rule replicated_ruleset $ crushtool -i crush.map --test --show-bad-mappings --rule 0 --num-rep 2 --min-x 1 --max-x $((1024 * 1024)) $ no problem here $ crushtool --decompile crush.map > crush.txt $ $ cat crush.txt ... # buckets host dhcp-126-101 { id -2 # do not change unnecessarily # weight 0.010 alg straw hash 0 # rjenkins1 item osd.0 weight 0.010 } ... root default { id -1 # do not change unnecessarily # weight 0.040 alg straw hash 0 # rjenkins1 item dhcp-126-101 weight 0.010 item dhcp-126-102 weight 0.010 item dhcp-126-103 weight 0.010 item dhcp-126-105 weight 0.010 } host dhcp-126-101.lab.eng.brq.redhat.com-general { id -6 # do not change unnecessarily # weight 0.000 alg straw hash 0 # rjenkins1 } ... root general { id -8 # do not change unnecessarily # weight 0.000 alg straw hash 0 # rjenkins1 item dhcp-126-101.lab.eng.brq.redhat.com-general weight 0.000 item dhcp-126-102.lab.eng.brq.redhat.com-general weight 0.000 } ... I noticed that for some unknown reason general uses different hosts than default. I tried to change them to the same ones i.e. dhcp-126-101 and dhcp-126-102. $ crushtool --compile crush.txt -o better-crush.map $ $ crushtool -i better-crush.map --test --show-bad-mappings --rule 1 --num-rep 1 --min-x 1 --max-x $((1024 * 1024)) | more $ $ crushtool -i better-crush.map --test --show-bad-mappings --rule 1 --num-rep 2 --min-x 1 --max-x $((1024 * 1024)) bad mapping rule 1 x 1 num_rep 2 result [0] bad mapping rule 1 x 2 num_rep 2 result [0] bad mapping rule 1 x 3 num_rep 2 result [0] bad mapping rule 1 x 4 num_rep 2 result [0] bad mapping rule 1 x 5 num_rep 2 result [0] ... still some issues here, so I change the weight to the same as for default i.e. 0.010 $ crushtool --compile crush.txt -o better-crush.map $ $ crushtool -i better-crush.map --test --show-bad-mappings --rule 1 --num-rep 1 --min-x 1 --max-x $((1024 * 1024)) | more $ $ crushtool -i better-crush.map --test --show-bad-mappings --rule 1 --num-rep 2 --min-x 1 --max-x $((1024 * 1024)) | more $ no problem with this settings Original crush map attached
Created attachment 1171871 [details] crush map
Checked on: rhscon-ui-0.0.42-1.el7scon.noarch rhscon-core-0.0.28-1.el7scon.x86_64 rhscon-ceph-0.0.27-1.el7scon.x86_64 rhscon-core-selinux-0.0.28-1.el7scon.noarch Rules created by console are working, but the default one is not working anymore. Hence pools created from CLI using default rule set are not working. $ ceph osd getcrushmap > crush.map $ $ crushtool -i crush.map --test --show-bad-mappings --rule 0 --num-rep 2 --min-x 1 --max-x $((1024 * 1024)) bad mapping rule 2 x 1 num_rep 2 result [] bad mapping rule 2 x 2 num_rep 2 result [] bad mapping rule 2 x 3 num_rep 2 result [] bad mapping rule 2 x 4 num_rep 2 result [] bad mapping rule 2 x 5 num_rep 2 result [] bad mapping rule 2 x 6 num_rep 2 result [] ... and so on all of them bad, the same for any other num-rep $ crushtool -i crush.map --test --show-bad-mappings --rule 1 --num-rep 2 --min-x 1 --max-x $((1024 * 1024)) $ no problem here $ cat crush.txt ... # buckets host dhcp-126-101 { id -2 # do not change unnecessarily # weight 0.000 alg straw hash 0 # rjenkins1 } ... root default { id -1 # do not change unnecessarily # weight 0.000 alg straw hash 0 # rjenkins1 item dhcp-126-101 weight 0.000 item dhcp-126-102 weight 0.000 item dhcp-126-103 weight 0.000 item dhcp-126-105 weight 0.000 } host dhcp-126-101.lab.eng.brq.redhat.com-general { id -6 # do not change unnecessarily # weight 0.000 alg straw hash 0 # rjenkins1 item osd.0 weight 0.010 } ... root general { id -8 # do not change unnecessarily # weight 0.040 alg straw hash 0 # rjenkins1 item dhcp-126-101.lab.eng.brq.redhat.com-general weight 0.010 item dhcp-126-102.lab.eng.brq.redhat.com-general weight 0.010 item dhcp-126-103.lab.eng.brq.redhat.com-general weight 0.010 item dhcp-126-105.lab.eng.brq.redhat.com-general weight 0.010 } ... Information about osd is missing for those hosts which uses default rule set now. Before it was vice versa.
*** Bug 1354603 has been marked as a duplicate of this bug. ***
Since this BZ has been flagged as a duplicate of BZ 1354603, QE team have to fully test a scenario described in BZ 1354603 during validation of this BZ.
Tested on: rhscon-core-0.0.38-1.el7scon.x86_64 rhscon-ceph-0.0.38-1.el7scon.x86_64 rhscon-core-selinux-0.0.38-1.el7scon.noarch rhscon-ui-0.0.51-1.el7scon.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2016:1754
(In reply to errata-xmlrpc from comment #11) > Since the problem described in this bug report should be > resolved in a recent advisory, it has been closed with a > resolution of ERRATA. > > For information on the advisory, and where to find the updated > files, follow the link below. > > If the solution does not work for you, open a new bug report. > > https://access.redhat.com/errata/RHEA-2016:1754 But where can I find the commits about this bug?