Bug 1231630 - Monitor Crash while creating ecpool
Summary: Monitor Crash while creating ecpool
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RADOS
Version: 1.3.0
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: rc
: 1.3.2
Assignee: Loic Dachary
QA Contact: ceph-qe-bugs
URL:
Whiteboard:
Depends On:
Blocks: ceph131rn
TreeView+ depends on / blocked
 
Reported: 2015-06-15 06:17 UTC by Tanay Ganguly
Modified: 2017-07-30 15:14 UTC (History)
11 users (show)

Fixed In Version: ceph-0.94.3-3.el7cp (RHEL) Ceph v0.94.3.3 (Ubuntu)
Doc Type: Known Issue
Doc Text:
.PGs creation can hang after creating a pool using incorrect values An attempt to create a new erasure-coded pool using values that do not align with the OSD crush map causes placement groups (PGs) to remain in the "creating" state indefinitely. As a consequence, the Ceph cluster cannot achieve the `active+clean` state. To fix this problem, delete the erasure-encoded pool and associated crush ruleset, delete the profile that was used to create that pool, and use a new corrected erasure-encoding profile that aligns with the crush map.
Clone Of:
Environment:
Last Closed: 2015-12-11 21:30:34 UTC
Embargoed:


Attachments (Terms of Use)
Mon log (3.79 MB, text/plain)
2015-06-15 06:17 UTC, Tanay Ganguly
no flags Details
Mon log (3.79 MB, text/plain)
2015-06-15 06:21 UTC, Tanay Ganguly
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 11814 0 None None None Never

Description Tanay Ganguly 2015-06-15 06:17:37 UTC
Created attachment 1038776 [details]
Mon log

Description of problem:
Seeing Monitor Crash while creating erasure coded pool with wrong parameters

Version-Release number of selected component (if applicable):
ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)

How reproducible:
1/1

Steps to Reproduce:
1. Create a ec profile, with the below command.

ceph osd erasure-code-profile set myprofile plugin=lrc mapping=__DD__DD layers='[[ "_cDD_cDD", "" ],[ "cDDD____", "" ],[ "____cDDD", "" ],]' ruleset-steps='[ [ "choose", "datacenter", 3 ], [ "chooseleaf", "osd", 0] ]'


2.  ceph osd pool create ecpool 12 12 erasure myprofile


Actual results: Monitoring is crashing.
BT:

 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
 1: /usr/bin/ceph-mon() [0x901e52]
 2: (()+0xf130) [0x7f205a9ca130]
 3: (crush_do_rule()+0x291) [0x833501]
 4: (OSDMap::_pg_to_osds(pg_pool_t const&, pg_t, std::vector<int, std::allocator<int> >*, int*, unsigned int*) const+0xff) [0x77b76f]
 5: (OSDMap::_pg_to_up_acting_osds(pg_t const&, std::vector<int, std::allocator<int> >*, int*, std::vector<int, std::allocator<int> >*, int*) const+0x104) [0x77be24]
 6: (PGMonitor::map_pg_creates()+0x268) [0x65b748]
 7: (PGMonitor::post_paxos_update()+0x25) [0x65bf35]
 8: (Monitor::refresh_from_paxos(bool*)+0x221) [0x575721]
 9: (Monitor::init_paxos()+0x95) [0x575ac5]
 10: (Monitor::preinit()+0x7f1) [0x57a881]
 11: (main()+0x24a1) [0x54d881]
 12: (__libc_start_main()+0xf5) [0x7f20593d0af5]
 13: /usr/bin/ceph-mon() [0x55d0f9]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Expected results: There should not be any crash.
While executing any ceph command, i am getting error.

ceph -s
2015-06-15 02:09:16.514340 7fe8301a6700  0 -- :/1012129 >> 10.16.154.227:6789/0 pipe(0x7fe82c028050 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7fe82c0250d0).fault
2015-06-15 02:09:19.513490 7fe827d78700  0 -- :/1012129 >> 10.16.154.227:6789/0 pipe(0x7fe81c000c00 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fe81c004ea0).fault


Also there is no way to recover once the Monitor have crashed, i have to purge and re-create the cluster again.

Additional info: Mon log, Crush Map

Comment 2 Tanay Ganguly 2015-06-15 06:21:37 UTC
Created attachment 1038777 [details]
Mon log

Comment 3 Christina Meno 2015-06-15 23:12:24 UTC
Sam, blocker for 1.3.0?

Comment 4 Loic Dachary 2015-06-16 20:03:39 UTC
This is not a blocker for 1.3.0.

Comment 5 Samuel Just 2015-07-14 18:17:00 UTC
Loic, is there a fix merged upstream that we want for 1.3.1?

Comment 6 Ken Dreyer (Red Hat) 2015-07-16 01:18:51 UTC
(In reply to Samuel Just from comment #5)
> Loic, is there a fix merged upstream that we want for 1.3.1?

If not, please re-target to 1.3.2.

Comment 7 Ken Dreyer (Red Hat) 2015-07-22 18:57:50 UTC
Loic, it's not totally clear to me what exact patches we'd need on top of 0.94.2? Can you please clarify?

Comment 8 Loic Dachary 2015-09-30 10:07:15 UTC
Ken, 

http://tracker.ceph.com/issues/11814 has "Copied to" issues, each of them being a Backport of the fix to the relevant stable releases. In this case there only is one, http://tracker.ceph.com/issues/11824 which is targetted to hammer, as shown by the Release field.

The description of http://tracker.ceph.com/issues/11824 is a link to the pull request that has the commits that were backported. This convention is strictly enforced (the stable release team verifies it is on a weekly basis).

The Target version field tells you in which version this backport will be published. In this case it is v0.94.4, i.e. the next hammer point release at the time of this writing. When this field is set, it means the backport has been tested (via the relevant ceph-qa-suite) and approved by the developer who merged the corresponding commit in master.

To answer your question, the patches you need are at https://github.com/ceph/ceph/pull/5276 (3 of them). 

If you'd like to know more about the backport process it's at http://tracker.ceph.com/projects/ceph-releases/wiki/HOWTO

Cheers

Comment 9 Ken Dreyer (Red Hat) 2015-09-30 19:47:47 UTC
Thanks Loic. The thing that confused me about the upstream tracker was that there were two issues marked as "related", and 11824 upstream also has a comment about "It must be backported together with the fix for #12419" ... so I really appreciate your clarification that only https://github.com/ceph/ceph/pull/5276 is needed.

Comment 10 Ken Dreyer (Red Hat) 2015-10-13 14:36:32 UTC
Loic, I take it that we'll also want the patches at http://tracker.ceph.com/issues/13477 ? "crush/mapper: ensure bucket id is valid before indexing buckets array" and "crush/mapper: ensure take bucket value is valid" ?

Comment 11 Loic Dachary 2015-10-26 11:40:28 UTC
Ken, yes, that's a patch worth having

Comment 13 Federico Lucifredi 2015-11-04 17:39:45 UTC
If we re-spin to fix the Ceph-deploy issues, we will include this fix as well.

Comment 14 Ken Dreyer (Red Hat) 2015-11-04 22:04:47 UTC
I've cherry-picked the changes from https://github.com/ceph/ceph/pull/5276 and https://github.com/ceph/ceph/pull/6430 here.

(I did not cherry-pick 6f0af185ad7cf9640557efb7f61a7ea521871b5b because it only fixes the vstart.sh file in /src/test/, and the usptream v0.94.3 tarball does not contain /src/test, . Also, vstart.sh is not used in RHCS downstream, so the patch is not relevant in any regard.)

The exact commands I ran on the "ceph-1.3-rhel-patches" branch in Gerrit (for RHEL) and the rhcs-0.94.3-ubuntu patch in GitHub (for Ubuntu):

git cherry-pick -x b58cbbab4f74e352c3d4a61190cea2731057b3c9
git cherry-pick -x f47ba4b1a1029a55f8bc4ab393a7fa3712cd4e00
git fetch https://github.com/SUSE/ceph wip-13654-hammer
git cherry-pick -x 81d8aa14f3f2b7bf4bdd0b4e53e3a653a600ef38
git cherry-pick -x a52f7cb372339dffbeed7dae8ce2680586760754

Comment 17 Tanay Ganguly 2015-11-07 18:51:08 UTC
Running the same command as mentioned in the BUG is not working.
Now i am not seeing the crash, but its getting stuck in pg creation forever.

1. Create a ec profile, with the below command.

ceph osd erasure-code-profile set myprofile plugin=lrc mapping=__DD__DD layers='[[ "_cDD_cDD", "" ],[ "cDDD____", "" ],[ "____cDDD", "" ],]' ruleset-steps='[ [ "choose", "datacenter", 3 ], [ "chooseleaf", "osd", 0] ]'

2.  ceph osd pool create ecpool 12 12 erasure myprofile

ceph osd tree
ID WEIGHT   TYPE NAME         UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 14.52994 root default                                        
-2  1.44998     host cephqe11                                   
 1  0.35999         osd.1          up  1.00000          1.00000 
14  1.09000         osd.14         up  1.00000          1.00000 
-3  4.35999     host cephqe8                                    
 2  1.09000         osd.2          up  1.00000          1.00000 
 3  1.09000         osd.3          up  1.00000          1.00000 
 4  1.09000         osd.4          up  1.00000          1.00000 
 5  1.09000         osd.5          up  1.00000          1.00000 
-4  4.35999     host cephqe9                                    
 6  1.09000         osd.6          up  1.00000          1.00000 
 7  1.09000         osd.7          up  1.00000          1.00000 
 8  1.09000         osd.8          up  1.00000          1.00000 
 9  1.09000         osd.9          up  1.00000          1.00000 
-5  4.35999     host cephqe10                                   
10  1.09000         osd.10         up  1.00000          1.00000 
11  1.09000         osd.11         up  1.00000          1.00000 
12  1.09000         osd.12         up  1.00000          1.00000 
13  1.09000         osd.13         up  1.00000          1.00000 
 0        0 osd.0                down        0          1.00000 

ceph -s
    cluster 4b86e8aa-7004-45b4-8328-319f23fbcd6f
     health HEALTH_WARN
            clock skew detected on mon.cephqe5, mon.cephqe6
           12 pgs stuck inactive
            12 pgs stuck unclean
            too few PGs per OSD (13 < min 30)
            Monitor clock skew detected 
     monmap e1: 3 mons at {cephqe4=10.70.44.42:6789/0,cephqe5=10.70.44.44:6789/0,cephqe6=10.70.44.46:6789/0}
            election epoch 4, quorum 0,1,2 cephqe4,cephqe5,cephqe6
     osdmap e79: 15 osds: 14 up, 14 in
      pgmap v184: 76 pgs, 2 pools, 3072 MB data, 3 objects
            9701 MB used, 14816 GB / 14826 GB avail
                  64 active+clean
                  12 creating


ceph pg dump |grep ^1
dumped all in format plain  
1.a     0       0       0       0       0       0       0       0       creating        0.000000        0'0     0:0     []      -1      []      -1      0'0     2015-11-08 05:22:31.669950      0'0     2015-11-08 05:22:31.669950
1.b     0       0       0       0       0       0       0       0       creating        0.000000        0'0     0:0     []      -1      []      -1      0'0     2015-11-08 05:22:31.669951      0'0     2015-11-08 05:22:31.669951
1.8     0       0       0       0       0       0       0       0       creating        0.000000        0'0     0:0     []      -1      []      -1      0'0     2015-11-08 05:22:31.669949      0'0     2015-11-08 05:22:31.669949
1.9     0       0       0       0       0       0       0       0       creating        0.000000        0'0     0:0     []      -1      []      -1      0'0     2015-11-08 05:22:31.669950      0'0     2015-11-08 05:22:31.669950
1.6     0       0       0       0       0       0       0       0       creating        0.000000        0'0     0:0     []      -1      []      -1      0'0     2015-11-08 05:22:31.669937      0'0     2015-11-08 05:22:31.669937
1.7     0       0       0       0       0       0       0       0       creating        0.000000        0'0     0:0     []      -1      []      -1      0'0     2015-11-08 05:22:31.669949      0'0     2015-11-08 05:22:31.669949
1.4     0       0       0       0       0       0       0       0       creating        0.000000        0'0     0:0     []      -1      []      -1      0'0     2015-11-08 05:22:31.669936      0'0     2015-11-08 05:22:31.669936
1.5     0       0       0       0       0       0       0       0       creating        0.000000        0'0     0:0     []      -1      []      -1      0'0     2015-11-08 05:22:31.669936      0'0     2015-11-08 05:22:31.669936
1.2     0       0       0       0       0       0       0       0       creating        0.000000        0'0     0:0     []      -1      []      -1      0'0     2015-11-08 05:22:31.669935      0'0     2015-11-08 05:22:31.669935
1.3     0       0       0       0       0       0       0       0       creating        0.000000        0'0     0:0     []      -1      []      -1      0'0     2015-11-08 05:22:31.669935      0'0     2015-11-08 05:22:31.669935
1.0     0       0       0       0       0       0       0       0       creating        0.000000        0'0     0:0     []      -1      []      -1      0'0     2015-11-08 05:22:31.669934      0'0     2015-11-08 05:22:31.669934
1.1     0       0       0       0       0       0       0       0       creating        0.000000        0'0     0:0     []      -1      []      -1      0'0     2015-11-08 05:22:31.669934      0'0     2015-11-08


Version: ceph-0.94.3-3.el7cp.x86_64

Comment 18 Ken Dreyer (Red Hat) 2015-11-09 17:17:10 UTC
(In reply to Tanay Ganguly from comment #17)
> Running the same command as mentioned in the BUG is not working.
> Now i am not seeing the crash, but its getting stuck in pg creation forever.

Loic, any idea why pg creation would get stuck here? Is there another bug to fix with this?

Comment 19 Loic Dachary 2015-11-09 17:22:22 UTC
> ruleset-steps='[ [ "choose", "datacenter", 3 ], [ "chooseleaf", "osd", 0] ]'

The crush rule requires for 3 datacenters but there are none in the crush map, no OSD can be mapped to the PG and they are stuck.

Comment 20 Federico Lucifredi 2015-11-10 00:03:17 UTC
The fix appears to have failed, and it was a non-blocker "target of opportunity" in the re-spin. 

I agree with Harish's assessment, Pushing to 1.3.2.

Comment 21 Ken Dreyer (Red Hat) 2015-11-10 19:38:19 UTC
With ceph 0.94.3.3, it sounds like the following is occurring:

1) User specifies an erasure code profile with three datacenters.
2) There are not three datacenters in the crush map.
3) Ceph waits for a crushmap adjustment

If I'm understanding correctly, there is no remaining issue to fix here. If a user was in this situation, the solution would be to re-do the erasure code profile with something that aligns with the current crush map, or else adjust the crush map to fit the erasure code profile. But Ceph can't guess at what the user wants to do in this situation.

Do I have that right Loic?

Comment 22 Harish NV Rao 2015-11-12 11:10:56 UTC
When invalid value as mentioned above was provided, the command hung. The expectation is that it should time out and print user understandable error message rather than hanging.

Comment 25 Ken Dreyer (Red Hat) 2015-11-12 15:44:33 UTC
(In reply to Harish NV Rao from comment #22)
> When invalid value as mentioned above was provided, the command hung.

Which command hung?

Comment 26 Tanay Ganguly 2015-11-12 17:03:47 UTC
Hi Ken,

Just to clarify the pg creation gets hung not the command. Ceph cluster never become active+clean state as it doesn't get valid parameters.

i.e. if users specifies Datacenter of Rack as ruleset-steps, where these entities does not exists in Crushmap then the PG creation stucks.

This was part of negative testing, which customer can sometime face it.

Only solution he have to delete the pool created using that profile and then delete the profile and also remove the ruleset from the crush ( To avoid user from using the same ruleset again )

From QE stand point, it would have been better if the command failed stating the reason for it.

Thanks,
Tanay

Comment 27 Ken Dreyer (Red Hat) 2015-11-12 17:09:30 UTC
(In reply to Tanay Ganguly from comment #26)
> From QE stand point, it would have been better if the command failed stating
> the reason for it.

Loic, would it be possible to do some sort of input validation as Tanay describes here, so it's clearer to the user?

Comment 29 Loic Dachary 2015-11-12 17:42:28 UTC
> Loic, would it be possible to do some sort of input validation as Tanay describes here, so it's clearer to the user?

There would be value in clarifying this, indeed. It's worth a discussion on ceph-devel I think: I can't think of a trivial way to do that.

Comment 30 Loic Dachary 2015-11-12 17:44:46 UTC
@Ken 

> If I'm understanding correctly, there is no remaining issue to fix here.

Yes, this issue does not need fixing. Another could be open to suggest a usability improvement.

Comment 32 Federico Lucifredi 2015-12-11 21:30:34 UTC
Validating a CRUSHmap would be a different bug. The balance of the issue is resolved.

NO QE required.


Note You need to log in before you can comment on or make changes to this bug.