Bug 1552101
Summary: | OC deploy timeouts with 3ceph nodes - ceph fsid hangs | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Pavel Sedlák <psedlak> | |
Component: | Ceph-Ansible | Assignee: | Guillaume Abrioux <gabrioux> | |
Status: | CLOSED ERRATA | QA Contact: | Yogev Rabl <yrabl> | |
Severity: | urgent | Docs Contact: | ||
Priority: | urgent | |||
Version: | 3.0 | CC: | adeza, anharris, aschoen, aschultz, ceph-eng-bugs, dbecker, dsariel, gfidente, gmeno, johfulto, kdreyer, mburns, morazi, nthomas, ohochman, pgrist, rhel-osp-director-maint, sankarshan, skatlapa, vashastr | |
Target Milestone: | rc | Keywords: | Reopened | |
Target Release: | 3.1 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | RHEL: ceph-ansible-3.1.0-0.1.beta3.el7cp | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1552769 1559275 (view as bug list) | Environment: | ||
Last Closed: | 2018-09-26 18:19:40 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1548353, 1552769 |
Description
Pavel Sedlák
2018-03-06 13:59:04 UTC
Not sure what the ceph fsid does, snippet of strace (which repeats when watched live) of it is below. Also ceph --help shows > Monitor commands: > ================= > [Contacting monitor, timeout after 5 seconds] > 2018-03-06 14:00:36.589137 7f757a0d6700 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory > 2018-03-06 14:00:36.591611 7f7578158700 0 -- :/3709006390 >> 172.17.3.13:6789/0 pipe(0x7f757405dba0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f757405ee60).fault > 2018-03-06 14:00:39.589775 7f7570ff9700 0 -- :/3709006390 >> 172.17.3.16:6789/0 pipe(0x7f7568000c80 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f7568001f90).fault strace snippet: > [pid 45190] <... select resumed> ) = 0 (Timeout) > [pid 45190] clone(child_stack=0x7f41c6ffcfb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f41c6ffd9d0, tls=0x7f41c6ffd700, child_tidptr=0x7f41c6ffd9d0) = 354090 > /tmp/strace: Process 354090 attached > [pid 45190] futex(0x10c9180, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> > [pid 354090] set_robust_list(0x7f41c6ffd9e0, 24) = 0 > [pid 354090] futex(0x10c9180, FUTEX_WAKE_PRIVATE, 1) = 1 > [pid 45190] <... futex resumed> ) = 0 > [pid 45190] futex(0x1073f40, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> > [pid 354090] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> > [pid 45190] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) > [pid 354090] <... futex resumed> ) = 0 > [pid 354090] futex(0x1073f40, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> > [pid 45190] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> > [pid 354090] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) > [pid 45190] <... futex resumed> ) = 0 > [pid 45190] futex(0x1073f40, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> > [pid 354090] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1) = 1 > [pid 45190] <... futex resumed> ) = 0 > [pid 354090] futex(0x1073f40, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> > [pid 45190] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> > [pid 354090] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) > [pid 45190] <... futex resumed> ) = 0 > [pid 354090] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> > [pid 45190] futex(0x1073f40, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> > [pid 354090] <... futex resumed> ) = 0 > [pid 354090] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1) = 1 > [pid 354090] madvise(0x7f41c67fd000, 8368128, MADV_DONTNEED <unfinished ...> > [pid 45190] <... futex resumed> ) = 0 > [pid 354090] <... madvise resumed> ) = 0 > [pid 354090] exit(0) = ? > [pid 354090] +++ exited with 0 +++ > [pid 45190] clone(/tmp/strace: Process 354091 attached > <unfinished ...> > [pid 354091] set_robust_list(0x7f41c6ffd9e0, 24 <unfinished ...> > [pid 45190] <... clone resumed> child_stack=0x7f41c6ffcfb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f41c6ffd9d0, tls=0x7f41c6ffd700, child_tidptr=0x7f41c6ffd9d0) = 354091 > [pid 354091] <... set_robust_list resumed> ) = 0 > [pid 354091] futex(0x1073f40, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> > [pid 45190] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> > [pid 354091] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) > [pid 45190] <... futex resumed> ) = 0 > [pid 45190] futex(0x1073f40, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> > [pid 354091] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1) = 0 > [pid 45190] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) > [pid 354091] futex(0xf64370, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> > [pid 45190] futex(0xf64370, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> > [pid 354091] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) > [pid 45190] <... futex resumed> ) = 0 > [pid 354091] futex(0x1073f40, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> > [pid 45190] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> > [pid 354091] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) > [pid 45190] <... futex resumed> ) = 0 > [pid 45190] futex(0x10c9180, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> > [pid 354091] futex(0x10c9180, FUTEX_WAKE_PRIVATE, 1) = 0 > [pid 45190] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) > [pid 45190] futex(0x1073f40, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> > [pid 354091] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1) = 0 > [pid 45190] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) > [pid 354091] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1) = 0 > [pid 354091] futex(0x1073f40, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> > [pid 45190] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> > [pid 354091] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) > [pid 45190] <... futex resumed> ) = 0 > [pid 45190] select(0, NULL, NULL, NULL, {0, 1000} <unfinished ...> > [pid 354091] madvise(0x7f41c67fd000, 8368128, MADV_DONTNEED) = 0 > [pid 354091] exit(0) = ? > [pid 354091] +++ exited with 0 +++ > [pid 45190] <... select resumed> ) = 0 (Timeout) This is not a duplicate. The issues are different. ceph-ansible 3.1 contains a fix which resolves this problem: https://github.com/ceph/ceph-ansible/commit/ec16cbdb1af9069de09d4a2e2e88739c2c303350 This bug now depends on 1548353 whose goal is to get ceph-ansible 3.1 into osp13 John would you please confirm 3.1.0beta3 (or 3.1.0beta4) fixes this bug? - resetting assignee to the guits because his linked PR is what fixed the issue - I verify that ceph-ansible-3.1.0.0-0.beta4.1.el7.noarch contains the fix - I did several deployments and didn't experience the reported timeout Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2819 |