Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
This project is now read‑only. Starting Monday, February 2, please use Jira Cloud for all bug tracking management.

Bug 1552101

Summary: OC deploy timeouts with 3ceph nodes - ceph fsid hangs
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Pavel Sedlák <psedlak>
Component: Ceph-AnsibleAssignee: Guillaume Abrioux <gabrioux>
Status: CLOSED ERRATA QA Contact: Yogev Rabl <yrabl>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.0CC: adeza, anharris, aschoen, aschultz, ceph-eng-bugs, dbecker, dsariel, gfidente, gmeno, johfulto, kdreyer, mburns, morazi, nthomas, ohochman, pgrist, rhel-osp-director-maint, sankarshan, skatlapa, vashastr
Target Milestone: rcKeywords: Reopened
Target Release: 3.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: RHEL: ceph-ansible-3.1.0-0.1.beta3.el7cp Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1552769 1559275 (view as bug list) Environment:
Last Closed: 2018-09-26 18:19:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1548353, 1552769    

Description Pavel Sedlák 2018-03-06 13:59:04 UTC
Overcloud deploy timeouts,
possibly due to hanging  ceph --connect-timeout 3 --cluster ceph fsid on controller-1.

ceph-ansible-3.0.14-1.el7cp.noarch (uc)
ceph-common-10.2.10-17.el7cp.x86_64  (ctl-1)

openstack workflow execution list|grep -v ' SUCCESS '
> +--------------------------------------+--------------------------------------+------------------------------------------------------------------------+--------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+---------+------------+---------------------+---------------------+
> | ID                                   | Workflow ID                          | Workflow name                                                          | Workflow namespace | Description                                                                                                                                                                                                                       | Task Execution ID                    | State   | State info | Created at          | Updated at          |
> +--------------------------------------+--------------------------------------+------------------------------------------------------------------------+--------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+---------+------------+---------------------+---------------------+
> | 3e1b36ab-7974-4d72-b50b-eb07836074e2 | 344043da-a011-4753-81cd-1290cf020a68 | tripleo.plan_management.v1.publish_ui_logs_to_swift                    |                    | {"triggered_by": {"type": "cron_trigger", "id": "8eca229a-5ece-4fb4-9306-ae9d0fe6dbd8", "name": "publish-ui-logs-hourly"}, "description": "Workflow execution created by cron trigger '(8eca229a-5ece-4fb4-9306-ae9d0fe6dbd8)'."} | <none>                               | ERROR   | None       | 2018-03-06 10:05:01 | 2018-03-06 10:05:07 |
> | c63f2e47-8e90-4e7d-bd1c-db6f2e580b25 | cbf3f2c2-5fc3-461e-8f7f-6fe28c438599 | tripleo.overcloud.workflow_tasks.step2                                 |                    | Heat managed                                                                                                                                                                                                                      | <none>                               | RUNNING | None       | 2018-03-06 10:16:48 | 2018-03-06 10:16:48 |
> | cca9c080-2709-4562-be83-b87db61fd878 | 35626a61-1127-44ef-ae59-2ba188b8592d | tripleo.storage.v1.ceph-install                                        |                    | sub-workflow execution                                                                                                                                                                                                            | a72f6980-4a8d-4be6-afb9-fb1ae4be7e16 | RUNNING | None       | 2018-03-06 10:16:49 | 2018-03-06 10:16:49 |
> | 9ed63a6c-0c7c-417f-947d-31649f4bd949 | 344043da-a011-4753-81cd-1290cf020a68 | tripleo.plan_management.v1.publish_ui_logs_to_swift                    |                    | {"triggered_by": {"type": "cron_trigger", "id": "8eca229a-5ece-4fb4-9306-ae9d0fe6dbd8", "name": "publish-ui-logs-hourly"}, "description": "Workflow execution created by cron trigger '(8eca229a-5ece-4fb4-9306-ae9d0fe6dbd8)'."} | <none>                               | ERROR   | None       | 2018-03-06 11:06:29 | 2018-03-06 11:06:36 |
> | d09b34ad-ab8e-4632-8f2e-0d0485de3e8d | 344043da-a011-4753-81cd-1290cf020a68 | tripleo.plan_management.v1.publish_ui_logs_to_swift                    |                    | {"triggered_by": {"type": "cron_trigger", "id": "8eca229a-5ece-4fb4-9306-ae9d0fe6dbd8", "name": "publish-ui-logs-hourly"}, "description": "Workflow execution created by cron trigger '(8eca229a-5ece-4fb4-9306-ae9d0fe6dbd8)'."} | <none>                               | ERROR   | None       | 2018-03-06 12:08:20 | 2018-03-06 12:08:27 |
> | 0141d397-b0c4-4362-8786-413f8287c1f7 | 344043da-a011-4753-81cd-1290cf020a68 | tripleo.plan_management.v1.publish_ui_logs_to_swift                    |                    | {"triggered_by": {"type": "cron_trigger", "id": "8eca229a-5ece-4fb4-9306-ae9d0fe6dbd8", "name": "publish-ui-logs-hourly"}, "description": "Workflow execution created by cron trigger '(8eca229a-5ece-4fb4-9306-ae9d0fe6dbd8)'."} | <none>                               | ERROR   | None       | 2018-03-06 13:09:25 | 2018-03-06 13:09:32 |
> +--------------------------------------+--------------------------------------+------------------------------------------------------------------------+--------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+---------+------------+---------------------+---------------------+

notice how long they are running, now:
> date --utc
> Tue Mar  6 13:43:18 UTC 2018

ps -ef |grep ansible
> mistral  10767  1371  4 05:17 ?        00:10:06 /usr/bin/python2 /bin/ansible-playbook /usr/share/ceph-ansible/site-docker.yml.sample --user tripleo-admin --become --become-user root --extra-vars {"ireallymeanit": "yes"} --inventory-file /tmp/ansible-mistral-actionpFS4X7/inventory.yaml --private-key /tmp/ansible-mistral-actionpFS4X7/ssh_private_key --skip-tags package-install,with_pkg
> mistral  10789     1  0 05:17 ?        00:00:00 ssh: /tmp/ansible-mistral-actionpFS4X7/.ansible/cp/192.168.24.10-tripleo-admin-22 [mux]
> mistral  12320 10767  0 05:19 ?        00:00:00 /usr/bin/python2 /bin/ansible-playbook /usr/share/ceph-ansible/site-docker.yml.sample --user tripleo-admin --become --become-user root --extra-vars {"ireallymeanit": "yes"} --inventory-file /tmp/ansible-mistral-actionpFS4X7/inventory.yaml --private-key /tmp/ansible-mistral-actionpFS4X7/ssh_private_key --skip-tags package-install,with_pkg
> mistral  12326 12320  0 05:19 ?        00:00:00 ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o IdentityFile="/tmp/ansible-mistral-actionpFS4X7/ssh_private_key" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=tripleo-admin -o ConnectTimeout=30 -o ControlPath=/tmp/ansible-mistral-actionpFS4X7/.ansible/cp/%h-%r-%p -tt 192.168.24.10 /bin/sh -c 'sudo -H -S -n -u root /bin/sh -c '"'"'echo BECOME-SUCCESS-dfogouwpzqdyrcchojmzbthdrrsfdtfa; /usr/bin/python /home/tripleo-admin/.ansible/tmp/ansible-tmp-1520331591.88-25827043162241/command.py; rm -rf "/home/tripleo-admin/.ansible/tmp ansible-tmp-1520331591.88-25827043162241/" > /dev/null 2>&1'"'"' && sleep 0'

on the 192.168.24.10 ip visible above is controller-1, from there:

> [heat-admin@controller-1 ~]$ ps axf -o etime,args  # snippet
> 03:37:47  \_ sshd: tripleo-admin [priv]
> 03:37:46  |   \_ sshd: tripleo-admin@pts/0
> 03:35:51  |       \_ /bin/sh -c sudo -H -S -n -u root /bin/sh -c 'echo BECOME-SUCCESS-dfogouwpzqdyrcchojmzbthdrrsfdtfa; /usr/bin/python /home/tripleo-admin/.ansible/tmp/ansible-tmp-1520331591.88-25827043162241/command.py; rm -rf "/home/tripleo-admin/.ansible/tmp/ansible-tmp-1520331591.88-25827043162241/" > /dev/null 2>&1' && sleep 0
> 03:35:51  |           \_ sudo -H -S -n -u root /bin/sh -c echo BECOME-SUCCESS-dfogouwpzqdyrcchojmzbthdrrsfdtfa; /usr/bin/python /home/tripleo-admin/.ansible/tmp/ansible-tmp-1520331591.88-25827043162241/command.py; rm -rf "/home/tripleo-admin/.ansible/tmp/ansible-tmp-1520331591.88-25827043162241/" > /dev/null 2>&1
> 03:35:51  |               \_ /bin/sh -c echo BECOME-SUCCESS-dfogouwpzqdyrcchojmzbthdrrsfdtfa; /usr/bin/python /home/tripleo-admin/.ansible/tmp/ansible-tmp-1520331591.88-25827043162241/command.py; rm -rf "/home/tripleo-admin/.ansible/tmp/ansible-tmp-1520331591.88-25827043162241/" > /dev/null 2>&1
> 03:35:51  |                   \_ /usr/bin/python /home/tripleo-admin/.ansible/tmp/ansible-tmp-1520331591.88-25827043162241/command.py
> 03:35:51  |                       \_ /usr/bin/python /tmp/ansible_c7pDky/ansible_module_command.py
> 03:35:51  |                           \_ python /bin/ceph --connect-timeout 3 --cluster ceph fsid

Comment 1 Pavel Sedlák 2018-03-06 14:02:24 UTC
Not sure what the ceph fsid does, snippet of strace (which repeats when watched live) of it is below.

Also ceph --help shows
> Monitor commands: 
>  =================
> [Contacting monitor, timeout after 5 seconds]
> 2018-03-06 14:00:36.589137 7f757a0d6700 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory
> 2018-03-06 14:00:36.591611 7f7578158700  0 -- :/3709006390 >> 172.17.3.13:6789/0 pipe(0x7f757405dba0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f757405ee60).fault
> 2018-03-06 14:00:39.589775 7f7570ff9700  0 -- :/3709006390 >> 172.17.3.16:6789/0 pipe(0x7f7568000c80 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f7568001f90).fault

strace snippet:
> [pid 45190] <... select resumed> )      = 0 (Timeout)
> [pid 45190] clone(child_stack=0x7f41c6ffcfb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f41c6ffd9d0, tls=0x7f41c6ffd700, child_tidptr=0x7f41c6ffd9d0) = 354090
> /tmp/strace: Process 354090 attached
> [pid 45190] futex(0x10c9180, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
> [pid 354090] set_robust_list(0x7f41c6ffd9e0, 24) = 0
> [pid 354090] futex(0x10c9180, FUTEX_WAKE_PRIVATE, 1) = 1
> [pid 45190] <... futex resumed> )       = 0
> [pid 45190] futex(0x1073f40, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
> [pid 354090] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
> [pid 45190] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
> [pid 354090] <... futex resumed> )      = 0
> [pid 354090] futex(0x1073f40, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
> [pid 45190] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
> [pid 354090] <... futex resumed> )      = -1 EAGAIN (Resource temporarily unavailable)
> [pid 45190] <... futex resumed> )       = 0
> [pid 45190] futex(0x1073f40, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
> [pid 354090] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1) = 1
> [pid 45190] <... futex resumed> )       = 0
> [pid 354090] futex(0x1073f40, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
> [pid 45190] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
> [pid 354090] <... futex resumed> )      = -1 EAGAIN (Resource temporarily unavailable)
> [pid 45190] <... futex resumed> )       = 0
> [pid 354090] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
> [pid 45190] futex(0x1073f40, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
> [pid 354090] <... futex resumed> )      = 0
> [pid 354090] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1) = 1
> [pid 354090] madvise(0x7f41c67fd000, 8368128, MADV_DONTNEED <unfinished ...>
> [pid 45190] <... futex resumed> )       = 0
> [pid 354090] <... madvise resumed> )    = 0
> [pid 354090] exit(0)                    = ?
> [pid 354090] +++ exited with 0 +++
> [pid 45190] clone(/tmp/strace: Process 354091 attached
>  <unfinished ...>
> [pid 354091] set_robust_list(0x7f41c6ffd9e0, 24 <unfinished ...>
> [pid 45190] <... clone resumed> child_stack=0x7f41c6ffcfb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f41c6ffd9d0, tls=0x7f41c6ffd700, child_tidptr=0x7f41c6ffd9d0) = 354091
> [pid 354091] <... set_robust_list resumed> ) = 0
> [pid 354091] futex(0x1073f40, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
> [pid 45190] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
> [pid 354091] <... futex resumed> )      = -1 EAGAIN (Resource temporarily unavailable)
> [pid 45190] <... futex resumed> )       = 0
> [pid 45190] futex(0x1073f40, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
> [pid 354091] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1) = 0
> [pid 45190] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
> [pid 354091] futex(0xf64370, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
> [pid 45190] futex(0xf64370, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
> [pid 354091] <... futex resumed> )      = -1 EAGAIN (Resource temporarily unavailable)
> [pid 45190] <... futex resumed> )       = 0
> [pid 354091] futex(0x1073f40, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
> [pid 45190] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
> [pid 354091] <... futex resumed> )      = -1 EAGAIN (Resource temporarily unavailable)
> [pid 45190] <... futex resumed> )       = 0
> [pid 45190] futex(0x10c9180, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
> [pid 354091] futex(0x10c9180, FUTEX_WAKE_PRIVATE, 1) = 0
> [pid 45190] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
> [pid 45190] futex(0x1073f40, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
> [pid 354091] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1) = 0
> [pid 45190] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
> [pid 354091] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1) = 0
> [pid 354091] futex(0x1073f40, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
> [pid 45190] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
> [pid 354091] <... futex resumed> )      = -1 EAGAIN (Resource temporarily unavailable)
> [pid 45190] <... futex resumed> )       = 0
> [pid 45190] select(0, NULL, NULL, NULL, {0, 1000} <unfinished ...>
> [pid 354091] madvise(0x7f41c67fd000, 8368128, MADV_DONTNEED) = 0
> [pid 354091] exit(0)                    = ?
> [pid 354091] +++ exited with 0 +++
> [pid 45190] <... select resumed> )      = 0 (Timeout)

Comment 5 Alex Schultz 2018-03-07 15:39:31 UTC
This is not a duplicate. The issues are different.

Comment 7 John Fulton 2018-03-07 17:04:36 UTC
ceph-ansible 3.1 contains a fix which resolves this problem: 

https://github.com/ceph/ceph-ansible/commit/ec16cbdb1af9069de09d4a2e2e88739c2c303350

This bug now depends on 1548353 whose goal is to get ceph-ansible 3.1 into osp13

Comment 9 Ken Dreyer (Red Hat) 2018-03-19 15:50:14 UTC
John would you please confirm 3.1.0beta3 (or 3.1.0beta4) fixes this bug?

Comment 10 John Fulton 2018-03-27 13:21:20 UTC
- resetting assignee to the guits because his linked PR is what fixed the issue
- I verify that ceph-ansible-3.1.0.0-0.beta4.1.el7.noarch contains the fix
- I did several deployments and didn't experience the reported timeout

Comment 12 errata-xmlrpc 2018-09-26 18:19:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2819