Bug 1552101

Summary: OC deploy timeouts with 3ceph nodes - ceph fsid hangs
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Pavel Sedlák <psedlak>
Component: Ceph-AnsibleAssignee: Guillaume Abrioux <gabrioux>
Status: CLOSED ERRATA QA Contact: Yogev Rabl <yrabl>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.0CC: adeza, anharris, aschoen, aschultz, ceph-eng-bugs, dbecker, dsariel, gfidente, gmeno, johfulto, kdreyer, mburns, morazi, nthomas, ohochman, pgrist, rhel-osp-director-maint, sankarshan, skatlapa, vashastr
Target Milestone: rcKeywords: Reopened
Target Release: 3.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: RHEL: ceph-ansible-3.1.0-0.1.beta3.el7cp Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1552769 1559275 (view as bug list) Environment:
Last Closed: 2018-09-26 18:19:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1548353, 1552769    

Description Pavel Sedlák 2018-03-06 13:59:04 UTC
Overcloud deploy timeouts,
possibly due to hanging  ceph --connect-timeout 3 --cluster ceph fsid on controller-1.

ceph-ansible-3.0.14-1.el7cp.noarch (uc)
ceph-common-10.2.10-17.el7cp.x86_64  (ctl-1)

openstack workflow execution list|grep -v ' SUCCESS '
> +--------------------------------------+--------------------------------------+------------------------------------------------------------------------+--------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+---------+------------+---------------------+---------------------+
> | ID                                   | Workflow ID                          | Workflow name                                                          | Workflow namespace | Description                                                                                                                                                                                                                       | Task Execution ID                    | State   | State info | Created at          | Updated at          |
> +--------------------------------------+--------------------------------------+------------------------------------------------------------------------+--------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+---------+------------+---------------------+---------------------+
> | 3e1b36ab-7974-4d72-b50b-eb07836074e2 | 344043da-a011-4753-81cd-1290cf020a68 | tripleo.plan_management.v1.publish_ui_logs_to_swift                    |                    | {"triggered_by": {"type": "cron_trigger", "id": "8eca229a-5ece-4fb4-9306-ae9d0fe6dbd8", "name": "publish-ui-logs-hourly"}, "description": "Workflow execution created by cron trigger '(8eca229a-5ece-4fb4-9306-ae9d0fe6dbd8)'."} | <none>                               | ERROR   | None       | 2018-03-06 10:05:01 | 2018-03-06 10:05:07 |
> | c63f2e47-8e90-4e7d-bd1c-db6f2e580b25 | cbf3f2c2-5fc3-461e-8f7f-6fe28c438599 | tripleo.overcloud.workflow_tasks.step2                                 |                    | Heat managed                                                                                                                                                                                                                      | <none>                               | RUNNING | None       | 2018-03-06 10:16:48 | 2018-03-06 10:16:48 |
> | cca9c080-2709-4562-be83-b87db61fd878 | 35626a61-1127-44ef-ae59-2ba188b8592d | tripleo.storage.v1.ceph-install                                        |                    | sub-workflow execution                                                                                                                                                                                                            | a72f6980-4a8d-4be6-afb9-fb1ae4be7e16 | RUNNING | None       | 2018-03-06 10:16:49 | 2018-03-06 10:16:49 |
> | 9ed63a6c-0c7c-417f-947d-31649f4bd949 | 344043da-a011-4753-81cd-1290cf020a68 | tripleo.plan_management.v1.publish_ui_logs_to_swift                    |                    | {"triggered_by": {"type": "cron_trigger", "id": "8eca229a-5ece-4fb4-9306-ae9d0fe6dbd8", "name": "publish-ui-logs-hourly"}, "description": "Workflow execution created by cron trigger '(8eca229a-5ece-4fb4-9306-ae9d0fe6dbd8)'."} | <none>                               | ERROR   | None       | 2018-03-06 11:06:29 | 2018-03-06 11:06:36 |
> | d09b34ad-ab8e-4632-8f2e-0d0485de3e8d | 344043da-a011-4753-81cd-1290cf020a68 | tripleo.plan_management.v1.publish_ui_logs_to_swift                    |                    | {"triggered_by": {"type": "cron_trigger", "id": "8eca229a-5ece-4fb4-9306-ae9d0fe6dbd8", "name": "publish-ui-logs-hourly"}, "description": "Workflow execution created by cron trigger '(8eca229a-5ece-4fb4-9306-ae9d0fe6dbd8)'."} | <none>                               | ERROR   | None       | 2018-03-06 12:08:20 | 2018-03-06 12:08:27 |
> | 0141d397-b0c4-4362-8786-413f8287c1f7 | 344043da-a011-4753-81cd-1290cf020a68 | tripleo.plan_management.v1.publish_ui_logs_to_swift                    |                    | {"triggered_by": {"type": "cron_trigger", "id": "8eca229a-5ece-4fb4-9306-ae9d0fe6dbd8", "name": "publish-ui-logs-hourly"}, "description": "Workflow execution created by cron trigger '(8eca229a-5ece-4fb4-9306-ae9d0fe6dbd8)'."} | <none>                               | ERROR   | None       | 2018-03-06 13:09:25 | 2018-03-06 13:09:32 |
> +--------------------------------------+--------------------------------------+------------------------------------------------------------------------+--------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+---------+------------+---------------------+---------------------+

notice how long they are running, now:
> date --utc
> Tue Mar  6 13:43:18 UTC 2018

ps -ef |grep ansible
> mistral  10767  1371  4 05:17 ?        00:10:06 /usr/bin/python2 /bin/ansible-playbook /usr/share/ceph-ansible/site-docker.yml.sample --user tripleo-admin --become --become-user root --extra-vars {"ireallymeanit": "yes"} --inventory-file /tmp/ansible-mistral-actionpFS4X7/inventory.yaml --private-key /tmp/ansible-mistral-actionpFS4X7/ssh_private_key --skip-tags package-install,with_pkg
> mistral  10789     1  0 05:17 ?        00:00:00 ssh: /tmp/ansible-mistral-actionpFS4X7/.ansible/cp/192.168.24.10-tripleo-admin-22 [mux]
> mistral  12320 10767  0 05:19 ?        00:00:00 /usr/bin/python2 /bin/ansible-playbook /usr/share/ceph-ansible/site-docker.yml.sample --user tripleo-admin --become --become-user root --extra-vars {"ireallymeanit": "yes"} --inventory-file /tmp/ansible-mistral-actionpFS4X7/inventory.yaml --private-key /tmp/ansible-mistral-actionpFS4X7/ssh_private_key --skip-tags package-install,with_pkg
> mistral  12326 12320  0 05:19 ?        00:00:00 ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o IdentityFile="/tmp/ansible-mistral-actionpFS4X7/ssh_private_key" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=tripleo-admin -o ConnectTimeout=30 -o ControlPath=/tmp/ansible-mistral-actionpFS4X7/.ansible/cp/%h-%r-%p -tt 192.168.24.10 /bin/sh -c 'sudo -H -S -n -u root /bin/sh -c '"'"'echo BECOME-SUCCESS-dfogouwpzqdyrcchojmzbthdrrsfdtfa; /usr/bin/python /home/tripleo-admin/.ansible/tmp/ansible-tmp-1520331591.88-25827043162241/command.py; rm -rf "/home/tripleo-admin/.ansible/tmp ansible-tmp-1520331591.88-25827043162241/" > /dev/null 2>&1'"'"' && sleep 0'

on the 192.168.24.10 ip visible above is controller-1, from there:

> [heat-admin@controller-1 ~]$ ps axf -o etime,args  # snippet
> 03:37:47  \_ sshd: tripleo-admin [priv]
> 03:37:46  |   \_ sshd: tripleo-admin@pts/0
> 03:35:51  |       \_ /bin/sh -c sudo -H -S -n -u root /bin/sh -c 'echo BECOME-SUCCESS-dfogouwpzqdyrcchojmzbthdrrsfdtfa; /usr/bin/python /home/tripleo-admin/.ansible/tmp/ansible-tmp-1520331591.88-25827043162241/command.py; rm -rf "/home/tripleo-admin/.ansible/tmp/ansible-tmp-1520331591.88-25827043162241/" > /dev/null 2>&1' && sleep 0
> 03:35:51  |           \_ sudo -H -S -n -u root /bin/sh -c echo BECOME-SUCCESS-dfogouwpzqdyrcchojmzbthdrrsfdtfa; /usr/bin/python /home/tripleo-admin/.ansible/tmp/ansible-tmp-1520331591.88-25827043162241/command.py; rm -rf "/home/tripleo-admin/.ansible/tmp/ansible-tmp-1520331591.88-25827043162241/" > /dev/null 2>&1
> 03:35:51  |               \_ /bin/sh -c echo BECOME-SUCCESS-dfogouwpzqdyrcchojmzbthdrrsfdtfa; /usr/bin/python /home/tripleo-admin/.ansible/tmp/ansible-tmp-1520331591.88-25827043162241/command.py; rm -rf "/home/tripleo-admin/.ansible/tmp/ansible-tmp-1520331591.88-25827043162241/" > /dev/null 2>&1
> 03:35:51  |                   \_ /usr/bin/python /home/tripleo-admin/.ansible/tmp/ansible-tmp-1520331591.88-25827043162241/command.py
> 03:35:51  |                       \_ /usr/bin/python /tmp/ansible_c7pDky/ansible_module_command.py
> 03:35:51  |                           \_ python /bin/ceph --connect-timeout 3 --cluster ceph fsid

Comment 1 Pavel Sedlák 2018-03-06 14:02:24 UTC
Not sure what the ceph fsid does, snippet of strace (which repeats when watched live) of it is below.

Also ceph --help shows
> Monitor commands: 
>  =================
> [Contacting monitor, timeout after 5 seconds]
> 2018-03-06 14:00:36.589137 7f757a0d6700 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory
> 2018-03-06 14:00:36.591611 7f7578158700  0 -- :/3709006390 >> 172.17.3.13:6789/0 pipe(0x7f757405dba0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f757405ee60).fault
> 2018-03-06 14:00:39.589775 7f7570ff9700  0 -- :/3709006390 >> 172.17.3.16:6789/0 pipe(0x7f7568000c80 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f7568001f90).fault

strace snippet:
> [pid 45190] <... select resumed> )      = 0 (Timeout)
> [pid 45190] clone(child_stack=0x7f41c6ffcfb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f41c6ffd9d0, tls=0x7f41c6ffd700, child_tidptr=0x7f41c6ffd9d0) = 354090
> /tmp/strace: Process 354090 attached
> [pid 45190] futex(0x10c9180, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
> [pid 354090] set_robust_list(0x7f41c6ffd9e0, 24) = 0
> [pid 354090] futex(0x10c9180, FUTEX_WAKE_PRIVATE, 1) = 1
> [pid 45190] <... futex resumed> )       = 0
> [pid 45190] futex(0x1073f40, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
> [pid 354090] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
> [pid 45190] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
> [pid 354090] <... futex resumed> )      = 0
> [pid 354090] futex(0x1073f40, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
> [pid 45190] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
> [pid 354090] <... futex resumed> )      = -1 EAGAIN (Resource temporarily unavailable)
> [pid 45190] <... futex resumed> )       = 0
> [pid 45190] futex(0x1073f40, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
> [pid 354090] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1) = 1
> [pid 45190] <... futex resumed> )       = 0
> [pid 354090] futex(0x1073f40, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
> [pid 45190] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
> [pid 354090] <... futex resumed> )      = -1 EAGAIN (Resource temporarily unavailable)
> [pid 45190] <... futex resumed> )       = 0
> [pid 354090] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
> [pid 45190] futex(0x1073f40, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
> [pid 354090] <... futex resumed> )      = 0
> [pid 354090] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1) = 1
> [pid 354090] madvise(0x7f41c67fd000, 8368128, MADV_DONTNEED <unfinished ...>
> [pid 45190] <... futex resumed> )       = 0
> [pid 354090] <... madvise resumed> )    = 0
> [pid 354090] exit(0)                    = ?
> [pid 354090] +++ exited with 0 +++
> [pid 45190] clone(/tmp/strace: Process 354091 attached
>  <unfinished ...>
> [pid 354091] set_robust_list(0x7f41c6ffd9e0, 24 <unfinished ...>
> [pid 45190] <... clone resumed> child_stack=0x7f41c6ffcfb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f41c6ffd9d0, tls=0x7f41c6ffd700, child_tidptr=0x7f41c6ffd9d0) = 354091
> [pid 354091] <... set_robust_list resumed> ) = 0
> [pid 354091] futex(0x1073f40, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
> [pid 45190] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
> [pid 354091] <... futex resumed> )      = -1 EAGAIN (Resource temporarily unavailable)
> [pid 45190] <... futex resumed> )       = 0
> [pid 45190] futex(0x1073f40, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
> [pid 354091] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1) = 0
> [pid 45190] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
> [pid 354091] futex(0xf64370, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
> [pid 45190] futex(0xf64370, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
> [pid 354091] <... futex resumed> )      = -1 EAGAIN (Resource temporarily unavailable)
> [pid 45190] <... futex resumed> )       = 0
> [pid 354091] futex(0x1073f40, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
> [pid 45190] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
> [pid 354091] <... futex resumed> )      = -1 EAGAIN (Resource temporarily unavailable)
> [pid 45190] <... futex resumed> )       = 0
> [pid 45190] futex(0x10c9180, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
> [pid 354091] futex(0x10c9180, FUTEX_WAKE_PRIVATE, 1) = 0
> [pid 45190] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
> [pid 45190] futex(0x1073f40, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
> [pid 354091] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1) = 0
> [pid 45190] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
> [pid 354091] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1) = 0
> [pid 354091] futex(0x1073f40, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
> [pid 45190] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
> [pid 354091] <... futex resumed> )      = -1 EAGAIN (Resource temporarily unavailable)
> [pid 45190] <... futex resumed> )       = 0
> [pid 45190] select(0, NULL, NULL, NULL, {0, 1000} <unfinished ...>
> [pid 354091] madvise(0x7f41c67fd000, 8368128, MADV_DONTNEED) = 0
> [pid 354091] exit(0)                    = ?
> [pid 354091] +++ exited with 0 +++
> [pid 45190] <... select resumed> )      = 0 (Timeout)

Comment 5 Alex Schultz 2018-03-07 15:39:31 UTC
This is not a duplicate. The issues are different.

Comment 7 John Fulton 2018-03-07 17:04:36 UTC
ceph-ansible 3.1 contains a fix which resolves this problem: 

https://github.com/ceph/ceph-ansible/commit/ec16cbdb1af9069de09d4a2e2e88739c2c303350

This bug now depends on 1548353 whose goal is to get ceph-ansible 3.1 into osp13

Comment 9 Ken Dreyer (Red Hat) 2018-03-19 15:50:14 UTC
John would you please confirm 3.1.0beta3 (or 3.1.0beta4) fixes this bug?

Comment 10 John Fulton 2018-03-27 13:21:20 UTC
- resetting assignee to the guits because his linked PR is what fixed the issue
- I verify that ceph-ansible-3.1.0.0-0.beta4.1.el7.noarch contains the fix
- I did several deployments and didn't experience the reported timeout

Comment 12 errata-xmlrpc 2018-09-26 18:19:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2819