Overcloud deploy timeouts, possibly due to hanging ceph --connect-timeout 3 --cluster ceph fsid on controller-1. ceph-ansible-3.0.14-1.el7cp.noarch (uc) ceph-common-10.2.10-17.el7cp.x86_64 (ctl-1) openstack workflow execution list|grep -v ' SUCCESS ' > +--------------------------------------+--------------------------------------+------------------------------------------------------------------------+--------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+---------+------------+---------------------+---------------------+ > | ID | Workflow ID | Workflow name | Workflow namespace | Description | Task Execution ID | State | State info | Created at | Updated at | > +--------------------------------------+--------------------------------------+------------------------------------------------------------------------+--------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+---------+------------+---------------------+---------------------+ > | 3e1b36ab-7974-4d72-b50b-eb07836074e2 | 344043da-a011-4753-81cd-1290cf020a68 | tripleo.plan_management.v1.publish_ui_logs_to_swift | | {"triggered_by": {"type": "cron_trigger", "id": "8eca229a-5ece-4fb4-9306-ae9d0fe6dbd8", "name": "publish-ui-logs-hourly"}, "description": "Workflow execution created by cron trigger '(8eca229a-5ece-4fb4-9306-ae9d0fe6dbd8)'."} | <none> | ERROR | None | 2018-03-06 10:05:01 | 2018-03-06 10:05:07 | > | c63f2e47-8e90-4e7d-bd1c-db6f2e580b25 | cbf3f2c2-5fc3-461e-8f7f-6fe28c438599 | tripleo.overcloud.workflow_tasks.step2 | | Heat managed | <none> | RUNNING | None | 2018-03-06 10:16:48 | 2018-03-06 10:16:48 | > | cca9c080-2709-4562-be83-b87db61fd878 | 35626a61-1127-44ef-ae59-2ba188b8592d | tripleo.storage.v1.ceph-install | | sub-workflow execution | a72f6980-4a8d-4be6-afb9-fb1ae4be7e16 | RUNNING | None | 2018-03-06 10:16:49 | 2018-03-06 10:16:49 | > | 9ed63a6c-0c7c-417f-947d-31649f4bd949 | 344043da-a011-4753-81cd-1290cf020a68 | tripleo.plan_management.v1.publish_ui_logs_to_swift | | {"triggered_by": {"type": "cron_trigger", "id": "8eca229a-5ece-4fb4-9306-ae9d0fe6dbd8", "name": "publish-ui-logs-hourly"}, "description": "Workflow execution created by cron trigger '(8eca229a-5ece-4fb4-9306-ae9d0fe6dbd8)'."} | <none> | ERROR | None | 2018-03-06 11:06:29 | 2018-03-06 11:06:36 | > | d09b34ad-ab8e-4632-8f2e-0d0485de3e8d | 344043da-a011-4753-81cd-1290cf020a68 | tripleo.plan_management.v1.publish_ui_logs_to_swift | | {"triggered_by": {"type": "cron_trigger", "id": "8eca229a-5ece-4fb4-9306-ae9d0fe6dbd8", "name": "publish-ui-logs-hourly"}, "description": "Workflow execution created by cron trigger '(8eca229a-5ece-4fb4-9306-ae9d0fe6dbd8)'."} | <none> | ERROR | None | 2018-03-06 12:08:20 | 2018-03-06 12:08:27 | > | 0141d397-b0c4-4362-8786-413f8287c1f7 | 344043da-a011-4753-81cd-1290cf020a68 | tripleo.plan_management.v1.publish_ui_logs_to_swift | | {"triggered_by": {"type": "cron_trigger", "id": "8eca229a-5ece-4fb4-9306-ae9d0fe6dbd8", "name": "publish-ui-logs-hourly"}, "description": "Workflow execution created by cron trigger '(8eca229a-5ece-4fb4-9306-ae9d0fe6dbd8)'."} | <none> | ERROR | None | 2018-03-06 13:09:25 | 2018-03-06 13:09:32 | > +--------------------------------------+--------------------------------------+------------------------------------------------------------------------+--------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+---------+------------+---------------------+---------------------+ notice how long they are running, now: > date --utc > Tue Mar 6 13:43:18 UTC 2018 ps -ef |grep ansible > mistral 10767 1371 4 05:17 ? 00:10:06 /usr/bin/python2 /bin/ansible-playbook /usr/share/ceph-ansible/site-docker.yml.sample --user tripleo-admin --become --become-user root --extra-vars {"ireallymeanit": "yes"} --inventory-file /tmp/ansible-mistral-actionpFS4X7/inventory.yaml --private-key /tmp/ansible-mistral-actionpFS4X7/ssh_private_key --skip-tags package-install,with_pkg > mistral 10789 1 0 05:17 ? 00:00:00 ssh: /tmp/ansible-mistral-actionpFS4X7/.ansible/cp/192.168.24.10-tripleo-admin-22 [mux] > mistral 12320 10767 0 05:19 ? 00:00:00 /usr/bin/python2 /bin/ansible-playbook /usr/share/ceph-ansible/site-docker.yml.sample --user tripleo-admin --become --become-user root --extra-vars {"ireallymeanit": "yes"} --inventory-file /tmp/ansible-mistral-actionpFS4X7/inventory.yaml --private-key /tmp/ansible-mistral-actionpFS4X7/ssh_private_key --skip-tags package-install,with_pkg > mistral 12326 12320 0 05:19 ? 00:00:00 ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o IdentityFile="/tmp/ansible-mistral-actionpFS4X7/ssh_private_key" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=tripleo-admin -o ConnectTimeout=30 -o ControlPath=/tmp/ansible-mistral-actionpFS4X7/.ansible/cp/%h-%r-%p -tt 192.168.24.10 /bin/sh -c 'sudo -H -S -n -u root /bin/sh -c '"'"'echo BECOME-SUCCESS-dfogouwpzqdyrcchojmzbthdrrsfdtfa; /usr/bin/python /home/tripleo-admin/.ansible/tmp/ansible-tmp-1520331591.88-25827043162241/command.py; rm -rf "/home/tripleo-admin/.ansible/tmp ansible-tmp-1520331591.88-25827043162241/" > /dev/null 2>&1'"'"' && sleep 0' on the 192.168.24.10 ip visible above is controller-1, from there: > [heat-admin@controller-1 ~]$ ps axf -o etime,args # snippet > 03:37:47 \_ sshd: tripleo-admin [priv] > 03:37:46 | \_ sshd: tripleo-admin@pts/0 > 03:35:51 | \_ /bin/sh -c sudo -H -S -n -u root /bin/sh -c 'echo BECOME-SUCCESS-dfogouwpzqdyrcchojmzbthdrrsfdtfa; /usr/bin/python /home/tripleo-admin/.ansible/tmp/ansible-tmp-1520331591.88-25827043162241/command.py; rm -rf "/home/tripleo-admin/.ansible/tmp/ansible-tmp-1520331591.88-25827043162241/" > /dev/null 2>&1' && sleep 0 > 03:35:51 | \_ sudo -H -S -n -u root /bin/sh -c echo BECOME-SUCCESS-dfogouwpzqdyrcchojmzbthdrrsfdtfa; /usr/bin/python /home/tripleo-admin/.ansible/tmp/ansible-tmp-1520331591.88-25827043162241/command.py; rm -rf "/home/tripleo-admin/.ansible/tmp/ansible-tmp-1520331591.88-25827043162241/" > /dev/null 2>&1 > 03:35:51 | \_ /bin/sh -c echo BECOME-SUCCESS-dfogouwpzqdyrcchojmzbthdrrsfdtfa; /usr/bin/python /home/tripleo-admin/.ansible/tmp/ansible-tmp-1520331591.88-25827043162241/command.py; rm -rf "/home/tripleo-admin/.ansible/tmp/ansible-tmp-1520331591.88-25827043162241/" > /dev/null 2>&1 > 03:35:51 | \_ /usr/bin/python /home/tripleo-admin/.ansible/tmp/ansible-tmp-1520331591.88-25827043162241/command.py > 03:35:51 | \_ /usr/bin/python /tmp/ansible_c7pDky/ansible_module_command.py > 03:35:51 | \_ python /bin/ceph --connect-timeout 3 --cluster ceph fsid
Not sure what the ceph fsid does, snippet of strace (which repeats when watched live) of it is below. Also ceph --help shows > Monitor commands: > ================= > [Contacting monitor, timeout after 5 seconds] > 2018-03-06 14:00:36.589137 7f757a0d6700 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory > 2018-03-06 14:00:36.591611 7f7578158700 0 -- :/3709006390 >> 172.17.3.13:6789/0 pipe(0x7f757405dba0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f757405ee60).fault > 2018-03-06 14:00:39.589775 7f7570ff9700 0 -- :/3709006390 >> 172.17.3.16:6789/0 pipe(0x7f7568000c80 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f7568001f90).fault strace snippet: > [pid 45190] <... select resumed> ) = 0 (Timeout) > [pid 45190] clone(child_stack=0x7f41c6ffcfb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f41c6ffd9d0, tls=0x7f41c6ffd700, child_tidptr=0x7f41c6ffd9d0) = 354090 > /tmp/strace: Process 354090 attached > [pid 45190] futex(0x10c9180, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> > [pid 354090] set_robust_list(0x7f41c6ffd9e0, 24) = 0 > [pid 354090] futex(0x10c9180, FUTEX_WAKE_PRIVATE, 1) = 1 > [pid 45190] <... futex resumed> ) = 0 > [pid 45190] futex(0x1073f40, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> > [pid 354090] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> > [pid 45190] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) > [pid 354090] <... futex resumed> ) = 0 > [pid 354090] futex(0x1073f40, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> > [pid 45190] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> > [pid 354090] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) > [pid 45190] <... futex resumed> ) = 0 > [pid 45190] futex(0x1073f40, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> > [pid 354090] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1) = 1 > [pid 45190] <... futex resumed> ) = 0 > [pid 354090] futex(0x1073f40, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> > [pid 45190] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> > [pid 354090] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) > [pid 45190] <... futex resumed> ) = 0 > [pid 354090] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> > [pid 45190] futex(0x1073f40, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> > [pid 354090] <... futex resumed> ) = 0 > [pid 354090] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1) = 1 > [pid 354090] madvise(0x7f41c67fd000, 8368128, MADV_DONTNEED <unfinished ...> > [pid 45190] <... futex resumed> ) = 0 > [pid 354090] <... madvise resumed> ) = 0 > [pid 354090] exit(0) = ? > [pid 354090] +++ exited with 0 +++ > [pid 45190] clone(/tmp/strace: Process 354091 attached > <unfinished ...> > [pid 354091] set_robust_list(0x7f41c6ffd9e0, 24 <unfinished ...> > [pid 45190] <... clone resumed> child_stack=0x7f41c6ffcfb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f41c6ffd9d0, tls=0x7f41c6ffd700, child_tidptr=0x7f41c6ffd9d0) = 354091 > [pid 354091] <... set_robust_list resumed> ) = 0 > [pid 354091] futex(0x1073f40, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> > [pid 45190] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> > [pid 354091] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) > [pid 45190] <... futex resumed> ) = 0 > [pid 45190] futex(0x1073f40, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> > [pid 354091] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1) = 0 > [pid 45190] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) > [pid 354091] futex(0xf64370, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> > [pid 45190] futex(0xf64370, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> > [pid 354091] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) > [pid 45190] <... futex resumed> ) = 0 > [pid 354091] futex(0x1073f40, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> > [pid 45190] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> > [pid 354091] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) > [pid 45190] <... futex resumed> ) = 0 > [pid 45190] futex(0x10c9180, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> > [pid 354091] futex(0x10c9180, FUTEX_WAKE_PRIVATE, 1) = 0 > [pid 45190] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) > [pid 45190] futex(0x1073f40, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> > [pid 354091] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1) = 0 > [pid 45190] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) > [pid 354091] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1) = 0 > [pid 354091] futex(0x1073f40, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> > [pid 45190] futex(0x1073f40, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> > [pid 354091] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) > [pid 45190] <... futex resumed> ) = 0 > [pid 45190] select(0, NULL, NULL, NULL, {0, 1000} <unfinished ...> > [pid 354091] madvise(0x7f41c67fd000, 8368128, MADV_DONTNEED) = 0 > [pid 354091] exit(0) = ? > [pid 354091] +++ exited with 0 +++ > [pid 45190] <... select resumed> ) = 0 (Timeout)
This is not a duplicate. The issues are different.
ceph-ansible 3.1 contains a fix which resolves this problem: https://github.com/ceph/ceph-ansible/commit/ec16cbdb1af9069de09d4a2e2e88739c2c303350 This bug now depends on 1548353 whose goal is to get ceph-ansible 3.1 into osp13
John would you please confirm 3.1.0beta3 (or 3.1.0beta4) fixes this bug?
- resetting assignee to the guits because his linked PR is what fixed the issue - I verify that ceph-ansible-3.1.0.0-0.beta4.1.el7.noarch contains the fix - I did several deployments and didn't experience the reported timeout
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2819