Bug 1729896 - ansible hangs on task trying to open ssh connection
Summary: ansible hangs on task trying to open ssh connection
Keywords:
Status: ASSIGNED
Alias: None
Product: Fedora
Classification: Fedora
Component: ansible
Version: 31
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Kevin Fenzi
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-07-15 09:19 UTC by Pavel Březina
Modified: 2019-11-01 08:52 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:


Attachments (Terms of Use)
ansible.log (5.64 MB, text/plain)
2019-07-15 09:42 UTC, Pavel Březina
no flags Details
journal log (8.64 KB, text/plain)
2019-07-15 09:42 UTC, Pavel Březina
no flags Details
main process backtrace (31.41 KB, text/plain)
2019-08-28 09:01 UTC, Pavel Březina
no flags Details
child process backtrace (55.05 KB, text/plain)
2019-08-28 09:01 UTC, Pavel Březina
no flags Details

Description Pavel Březina 2019-07-15 09:19:51 UTC
Description of problem:
Ansible hangs on task when trying to establish ssh connection.

I have this playbook which is run on several vagrant machines:
https://github.com/SSSD/sssd-test-suite/blob/master/provision/prepare-guests.yml

I never run into an issue running it locally but when I run it in openstack with  nested virtualization, ansible quite often hangs at specific point of the playbook. It sometimes does not hang but more often it does.

It hangs on line 31 of the playbook in the first task of the role:
https://github.com/SSSD/sssd-test-suite/blob/master/provision/prepare-guests.yml#L31
https://github.com/SSSD/sssd-test-suite/blob/master/provision/roles/ipa/tasks/main.yml#L1

It waits on ssh command and never finishes or timeout, I have to kill it manually.

Version-Release number of selected component (if applicable):
ansible-2.8.1-1.fc29.noarch
vagrant-2.1.2-3.fc29.noarch
openssh-7.9p1-6.fc29.x86_64


How reproducible:
Quite often but not every time.


Steps to Reproduce:
$ POOL_DIR=/path/to/sssd/test/suite/pool
$ git clone https://github.com/SSSD/sssd-test-suite.git
$ cd sssd-test-suite
$ ./install-cli-deps.sh
$ ./sssd-test-suite provision host --pool "$POOL_DIR"
$ cp ./configs/sssd-f30.json config.json
$ ./sssd-test-suite up ipa ldap client -s && ./sssd-test-suite provision guest ipa ldap client

Actual results:
Ansible hangs on TASK [ipa : Remove hostname from /etc/hosts]


Expected results:
Ansible does not hang.


Additional info:
Attached ansible logs and journal log from the target machine.

Comment 1 Pavel Březina 2019-07-15 09:42:29 UTC
Created attachment 1590672 [details]
ansible.log

Comment 2 Pavel Březina 2019-07-15 09:42:46 UTC
Created attachment 1590673 [details]
journal log

Comment 3 Kevin Fenzi 2019-07-17 22:01:33 UTC
It looks like that connection never happens/compeltes (at least from the target machine side).

Can you run ansible-playbook with -vvvv and ANSIBLE_DEBUG=1 and see if it gives any hints as to whats happening?

Comment 4 Pavel Březina 2019-08-05 07:20:47 UTC
The output was generated with ANSIBLE_DEBUG=true, ANSIBLE_LOG_PATH=file.log and I believe also -vvvv was passed to ansible-playbook. In addition to this, -vvv was passed to ssh args. As you can see in the end, there is 'SSH: EXEC ssh -vvv' however no debug information was printed by ssh so it looks like the ssh command gets never executed or ansible is unable to read its pipe or something like that.

I was unable to reproduce it this time, but if you see that the log file does not contain -vvvv information, I will try again.

Comment 5 Pavel Březina 2019-08-28 08:39:53 UTC
I finally managed to get gdb backtrace from the dead lock. There is one ansible-playbook process that spawned another ansible-playbook process which I believe is suppose to become ssh later. The two processes ends up in deadlock, see the attached backtrace.

I consulted this with Florian Weimer from glibc. Apparently it is not OK to call os.fork() in multi-threaded processes as it requires some special treatment and Ansible should use subprocess to re-invoke itself.

Quoting Florian (concatenated several mail together):
> It looks like the parent process is multi-threaded and calls fork.  That
> requires special care to work properly.  There's a glibc upstream bug
> about it:
>
>  <https://sourceware.org/bugzilla/show_bug.cgi?id=19282>
>
> Making this work in general on the glibc side could be really tricky.
> We can likely fix this grantpt issue, but then ansible might run into
> something else.
>
> I posted a patch to fix the grantpt issue:
>
>  <https://sourceware.org/bugzilla/show_bug.cgi?id=24941>
>  <https://sourceware.org/ml/libc-alpha/2019-08/msg00666.html>
>
> Ansible uses pseudo-terminals extensively.  Maybe behavior differs
> depending on whether it is invoked with a terminal or not?  It could
> also be related to daemon mode.
>
> I'm not myself an ansible user, so I really don't know the details.  I
> just had a quick look at the Fedora sources.
>
> If it gets accepted upstream, I will put it into Fedora.
>
>> What can Ansible do about it?
> It should use the subprocess to re-invoke itself, and stop using
> os.fork.  At the very least, they should not call fork while
> multi-threaded.

BTW I never run into this bug locally, but it happens quite often when it is run via screen or cron. It fact it makes it unreliable and unusable.

Comment 6 Pavel Březina 2019-08-28 09:01:26 UTC
Created attachment 1608911 [details]
main process backtrace

Comment 7 Pavel Březina 2019-08-28 09:01:47 UTC
Created attachment 1608912 [details]
child process backtrace

Comment 8 Kevin Fenzi 2019-09-02 20:13:33 UTC
Would you be willing to file this upstream? Or would you prefer I do so?

I am not at all sure whats going on here. ;(

Comment 9 Pavel Březina 2019-09-03 09:07:37 UTC
I submitted https://github.com/ansible/ansible/issues/61701

Comment 10 Ben Cotton 2019-10-31 18:52:51 UTC
This message is a reminder that Fedora 29 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 29 on 2019-11-26.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '29'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 29 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.


Note You need to log in before you can comment on or make changes to this bug.