Description of problem: Ansible hangs on task when trying to establish ssh connection. I have this playbook which is run on several vagrant machines: https://github.com/SSSD/sssd-test-suite/blob/master/provision/prepare-guests.yml I never run into an issue running it locally but when I run it in openstack with nested virtualization, ansible quite often hangs at specific point of the playbook. It sometimes does not hang but more often it does. It hangs on line 31 of the playbook in the first task of the role: https://github.com/SSSD/sssd-test-suite/blob/master/provision/prepare-guests.yml#L31 https://github.com/SSSD/sssd-test-suite/blob/master/provision/roles/ipa/tasks/main.yml#L1 It waits on ssh command and never finishes or timeout, I have to kill it manually. Version-Release number of selected component (if applicable): ansible-2.8.1-1.fc29.noarch vagrant-2.1.2-3.fc29.noarch openssh-7.9p1-6.fc29.x86_64 How reproducible: Quite often but not every time. Steps to Reproduce: $ POOL_DIR=/path/to/sssd/test/suite/pool $ git clone https://github.com/SSSD/sssd-test-suite.git $ cd sssd-test-suite $ ./install-cli-deps.sh $ ./sssd-test-suite provision host --pool "$POOL_DIR" $ cp ./configs/sssd-f30.json config.json $ ./sssd-test-suite up ipa ldap client -s && ./sssd-test-suite provision guest ipa ldap client Actual results: Ansible hangs on TASK [ipa : Remove hostname from /etc/hosts] Expected results: Ansible does not hang. Additional info: Attached ansible logs and journal log from the target machine.
Created attachment 1590672 [details] ansible.log
Created attachment 1590673 [details] journal log
It looks like that connection never happens/compeltes (at least from the target machine side). Can you run ansible-playbook with -vvvv and ANSIBLE_DEBUG=1 and see if it gives any hints as to whats happening?
The output was generated with ANSIBLE_DEBUG=true, ANSIBLE_LOG_PATH=file.log and I believe also -vvvv was passed to ansible-playbook. In addition to this, -vvv was passed to ssh args. As you can see in the end, there is 'SSH: EXEC ssh -vvv' however no debug information was printed by ssh so it looks like the ssh command gets never executed or ansible is unable to read its pipe or something like that. I was unable to reproduce it this time, but if you see that the log file does not contain -vvvv information, I will try again.
I finally managed to get gdb backtrace from the dead lock. There is one ansible-playbook process that spawned another ansible-playbook process which I believe is suppose to become ssh later. The two processes ends up in deadlock, see the attached backtrace. I consulted this with Florian Weimer from glibc. Apparently it is not OK to call os.fork() in multi-threaded processes as it requires some special treatment and Ansible should use subprocess to re-invoke itself. Quoting Florian (concatenated several mail together): > It looks like the parent process is multi-threaded and calls fork. That > requires special care to work properly. There's a glibc upstream bug > about it: > > <https://sourceware.org/bugzilla/show_bug.cgi?id=19282> > > Making this work in general on the glibc side could be really tricky. > We can likely fix this grantpt issue, but then ansible might run into > something else. > > I posted a patch to fix the grantpt issue: > > <https://sourceware.org/bugzilla/show_bug.cgi?id=24941> > <https://sourceware.org/ml/libc-alpha/2019-08/msg00666.html> > > Ansible uses pseudo-terminals extensively. Maybe behavior differs > depending on whether it is invoked with a terminal or not? It could > also be related to daemon mode. > > I'm not myself an ansible user, so I really don't know the details. I > just had a quick look at the Fedora sources. > > If it gets accepted upstream, I will put it into Fedora. > >> What can Ansible do about it? > It should use the subprocess to re-invoke itself, and stop using > os.fork. At the very least, they should not call fork while > multi-threaded. BTW I never run into this bug locally, but it happens quite often when it is run via screen or cron. It fact it makes it unreliable and unusable.
Created attachment 1608911 [details] main process backtrace
Created attachment 1608912 [details] child process backtrace
Would you be willing to file this upstream? Or would you prefer I do so? I am not at all sure whats going on here. ;(
I submitted https://github.com/ansible/ansible/issues/61701
This message is a reminder that Fedora 29 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora 29 on 2019-11-26. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '29'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 29 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
This message is a reminder that Fedora 31 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora 31 on 2020-11-24. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '31'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 31 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Fedora 31 changed to end-of-life (EOL) status on 2020-11-24. Fedora 31 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. If you experience problems, please add a comment to this bug. Thank you for reporting this bug and we are sorry it could not be fixed.