Bug 1635923
| Summary: | Deployment freezes on openshift_node : install needed rpm(s) | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | jmselmi <jmselmi> | ||||
| Component: | Installer | Assignee: | Scott Dodson <sdodson> | ||||
| Status: | CLOSED DEFERRED | QA Contact: | Johnny Liu <jialiu> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | low | ||||||
| Version: | 3.10.0 | CC: | aos-bugs, jmselmi, jokerman, mmccomas, rgordill, wmeng | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 3.10.z | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2019-02-15 21:18:34 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
jmselmi
2018-10-04 00:23:16 UTC
Created attachment 1490427 [details]
ansible-openshift -vvv
The playbook deploy_cluster.yml keeps freezing. No error generated yet.
When, I canceled and tried again, on some nodes the rpms were already installed but, the playbook keeps freezing on the remaining ones.
rpm -q openshift-ansible
openshift-ansible-3.10.47-1.git.0.95bc2d2.el7_5.noarch
rpm -q ansible
ansible-2.6.4-1.el7.noarch
ansible --version
ansible 2.6.4
config file = /etc/ansible/ansible.cfg
configured module search path = [u'/home/jmselmi/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
ansible python module location = /usr/lib/python2.7/site-packages/ansible
executable location = /usr/bin/ansible
python version = 2.7.5 (default, May 3 2017, 07:55:04) [GCC 4.8.5 20150623 (Red Hat 4.8.5-14)]
This often happens when yum is broken on the target hosts. What happens if you run this on the host? yum install ansible bash-completion dnsmasq ntp logrotate httpd-tools bind-utils firewalld libselinux-python conntrack-tools openssl iproute python-dbus PyYAML yum-utils glusterfs-fuse device-mapper-multipath nfs-utils cockpit-ws cockpit-system cockpit-bridge cockpit-docker iscsi-initiator-utils ceph-common atomic it's not broken. sudo yum install ansible bash-completion dnsmasq ntp logrotate httpd-tools bind-utils firewalld libselinux-python conntrack-tools openssl iproute python-dbus PyYAML yum-utils glusterfs-fuse device-mapper-multipath nfs-utils cockpit-ws cockpit-system cockpit-bridge cockpit-docker iscsi-initiator-utils ceph-common atomic Loaded plugins: product-id, search-disabled-repos, subscription-manager rhel-7-server-ansible-2.4-rpms | 4.0 kB 00:00:00 rhel-7-server-extras-rpms | 3.4 kB 00:00:00 rhel-7-server-ose-3.10-rpms | 4.0 kB 00:00:00 rhel-7-server-rpms | 3.5 kB 00:00:00 Package ansible-2.4.6.0-1.el7ae.noarch already installed and latest version Package 1:bash-completion-2.1-6.el7.noarch already installed and latest version Package matching dnsmasq-2.76-5.el7.x86_64 already installed. Checking for update. Package ntp-4.2.6p5-28.el7.x86_64 already installed and latest version Package logrotate-3.8.6-15.el7.x86_64 already installed and latest version Package matching httpd-tools-2.4.6-80.el7_5.1.x86_64 already installed. Checking for update. Package matching 32:bind-utils-9.9.4-61.el7_5.1.x86_64 already installed. Checking for update. Package libselinux-python-2.5-12.el7.x86_64 already installed and latest version Package matching conntrack-tools-1.4.4-3.el7_3.x86_64 already installed. Checking for update. Package 1:openssl-1.0.2k-12.el7.x86_64 already installed and latest version Package iproute-4.11.0-14.el7.x86_64 already installed and latest version Package dbus-python-1.1.1-9.el7.x86_64 already installed and latest version Package PyYAML-3.10-11.el7.x86_64 already installed and latest version Package yum-utils-1.1.31-46.el7_5.noarch already installed and latest version Package matching glusterfs-fuse-3.8.4-54.15.el7.x86_64 already installed. Checking for update. Package matching device-mapper-multipath-0.4.9-119.el7_5.1.x86_64 already installed. Checking for update. Package matching 1:nfs-utils-1.3.0-0.54.el7.x86_64 already installed. Checking for update. Package matching iscsi-initiator-utils-6.2.0.874-7.el7.x86_64 already installed. Checking for update. Package matching 1:ceph-common-0.94.5-2.el7.x86_64 already installed. Checking for update. Package 1:atomic-1.22.1-25.git5a342e3.el7.x86_64 already installed and latest version from the playbook:
---
- name: install needed rpm(s)
package:
name: "{{ r_openshift_node_image_prep_packages | join(',') }}"
state: present
register: result
until: result is succeeded
when: not (openshift_is_atomic | default(False) | bool)
```
all the rpms where installed on the nodes. But; I believe somehow the result is not reported correctly
any ideas ?
here's an example:
ocp310-infra-0 | SUCCESS | rc=0 >>
Loaded plugins: product-id, search-disabled-repos, subscription-manager
Package ansible-2.4.6.0-1.el7ae.noarch already installed and latest version
Package 1:bash-completion-2.1-6.el7.noarch already installed and latest version
Package dnsmasq-2.76-5.el7.x86_64 already installed and latest version
Package ntp-4.2.6p5-28.el7.x86_64 already installed and latest version
Package logrotate-3.8.6-15.el7.x86_64 already installed and latest version
Package httpd-tools-2.4.6-80.el7_5.1.x86_64 already installed and latest version
Package 32:bind-utils-9.9.4-61.el7_5.1.x86_64 already installed and latest version
Package firewalld-0.4.4.4-15.el7_5.noarch already installed and latest version
Package libselinux-python-2.5-12.el7.x86_64 already installed and latest version
Package conntrack-tools-1.4.4-3.el7_3.x86_64 already installed and latest version
Package 1:openssl-1.0.2k-12.el7.x86_64 already installed and latest version
Package iproute-4.11.0-14.el7.x86_64 already installed and latest version
Package dbus-python-1.1.1-9.el7.x86_64 already installed and latest version
Package PyYAML-3.10-11.el7.x86_64 already installed and latest version
Package yum-utils-1.1.31-46.el7_5.noarch already installed and latest version
Package glusterfs-fuse-3.8.4-54.15.el7.x86_64 already installed and latest version
Package device-mapper-multipath-0.4.9-119.el7_5.1.x86_64 already installed and latest version
Package 1:nfs-utils-1.3.0-0.54.el7.x86_64 already installed and latest version
Package cockpit-ws-154-3.el7.x86_64 already installed and latest version
Package cockpit-system-154-3.el7.noarch already installed and latest version
Package cockpit-bridge-154-3.el7.x86_64 already installed and latest version
Package cockpit-docker-176-2.el7.x86_64 already installed and latest version
Package iscsi-initiator-utils-6.2.0.874-7.el7.x86_64 already installed and latest version
Package 1:ceph-common-0.94.5-2.el7.x86_64 already installed and latest version
Nothing to do
No ideas, does it reproduce on a fresh set of hosts? all I have is a gcp platform. I rebuilt all the instances, tried again and still got the same results. One of the workaround that I applied is to install all the rpms before launching the deploy_cluster playbook. Cheers, The only other item that comes to mind is that, prior to running the installer, you should make sure that all packages installed are up to date. Can you make sure to run `yum upgrade` and reboot all hosts before running the installer? Well, I used the 'yum update -y' instead before running the installer. One potential cause for this is another process holding the yum lock. When the stall happens on a host can you look at the contents of /var/run/yum.pid and investigate which process that is and what it's doing? https://github.com/ansible/ansible/issues/44120 related, but only in so much that it adds a timeout for aquiring the lock, it would still fail. file /var/run/yum.pid /var/run/yum.pid: cannot open (No such file or directory) I see the playbook running: ├─sshd -D │ ├─sshd │ │ └─sshd │ │ └─sh -c sudo -H -S -n -u root /bin/sh -c 'echo BECOME-SUCCESS-bdpufjozgtbwxxukkyrnkhoqexikfsfd; /usr/bin/python' && sleep 0 │ │ └─sudo -H -S -n -u root /bin/sh -c echo BECOME-SUCCESS-bdpufjozgtbwxxukkyrnkhoqexikfsfd; /usr/bin/python │ │ └─sh -c echo BECOME-SUCCESS-bdpufjozgtbwxxukkyrnkhoqexikfsfd; /usr/bin/python │ │ └─python │ │ └─python /tmp/ansible_VB9uyN/ansible_module_yum.py Repos are fine. yum repolist Loaded plugins: product-id, search-disabled-repos, subscription-manager repo id repo name status google-cloud-compute Google Cloud Compute 11 rhel-7-server-ansible-2.4-rpms/x86_64 Red Hat Ansible Engine 2.4 RPMs for Red Hat Enterprise Linux 7 Server 23 rhel-7-server-extras-rpms/x86_64 Red Hat Enterprise Linux 7 Server - Extras (RPMs) 923 rhel-7-server-ose-3.10-rpms/x86_64 Red Hat OpenShift Container Platform 3.10 (RPMs) 610 rhel-7-server-rpms/7Server/x86_64 Red Hat Enterprise Linux 7 Server (RPMs) 21,065 repolist: 22,632 Same happens in 3.11.16 in GCP.
I don't know if it is something related with ansible arguments. In bastion host, I see the ssh processes:
ssh -o ControlMaster=auto -o ControlPersist=600s -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=false -o ForwardAgent=yes -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=cloud-user -o ConnectTimeout=10 -o ControlPath=/home/cloud-user/.ansible/cp/%h-%r openshift-iberia-311-master-0 /bin/sh -c 'sudo -H -S -n -u root /bin/sh -c '"'"'echo BECOME-SUCCESS-hpfawrdvmllxusjotnlmzmszjolxyzgs; /usr/bin/python'"'"' && sleep 0'
But when exec the same command in parallel, I receive:
mux_client_request_session: read from master failed: Broken pipe
And then, in my ansible installation:
fatal: [openshift-iberia-311-master-0]: UNREACHABLE! => {"changed": false, "msg": "SSH Error: data could not be sent to remote host \"openshift-iberia-311-master-0\". Make sure this host can be reached over ssh", "unreachable": true}
And the ansible mux communication is restarted:
cloud-u+ 13267 1 0 10:27 ? 00:00:00 ssh: /home/cloud-user/.ansible/cp/openshift-iberia-311-app-0-cloud-user [mux]
cloud-u+ 18743 1 0 12:03 ? 00:00:00 ssh: /home/cloud-user/.ansible/cp/openshift-iberia-311-master-0-cloud-user [mux]
We've consulted with the ansible team and there are no known causes of the yum module hanging in current releases other than the lock files. I believe we need to strace the process on the target host to figure out why it's hanging. Can either of you do that to pinpoint which resource the module is waiting for? https://tannerjc.net/wiki/index.php?title=Ansible_Hangs_Filament provides some generic documentation regarding how to debug ansible hangs. This is believed to be a deadlock in yum/rpm that's beyond our control. Additionally, there appear to be no active cases related to this bug. As such we're closing this bug in order to focus on bugs that are still tied to active customer cases. Please re-open this bug if you feel it was closed in error or a new active case is attached. |