Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1635923

Summary:

Deployment freezes on openshift_node : install needed rpm(s)

Product:

OpenShift Container Platform

Reporter:

jmselmi <jmselmi>

Component:

Installer

Assignee:

Scott Dodson <sdodson>

Status:

CLOSED DEFERRED

QA Contact:

Johnny Liu <jialiu>

Severity:

medium

Docs Contact:

Priority:

low

Version:

3.10.0

CC:

aos-bugs, jmselmi, jokerman, mmccomas, rgordill, wmeng

Target Milestone:

---

Target Release:

3.10.z

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-02-15 21:18:34 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
ansible-openshift -vvv	none

Description jmselmi 2018-10-04 00:23:16 UTC

Description of problem:
I am deploying OCP 3.10.45 on GCP (3 master, 3 infra and 3 apps nodes). During the deployment, the playbook freezes on task install needed rpm(s) from the openshift_node role.

When, I run the playbook again; I found the rpms were already uninstalled.

Version-Release number of selected component (if applicable):

3.10.45

How reproducible:



Steps to Reproduce:
1. Prepare nodes on gcp
2. prepare hosts and all per-requisities 
3. start the deployment

Actual results:
the playbook freezes

Expected results:

it should continue running.
Additional info:


Description of problem:

Version-Release number of the following components:
rpm -q openshift-ansible
rpm -q ansible
ansible --version

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

Expected results:

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 jmselmi 2018-10-04 09:33:47 UTC

Created attachment 1490427 [details]
ansible-openshift -vvv

The playbook deploy_cluster.yml keeps freezing. No error generated yet.
When, I canceled  and tried again, on some nodes the rpms were already installed but, the playbook keeps freezing on the remaining ones.

rpm -q openshift-ansible
openshift-ansible-3.10.47-1.git.0.95bc2d2.el7_5.noarch

rpm -q ansible
ansible-2.6.4-1.el7.noarch

ansible --version
ansible 2.6.4
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/home/jmselmi/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, May  3 2017, 07:55:04) [GCC 4.8.5 20150623 (Red Hat 4.8.5-14)]

Comment 2 Scott Dodson 2018-10-04 12:35:07 UTC

This often happens when yum is broken on the target hosts. What happens if you run this on the host?

yum install ansible bash-completion dnsmasq ntp logrotate httpd-tools bind-utils firewalld libselinux-python conntrack-tools openssl iproute python-dbus PyYAML yum-utils glusterfs-fuse device-mapper-multipath nfs-utils cockpit-ws cockpit-system cockpit-bridge cockpit-docker iscsi-initiator-utils ceph-common atomic

Comment 3 jmselmi 2018-10-04 16:00:35 UTC

it's not broken.

sudo yum install ansible bash-completion dnsmasq ntp logrotate httpd-tools bind-utils firewalld libselinux-python conntrack-tools openssl iproute python-dbus PyYAML yum-utils glusterfs-fuse device-mapper-multipath nfs-utils cockpit-ws cockpit-system cockpit-bridge cockpit-docker iscsi-initiator-utils ceph-common atomic
Loaded plugins: product-id, search-disabled-repos, subscription-manager
rhel-7-server-ansible-2.4-rpms | 4.0 kB 00:00:00
rhel-7-server-extras-rpms | 3.4 kB 00:00:00
rhel-7-server-ose-3.10-rpms | 4.0 kB 00:00:00
rhel-7-server-rpms | 3.5 kB 00:00:00
Package ansible-2.4.6.0-1.el7ae.noarch already installed and latest version
Package 1:bash-completion-2.1-6.el7.noarch already installed and latest version
Package matching dnsmasq-2.76-5.el7.x86_64 already installed. Checking for update.
Package ntp-4.2.6p5-28.el7.x86_64 already installed and latest version
Package logrotate-3.8.6-15.el7.x86_64 already installed and latest version
Package matching httpd-tools-2.4.6-80.el7_5.1.x86_64 already installed. Checking for update.
Package matching 32:bind-utils-9.9.4-61.el7_5.1.x86_64 already installed. Checking for update.
Package libselinux-python-2.5-12.el7.x86_64 already installed and latest version
Package matching conntrack-tools-1.4.4-3.el7_3.x86_64 already installed. Checking for update.
Package 1:openssl-1.0.2k-12.el7.x86_64 already installed and latest version
Package iproute-4.11.0-14.el7.x86_64 already installed and latest version
Package dbus-python-1.1.1-9.el7.x86_64 already installed and latest version
Package PyYAML-3.10-11.el7.x86_64 already installed and latest version
Package yum-utils-1.1.31-46.el7_5.noarch already installed and latest version
Package matching glusterfs-fuse-3.8.4-54.15.el7.x86_64 already installed. Checking for update.
Package matching device-mapper-multipath-0.4.9-119.el7_5.1.x86_64 already installed. Checking for update.
Package matching 1:nfs-utils-1.3.0-0.54.el7.x86_64 already installed. Checking for update.
Package matching iscsi-initiator-utils-6.2.0.874-7.el7.x86_64 already installed. Checking for update.
Package matching 1:ceph-common-0.94.5-2.el7.x86_64 already installed. Checking for update.
Package 1:atomic-1.22.1-25.git5a342e3.el7.x86_64 already installed and latest version

Comment 4 jmselmi 2018-10-05 16:15:57 UTC

from the playbook: 
---
- name: install needed rpm(s)
  package:
    name: "{{ r_openshift_node_image_prep_packages  | join(',') }}"
    state: present
  register: result
  until: result is succeeded
  when: not (openshift_is_atomic | default(False) | bool)

```

all the rpms where installed on the nodes. But; I believe somehow the result is not reported correctly

any ideas ?

here's an example: 
ocp310-infra-0 | SUCCESS | rc=0 >>
Loaded plugins: product-id, search-disabled-repos, subscription-manager
Package ansible-2.4.6.0-1.el7ae.noarch already installed and latest version
Package 1:bash-completion-2.1-6.el7.noarch already installed and latest version
Package dnsmasq-2.76-5.el7.x86_64 already installed and latest version
Package ntp-4.2.6p5-28.el7.x86_64 already installed and latest version
Package logrotate-3.8.6-15.el7.x86_64 already installed and latest version
Package httpd-tools-2.4.6-80.el7_5.1.x86_64 already installed and latest version
Package 32:bind-utils-9.9.4-61.el7_5.1.x86_64 already installed and latest version
Package firewalld-0.4.4.4-15.el7_5.noarch already installed and latest version
Package libselinux-python-2.5-12.el7.x86_64 already installed and latest version
Package conntrack-tools-1.4.4-3.el7_3.x86_64 already installed and latest version
Package 1:openssl-1.0.2k-12.el7.x86_64 already installed and latest version
Package iproute-4.11.0-14.el7.x86_64 already installed and latest version
Package dbus-python-1.1.1-9.el7.x86_64 already installed and latest version
Package PyYAML-3.10-11.el7.x86_64 already installed and latest version
Package yum-utils-1.1.31-46.el7_5.noarch already installed and latest version
Package glusterfs-fuse-3.8.4-54.15.el7.x86_64 already installed and latest version
Package device-mapper-multipath-0.4.9-119.el7_5.1.x86_64 already installed and latest version
Package 1:nfs-utils-1.3.0-0.54.el7.x86_64 already installed and latest version
Package cockpit-ws-154-3.el7.x86_64 already installed and latest version
Package cockpit-system-154-3.el7.noarch already installed and latest version
Package cockpit-bridge-154-3.el7.x86_64 already installed and latest version
Package cockpit-docker-176-2.el7.x86_64 already installed and latest version
Package iscsi-initiator-utils-6.2.0.874-7.el7.x86_64 already installed and latest version
Package 1:ceph-common-0.94.5-2.el7.x86_64 already installed and latest version
Nothing to do

Comment 5 Scott Dodson 2018-10-05 16:32:59 UTC

No ideas, does it reproduce on a fresh set of hosts?

Comment 6 jmselmi 2018-10-09 11:48:31 UTC

all I have is a gcp platform.
I rebuilt all the instances, tried again and still got the same results.

One of the workaround that I applied is to install all the rpms before launching the deploy_cluster playbook.

Cheers,

Comment 7 Scott Dodson 2018-10-09 12:58:55 UTC

The only other item that comes to mind is that, prior to running the installer, you should make sure that all packages installed are up to date. Can you make sure to run `yum upgrade` and reboot all hosts before running the installer?

Comment 8 jmselmi 2018-10-09 13:07:50 UTC

Well, I used the 'yum update -y' instead before running the installer.

Comment 9 Scott Dodson 2018-10-09 14:05:48 UTC

One potential cause for this is another process holding the yum lock. When the stall happens on a host can you look at the contents of /var/run/yum.pid and investigate which process that is and what it's doing?

https://github.com/ansible/ansible/issues/44120 related, but only in so much that it adds a timeout for aquiring the lock, it would still fail.

Comment 10 jmselmi 2018-10-11 13:12:05 UTC

file /var/run/yum.pid
/var/run/yum.pid: cannot open (No such file or directory)


I see the playbook running:
  ├─sshd -D
  │   ├─sshd
  │   │   └─sshd
  │   │       └─sh -c sudo -H -S -n -u root /bin/sh -c 'echo BECOME-SUCCESS-bdpufjozgtbwxxukkyrnkhoqexikfsfd; /usr/bin/python' && sleep 0
  │   │           └─sudo -H -S -n -u root /bin/sh -c echo BECOME-SUCCESS-bdpufjozgtbwxxukkyrnkhoqexikfsfd; /usr/bin/python
  │   │               └─sh -c echo BECOME-SUCCESS-bdpufjozgtbwxxukkyrnkhoqexikfsfd; /usr/bin/python
  │   │                   └─python
  │   │                       └─python /tmp/ansible_VB9uyN/ansible_module_yum.py


Repos are fine.

yum repolist
Loaded plugins: product-id, search-disabled-repos, subscription-manager
repo id                                                               repo name                                                                                             status
google-cloud-compute                                                  Google Cloud Compute                                                                                      11
rhel-7-server-ansible-2.4-rpms/x86_64                                 Red Hat Ansible Engine 2.4 RPMs for Red Hat Enterprise Linux 7 Server                                     23
rhel-7-server-extras-rpms/x86_64                                      Red Hat Enterprise Linux 7 Server - Extras (RPMs)                                                        923
rhel-7-server-ose-3.10-rpms/x86_64                                    Red Hat OpenShift Container Platform 3.10 (RPMs)                                                         610
rhel-7-server-rpms/7Server/x86_64                                     Red Hat Enterprise Linux 7 Server (RPMs)                                                              21,065
repolist: 22,632

Comment 11 Ramon Gordillo 2018-10-22 16:12:16 UTC

Same happens in 3.11.16 in GCP.

I don't know if it is something related with ansible arguments. In bastion host, I see the ssh processes:

ssh -o ControlMaster=auto -o ControlPersist=600s -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=false -o ForwardAgent=yes -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=cloud-user -o ConnectTimeout=10 -o ControlPath=/home/cloud-user/.ansible/cp/%h-%r openshift-iberia-311-master-0 /bin/sh -c 'sudo -H -S -n -u root /bin/sh -c '"'"'echo BECOME-SUCCESS-hpfawrdvmllxusjotnlmzmszjolxyzgs; /usr/bin/python'"'"' && sleep 0'

But when exec the same command in parallel, I receive:

mux_client_request_session: read from master failed: Broken pipe

And then, in my ansible installation:

fatal: [openshift-iberia-311-master-0]: UNREACHABLE! => {"changed": false, "msg": "SSH Error: data could not be sent to remote host \"openshift-iberia-311-master-0\". Make sure this host can be reached over ssh", "unreachable": true}

And the ansible mux communication is restarted:

cloud-u+ 13267     1  0 10:27 ?        00:00:00 ssh: /home/cloud-user/.ansible/cp/openshift-iberia-311-app-0-cloud-user [mux]
cloud-u+ 18743     1  0 12:03 ?        00:00:00 ssh: /home/cloud-user/.ansible/cp/openshift-iberia-311-master-0-cloud-user [mux]

Comment 12 Scott Dodson 2018-10-22 17:40:30 UTC

We've consulted with the ansible team and there are no known causes of the yum module hanging in current releases other than the lock files.

I believe we need to strace the process on the target host to figure out why it's hanging. Can either of you do that to pinpoint which resource the module is waiting for?

https://tannerjc.net/wiki/index.php?title=Ansible_Hangs_Filament provides some generic documentation regarding how to debug ansible hangs.

Comment 16 Scott Dodson 2019-02-15 21:18:34 UTC

This is believed to be a deadlock in yum/rpm that's beyond our control.

Additionally, there appear to be no active cases related to this bug. As such we're closing this bug in order to focus on bugs that are still tied to active customer cases. Please re-open this bug if you feel it was closed in error or a new active case is attached.