Bug 1386398

Summary: QCI RHV+CFME+OCP deployment stuck 83.3%. RHV deployment hangs during ansible TASK [hypervisor : Install dependencies]
Product: Red Hat Quickstart Cloud Installer Reporter: Landon LaSmith <llasmith>
Component: Installation - RHEVAssignee: Fabian von Feilitzsch <fabian>
Status: CLOSED ERRATA QA Contact: Tasos Papaioannou <tpapaioa>
Severity: high Docs Contact: Dan Macpherson <dmacpher>
Priority: unspecified    
Version: 1.1CC: apagac, bthurber, jmatthew, qci-bugzillas, tpapaioa
Target Milestone: ---Keywords: Automation, Triaged
Target Release: 1.1   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-02-28 01:40:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Landon LaSmith 2016-10-18 20:19:43 UTC
Description of problem: During a QCI deployment of RHV+CFME+OCP, the deployment hangs for 8+ hours at 83.3% during the RHV deployment. The ansible.log shows that it's running TASK [hypervisor : Install dependencies] but there is no error or failure to end the deployment.  All hypervisors are accessible by ssh from fusor with no disk space issues.

QCI Media Version: QCI-1.1-RHEL-7-20161013.t.0

How reproducible: First occurrence


Steps to Reproduce:
1. Install QCI
2. Start RHV CFME OCP deployment
3. Wait for deployment to hang at 83.3% of RHV deployment

Actual results: QCI shows the deployment as still running 8+ hours later


Expected results: QCI or Ansible would timeout after an acceptable amount of time

Comment 3 Tasos Papaioannou 2016-10-26 19:37:34 UTC
FYI, downgrading tfm-rubygem-fusor_server-1.1.7-1.el7sat to tfm-rubygem-fusor_server-1.1.5-1.el7sat (then restarting services) resolved the issue for me on QCI-1.1-RHEL-7-20161025.t.0. Before the downgrade, every deployment of RHV self-hosted would hang indefinitely without getting a response on the ansible task that yum installs genisoimage/rhevm-appliance/glusterfs-fuse/ovirt-hosted-engine-setup, even though the installation completes successfully on the host.

Comment 4 Tasos Papaioannou 2016-10-27 01:02:02 UTC
The problem appears to be with the ansible-playbook process on the Satellite hanging when stdout gets too large. On my self-hosted RHV deployment that was hanging at 66.7%, I see:

1.) The client is installing the packages via yum:

# ps -eF|grep yum
root     10789     1  0 37203  3980   0 21:35 ?        00:00:00 /usr/bin/python /root/.ansible/tmp/ansible-tmp-1477517716.08-166185076484205/async_wrapper 306955908063 7200 /root/.ansible/tmp/ansible-tmp-1477517716.08-166185076484205/yum
root     10790 10789  0 37203  4304   2 21:35 ?        00:00:00 /usr/bin/python /root/.ansible/tmp/ansible-tmp-1477517716.08-166185076484205/async_wrapper 306955908063 7200 /root/.ansible/tmp/ansible-tmp-1477517716.08-166185076484205/yum
root     10791 10790  0 47293  8072   1 21:35 ?        00:00:00 /usr/bin/python /root/.ansible/tmp/ansible-tmp-1477517716.08-166185076484205/yum
root     10792 10791  0 112614 33836  3 21:35 ?        00:00:01 /usr/bin/python /tmp/ansible_6frsOZ/ansible_module_yum.py
root     10795 10792 17 107686 102200 1 21:35 ?        00:01:08 /usr/bin/python /usr/bin/yum -d 2 -y install genisoimage rhevm-appliance glusterfs-fuse ovirt-hosted-engine-setup
root     13274 10855  0 28163   960   2 21:41 pts/0    00:00:00 grep --color=auto yum

2.) After the installation is complete, the yum and ansible-related processes are now done:

# grep -e genisoimage -e rhevm-appliance -e glusterfs-fuse -e ovirt-hosted-engine-setup /var/log/yum.log 
Oct 26 21:37:36 Installed: glusterfs-fuse-3.7.9-10.el7.x86_64
Oct 26 21:40:48 Installed: genisoimage-1.1.11-23.el7.x86_64
Oct 26 21:41:17 Installed: ovirt-hosted-engine-setup-2.0.2.2-2.el7ev.noarch
Oct 26 21:42:04 Installed: rhevm-appliance-20160922.0-1.el7ev.noarch

# ps -eF|grep yum
root     13332 10855  0 28162   960   3 21:43 pts/0    00:00:00 grep --color=auto yum

3.) On the Satellite server, nothing further gets logged to /var/log/foreman/deployments/<deployment>/ansible.log, and no more tasks from /usr/share/ansible-ovirt/self_hosted.yml are getting executed on the client. The ansible-playbook processes are still running on the Satellite, though:

# ps -eF|grep ansible
foreman  19741 10517  0 68071 31932   2 17:34 ?        00:00:04 /usr/bin/python2 /usr/bin/ansible-playbook /usr/share/ansible-ovirt/self_hosted.yml -i /usr/share/foreman/tmp/ansible-ovirt/tpapaioa_11/inventory -e {"admin_password":"changeme","cluster_name":"Default","dc_name":"Default","cpu_model":"model_Nehalem","cpu_type":"Intel Nehalem Family","hosted_storage_address":"192.168.100.254","hosted_storage_name":"my_hosted_storage","hosted_storage_path":"/home/export/self_hosted_domain","data_storage_address":"192.168.100.254","data_storage_name":"my_storage","data_storage_path":"/home/export/data_domain","create_export_domain":false,"export_storage_address":null,"export_storage_name":"my_export","export_storage_path":null,"engine_activation_key":"RHV_Engine-tpapaioa_11-RHV_Engine","engine_db_password":"changeme","engine_fqdn":"tpapaioa-11-rhv-engine.example.com","engine_mac_address":"22:29:4c:d5:f9:e9","gateway":"192.168.100.254","mac_address_range":"00:1A:07:00:00:00,00:1A:07:FF:FF:FF","mac_pool_name":"qci","root_password":"changeme","satellite_fqdn":"fusor.example.com","config_dir":"/etc/qci/","storageDatacenterName":"hosted_storage","storage_type":"nfs","engine_repositories":["rhel-7-server-beta-rpms","rhel-7-server-satellite-tools-6.2-rpms","rhel-7-server-supplementary-beta-rpms","rhel-7-server-rhv-4.0-rpms","jb-eap-7-for-rhel-7-server-rpms","rhel-7-server-optional-beta-rpms"]} -vvvv
foreman  19752 19741  9 83331 23696   3 17:34 ?        00:00:49 /usr/bin/python2 /usr/bin/ansible-playbook /usr/share/ansible-ovirt/self_hosted.yml -i /usr/share/foreman/tmp/ansible-ovirt/tpapaioa_11/inventory -e {"admin_password":"changeme","cluster_name":"Default","dc_name":"Default","cpu_model":"model_Nehalem","cpu_type":"Intel Nehalem Family","hosted_storage_address":"192.168.100.254","hosted_storage_name":"my_hosted_storage","hosted_storage_path":"/home/export/self_hosted_domain","data_storage_address":"192.168.100.254","data_storage_name":"my_storage","data_storage_path":"/home/export/data_domain","create_export_domain":false,"export_storage_address":null,"export_storage_name":"my_export","export_storage_path":null,"engine_activation_key":"RHV_Engine-tpapaioa_11-RHV_Engine","engine_db_password":"changeme","engine_fqdn":"tpapaioa-11-rhv-engine.example.com","engine_mac_address":"22:29:4c:d5:f9:e9","gateway":"192.168.100.254","mac_address_range":"00:1A:07:00:00:00,00:1A:07:FF:FF:FF","mac_pool_name":"qci","root_password":"changeme","satellite_fqdn":"fusor.example.com","config_dir":"/etc/qci/","storageDatacenterName":"hosted_storage","storage_type":"nfs","engine_repositories":["rhel-7-server-beta-rpms","rhel-7-server-satellite-tools-6.2-rpms","rhel-7-server-supplementary-beta-rpms","rhel-7-server-rhv-4.0-rpms","jb-eap-7-for-rhel-7-server-rpms","rhel-7-server-optional-beta-rpms"]} -vvvv
root     20750 20015  0 28162   960   0 17:43 pts/2    00:00:00 grep --color=auto ansible

4.) An strace of the processes shows that one of them is hanging while trying to write to stdout:

# strace -p19741
Process 19741 attached
write(1, "a-libGL.x86_64 0:11.2.2-2.201606"..., 106496^CProcess 19741 detached

5.) If I cat this file descriptor, I see a huge amount of yum-related output: dependency resolution message, the list of packages to install, package download output, then installation output....

# cat /proc/19741/fd/1
Loaded plugins: package_upload, product-id, search-disabled-repos, subscription-\n              : manager\nResolving Dependencies\n--> Running transaction check\n---> Package genisoimage.x86_64 0:1.1.11-23.el7 will be installed\n--> Processing Dependency: [...] yajl.x86_64 0:2.0.4-4.el7   
                                                  \n  yum-utils.noarch 0:1.1.31-40.el7                                              \n\nComplete!\n

6.) My viewing of the file descriptor contents is apparently the flush it needs: this huge amount of yum output immediately gets logged to ansible.log, and the next step in the playbook finally start.

7.) I have to do this one more time during the hosted-engine task, to flush the huge amount of output from that command to the log. After doing that, the deployment completes successfully.

These steps were done on QCI-1.1-RHEL-7-20161025.t.0, without the package downgrade mentioned in my previous comment.

Comment 5 Fabian von Feilitzsch 2016-10-28 01:00:10 UTC
Tasos, this is great information, thanks so much for the debugging. I was investigating further yesterday/today and I think this might be an issue with the way we are running the command from Ruby, it is trying to read in the whole chunk from standard out at once. I updated it to a method which breaks the output into sane chunks, which seemed to work for me. I'm going to test it a little more and if it works will try to get that into a compose tomorrow.

Comment 6 Landon LaSmith 2016-10-28 17:48:01 UTC
I was able to successfully complete a RHV Selfhosted+CFME+OCP deployment by repeating Tasos' steps above.  I waited for the ansible processes on satellite to get stuck writing to stdout (File Descriptor 1). cat'ing the file descriptor allowed the ansible process to continue. Each time the RHV install was stuck (4-5 times total), I was able to repeat the steps and continue the RHV deployment.

OCP and CFME installed successfully w/o any manual intervention.

ISO Media Version: QCI-1.1-RHEL-7-20161026.t.0

Comment 7 Fabian von Feilitzsch 2016-10-28 20:01:23 UTC
https://github.com/fusor/fusor/pull/1276

Caused by large stdout chunks failing to write, hanging the ansible process. Fixed by using Open3.capture2e instead of Open3.popen2e, which handles reading the stdout in chunks so that we don't have to.

Thanks again everybody for the extra debugging information, very helpful.

Comment 8 John Matthews 2016-11-01 14:06:21 UTC
Will be fixed in compose 11/1

Comment 9 Tasos Papaioannou 2016-11-01 19:09:10 UTC
Verified on QCI-1.1-RHEL-7-20161101.t.0.

Comment 10 Fabian von Feilitzsch 2016-11-16 14:45:15 UTC
*** Bug 1388929 has been marked as a duplicate of this bug. ***

Comment 13 errata-xmlrpc 2017-02-28 01:40:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:0335