1386398 – QCI RHV+CFME+OCP deployment stuck 83.3%. RHV deployment hangs during ansible TASK [hypervisor : Install dependencies]

Bug 1386398 - QCI RHV+CFME+OCP deployment stuck 83.3%. RHV deployment hangs during ansible TASK [hypervisor : Install dependencies]

Summary: QCI RHV+CFME+OCP deployment stuck 83.3%. RHV deployment hangs during ansible ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Quickstart Cloud Installer
Classification:	Red Hat
Component:	Installation - RHEV
Sub Component:
Version:	1.1
Hardware:	All
OS:	All
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	1.1
Assignee:	Fabian von Feilitzsch
QA Contact:	Tasos Papaioannou
Docs Contact:	Dan Macpherson
URL:
Whiteboard:
Duplicates (1):	1388929 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-10-18 20:19 UTC by Landon LaSmith
Modified:	2017-02-28 01:40 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-02-28 01:40:22 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2017:0335	0	normal	SHIPPED_LIVE	Red Hat Quickstart Installer 1.1	2017-02-28 06:36:13 UTC

Description Landon LaSmith 2016-10-18 20:19:43 UTC

Description of problem: During a QCI deployment of RHV+CFME+OCP, the deployment hangs for 8+ hours at 83.3% during the RHV deployment. The ansible.log shows that it's running TASK [hypervisor : Install dependencies] but there is no error or failure to end the deployment.  All hypervisors are accessible by ssh from fusor with no disk space issues.

QCI Media Version: QCI-1.1-RHEL-7-20161013.t.0

How reproducible: First occurrence


Steps to Reproduce:
1. Install QCI
2. Start RHV CFME OCP deployment
3. Wait for deployment to hang at 83.3% of RHV deployment

Actual results: QCI shows the deployment as still running 8+ hours later


Expected results: QCI or Ansible would timeout after an acceptable amount of time

Comment 3 Tasos Papaioannou 2016-10-26 19:37:34 UTC

FYI, downgrading tfm-rubygem-fusor_server-1.1.7-1.el7sat to tfm-rubygem-fusor_server-1.1.5-1.el7sat (then restarting services) resolved the issue for me on QCI-1.1-RHEL-7-20161025.t.0. Before the downgrade, every deployment of RHV self-hosted would hang indefinitely without getting a response on the ansible task that yum installs genisoimage/rhevm-appliance/glusterfs-fuse/ovirt-hosted-engine-setup, even though the installation completes successfully on the host.

Comment 4 Tasos Papaioannou 2016-10-27 01:02:02 UTC

The problem appears to be with the ansible-playbook process on the Satellite hanging when stdout gets too large. On my self-hosted RHV deployment that was hanging at 66.7%, I see:

1.) The client is installing the packages via yum:

# ps -eF|grep yum
root     10789     1  0 37203  3980   0 21:35 ?        00:00:00 /usr/bin/python /root/.ansible/tmp/ansible-tmp-1477517716.08-166185076484205/async_wrapper 306955908063 7200 /root/.ansible/tmp/ansible-tmp-1477517716.08-166185076484205/yum
root     10790 10789  0 37203  4304   2 21:35 ?        00:00:00 /usr/bin/python /root/.ansible/tmp/ansible-tmp-1477517716.08-166185076484205/async_wrapper 306955908063 7200 /root/.ansible/tmp/ansible-tmp-1477517716.08-166185076484205/yum
root     10791 10790  0 47293  8072   1 21:35 ?        00:00:00 /usr/bin/python /root/.ansible/tmp/ansible-tmp-1477517716.08-166185076484205/yum
root     10792 10791  0 112614 33836  3 21:35 ?        00:00:01 /usr/bin/python /tmp/ansible_6frsOZ/ansible_module_yum.py
root     10795 10792 17 107686 102200 1 21:35 ?        00:01:08 /usr/bin/python /usr/bin/yum -d 2 -y install genisoimage rhevm-appliance glusterfs-fuse ovirt-hosted-engine-setup
root     13274 10855  0 28163   960   2 21:41 pts/0    00:00:00 grep --color=auto yum

2.) After the installation is complete, the yum and ansible-related processes are now done:

# grep -e genisoimage -e rhevm-appliance -e glusterfs-fuse -e ovirt-hosted-engine-setup /var/log/yum.log 
Oct 26 21:37:36 Installed: glusterfs-fuse-3.7.9-10.el7.x86_64
Oct 26 21:40:48 Installed: genisoimage-1.1.11-23.el7.x86_64
Oct 26 21:41:17 Installed: ovirt-hosted-engine-setup-2.0.2.2-2.el7ev.noarch
Oct 26 21:42:04 Installed: rhevm-appliance-20160922.0-1.el7ev.noarch

# ps -eF|grep yum
root     13332 10855  0 28162   960   3 21:43 pts/0    00:00:00 grep --color=auto yum

3.) On the Satellite server, nothing further gets logged to /var/log/foreman/deployments/<deployment>/ansible.log, and no more tasks from /usr/share/ansible-ovirt/self_hosted.yml are getting executed on the client. The ansible-playbook processes are still running on the Satellite, though:

# ps -eF|grep ansible
foreman  19741 10517  0 68071 31932   2 17:34 ?        00:00:04 /usr/bin/python2 /usr/bin/ansible-playbook /usr/share/ansible-ovirt/self_hosted.yml -i /usr/share/foreman/tmp/ansible-ovirt/tpapaioa_11/inventory -e {"admin_password":"changeme","cluster_name":"Default","dc_name":"Default","cpu_model":"model_Nehalem","cpu_type":"Intel Nehalem Family","hosted_storage_address":"192.168.100.254","hosted_storage_name":"my_hosted_storage","hosted_storage_path":"/home/export/self_hosted_domain","data_storage_address":"192.168.100.254","data_storage_name":"my_storage","data_storage_path":"/home/export/data_domain","create_export_domain":false,"export_storage_address":null,"export_storage_name":"my_export","export_storage_path":null,"engine_activation_key":"RHV_Engine-tpapaioa_11-RHV_Engine","engine_db_password":"changeme","engine_fqdn":"tpapaioa-11-rhv-engine.example.com","engine_mac_address":"22:29:4c:d5:f9:e9","gateway":"192.168.100.254","mac_address_range":"00:1A:07:00:00:00,00:1A:07:FF:FF:FF","mac_pool_name":"qci","root_password":"changeme","satellite_fqdn":"fusor.example.com","config_dir":"/etc/qci/","storageDatacenterName":"hosted_storage","storage_type":"nfs","engine_repositories":["rhel-7-server-beta-rpms","rhel-7-server-satellite-tools-6.2-rpms","rhel-7-server-supplementary-beta-rpms","rhel-7-server-rhv-4.0-rpms","jb-eap-7-for-rhel-7-server-rpms","rhel-7-server-optional-beta-rpms"]} -vvvv
foreman  19752 19741  9 83331 23696   3 17:34 ?        00:00:49 /usr/bin/python2 /usr/bin/ansible-playbook /usr/share/ansible-ovirt/self_hosted.yml -i /usr/share/foreman/tmp/ansible-ovirt/tpapaioa_11/inventory -e {"admin_password":"changeme","cluster_name":"Default","dc_name":"Default","cpu_model":"model_Nehalem","cpu_type":"Intel Nehalem Family","hosted_storage_address":"192.168.100.254","hosted_storage_name":"my_hosted_storage","hosted_storage_path":"/home/export/self_hosted_domain","data_storage_address":"192.168.100.254","data_storage_name":"my_storage","data_storage_path":"/home/export/data_domain","create_export_domain":false,"export_storage_address":null,"export_storage_name":"my_export","export_storage_path":null,"engine_activation_key":"RHV_Engine-tpapaioa_11-RHV_Engine","engine_db_password":"changeme","engine_fqdn":"tpapaioa-11-rhv-engine.example.com","engine_mac_address":"22:29:4c:d5:f9:e9","gateway":"192.168.100.254","mac_address_range":"00:1A:07:00:00:00,00:1A:07:FF:FF:FF","mac_pool_name":"qci","root_password":"changeme","satellite_fqdn":"fusor.example.com","config_dir":"/etc/qci/","storageDatacenterName":"hosted_storage","storage_type":"nfs","engine_repositories":["rhel-7-server-beta-rpms","rhel-7-server-satellite-tools-6.2-rpms","rhel-7-server-supplementary-beta-rpms","rhel-7-server-rhv-4.0-rpms","jb-eap-7-for-rhel-7-server-rpms","rhel-7-server-optional-beta-rpms"]} -vvvv
root     20750 20015  0 28162   960   0 17:43 pts/2    00:00:00 grep --color=auto ansible

4.) An strace of the processes shows that one of them is hanging while trying to write to stdout:

# strace -p19741
Process 19741 attached
write(1, "a-libGL.x86_64 0:11.2.2-2.201606"..., 106496^CProcess 19741 detached

5.) If I cat this file descriptor, I see a huge amount of yum-related output: dependency resolution message, the list of packages to install, package download output, then installation output....

# cat /proc/19741/fd/1
Loaded plugins: package_upload, product-id, search-disabled-repos, subscription-\n              : manager\nResolving Dependencies\n--> Running transaction check\n---> Package genisoimage.x86_64 0:1.1.11-23.el7 will be installed\n--> Processing Dependency: [...] yajl.x86_64 0:2.0.4-4.el7   
                                                  \n  yum-utils.noarch 0:1.1.31-40.el7                                              \n\nComplete!\n

6.) My viewing of the file descriptor contents is apparently the flush it needs: this huge amount of yum output immediately gets logged to ansible.log, and the next step in the playbook finally start.

7.) I have to do this one more time during the hosted-engine task, to flush the huge amount of output from that command to the log. After doing that, the deployment completes successfully.

These steps were done on QCI-1.1-RHEL-7-20161025.t.0, without the package downgrade mentioned in my previous comment.

Comment 5 Fabian von Feilitzsch 2016-10-28 01:00:10 UTC

Tasos, this is great information, thanks so much for the debugging. I was investigating further yesterday/today and I think this might be an issue with the way we are running the command from Ruby, it is trying to read in the whole chunk from standard out at once. I updated it to a method which breaks the output into sane chunks, which seemed to work for me. I'm going to test it a little more and if it works will try to get that into a compose tomorrow.

Comment 6 Landon LaSmith 2016-10-28 17:48:01 UTC

I was able to successfully complete a RHV Selfhosted+CFME+OCP deployment by repeating Tasos' steps above.  I waited for the ansible processes on satellite to get stuck writing to stdout (File Descriptor 1). cat'ing the file descriptor allowed the ansible process to continue. Each time the RHV install was stuck (4-5 times total), I was able to repeat the steps and continue the RHV deployment.

OCP and CFME installed successfully w/o any manual intervention.

ISO Media Version: QCI-1.1-RHEL-7-20161026.t.0

Comment 7 Fabian von Feilitzsch 2016-10-28 20:01:23 UTC

https://github.com/fusor/fusor/pull/1276

Caused by large stdout chunks failing to write, hanging the ansible process. Fixed by using Open3.capture2e instead of Open3.popen2e, which handles reading the stdout in chunks so that we don't have to.

Thanks again everybody for the extra debugging information, very helpful.

Comment 8 John Matthews 2016-11-01 14:06:21 UTC

Will be fixed in compose 11/1

Comment 9 Tasos Papaioannou 2016-11-01 19:09:10 UTC

Verified on QCI-1.1-RHEL-7-20161101.t.0.

Comment 10 Fabian von Feilitzsch 2016-11-16 14:45:15 UTC

*** Bug 1388929 has been marked as a duplicate of this bug. ***

Comment 13 errata-xmlrpc 2017-02-28 01:40:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:0335

Note You need to log in before you can comment on or make changes to this bug.