| Summary: | QCI RHV+CFME+OCP deployment stuck 83.3%. RHV deployment hangs during ansible TASK [hypervisor : Install dependencies] | ||
|---|---|---|---|
| Product: | Red Hat Quickstart Cloud Installer | Reporter: | Landon LaSmith <llasmith> |
| Component: | Installation - RHEV | Assignee: | Fabian von Feilitzsch <fabian> |
| Status: | CLOSED ERRATA | QA Contact: | Tasos Papaioannou <tpapaioa> |
| Severity: | high | Docs Contact: | Dan Macpherson <dmacpher> |
| Priority: | unspecified | ||
| Version: | 1.1 | CC: | apagac, bthurber, jmatthew, qci-bugzillas, tpapaioa |
| Target Milestone: | --- | Keywords: | Automation, Triaged |
| Target Release: | 1.1 | ||
| Hardware: | All | ||
| OS: | All | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-02-28 01:40:22 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Landon LaSmith
2016-10-18 20:19:43 UTC
FYI, downgrading tfm-rubygem-fusor_server-1.1.7-1.el7sat to tfm-rubygem-fusor_server-1.1.5-1.el7sat (then restarting services) resolved the issue for me on QCI-1.1-RHEL-7-20161025.t.0. Before the downgrade, every deployment of RHV self-hosted would hang indefinitely without getting a response on the ansible task that yum installs genisoimage/rhevm-appliance/glusterfs-fuse/ovirt-hosted-engine-setup, even though the installation completes successfully on the host. The problem appears to be with the ansible-playbook process on the Satellite hanging when stdout gets too large. On my self-hosted RHV deployment that was hanging at 66.7%, I see:
1.) The client is installing the packages via yum:
# ps -eF|grep yum
root 10789 1 0 37203 3980 0 21:35 ? 00:00:00 /usr/bin/python /root/.ansible/tmp/ansible-tmp-1477517716.08-166185076484205/async_wrapper 306955908063 7200 /root/.ansible/tmp/ansible-tmp-1477517716.08-166185076484205/yum
root 10790 10789 0 37203 4304 2 21:35 ? 00:00:00 /usr/bin/python /root/.ansible/tmp/ansible-tmp-1477517716.08-166185076484205/async_wrapper 306955908063 7200 /root/.ansible/tmp/ansible-tmp-1477517716.08-166185076484205/yum
root 10791 10790 0 47293 8072 1 21:35 ? 00:00:00 /usr/bin/python /root/.ansible/tmp/ansible-tmp-1477517716.08-166185076484205/yum
root 10792 10791 0 112614 33836 3 21:35 ? 00:00:01 /usr/bin/python /tmp/ansible_6frsOZ/ansible_module_yum.py
root 10795 10792 17 107686 102200 1 21:35 ? 00:01:08 /usr/bin/python /usr/bin/yum -d 2 -y install genisoimage rhevm-appliance glusterfs-fuse ovirt-hosted-engine-setup
root 13274 10855 0 28163 960 2 21:41 pts/0 00:00:00 grep --color=auto yum
2.) After the installation is complete, the yum and ansible-related processes are now done:
# grep -e genisoimage -e rhevm-appliance -e glusterfs-fuse -e ovirt-hosted-engine-setup /var/log/yum.log
Oct 26 21:37:36 Installed: glusterfs-fuse-3.7.9-10.el7.x86_64
Oct 26 21:40:48 Installed: genisoimage-1.1.11-23.el7.x86_64
Oct 26 21:41:17 Installed: ovirt-hosted-engine-setup-2.0.2.2-2.el7ev.noarch
Oct 26 21:42:04 Installed: rhevm-appliance-20160922.0-1.el7ev.noarch
# ps -eF|grep yum
root 13332 10855 0 28162 960 3 21:43 pts/0 00:00:00 grep --color=auto yum
3.) On the Satellite server, nothing further gets logged to /var/log/foreman/deployments/<deployment>/ansible.log, and no more tasks from /usr/share/ansible-ovirt/self_hosted.yml are getting executed on the client. The ansible-playbook processes are still running on the Satellite, though:
# ps -eF|grep ansible
foreman 19741 10517 0 68071 31932 2 17:34 ? 00:00:04 /usr/bin/python2 /usr/bin/ansible-playbook /usr/share/ansible-ovirt/self_hosted.yml -i /usr/share/foreman/tmp/ansible-ovirt/tpapaioa_11/inventory -e {"admin_password":"changeme","cluster_name":"Default","dc_name":"Default","cpu_model":"model_Nehalem","cpu_type":"Intel Nehalem Family","hosted_storage_address":"192.168.100.254","hosted_storage_name":"my_hosted_storage","hosted_storage_path":"/home/export/self_hosted_domain","data_storage_address":"192.168.100.254","data_storage_name":"my_storage","data_storage_path":"/home/export/data_domain","create_export_domain":false,"export_storage_address":null,"export_storage_name":"my_export","export_storage_path":null,"engine_activation_key":"RHV_Engine-tpapaioa_11-RHV_Engine","engine_db_password":"changeme","engine_fqdn":"tpapaioa-11-rhv-engine.example.com","engine_mac_address":"22:29:4c:d5:f9:e9","gateway":"192.168.100.254","mac_address_range":"00:1A:07:00:00:00,00:1A:07:FF:FF:FF","mac_pool_name":"qci","root_password":"changeme","satellite_fqdn":"fusor.example.com","config_dir":"/etc/qci/","storageDatacenterName":"hosted_storage","storage_type":"nfs","engine_repositories":["rhel-7-server-beta-rpms","rhel-7-server-satellite-tools-6.2-rpms","rhel-7-server-supplementary-beta-rpms","rhel-7-server-rhv-4.0-rpms","jb-eap-7-for-rhel-7-server-rpms","rhel-7-server-optional-beta-rpms"]} -vvvv
foreman 19752 19741 9 83331 23696 3 17:34 ? 00:00:49 /usr/bin/python2 /usr/bin/ansible-playbook /usr/share/ansible-ovirt/self_hosted.yml -i /usr/share/foreman/tmp/ansible-ovirt/tpapaioa_11/inventory -e {"admin_password":"changeme","cluster_name":"Default","dc_name":"Default","cpu_model":"model_Nehalem","cpu_type":"Intel Nehalem Family","hosted_storage_address":"192.168.100.254","hosted_storage_name":"my_hosted_storage","hosted_storage_path":"/home/export/self_hosted_domain","data_storage_address":"192.168.100.254","data_storage_name":"my_storage","data_storage_path":"/home/export/data_domain","create_export_domain":false,"export_storage_address":null,"export_storage_name":"my_export","export_storage_path":null,"engine_activation_key":"RHV_Engine-tpapaioa_11-RHV_Engine","engine_db_password":"changeme","engine_fqdn":"tpapaioa-11-rhv-engine.example.com","engine_mac_address":"22:29:4c:d5:f9:e9","gateway":"192.168.100.254","mac_address_range":"00:1A:07:00:00:00,00:1A:07:FF:FF:FF","mac_pool_name":"qci","root_password":"changeme","satellite_fqdn":"fusor.example.com","config_dir":"/etc/qci/","storageDatacenterName":"hosted_storage","storage_type":"nfs","engine_repositories":["rhel-7-server-beta-rpms","rhel-7-server-satellite-tools-6.2-rpms","rhel-7-server-supplementary-beta-rpms","rhel-7-server-rhv-4.0-rpms","jb-eap-7-for-rhel-7-server-rpms","rhel-7-server-optional-beta-rpms"]} -vvvv
root 20750 20015 0 28162 960 0 17:43 pts/2 00:00:00 grep --color=auto ansible
4.) An strace of the processes shows that one of them is hanging while trying to write to stdout:
# strace -p19741
Process 19741 attached
write(1, "a-libGL.x86_64 0:11.2.2-2.201606"..., 106496^CProcess 19741 detached
5.) If I cat this file descriptor, I see a huge amount of yum-related output: dependency resolution message, the list of packages to install, package download output, then installation output....
# cat /proc/19741/fd/1
Loaded plugins: package_upload, product-id, search-disabled-repos, subscription-\n : manager\nResolving Dependencies\n--> Running transaction check\n---> Package genisoimage.x86_64 0:1.1.11-23.el7 will be installed\n--> Processing Dependency: [...] yajl.x86_64 0:2.0.4-4.el7
\n yum-utils.noarch 0:1.1.31-40.el7 \n\nComplete!\n
6.) My viewing of the file descriptor contents is apparently the flush it needs: this huge amount of yum output immediately gets logged to ansible.log, and the next step in the playbook finally start.
7.) I have to do this one more time during the hosted-engine task, to flush the huge amount of output from that command to the log. After doing that, the deployment completes successfully.
These steps were done on QCI-1.1-RHEL-7-20161025.t.0, without the package downgrade mentioned in my previous comment.
Tasos, this is great information, thanks so much for the debugging. I was investigating further yesterday/today and I think this might be an issue with the way we are running the command from Ruby, it is trying to read in the whole chunk from standard out at once. I updated it to a method which breaks the output into sane chunks, which seemed to work for me. I'm going to test it a little more and if it works will try to get that into a compose tomorrow. I was able to successfully complete a RHV Selfhosted+CFME+OCP deployment by repeating Tasos' steps above. I waited for the ansible processes on satellite to get stuck writing to stdout (File Descriptor 1). cat'ing the file descriptor allowed the ansible process to continue. Each time the RHV install was stuck (4-5 times total), I was able to repeat the steps and continue the RHV deployment. OCP and CFME installed successfully w/o any manual intervention. ISO Media Version: QCI-1.1-RHEL-7-20161026.t.0 https://github.com/fusor/fusor/pull/1276 Caused by large stdout chunks failing to write, hanging the ansible process. Fixed by using Open3.capture2e instead of Open3.popen2e, which handles reading the stdout in chunks so that we don't have to. Thanks again everybody for the extra debugging information, very helpful. Will be fixed in compose 11/1 Verified on QCI-1.1-RHEL-7-20161101.t.0. *** Bug 1388929 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:0335 |