Hide Forgot
Description of problem: When running node scaleup playbook for 100 nodes the playbook fails with following error 2018-03-20 18:10:58,961 p=11077 u=root | TASK [openshift_node : Template multipath configuration] ************************************************************************************ 2018-03-20 18:10:58,961 p=11077 u=root | task path: /root/openshift-ansible/roles/openshift_node/tasks/storage_plugins/iscsi.yml:24 2018-03-20 18:11:04,361 p=11077 u=root | ERROR! Unexpected Exception, this is probably a bug: [Errno 12] Cannot allocate memory 2018-03-20 18:11:04,370 p=11077 u=root | the full traceback was: Traceback (most recent call last): File "/usr/bin/ansible-playbook", line 106, in <module> exit_code = cli.run() File "/usr/lib/python2.7/site-packages/ansible/cli/playbook.py", line 122, in run results = pbex.run() File "/usr/lib/python2.7/site-packages/ansible/executor/playbook_executor.py", line 154, in run result = self._tqm.run(play=play) File "/usr/lib/python2.7/site-packages/ansible/executor/task_queue_manager.py", line 290, in run play_return = strategy.run(iterator, play_context) File "/usr/lib/python2.7/site-packages/ansible/plugins/strategy/linear.py", line 277, in run self._queue_task(host, task, task_vars, play_context) File "/usr/lib/python2.7/site-packages/ansible/plugins/strategy/__init__.py", line 254, in _queue_task worker_prc.start() File "/usr/lib64/python2.7/multiprocessing/process.py", line 130, in start self._popen = Popen(self) File "/usr/lib64/python2.7/multiprocessing/forking.py", line 121, in __init__ self.pid = os.fork() OSError: [Errno 12] Cannot allocate memory forks in ansible.conf is set to 25 Version-Release number of the following components: rpm -q openshift-ansible openshift-ansible-3.9.11-1.git.0.153fcff.el7.noarch openshift-ansible head is 6155f6e6c03f3a2545f47e3a08a91674e7526f98 rpm -q ansible ansible-2.4.3.0-1.el7ae.noarch ansible --version ansible 2.4.3.0 config file = /root/openshift-ansible/ansible.cfg configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules'] ansible python module location = /usr/lib/python2.7/site-packages/ansible executable location = /usr/bin/ansible python version = 2.7.5 (default, Feb 20 2018, 09:19:12) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] How reproducible: ansible fork set to 25 and run node scaleup playbook Actual results: Please include the entire output from the last TASK line through the end of output if an error is generated Expected results: Playbook should complete Additional info: Please attach logs from ansible-playbook with the -vvv flag Attached
Created attachment 1410703 [details] ansible log with -vvv
Created attachment 1410705 [details] pbench pidstat memory usage during the run
Created attachment 1410706 [details] pbench sar memory usage during the run
This seems to have started when ansible-2.4.3 became required (along with the functionality in openshift-ansible that made it required). We are testing going back to 2.4.2 and an older level of openshift-ansible to double check, but those scaleups did not encounter the issue. In the scale-ci lab @jmencak could scale up running ansible on an instance with 2GB memory but can no longer do so with ansible-2.4.3 and openshift-ansible latest.
Created attachment 1411215 [details] A small 8n cluster with ansible 2.4.2(success)/2.4.3(fail) and openshift-ansible 3.9.7
Per discussion; we will include a release note about observable RAM requirements for 3.9 and work on solving the root issue for 3.9.z
Ran some memory profiling with Fedora Atomic 27, single host install: 1. Installed v3.9.0 cluster 2. Re-ran playbook: 2.4.2 memory usage at end of run: 95.297 MiB 2.4.3 memory usage at end of run: 190.195 MiB 2.5.0rc3 memory usage at end of run: 72.312 MiB There is definitely a regression for 2.4.3, and 2.4.2 performance is not great. I'm going to test with more than one host to see if the problem is linear, etc.
For comment 14 above, the playbook ran against already-operational cluster was: /home/mgugino/git/openshift-ansible/playbooks/openshift-node/config.yml openshift-node play seemed to be the biggest spike of memory during cluster install; wanted to see if that was an artifact of being called after a good amount of plays have already run. Seems ansible doesn't like that play and roles.
Deployed new 3.9 cluster w/ 5 hosts (2 masters, 3 nodes). Ran same playbook as above (openshift-node/config.yml) v2.5.0rc3: 92.961 MiB. Max: 96.941 v2.4.4-0.2.rc1: Mem at the end of run: 93.383 MiB. Max: 93.383 MiB v2.4.3: At end of run: 254.383 MiB. Max: 254.383 MiB v2.4.2: At end of run: 129.465 MiB. Max: 147.062 MiB In all cases, the memory at the start of the playbook run is ~55MiB. These numbers are per: fork of ansible. The free RAM on a machine can vary dramatically depending on the number of simultaneous operations.
@alex, similar bug, https://bugzilla.redhat.com/show_bug.cgi?id=1557290. it works with large memory. Could you add it to know issues list too?
@scott, It seems the Cannot allocate memory issue occurred on scaleup/redeploy. Is that related to fact gather?
Hit memory issue again. when install OCP with logging enabled. TASK [openshift_logging : Add Kibana route information to web console asset config] *** Thursday 22 March 2018 00:24:57 +0000 (0:00:00.066) 0:13:51.448 ******** TASK [openshift_web_console : Read the existing web console config map] ******** Thursday 22 March 2018 00:24:59 +0000 (0:00:01.533) 0:13:52.981 ******** ERROR! Unexpected Exception, this is probably a bug: [Errno 12] Cannot allocate memory to see the full traceback, use -vvv tools/launch_instance.rb:430:in `block in run_ansible_playbook': ansible failed execution, see logs (RuntimeError) from tools/launch_instance.rb:418:in `times' from tools/launch_instance.rb:418:in `run_ansible_playbook' from tools/launch_instance.rb:489:in `installation_task' from tools/launch_instance.rb:560:in `block in launch_template' from tools/launch_instance.rb:559:in `each' from tools/launch_instance.rb:559:in `launch_template' from tools/launch_instance.rb:54:in `block (2 levels) in run' from /home/slave3/.gem/ruby/gems/commander-4.4.3/lib/commander/command.rb:178:in `call' from /home/slave3/.gem/ruby/gems/commander-4.4.3/lib/commander/command.rb:153:in `run' from /home/slave3/.gem/ruby/gems/commander-4.4.3/lib/commander/runner.rb:446:in `run_active_command' from /home/slave3/.gem/ruby/gems/commander-4.4.3/lib/commander/runner.rb:68:in `run!' from /home/slave3/.gem/ruby/gems/commander-4.4.3/lib/commander/delegates.rb:15:in `run!' from tools/launch_instance.rb:92:in `run' from tools/launch_instance.rb:662:in `<main>' waiting for operation up to 36000 seconds.. [00:25:01] INFO> Exit Status: 64000 ############ ANSIBLE END: /home/slave3/workspace/Launch Environment Flexy/private-openshift-ansible/playbooks/deploy_cluster.yml ##
Created attachment 1412102 [details] profiled task memory without serial This file contains profiled memory output of ansible control host without modification to 2.4.3.
Created attachment 1412104 [details] profiled task memory serial 10 forks, 38 nodes, serial 25% in playbooks/openshift-node/private/configure_nodes.yml
Created attachment 1412164 [details] rhel 2.4.3 (pip) output: 38 nodes 10 forks
Created attachment 1412165 [details] rhel 2.4.4 (pip) output: 38 nodes 10 forks
Created attachment 1412180 [details] 2.4.3 (venv) rhel pip freeze
Just ran another test. RHEL control host, 38 nodes, 10 forks. This time, I put "serial: 25%" on the play that calls the openshift_node role. Memory usage peaked at 9,417,699,328 bytes and the run complated without OOM. profile_memory reported max of 928 MiB during this run. The same setup without the "serial: 25%" would have failed. My theory at this point is that: 1) Memory ballooning is made worse by having additional hosts in-scope during a play. 2) The memory is copied once for each host in-scope, not per fork. 3) Possible problem with logic causing too many forks to be created.
Also, these issues only seem to exist on RHEL. Fedora + Python 3.6 + ansible 2.4.4 or 2.4.3 do not have the serious memory ballooning issues. The profile_memory callback does report 1-2 GiB on Fedora, but the actual total RAM usage is ~3.5 GiB for 10 forks and 38 hosts. On RHEL, ansible 2.4.4 installed via pip into a virtualenv (python 2) yields a max RAM usage of 4,650,131,456 bytes. I believe that the issue may be a dependent library bug causing the issue on RHEL.
RHEL + rpm installed ansible 2.4.3 + https://github.com/openshift/openshift-ansible/pull/7648 fixes the issue entirely. include_tasks vs import_tasks seems to be the main culprit. There is an 'include_task' during upgrades for node_config_hook. node_config_hook runs during the serial portion of the upgrade, and based on previous testing 'serial' greatly reduces any memory impact.
Running with the PR in comment 39 on ansible 2.4.3, the problem is not encountered. There is one quick spike to 8GB but that is it. Screen shot of kbmemused attached.
Created attachment 1413217 [details] kbmemused for PR 7648 fix kbmemused with https://github.com/openshift/openshift-ansible/pull/7648 fix in place.
Created attachment 1413297 [details] kbmemused for PR 7648 fix with ansible 2.4.4 Attached kbmemused graph available during scaleup with the PR code and ansible 2.4.4. It looks similar to 2.4.3, in this case spike is 9G vs 8G in ansible 2.4.3
The fix do reduce the memory usage. Deploy OCP with Logging succeed using 8G Memory. Note: the pure logging redeploy still failed, it was traced in https://bugzilla.redhat.com/show_bug.cgi?id=1557290.
Fresh install and scaleup is running well, no regression issue.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0600