Bug 1558672

Summary:

openshift ansible node scaleup fails - Cannot allocate memory

Product:

OpenShift Container Platform

Reporter:

Vikas Laad <vlaad>

Component:

Installer

Assignee:

Michael Gugino <mgugino>

Status:

CLOSED ERRATA

QA Contact:

Johnny Liu <jialiu>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

3.9.0

CC:

adellape, anli, aos-bugs, jeder, jmencak, jokerman, mifiedle, mmccomas, smunilla, tparsons

Target Milestone:

---

Keywords:

Regression

Target Release:

3.9.z

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

aos-scalability-39

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

The node scaleup and installation playbooks may have consumed more memory than expected due to the use of 'include_tasks' in several places. The majority of these instances have now been converted to 'import_tasks' calls which do not consume as much memory. After this change memory consumption should be below 100MiB per host, for large (100+ hosts) environments we recommend a control host with at least 16GiB of memory.

Story Points:

---

Clone Of:

Clones:

1559527 (view as bug list)

Environment:

Last Closed:

2018-03-28 17:35:12 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1559527

Attachments:

Description	Flags
ansible log with -vvv	none
pbench pidstat memory usage during the run	none
pbench sar memory usage during the run	none
A small 8n cluster with ansible 2.4.2(success)/2.4.3(fail) and openshift-ansible 3.9.7	none
profiled task memory without serial	none
profiled task memory serial	none
rhel 2.4.3 (pip) output: 38 nodes 10 forks	none
rhel 2.4.4 (pip) output: 38 nodes 10 forks	none
2.4.3 (venv) rhel pip freeze	none
kbmemused for PR 7648 fix	none
kbmemused for PR 7648 fix with ansible 2.4.4	none

Description Vikas Laad 2018-03-20 18:27:13 UTC

Description of problem:
When running node scaleup playbook for 100 nodes the playbook fails with following error
2018-03-20 18:10:58,961 p=11077 u=root |  TASK [openshift_node : Template multipath configuration] ************************************************************************************
2018-03-20 18:10:58,961 p=11077 u=root |  task path: /root/openshift-ansible/roles/openshift_node/tasks/storage_plugins/iscsi.yml:24
2018-03-20 18:11:04,361 p=11077 u=root |  ERROR! Unexpected Exception, this is probably a bug: [Errno 12] Cannot allocate memory
2018-03-20 18:11:04,370 p=11077 u=root |  the full traceback was:

Traceback (most recent call last):
  File "/usr/bin/ansible-playbook", line 106, in <module>
    exit_code = cli.run()
  File "/usr/lib/python2.7/site-packages/ansible/cli/playbook.py", line 122, in run
    results = pbex.run()
  File "/usr/lib/python2.7/site-packages/ansible/executor/playbook_executor.py", line 154, in run
    result = self._tqm.run(play=play)
  File "/usr/lib/python2.7/site-packages/ansible/executor/task_queue_manager.py", line 290, in run
    play_return = strategy.run(iterator, play_context)
  File "/usr/lib/python2.7/site-packages/ansible/plugins/strategy/linear.py", line 277, in run
    self._queue_task(host, task, task_vars, play_context)
  File "/usr/lib/python2.7/site-packages/ansible/plugins/strategy/__init__.py", line 254, in _queue_task
    worker_prc.start()
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 130, in start
    self._popen = Popen(self)
  File "/usr/lib64/python2.7/multiprocessing/forking.py", line 121, in __init__
    self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory

forks in ansible.conf is set to 25

Version-Release number of the following components:
rpm -q openshift-ansible
openshift-ansible-3.9.11-1.git.0.153fcff.el7.noarch

openshift-ansible head is 6155f6e6c03f3a2545f47e3a08a91674e7526f98

rpm -q ansible
ansible-2.4.3.0-1.el7ae.noarch

ansible --version
ansible 2.4.3.0
  config file = /root/openshift-ansible/ansible.cfg
  configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, Feb 20 2018, 09:19:12) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)]

How reproducible:
ansible fork set to 25 and run node scaleup playbook

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

Expected results:
Playbook should complete

Additional info:
Please attach logs from ansible-playbook with the -vvv flag
Attached

Comment 1 Vikas Laad 2018-03-20 18:32:04 UTC

Created attachment 1410703 [details]
ansible log with -vvv

Comment 3 Vikas Laad 2018-03-20 18:41:30 UTC

Created attachment 1410705 [details]
pbench pidstat memory usage during the run

Comment 4 Vikas Laad 2018-03-20 18:41:56 UTC

Created attachment 1410706 [details]
pbench sar memory usage during the run

Comment 6 Mike Fiedler 2018-03-20 18:48:03 UTC

This seems to have started when ansible-2.4.3 became required (along with the functionality in openshift-ansible that made it required).   We are testing going back to 2.4.2 and an older level of openshift-ansible to double check, but those scaleups did not encounter the issue.

In the scale-ci lab @jmencak could scale up  running ansible on an instance with 2GB memory but can no longer do so with ansible-2.4.3 and openshift-ansible latest.

Comment 12 Jiří Mencák 2018-03-21 13:23:39 UTC

Created attachment 1411215 [details]
A small 8n cluster with ansible 2.4.2(success)/2.4.3(fail) and openshift-ansible 3.9.7

Comment 13 N. Harrison Ripps 2018-03-21 14:15:45 UTC

Per discussion; we will include a release note about observable RAM requirements for 3.9 and work on solving the root issue for 3.9.z

Comment 14 Michael Gugino 2018-03-21 15:17:43 UTC

Ran some memory profiling with Fedora Atomic 27, single host install:

1. Installed v3.9.0 cluster
2. Re-ran playbook:

2.4.2 memory usage at end of run: 95.297 MiB
2.4.3 memory usage at end of run: 190.195 MiB
2.5.0rc3 memory usage at end of run: 72.312 MiB

There is definitely a regression for 2.4.3, and 2.4.2 performance is not great.

I'm going to test with more than one host to see if the problem is linear, etc.

Comment 15 Michael Gugino 2018-03-21 15:19:14 UTC

For comment 14 above, the playbook ran against already-operational cluster was: /home/mgugino/git/openshift-ansible/playbooks/openshift-node/config.yml

openshift-node play seemed to be the biggest spike of memory during cluster install; wanted to see if that was an artifact of being called after a good amount of plays have already run.  Seems ansible doesn't like that play and roles.

Comment 17 Michael Gugino 2018-03-21 17:26:23 UTC

Deployed new 3.9 cluster w/ 5 hosts (2 masters, 3 nodes).

Ran same playbook as above (openshift-node/config.yml)

v2.5.0rc3: 92.961 MiB.  Max: 96.941

v2.4.4-0.2.rc1: Mem at the end of run: 93.383 MiB.  Max: 93.383 MiB

v2.4.3: At end of run: 254.383 MiB. Max: 254.383 MiB

v2.4.2: At end of run: 129.465 MiB. Max: 147.062 MiB

In all cases, the memory at the start of the playbook run is ~55MiB.

These numbers are per: fork of ansible.  The free RAM on a machine can vary dramatically depending on the number of simultaneous operations.

Comment 23 Anping Li 2018-03-22 01:44:21 UTC

@alex, 
similar bug, https://bugzilla.redhat.com/show_bug.cgi?id=1557290. it works with large memory. Could you add it to know issues list too?

Comment 24 Anping Li 2018-03-22 01:47:25 UTC

@scott, It seems the Cannot allocate memory issue occurred on scaleup/redeploy.  Is that related to fact gather?

Comment 25 Anping Li 2018-03-22 03:27:33 UTC

Hit memory issue again. when install OCP with logging enabled.

TASK [openshift_logging : Add Kibana route information to web console asset config] ***
Thursday 22 March 2018  00:24:57 +0000 (0:00:00.066)       0:13:51.448 ******** 

TASK [openshift_web_console : Read the existing web console config map] ********
Thursday 22 March 2018  00:24:59 +0000 (0:00:01.533)       0:13:52.981 ******** 
ERROR! Unexpected Exception, this is probably a bug: [Errno 12] Cannot allocate memory
to see the full traceback, use -vvv
tools/launch_instance.rb:430:in `block in run_ansible_playbook': ansible failed execution, see logs (RuntimeError)
	from tools/launch_instance.rb:418:in `times'
	from tools/launch_instance.rb:418:in `run_ansible_playbook'
	from tools/launch_instance.rb:489:in `installation_task'
	from tools/launch_instance.rb:560:in `block in launch_template'
	from tools/launch_instance.rb:559:in `each'
	from tools/launch_instance.rb:559:in `launch_template'
	from tools/launch_instance.rb:54:in `block (2 levels) in run'
	from /home/slave3/.gem/ruby/gems/commander-4.4.3/lib/commander/command.rb:178:in `call'
	from /home/slave3/.gem/ruby/gems/commander-4.4.3/lib/commander/command.rb:153:in `run'
	from /home/slave3/.gem/ruby/gems/commander-4.4.3/lib/commander/runner.rb:446:in `run_active_command'
	from /home/slave3/.gem/ruby/gems/commander-4.4.3/lib/commander/runner.rb:68:in `run!'
	from /home/slave3/.gem/ruby/gems/commander-4.4.3/lib/commander/delegates.rb:15:in `run!'
	from tools/launch_instance.rb:92:in `run'
	from tools/launch_instance.rb:662:in `<main>'
waiting for operation up to 36000 seconds..

[00:25:01] INFO> Exit Status: 64000
############ ANSIBLE END: /home/slave3/workspace/Launch Environment Flexy/private-openshift-ansible/playbooks/deploy_cluster.yml ##

Comment 30 Michael Gugino 2018-03-23 13:06:11 UTC

Created attachment 1412102 [details]
profiled task memory without serial

This file contains profiled memory output of ansible control host without modification to 2.4.3.

Comment 31 Michael Gugino 2018-03-23 13:09:43 UTC

Created attachment 1412104 [details]
profiled task memory serial

10 forks, 38 nodes, serial 25% in playbooks/openshift-node/private/configure_nodes.yml

Comment 34 Michael Gugino 2018-03-23 15:16:46 UTC

Created attachment 1412164 [details]
rhel 2.4.3 (pip) output: 38 nodes 10 forks

Comment 35 Michael Gugino 2018-03-23 15:28:38 UTC

Created attachment 1412165 [details]
rhel 2.4.4 (pip) output: 38 nodes 10 forks

Comment 36 Michael Gugino 2018-03-23 15:48:14 UTC

Created attachment 1412180 [details]
2.4.3 (venv) rhel pip freeze

Comment 37 Michael Gugino 2018-03-23 23:47:23 UTC

Just ran another test.  RHEL control host, 38 nodes, 10 forks.  This time, I put "serial: 25%" on the play that calls the openshift_node role.  Memory usage peaked at 9,417,699,328 bytes and the run complated without OOM.  profile_memory reported max of 928 MiB during this run.

The same setup without the "serial: 25%" would have failed.

My theory at this point is that:
1) Memory ballooning is made worse by having additional hosts in-scope during a play.
2) The memory is copied once for each host in-scope, not per fork.
3) Possible problem with logic causing too many forks to be created.

Comment 38 Michael Gugino 2018-03-24 00:17:40 UTC

Also, these issues only seem to exist on RHEL.  Fedora + Python 3.6 + ansible 2.4.4 or 2.4.3 do not have the serious memory ballooning issues.  The profile_memory callback does report 1-2 GiB on Fedora, but the actual total RAM usage is ~3.5 GiB for 10 forks and 38 hosts.

On RHEL, ansible 2.4.4 installed via pip into a virtualenv (python 2) yields a max RAM usage of 4,650,131,456 bytes.

I believe that the issue may be a dependent library bug causing the issue on RHEL.

Comment 39 Michael Gugino 2018-03-26 02:42:58 UTC

RHEL + rpm installed ansible 2.4.3 + https://github.com/openshift/openshift-ansible/pull/7648 fixes the issue entirely.

include_tasks vs import_tasks seems to be the main culprit.  There is an 'include_task' during upgrades for node_config_hook.  node_config_hook runs during the serial portion of the upgrade, and based on previous testing 'serial' greatly reduces any memory impact.

Comment 40 Mike Fiedler 2018-03-26 15:23:06 UTC

Running with the PR in comment 39 on ansible 2.4.3, the problem is not encountered.  There is one quick spike to 8GB but that is it. Screen shot of kbmemused attached.

Comment 41 Mike Fiedler 2018-03-26 15:24:15 UTC

Created attachment 1413217 [details]
kbmemused for PR 7648 fix

kbmemused with https://github.com/openshift/openshift-ansible/pull/7648 fix in place.

Comment 42 Vikas Laad 2018-03-26 19:07:21 UTC

Created attachment 1413297 [details]
kbmemused for PR 7648 fix with ansible 2.4.4

Attached kbmemused graph available during scaleup with the PR code and ansible 2.4.4. It looks similar to 2.4.3, in this case spike is 9G vs 8G in ansible 2.4.3

Comment 45 Anping Li 2018-03-27 07:09:40 UTC

The fix do reduce the memory usage. Deploy OCP with Logging succeed using 8G Memory.   

Note: the pure logging redeploy still failed, it was traced in  https://bugzilla.redhat.com/show_bug.cgi?id=1557290.

Comment 46 Johnny Liu 2018-03-27 10:48:05 UTC

Fresh install and scaleup is running well, no regression issue.

Comment 49 errata-xmlrpc 2018-03-28 17:35:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0600

Comment 50 Red Hat Bugzilla 2023-09-14 04:25:46 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days