Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1504075

Summary:	[3.9] Installer terminates after single node reports error despite error tolerance
Product:	OpenShift Container Platform	Reporter:	Justin Pierce <jupierce>
Component:	Cluster Version Operator	Assignee:	Russell Teague <rteague>
Status:	CLOSED ERRATA	QA Contact:	Weihua Meng <wmeng>
Severity:	low	Docs Contact:
Priority:	high
Version:	3.9.0	CC:	aos-bugs, jialiu, jokerman, mmccomas, wmeng
Target Milestone:	---	Keywords:	DeliveryBlocker
Target Release:	3.9.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Incorrect batch size calculations in Ansible 2.4.1 would cause playbooks to fail when using max_fail_percentage. The batch calculations were updated in Ansible 2.4.2 to correctly account for failures in each batch.	Story Points:	---
Clone Of:
Clones:	1538807 (view as bug list)		Environment:
Last Closed:	2018-03-28 14:08:09 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Justin Pierce 2017-10-19 13:03:46 UTC

Description of problem:

Despite being invoked with some error tolerance (openshift_upgrade_nodes_max_fail_percentage=30):

/usr/bin/ansible-playbook -f 20 -i ./cicd-to-productization-inventory.py -M /home/opsmedic/aos-cd/tmp/tmp.linMudJ57G/openshift-ansible_extract/library/ -e docker_version=1.12.6-58.git85d7426.el7 -e openshift_upgrade_nodes_serial=5% -e openshift_upgrade_nodes_max_fail_percentage=30 -e osm_cluster_network_cidr=10.128.0.0/14 -e osm_host_subnet_length=9 -e openshift_portal_net=172.30.0.0/16 -e openshift_disable_check=disk_availability,docker_storage,memory_availability,package_version /home/opsmedic/aos-cd/tmp/tmp.linMudJ57G/openshift-ansible_extract/playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade_nodes.yml

a recent cluster upgrade terminated after a single node failure:

>>>>>>>>>>>>
TASK [openshift_node_upgrade : Upgrade Docker] *********************************
Thursday 19 October 2017  01:12:38 +0000 (0:00:11.869)       7:31:22.998 ****** 
fatal: [starter-ca-central-1-node-compute-55314]: FAILED! => {"changed": false, "failed": true, "module_stderr": "error: rpmdb: BDB0113 Thread/process 8621/140301613152256 failed: BDB1507 Thread died in Berkeley DB library\nerror: db5 error(-30973) from dbenv->failchk: BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery\nerror: cannot open Packages index using db5 -  (-30973)\nerror: cannot open Packages database in /var/lib/rpm\nTraceback (most recent call last):\n  File \"/tmp/ansible_0y109f/ansible_module_yum.py\", line 1267, in <module>\n    main()\n  File \"/tmp/ansible_0y109f/ansible_module_yum.py\", line 1235, in main\n    my.conf\n  File \"/usr/lib/python2.7/site-packages/yum/__init__.py\", line 1078, in <lambda>\n    conf = property(fget=lambda self: self._getConfig(),\n  File \"/usr/lib/python2.7/site-packages/yum/__init__.py\", line 349, in _getConfig\n    startupconf = config.readStartupConfig(fn, root, releasever)\n  File \"/usr/lib/python2.7/site-packages/yum/config.py\", line 1093, in readStartupConfig\n    startupconf.distroverpkg)\n  File \"/usr/lib/python2.7/site-packages/yum/config.py\", line 1235, in _getsysver\n    raise Errors.YumBaseError(\"Error: \" + str(e))\nyum.Errors.YumBaseError: Error: rpmdb open failed\n", "module_stdout": "", "msg": "MODULE FAILURE", "rc": 1}
ok: [starter-ca-central-1-node-compute-d9c6e]
ok: [starter-ca-central-1-node-compute-541af]
ok: [starter-ca-central-1-node-compute-5038d]
<<<<<<<<<<<<


Version-Release number of the following components:
v3.7.0-0.143.7

Comment 5 Russell Teague 2017-10-25 20:48:16 UTC

It appears that max_fail_percentage is broken when used with serial.  After further investigation, I will open an upstream Ansible issue with a simple reproducer.

Comment 6 Russell Teague 2017-10-27 14:36:47 UTC

Upstream Ansible issue: https://github.com/ansible/ansible/issues/32255

Comment 8 Scott Dodson 2017-10-30 12:58:06 UTC

To summarize Russ's findings.

Assume you have a batch size of 4 (serial=4) and you've set max_fail_percentage=30 as in the original bug report. When the first node fails that means 25% of your nodes have failed. The playbook then executes the next task with 3 nodes and computes a new failure percentage using (1 failure / 3 nodes) which yields 33% and exceeds the 30% threshold; the playbook aborts.

A few observations in general on max_fail_percentage. First, I think it's wise to use an integer serial value so that the number of nodes in a batch is not a function of environment size and is therefore predictable so we can work around the math that ansible is doing. Then we have to set max_fail_percentage to trigger at $failures / ($serial - $failures). So if we have serial of 4 and we want to fail after a second failure then 1 / (4 -1) = 33.3 and since we have to be > the percentage we'd pick 34.

Comment 9 Russell Teague 2017-11-01 19:33:30 UTC

This should be fixed in Ansible 2.4.2 once that ships.
https://github.com/ansible/ansible/pull/32362#issuecomment-341146880

Comment 11 Russell Teague 2017-12-13 20:06:46 UTC

Ansible 2.4.2 has shipped.
https://errata.devel.redhat.com/advisory/31715

Comment 15 Russell Teague 2018-01-22 22:04:45 UTC

Waiting for Ansible 2.4.2 to ship in Extras.
https://errata.devel.redhat.com/advisory/31922

Comment 16 Russell Teague 2018-01-25 14:47:45 UTC

Ansible 2.4.2 shipped in Extras.

Comment 17 Weihua Meng 2018-01-26 01:18:44 UTC

Fixed.
ansible-2.4.2.0-2.el7.noarch.rpm
is in repo AtomicOpenShift/3.9/v3.9.0-0.24.0_2018-01-25.1/x86_64/os/Packages/

Comment 20 errata-xmlrpc 2018-03-28 14:08:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0489