1486054 – Installer removes custom configuration from master-config.yaml during upgrade

Bug 1486054 - Installer removes custom configuration from master-config.yaml during upgrade

Summary: Installer removes custom configuration from master-config.yaml during upgrade

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	3.6.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.7.0
Assignee:	Michael Gugino
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-08-28 20:53 UTC by Justin Pierce
Modified:	2017-12-18 14:01 UTC (History)
CC List:	16 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Previously the upgrade playbooks incorrectly overwrote non default admissionConfig parameters while setting specific values required of the upgrade process. This task was removed as it's no longer necessary after having upgraded from 3.4 to 3.5.
Clone Of:
Environment:
Last Closed:	2017-11-28 22:08:20 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1497599	0	medium	CLOSED	BuildDefaults are not automatically inherited in master-config.yaml after upgrade	2021-09-09 12:39:09 UTC
Red Hat Product Errata	RHSA-2017:3188	0	normal	SHIPPED_LIVE	Moderate: Red Hat OpenShift Container Platform 3.7 security, bug, and enhancement update	2017-11-29 02:34:54 UTC

Internal Links: 1497599

Description Justin Pierce 2017-08-28 20:53:50 UTC

Description of problem:
v3.6.173.0.5

How reproducible:
100%

Steps to Reproduce:
1. Make configuration change to admission controllers in master-config
2. Upgrade using openshift-ansible

Actual results:
Manually introduced configuration changes are lost. This was a particularly problematic change for our starter-us-east-2 upgrade. The timeline was as follows:
[A] Cluster was running 3.5 and had admission controller settings to set resource.request.cpu on pods to 6% of resource.limit.cpu.
[B] Openshift-ansible upgrade ran, modifying master-config.yaml according to inventory file -- removing these configuration customizations.
[C] OpenShift-ansible changed imagestreams, which caused hundreds of deployment controllers to re-create their pods. These DC's had resource.limit.cpu=2, but no value set for resource.requests.cpu. Since the admission controller from [A] was removed by [B], the pods defaulted to requesting 2 full CPUs. This rapidly consumed all available CPU on the cluster and made the majority of the pods pause in the Pending state -- waiting for adequate CPU.
[D] The operations configuration loop eventually ran and restored the desired admission controller settings, but this was too late for the pods which had been created between [C] and [D] (virtually all of them). These CPU hungry pods prevented the cluster from scheduling virtually anything.

Expected results:
Customers should, at a minimum, be able to opt out of configuration changes to master, node, docker, etc.

Comment 4 Scott Dodson 2017-08-29 15:03:11 UTC

The specific bug we'll figure out, there's specific code related to migrating from old admissionConfig format to new admissionConfig format that's likely root cause of this.

But I'm concerned about the request to enable opting out of configuration changes during upgrade. I don't think it's possible over the long term to allow people to opt out of configuration updates while ensuring successful cluster lifecycle. We need to be able to assert changes to configuration over the lifecycle of the product.

Things like changing the default logging options for docker is one specific case where we've been asked to change defaults. When those options are set outside of openshift-ansible what are we to do when the logging team requests that we change them? We don't know if what's there is simply the old default or a deliberate action on behalf of the admin. What do we do when the configuration change is vital to the upgrade and the product simply won't work with the old values?

My opinion is that we need to push forward with ensuring that we make improvements to allowing configloop and openshift-ansible to co-exist and I think the most return on investment will be ensuring that config loop customization are accurately expressed in openshift-ansible variables. If and when we find items that cannot be expressed as openshift-ansible variables we need to add them.

Comment 5 Justin Pierce 2017-08-29 16:14:05 UTC

We should still be able to manage cluster lifecycle while maintaining customizations made outside of openshift-ansible by adopting a slightly different theory of configuration.

1. Patch configuration instead of re-writing. The config-loop generally only modifies the pertinent areas of a configuration file while leaving other configuration choices in place.
Example: if a customer has not defined admission controllers in o-a inventory, don't modify them.

2. When absolutely necessary, migrate existing configuration without loss. I think this should be demanded rarely, if ever, by the platform. Changes to the schema/semantics of the platform's configuration should be treated with the same gravity as an API change.
Example: If a new admission controller is required by the platform to function reliably/securely, requiring it to be added to the configuration seems like a questionable decision. If we have no other choice, patch it into the configuration, but do not lose existing admission controllers.

3. Where patching cannot safely be achieved, consider it a platform defect. Layered configuration files may be necessary.

4. If a external application like docker has an incompatible configuration, fail fast. If a customer has opted out of o-a managing docker configuration changes and the existing configuration is incompatible with a platform requirement, exit with an error. In the specific example for logging, the failure should only occur if the customer selects that logging should be installed and the configuration is incompatible.

#4 is particularly important for operations as it gives us a warning signal before rolling out a potentially disruptive change to our environments.

Comment 6 Luke Meyer 2017-08-29 20:32:48 UTC

This is really a pretty old issue showing up again, and IIRC it goes back to the decision that the config file should be a fully-specified manifestation of the config API object and not leave any defaults to be supplied by code. We knew at the time it would lead to this problem, but that's what we did, so now we have this problem to manage.

I'm pretty sure there's a lot of prior discussion that could be dug up but I can't find it offhand. For my two cents, layered config files merging into the fully-specified config seems like a decent approach.

Comment 8 Scott Dodson 2017-10-11 17:43:38 UTC

It looks like we messed things up when adding openshift.io/ImagePolicy via openshift_master_admission_plugin_config

Comment 9 Michael Gugino 2017-10-13 20:13:00 UTC

These configuration changes could have been implemented using the variable
openshift_master_admission_plugin_config.

If this variable was set, your desired values would have been applied during upgrades.

In any case, the upgrade utility has been modified to only run that task if openshift_master_admission_plugin_config is set in inventory, not the default fact value.


PR Created: https://github.com/openshift/openshift-ansible/pull/5763

Comment 11 liujia 2017-10-24 10:10:52 UTC

@Michael Gugino

Fail to re-produce the issue on an old v3.7 build which did not merge your pr5763. 

Version:
openshift-ansible-3.7.0-0.145.0.git.0.b37c5e6.el7.noarch

Steps:
1, Install ocp v3.6
2, Change master-config.yml according to comment1 and restart master service.
3, Prepare repos and upgrade above cluster to v3.7

After upgrade, check master-config.yml before and after upgrade to be the same.

Comment 19 Michael Gugino 2017-10-26 13:23:19 UTC

PR Created to resolve control plane not upgrading during upgrades.

https://github.com/openshift/openshift-ansible/pull/5875

Comment 20 Michael Gugino 2017-10-27 13:41:15 UTC

PR Merged.

Comment 25 liujia 2017-10-31 02:45:42 UTC

Here is upgrade log from v3.5 to v3.6 in attachment. I checked it to find that master_config_upgrade.yml was skipped just the same as earlier v3.7.  

openshift-ansible-3.6.173.0.5-3.git.0.522a92a.el7.noarch.rpm is the first available release build of v3.6. So if the version is right, it must be some else steps needed for this issue.

Comment 27 liujia 2017-10-31 02:50:45 UTC

@Justin Pierce

Before verify the bug, QE need reproduce it first. Could you help confirm the steps in comment22? According to your description, no special step needed, but QE can not reproduce it both on earlier v3.7 and v3.6 which is the same as yours.

Comment 28 Michael Gugino 2017-10-31 16:54:48 UTC

(In reply to liujia from comment #27)
> @Justin Pierce
> 
> Before verify the bug, QE need reproduce it first. Could you help confirm
> the steps in comment22? According to your description, no special step
> needed, but QE can not reproduce it both on earlier v3.7 and v3.6 which is
> the same as yours.

Okay, I have this sorted out.

The original issue reported was caused by running upgrade_control_plane.yml playbook, not upgrade.yml playbook.  The original reported issue was fixed with this PR: https://github.com/openshift/openshift-ansible/pull/5763

However, when attempting to verify the issue, QE is using upgrade.yml (to upgrade both control plane and nodes).  This particular playbook neglected to apply the appropriate configuration settings to the master config, due to an unreported bug.  This bug appears to have existed for quite some time.

The second, unreported bug (not running master config upgrade) was fixed in this PR: https://github.com/openshift/openshift-ansible/pull/5875

@liujia

To verify the original bug as reported in this BZ, we should run upgrade_control_plane.yml instead of upgrade.yml.  That should be all that is necessary for this BZ to be verified.

Comment 29 liujia 2017-11-01 07:04:53 UTC

@Michael Gugino

Thx for the important info about running upgrade_control_plane.yml instead of upgrade.yml. Unfortunately, I hit another issue https://bugzilla.redhat.com/show_bug.cgi?id=1508290. I will come back to this issue after that one fixed.

Comment 30 liujia 2017-11-10 05:16:34 UTC

Verified on openshift-ansible-3.7.4-1.git.0.254e849.el7.noarch.

Checked pr5763 and pr5875 has merged.
Upgrade master first succeed without changes of admission plugin.
Upgrade node then succeed without changes of admission plugin.

Comment 31 liujia 2017-11-10 05:17:45 UTC

Case added, remove testcaseneeded.

Comment 34 errata-xmlrpc 2017-11-28 22:08:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188

Note You need to log in before you can comment on or make changes to this bug.