Description of problem: v3.6.173.0.5 How reproducible: 100% Steps to Reproduce: 1. Make configuration change to admission controllers in master-config 2. Upgrade using openshift-ansible Actual results: Manually introduced configuration changes are lost. This was a particularly problematic change for our starter-us-east-2 upgrade. The timeline was as follows: [A] Cluster was running 3.5 and had admission controller settings to set resource.request.cpu on pods to 6% of resource.limit.cpu. [B] Openshift-ansible upgrade ran, modifying master-config.yaml according to inventory file -- removing these configuration customizations. [C] OpenShift-ansible changed imagestreams, which caused hundreds of deployment controllers to re-create their pods. These DC's had resource.limit.cpu=2, but no value set for resource.requests.cpu. Since the admission controller from [A] was removed by [B], the pods defaulted to requesting 2 full CPUs. This rapidly consumed all available CPU on the cluster and made the majority of the pods pause in the Pending state -- waiting for adequate CPU. [D] The operations configuration loop eventually ran and restored the desired admission controller settings, but this was too late for the pods which had been created between [C] and [D] (virtually all of them). These CPU hungry pods prevented the cluster from scheduling virtually anything. Expected results: Customers should, at a minimum, be able to opt out of configuration changes to master, node, docker, etc.
The specific bug we'll figure out, there's specific code related to migrating from old admissionConfig format to new admissionConfig format that's likely root cause of this. But I'm concerned about the request to enable opting out of configuration changes during upgrade. I don't think it's possible over the long term to allow people to opt out of configuration updates while ensuring successful cluster lifecycle. We need to be able to assert changes to configuration over the lifecycle of the product. Things like changing the default logging options for docker is one specific case where we've been asked to change defaults. When those options are set outside of openshift-ansible what are we to do when the logging team requests that we change them? We don't know if what's there is simply the old default or a deliberate action on behalf of the admin. What do we do when the configuration change is vital to the upgrade and the product simply won't work with the old values? My opinion is that we need to push forward with ensuring that we make improvements to allowing configloop and openshift-ansible to co-exist and I think the most return on investment will be ensuring that config loop customization are accurately expressed in openshift-ansible variables. If and when we find items that cannot be expressed as openshift-ansible variables we need to add them.
We should still be able to manage cluster lifecycle while maintaining customizations made outside of openshift-ansible by adopting a slightly different theory of configuration. 1. Patch configuration instead of re-writing. The config-loop generally only modifies the pertinent areas of a configuration file while leaving other configuration choices in place. Example: if a customer has not defined admission controllers in o-a inventory, don't modify them. 2. When absolutely necessary, migrate existing configuration without loss. I think this should be demanded rarely, if ever, by the platform. Changes to the schema/semantics of the platform's configuration should be treated with the same gravity as an API change. Example: If a new admission controller is required by the platform to function reliably/securely, requiring it to be added to the configuration seems like a questionable decision. If we have no other choice, patch it into the configuration, but do not lose existing admission controllers. 3. Where patching cannot safely be achieved, consider it a platform defect. Layered configuration files may be necessary. 4. If a external application like docker has an incompatible configuration, fail fast. If a customer has opted out of o-a managing docker configuration changes and the existing configuration is incompatible with a platform requirement, exit with an error. In the specific example for logging, the failure should only occur if the customer selects that logging should be installed and the configuration is incompatible. #4 is particularly important for operations as it gives us a warning signal before rolling out a potentially disruptive change to our environments.
This is really a pretty old issue showing up again, and IIRC it goes back to the decision that the config file should be a fully-specified manifestation of the config API object and not leave any defaults to be supplied by code. We knew at the time it would lead to this problem, but that's what we did, so now we have this problem to manage. I'm pretty sure there's a lot of prior discussion that could be dug up but I can't find it offhand. For my two cents, layered config files merging into the fully-specified config seems like a decent approach.
It looks like we messed things up when adding openshift.io/ImagePolicy via openshift_master_admission_plugin_config
These configuration changes could have been implemented using the variable openshift_master_admission_plugin_config. If this variable was set, your desired values would have been applied during upgrades. In any case, the upgrade utility has been modified to only run that task if openshift_master_admission_plugin_config is set in inventory, not the default fact value. PR Created: https://github.com/openshift/openshift-ansible/pull/5763
@Michael Gugino Fail to re-produce the issue on an old v3.7 build which did not merge your pr5763. Version: openshift-ansible-3.7.0-0.145.0.git.0.b37c5e6.el7.noarch Steps: 1, Install ocp v3.6 2, Change master-config.yml according to comment1 and restart master service. 3, Prepare repos and upgrade above cluster to v3.7 After upgrade, check master-config.yml before and after upgrade to be the same.
PR Created to resolve control plane not upgrading during upgrades. https://github.com/openshift/openshift-ansible/pull/5875
PR Merged.
Here is upgrade log from v3.5 to v3.6 in attachment. I checked it to find that master_config_upgrade.yml was skipped just the same as earlier v3.7. openshift-ansible-3.6.173.0.5-3.git.0.522a92a.el7.noarch.rpm is the first available release build of v3.6. So if the version is right, it must be some else steps needed for this issue.
@Justin Pierce Before verify the bug, QE need reproduce it first. Could you help confirm the steps in comment22? According to your description, no special step needed, but QE can not reproduce it both on earlier v3.7 and v3.6 which is the same as yours.
(In reply to liujia from comment #27) > @Justin Pierce > > Before verify the bug, QE need reproduce it first. Could you help confirm > the steps in comment22? According to your description, no special step > needed, but QE can not reproduce it both on earlier v3.7 and v3.6 which is > the same as yours. Okay, I have this sorted out. The original issue reported was caused by running upgrade_control_plane.yml playbook, not upgrade.yml playbook. The original reported issue was fixed with this PR: https://github.com/openshift/openshift-ansible/pull/5763 However, when attempting to verify the issue, QE is using upgrade.yml (to upgrade both control plane and nodes). This particular playbook neglected to apply the appropriate configuration settings to the master config, due to an unreported bug. This bug appears to have existed for quite some time. The second, unreported bug (not running master config upgrade) was fixed in this PR: https://github.com/openshift/openshift-ansible/pull/5875 @liujia To verify the original bug as reported in this BZ, we should run upgrade_control_plane.yml instead of upgrade.yml. That should be all that is necessary for this BZ to be verified.
@Michael Gugino Thx for the important info about running upgrade_control_plane.yml instead of upgrade.yml. Unfortunately, I hit another issue https://bugzilla.redhat.com/show_bug.cgi?id=1508290. I will come back to this issue after that one fixed.
Verified on openshift-ansible-3.7.4-1.git.0.254e849.el7.noarch. Checked pr5763 and pr5875 has merged. Upgrade master first succeed without changes of admission plugin. Upgrade node then succeed without changes of admission plugin.
Case added, remove testcaseneeded.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3188