Bug 1858400
Summary: | [Performance] Lease refresh period for machine-api-controllers is too high, causes heavy writes to etcd at idle | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Clayton Coleman <ccoleman> | |
Component: | Cloud Compute | Assignee: | Danil Grigorev <dgrigore> | |
Cloud Compute sub component: | Other Providers | QA Contact: | sunzhaohua <zhsun> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | high | |||
Priority: | unspecified | CC: | dgoodwin, kewang, mimccune | |
Version: | 4.5 | |||
Target Milestone: | --- | |||
Target Release: | 4.6.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1858403 (view as bug list) | Environment: | ||
Last Closed: | 2020-10-27 16:15:58 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: |
Description
Clayton Coleman
2020-07-17 20:02:32 UTC
Please see https://bugzilla.redhat.com/show_bug.cgi?id=1858403, in investigating how to solve this for cloud cred operator I think I found this is more complicated than it looks and this issue is possibly not fixed for machine-api. (unless I've made a mistake in my testing) Checked the audit log, seems that there is still a gap between machineapi components and machine config controller. Tested on 4.6.0-0.nightly-2020-08-02-091622 $ grep -ir "system:serviceaccount:openshift-machine-config-operator:machine-config-controller" | wc -l 177 $ grep -ir "cluster-api-provider-healthcheck-leader" | wc -l 996 $ grep -ir "cluster-api-provider-aws-leader" | wc -l 994 $ grep -ir "cluster-api-provider-nodelink-leader" | wc -l 994 Clayton has added a comment on https://bugzilla.redhat.com/show_bug.cgi?id=1858403#c5 for how to fix this properly. I will be pursuing for cloud cred operator this week as well. After some consideration settling on 120/110/90s for each provider. Danil: I suspect this may still not be what you want, controller-runtime does not presently expose the correct way to do this where the lease is released when the leader process stops. As implemented in the PRs here you have likely added a 90s startup delay which will be irritating in development and I believe will also impact installation times. Correct method Clayton set us onto can be seen in: https://github.com/openshift/cloud-credential-operator/pull/231 Makes sense, I agree with you. But we don't mind experiencing this issue, as we currently work through same problem with our MAO deployment. The values 120/110/90s were agreed upon in a slack discussion, which would be ok for us. I like the implementation, and I'm going to transfer it to the controller-runtime later, but you bring a good point. Just for the sake of closing this bug, hoping to avoid possible friction in upstream implementing this. Are you confident this does not push the default installation out 90+ seconds, perhaps during the transition from bootstrap to real control plane? For reference we tried what you're using here and Clayton's response is at https://bugzilla.redhat.com/show_bug.cgi?id=1858403#c5 @Devan, the Machine controllers are only started after the pivot from bootstrap to real control plane as far as I'm aware. None of the Machine API components are used in bootstrapping the control plane machines so we won't be adding any extra delay to installation. Since we haven't had time to explore the releaseOnCancel and the effects it may have on the system fully yet, we were discussing as a team merging these PRs as is for now, and then creating a new BZ to introduce the releaseOnCancel behaviour once a new release of controller runtime is cut (the option was merged in overnight). Do you think that would be an acceptable approach here? just wanted to drop an update here, we need to add the extended duration patches to the baremetal, ovirt, and openshift controllers. i am working to propose these changes today. here are the last patches which should complete this sequence: https://github.com/openshift/cluster-api-provider-baremetal/pull/100 https://github.com/openshift/cluster-api-provider-ovirt/pull/66 https://github.com/openshift/cluster-api-provider-openstack/pull/114 i am resetting this bz to POST and updating the pull requests. verified on gcp, checked audit logs on 3 masters. 4.6.0-0.nightly-2020-08-31-194600 sh-4.4# cd /var/log/kube-apiserver sh-4.4# grep -ir "system:serviceaccount:openshift-machine-config-operator:machine-config-controller" | wc -l 881 sh-4.4# grep -ir "cluster-api-provider-healthcheck-leader" | wc -l 102 sh-4.4# grep -ir "cluster-api-provider-gcp-leader" | wc -l 454 sh-4.4# grep -ir "cluster-api-provider-nodelink-leader" | wc -l 88 sh-4.4# cd /var/log/kube-apiserver sh-4.4# grep -ir "system:serviceaccount:openshift-machine-config-operator:machine-config-controller" | wc -l 94 sh-4.4# grep -ir "cluster-api-provider-healthcheck-leader" | wc -l 14 sh-4.4# grep -ir "cluster-api-provider-gcp-leader" | wc -l 55 sh-4.4# grep -ir "cluster-api-provider-nodelink-leader" | wc -l 6 # grep -ir "system:serviceaccount:openshift-machine-config-operator:machine-config-controller" | wc -l 461 sh-4.4# grep -ir "cluster-api-provider-healthcheck-leader" | wc -l 0 sh-4.4# grep -ir "cluster-api-provider-gcp-leader" | wc -l 22 sh-4.4# grep -ir "cluster-api-provider-nodelink-leader" | wc -l 22 sh-4.4# cd /var/log/kube-apiserver sh-4.4# grep -ir "system:serviceaccount:openshift-machine-config-operator:machine-config-controller" | wc -l 601 Verified on azure sh-4.4# grep -ir "cluster-api-provider-healthcheck-leader" | wc -l 18 sh-4.4# grep -ir "cluster-api-provider-azure-leader" | wc -l 330 sh-4.4# grep -ir "cluster-api-provider-nodelink-leader" | wc -l 44 sh-4.4# grep -ir "system:serviceaccount:openshift-machine-config-operator:machine-config-controller" | wc -l 319 sh-4.4# grep -ir "cluster-api-provider-healthcheck-leader" | wc -l 66 sh-4.4# grep -ir "cluster-api-provider-azure-leader" | wc -l 32 sh-4.4# grep -ir "cluster-api-provider-nodelink-leader" | wc -l 50 sh-4.4# grep -ir "system:serviceaccount:openshift-machine-config-operator:machine-config-controller" | wc -l 341 sh-4.4# grep -ir "cluster-api-provider-healthcheck-leader" | wc -l 24 sh-4.4# grep -ir "cluster-api-provider-azure-leader" | wc -l 30 sh-4.4# grep -ir "cluster-api-provider-nodelink-leader" | wc -l 18 Verified on aws clusterversion: 4.6.0-0.nightly-2020-09-01-205915 sh-4.4# grep -ir "system:serviceaccount:openshift-machine-config-operator:machine-config-controller" | wc -l 488 sh-4.4# grep -ir "cluster-api-provider-healthcheck-leader" | wc -l 20 sh-4.4# grep -ir "cluster-api-provider-aws-leader" | wc -l 68 sh-4.4# grep -ir "cluster-api-provider-nodelink-leader" | wc -l 2 sh-4.4# grep -ir "system:serviceaccount:openshift-machine-config-operator:machine-config-controller" | wc -l 364 sh-4.4# grep -ir "cluster-api-provider-healthcheck-leader" | wc -l 40 sh-4.4# grep -ir "cluster-api-provider-aws-leader" | wc -l 171 sh-4.4# grep -ir "cluster-api-provider-nodelink-leader" | wc -l 60 sh-4.4# grep -ir "system:serviceaccount:openshift-machine-config-operator:machine-config-controller" | wc -l 0 sh-4.4# grep -ir "cluster-api-provider-healthcheck-leader" | wc -l 12 sh-4.4# grep -ir "cluster-api-provider-aws-leader" | wc -l 56 sh-4.4# grep -ir "cluster-api-provider-nodelink-leader" | wc -l 8 Verified on osp clusterverision: 4.6.0-0.nightly-2020-09-05-015624 sh-4.4# grep -ir "system:serviceaccount:openshift-machine-config-operator:machine-config-controller" | wc -l 918 sh-4.4# grep -ir "cluster-api-provider-healthcheck-leader" | wc -l 33 sh-4.4# grep -ir "cluster-api-provider-openstack-leader" | wc -l 113 sh-4.4# grep -ir "cluster-api-provider-nodelink-leader" | wc -l 35 sh-4.4# grep -ir "system:serviceaccount:openshift-machine-config-operator:machine-config-controller" | wc -l 0 sh-4.4# grep -ir "cluster-api-provider-healthcheck-leader" | wc -l 9 sh-4.4# grep -ir "cluster-api-provider-openstack-leader" | wc -l 1342 sh-4.4# grep -ir "cluster-api-provider-nodelink-leader" | wc -l 238 sh-4.4# grep -ir "system:serviceaccount:openshift-machine-config-operator:machine-config-controller" | wc -l 2123 sh-4.4# grep -ir "cluster-api-provider-healthcheck-leader" | wc -l 327 sh-4.4# grep -ir "cluster-api-provider-openstack-leader" | wc -l 54 sh-4.4# grep -ir "cluster-api-provider-nodelink-leader" | wc -l 86 Verified on vsphere clusterverision: 4.6.0-0.nightly-2020-09-05-015624 sh-4.4# grep -ir "system:serviceaccount:openshift-machine-config-operator:machine-config-controller" | wc -l 126 sh-4.4# grep -ir "cluster-api-provider-healthcheck-leader" | wc -l 76 sh-4.4# grep -ir "cluster-api-provider-vsphere-leader" | wc -l 83 sh-4.4# grep -ir "cluster-api-provider-nodelink-leader" | wc -l 85 sh-4.4# grep -ir "system:serviceaccount:openshift-machine-config-operator:machine-config-controller" | wc -l 2067 sh-4.4# grep -ir "cluster-api-provider-healthcheck-leader" | wc -l 84 sh-4.4# grep -ir "cluster-api-provider-vsphere-leader" | wc -l 80 sh-4.4# grep -ir "cluster-api-provider-nodelink-leader" | wc -l 84 sh-4.4# grep -ir "system:serviceaccount:openshift-machine-config-operator:machine-config-controller" | wc -l 424 sh-4.4# grep -ir "cluster-api-provider-healthcheck-leader" | wc -l 176 sh-4.4# grep -ir "cluster-api-provider-vsphere-leader" | wc -l 174 sh-4.4# grep -ir "cluster-api-provider-nodelink-leader" | wc -l 177 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |