Description of problem: Upgrading 3.9.30 to 3.10.0-0.56.0 in a scale cluster failse with: "stdout": "E0603 19:49:35.301890 error: -n kube-service-catalog configmaps/service-catalog-controller-manager: Operation cannot be fulfilled on configmaps \"service-catalog-controller-manager\": the object has been modified; please apply your changes to the latest version and try again\nE0603 19:49:35.343325 error: -n kube-system configmaps/kube-scheduler: Operation cannot be fulfilled on configmaps \"kube- scheduler\": the object has been modified; please apply your changes to the latest version and try again\nsummary: total=169741 errors=2 ignored=0 unchanged=116443 migrated=53296\ninfo: to rerun only failing resources, add --include=configmaps\nerror: 2 resources failed to migrate" The cluster is on AWS with - 3 x master m4.2xlarge - 2 x infra m4.xlarge - 198 x compute t2.medium The cluster is running fine at the 3.9 level - no pods in a state other than Running - all nodes Ready. The cluster is loaded with 2000 projects with the following resources: - 10000 build configs - 20000 completed builds - 10000 deployments - 10000 services - 40000 secrets - 10000 routes The deployment inventory for upgrade (redacted) will be linked in a folluwup comment. Version-Release number of the following components: root@ip-172-31-47-73: ~/svt/openshift_scalability # rpm -q openshift-ansible openshift-ansible-3.10.0-0.56.0.git.0.b921fb9.el7.noarch root@ip-172-31-47-73: ~/svt/openshift_scalability # rpm -q ansible ansible-2.4.4.0-1.el7ae.noarch root@ip-172-31-47-73: ~/svt/openshift_scalability # ansible --version ansible 2.4.4.0 config file = /etc/ansible/ansible.cfg configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules'] ansible python module location = /usr/lib/python2.7/site-packages/ansible executable location = /usr/bin/ansible python version = 2.7.5 (default, May 4 2018, 09:38:16) [GCC 4.8.5 20150623 (Red Hat 4.8.5-34)] How reproducible: unknown - difficult repro environment Steps to Reproduce: 1. Install cluster on AWS (specs above) 2. Load cluster with resource with SVT cluster-loader https://github.com/openshift/svt/tree/master/openshift_scalability with the following specs: projects: - num: 2000 basename: svt-b- templates: - num: 5 file: ./content/build-config-template.json - num: 10 file: ./content/build-template.json - num: 5 file: ./content/deployment-config-0rep-pause-template.json parameters: - ENV_VALUE: "asodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij12" - num: 20 file: ./content/ssh-secret-template.json - num: 5 file: ./content/route-template.json # rcs and services are implemented in deployments. quotas: - name: default 3. Verify the cluster is ok - check pod status, create a new project, build a new application, etc 4. Upgrade the cluster to 3.10 following the documentation 5. Run the upgrade_control_plane playbook Actual results: ansible -vvv output attached. Install fails migrating service broker configmaps Expected results: successful installs
The control plane upgrade made it as far as running etcd as static pods, but the master controller/api processes were not yet containerized. They were upgraded to 3.10: NAME STATUS ROLES AGE VERSION ip-172-31-38-223.us-west-2.compute.internal Ready master 10h v1.10.0+b81c8f8 ip-172-31-47-68.us-west-2.compute.internal Ready master 10h v1.10.0+b81c8f8 ip-172-31-59-186.us-west-2.compute.internal Ready master 2d v1.10.0+b81c8f8
comment 1, bullet 1 should be "inventory used during upgrade"
Master logs seem to point to slow response times for these resources. There may be high contention on these items due to lease changes which may point to a different issue in the cluster. migrate storage will retry conflict errors up to ten times so I do not think the issue has anything to do with the command itself. If the environment is readily able to reproduce this issue, please run and attach the logs of: oc adm migrate storage --confirm --include=configmaps --all-namespaces=false --loglevel=10 -n kube-service-catalog oc adm migrate storage --confirm --include=configmaps --all-namespaces=false --loglevel=10 -n kube-system Assigning to master team so they can review the logs.
Re-attempting with io1 provisioned-IOPS disk for etcd. I did see some etcd cluster unhealthy/leader re-elections so it might have been slowness.
Created attachment 1447604 [details] system log, ansible log, oc adm storage output Reproduced on a clean cluster. The cluster only had 3 master/2 infra/5 computes/1 load balancer. I used the same cluster load as described in the original description, but did not have the 200 nodes this time. Figured since it was failing in control plane, they might not be relevant. - 3.9.27 cluster in AWS - 2 master: m4.2xlarge - 2 infra, 5 computs, 1 lb all m4.xlarge Masters had io1 storage provisioned for 1500 IOPS Loaded the cluster as in the description and then started upgrade_control_plane. It failed with "E0604 18:04:05.869994 error: -n kube-service-catalog configmaps/service-catalog-controller-manager: Operation cannot be fulfilled on configmaps \"service-catalog-controller-manager\": the object has been modified; please apply your changes to the latest version and try again", Captured the full system log from all 3 masters, ansible -vvv and the output of the oc adm storage commands mentioned in comment 4. I also captured pbench data and will link the master cpu/memory/network data in a private comment.
Marking this as TestBlocker. It blocks the SVT test case to upgrade a 200 node cluster with a loaded control plane.
The migrate logs have no errors (and need to be run at a higher loglevel). I assume the migrate error is a race, probably due to the cluster being under load from a full storage migration. Can we get logs from those commands when the cluster is at load? Easy way to generate load: oc adm migrate storage --confirm --include='*' Concurrently with the above, capture the logs as mentioned in comment 4.
Created attachment 1447664 [details] oc adm migrate output Sorry, forgot to redirect stderr last time. Captured new logs. 1. oc adm migrate storage --confirm --include='*' 2. Run the commands in comment 4. Repeated 3 times each, but don't think that made a difference.
Is it always failing on service-catalog-controller-manager? No other errors?
> Is it always failing on service-catalog-controller-manager? No other errors? 3 times out of 3. Not a huge sample. I can re-try today - let me know if any additional instrumentation would be useful or if there is anything you'd like to see run outside of the openshift-ansible upgrade. I'll get a cluster in the "before" condition now.
the attached logs show success and no errors. can we get a sense of the write rate for those configmaps? how many updates do you see on the two problematic configmaps per minute doing this: oc get configmap/kube-scheduler -n kube-system -w oc get configmap/service-catalog-controller-manager -n kube-system -w how long does a single request take on average: time oc get configmap/kube-scheduler -n kube-system time oc get configmap/kube-scheduler -n kube-system time oc get configmap/kube-scheduler -n kube-system time oc get configmap/kube-scheduler -n kube-system time oc get configmap/kube-scheduler -n kube-system
Updates in the watches for both configmaps in comment 14 pop every 2 seconds. Response time for oc get configmap seems reasonable. root@ip-172-31-42-161: ~ # time oc get configmap/kube-scheduler -n kube-system NAME DATA AGE kube-scheduler 0 1d real 0m0.139s user 0m0.155s sys 0m0.041s root@ip-172-31-42-161: ~ # time oc get configmap/kube-scheduler -n kube-system NAME DATA AGE kube-scheduler 0 1d real 0m0.136s user 0m0.148s sys 0m0.043s root@ip-172-31-42-161: ~ # time oc get configmap/kube-scheduler -n kube-system NAME DATA AGE kube-scheduler 0 1d real 0m0.135s user 0m0.147s sys 0m0.040s root@ip-172-31-42-161: ~ # time oc get configmap/kube-scheduler -n kube-system NAME DATA AGE kube-scheduler 0 1d real 0m0.135s user 0m0.155s sys 0m0.031s root@ip-172-31-42-161: ~ # time oc get configmap/kube-scheduler -n kube-system NAME DATA AGE kube-scheduler 0 1d real 0m0.141s user 0m0.160s sys 0m0.035s root@ip-172-31-42-161: ~ # time oc get configmap/kube-scheduler -n kube-system NAME DATA AGE kube-scheduler 0 1d real 0m0.141s user 0m0.154s sys 0m0.042s root@ip-172-31-42-161: ~ # time oc get configmap/kube-scheduler -n kube-system NAME DATA AGE kube-scheduler 0 1d real 0m0.134s user 0m0.155s sys 0m0.031s
No idea if this will help or not: https://github.com/openshift/origin/pull/19916/files?w=1
Mo, that fix looks right. Delaying between the GET and PUT makes us vulnerable to conflict errors on regularly updated resources. That would also explain why the issue was only seen when running the full migration, when the rate limiter would be kicking in
Commit pushed to master at https://github.com/openshift/origin https://github.com/openshift/origin/commit/c20900a287ad5c2b52bfe82d65dd9f695ed00b88 Make API storage migration less sensitive to conflicts This change moves the rate limiting logic in migrate storage to after both the GET and PUT have completed (instead of being in between the two operations). This should make the command less sensitive to conflicts caused by resources that change often. Bug 1585511 Signed-off-by: Monis Khan <mkhan>
Verfied on 3.10.0-0.67.0 independently by mffiedler and vlaad on ha cluster with co-located and separate master/etcd installs.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1816