Bug 1585511
Summary: | Upgrade failure: service-catalog-controller-manager has been modified - upgrading scalability env | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Mike Fiedler <mifiedle> | ||||||
Component: | Master | Assignee: | Mo <mkhan> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Vikas Laad <vlaad> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 3.10.0 | CC: | aos-bugs, jliggitt, jokerman, mifiedle, mkhan, mmccomas, vlaad, wmeng | ||||||
Target Milestone: | --- | Keywords: | TestBlocker | ||||||
Target Release: | 3.10.0 | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2018-07-30 19:16:54 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | 1588707 | ||||||||
Bug Blocks: | |||||||||
Attachments: |
|
Description
Mike Fiedler
2018-06-03 21:57:20 UTC
The control plane upgrade made it as far as running etcd as static pods, but the master controller/api processes were not yet containerized. They were upgraded to 3.10: NAME STATUS ROLES AGE VERSION ip-172-31-38-223.us-west-2.compute.internal Ready master 10h v1.10.0+b81c8f8 ip-172-31-47-68.us-west-2.compute.internal Ready master 10h v1.10.0+b81c8f8 ip-172-31-59-186.us-west-2.compute.internal Ready master 2d v1.10.0+b81c8f8 comment 1, bullet 1 should be "inventory used during upgrade" Master logs seem to point to slow response times for these resources. There may be high contention on these items due to lease changes which may point to a different issue in the cluster. migrate storage will retry conflict errors up to ten times so I do not think the issue has anything to do with the command itself. If the environment is readily able to reproduce this issue, please run and attach the logs of: oc adm migrate storage --confirm --include=configmaps --all-namespaces=false --loglevel=10 -n kube-service-catalog oc adm migrate storage --confirm --include=configmaps --all-namespaces=false --loglevel=10 -n kube-system Assigning to master team so they can review the logs. Re-attempting with io1 provisioned-IOPS disk for etcd. I did see some etcd cluster unhealthy/leader re-elections so it might have been slowness. Created attachment 1447604 [details] system log, ansible log, oc adm storage output Reproduced on a clean cluster. The cluster only had 3 master/2 infra/5 computes/1 load balancer. I used the same cluster load as described in the original description, but did not have the 200 nodes this time. Figured since it was failing in control plane, they might not be relevant. - 3.9.27 cluster in AWS - 2 master: m4.2xlarge - 2 infra, 5 computs, 1 lb all m4.xlarge Masters had io1 storage provisioned for 1500 IOPS Loaded the cluster as in the description and then started upgrade_control_plane. It failed with "E0604 18:04:05.869994 error: -n kube-service-catalog configmaps/service-catalog-controller-manager: Operation cannot be fulfilled on configmaps \"service-catalog-controller-manager\": the object has been modified; please apply your changes to the latest version and try again", Captured the full system log from all 3 masters, ansible -vvv and the output of the oc adm storage commands mentioned in comment 4. I also captured pbench data and will link the master cpu/memory/network data in a private comment. Marking this as TestBlocker. It blocks the SVT test case to upgrade a 200 node cluster with a loaded control plane. The migrate logs have no errors (and need to be run at a higher loglevel). I assume the migrate error is a race, probably due to the cluster being under load from a full storage migration. Can we get logs from those commands when the cluster is at load? Easy way to generate load: oc adm migrate storage --confirm --include='*' Concurrently with the above, capture the logs as mentioned in comment 4. Created attachment 1447664 [details] oc adm migrate output Sorry, forgot to redirect stderr last time. Captured new logs. 1. oc adm migrate storage --confirm --include='*' 2. Run the commands in comment 4. Repeated 3 times each, but don't think that made a difference. Is it always failing on service-catalog-controller-manager? No other errors? > Is it always failing on service-catalog-controller-manager? No other errors?
3 times out of 3. Not a huge sample. I can re-try today - let me know if any additional instrumentation would be useful or if there is anything you'd like to see run outside of the openshift-ansible upgrade. I'll get a cluster in the "before" condition now.
the attached logs show success and no errors. can we get a sense of the write rate for those configmaps? how many updates do you see on the two problematic configmaps per minute doing this: oc get configmap/kube-scheduler -n kube-system -w oc get configmap/service-catalog-controller-manager -n kube-system -w how long does a single request take on average: time oc get configmap/kube-scheduler -n kube-system time oc get configmap/kube-scheduler -n kube-system time oc get configmap/kube-scheduler -n kube-system time oc get configmap/kube-scheduler -n kube-system time oc get configmap/kube-scheduler -n kube-system Updates in the watches for both configmaps in comment 14 pop every 2 seconds. Response time for oc get configmap seems reasonable. root@ip-172-31-42-161: ~ # time oc get configmap/kube-scheduler -n kube-system NAME DATA AGE kube-scheduler 0 1d real 0m0.139s user 0m0.155s sys 0m0.041s root@ip-172-31-42-161: ~ # time oc get configmap/kube-scheduler -n kube-system NAME DATA AGE kube-scheduler 0 1d real 0m0.136s user 0m0.148s sys 0m0.043s root@ip-172-31-42-161: ~ # time oc get configmap/kube-scheduler -n kube-system NAME DATA AGE kube-scheduler 0 1d real 0m0.135s user 0m0.147s sys 0m0.040s root@ip-172-31-42-161: ~ # time oc get configmap/kube-scheduler -n kube-system NAME DATA AGE kube-scheduler 0 1d real 0m0.135s user 0m0.155s sys 0m0.031s root@ip-172-31-42-161: ~ # time oc get configmap/kube-scheduler -n kube-system NAME DATA AGE kube-scheduler 0 1d real 0m0.141s user 0m0.160s sys 0m0.035s root@ip-172-31-42-161: ~ # time oc get configmap/kube-scheduler -n kube-system NAME DATA AGE kube-scheduler 0 1d real 0m0.141s user 0m0.154s sys 0m0.042s root@ip-172-31-42-161: ~ # time oc get configmap/kube-scheduler -n kube-system NAME DATA AGE kube-scheduler 0 1d real 0m0.134s user 0m0.155s sys 0m0.031s No idea if this will help or not: https://github.com/openshift/origin/pull/19916/files?w=1 Mo, that fix looks right. Delaying between the GET and PUT makes us vulnerable to conflict errors on regularly updated resources. That would also explain why the issue was only seen when running the full migration, when the rate limiter would be kicking in Commit pushed to master at https://github.com/openshift/origin https://github.com/openshift/origin/commit/c20900a287ad5c2b52bfe82d65dd9f695ed00b88 Make API storage migration less sensitive to conflicts This change moves the rate limiting logic in migrate storage to after both the GET and PUT have completed (instead of being in between the two operations). This should make the command less sensitive to conflicts caused by resources that change often. Bug 1585511 Signed-off-by: Monis Khan <mkhan> Verfied on 3.10.0-0.67.0 independently by mffiedler and vlaad on ha cluster with co-located and separate master/etcd installs. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1816 |