Bug 1585511

Summary: Upgrade failure: service-catalog-controller-manager has been modified - upgrading scalability env
Product: OpenShift Container Platform Reporter: Mike Fiedler <mifiedle>
Component: MasterAssignee: Mo <mkhan>
Status: CLOSED ERRATA QA Contact: Vikas Laad <vlaad>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.10.0CC: aos-bugs, jliggitt, jokerman, mifiedle, mkhan, mmccomas, vlaad, wmeng
Target Milestone: ---Keywords: TestBlocker
Target Release: 3.10.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-07-30 19:16:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1588707    
Bug Blocks:    
Attachments:
Description Flags
system log, ansible log, oc adm storage output
none
oc adm migrate output none

Description Mike Fiedler 2018-06-03 21:57:20 UTC
Description of problem:

Upgrading 3.9.30 to 3.10.0-0.56.0 in a scale cluster failse with:

"stdout": "E0603 19:49:35.301890 error:     -n kube-service-catalog configmaps/service-catalog-controller-manager: Operation cannot be fulfilled on configmaps \"service-catalog-controller-manager\": the object has been modified; please apply your changes to the latest version and try again\nE0603 19:49:35.343325 error:     -n kube-system configmaps/kube-scheduler: Operation cannot be fulfilled on configmaps \"kube-
scheduler\": the object has been modified; please apply your changes to the latest version and try again\nsummary: total=169741 errors=2 ignored=0 unchanged=116443 migrated=53296\ninfo: to rerun only failing resources, add --include=configmaps\nerror: 2 resources failed to migrate"

The cluster is on AWS with 

- 3 x master m4.2xlarge
- 2 x infra m4.xlarge
- 198 x compute t2.medium

The cluster is running fine at the 3.9 level - no pods in a state other than Running - all nodes Ready.   The cluster is loaded with 2000 projects with the following resources:

- 10000 build configs
- 20000 completed builds
- 10000 deployments
- 10000 services
- 40000 secrets
- 10000 routes

The deployment inventory for upgrade (redacted) will be linked in a folluwup comment.

Version-Release number of the following components:
root@ip-172-31-47-73: ~/svt/openshift_scalability # rpm -q openshift-ansible
openshift-ansible-3.10.0-0.56.0.git.0.b921fb9.el7.noarch
root@ip-172-31-47-73: ~/svt/openshift_scalability # rpm -q ansible
ansible-2.4.4.0-1.el7ae.noarch
root@ip-172-31-47-73: ~/svt/openshift_scalability # ansible --version
ansible 2.4.4.0
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, May  4 2018, 09:38:16) [GCC 4.8.5 20150623 (Red Hat 4.8.5-34)]

How reproducible: unknown - difficult repro environment

Steps to Reproduce:
1. Install cluster on AWS (specs above)
2. Load cluster with resource with SVT cluster-loader https://github.com/openshift/svt/tree/master/openshift_scalability with the following specs:

projects:
  - num: 2000
    basename: svt-b-
    templates:
      -
        num: 5
        file: ./content/build-config-template.json
      -
        num: 10
        file: ./content/build-template.json
      - 
        num: 5   
        file: ./content/deployment-config-0rep-pause-template.json
        parameters:
          -
            ENV_VALUE: "asodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij12"
      -
        num: 20
        file: ./content/ssh-secret-template.json
      -
        num: 5
        file: ./content/route-template.json
      # rcs and services are implemented in deployments.
quotas:
  - name: default

3.  Verify the cluster is ok - check pod status, create a new project, build a new application, etc
4.  Upgrade the cluster to 3.10 following the documentation
5.  Run the upgrade_control_plane playbook

Actual results:
ansible -vvv output attached.  Install fails migrating service broker configmaps

Expected results:

successful installs

Comment 2 Mike Fiedler 2018-06-03 22:04:05 UTC
The control plane upgrade made it as far as running etcd as static pods, but the master controller/api processes were not yet containerized.  They were upgraded to 3.10:

NAME                                          STATUS    ROLES     AGE       VERSION
ip-172-31-38-223.us-west-2.compute.internal   Ready     master    10h       v1.10.0+b81c8f8
ip-172-31-47-68.us-west-2.compute.internal    Ready     master    10h       v1.10.0+b81c8f8
ip-172-31-59-186.us-west-2.compute.internal   Ready     master    2d        v1.10.0+b81c8f8

Comment 3 Mike Fiedler 2018-06-04 01:40:47 UTC
comment 1, bullet 1 should be "inventory used during upgrade"

Comment 4 Mo 2018-06-04 14:21:42 UTC
Master logs seem to point to slow response times for these resources.  There may be high contention on these items due to lease changes which may point to a different issue in the cluster.  migrate storage will retry conflict errors up to ten times so I do not think the issue has anything to do with the command itself.  If the environment is readily able to reproduce this issue, please run and attach the logs of:

oc adm migrate storage --confirm --include=configmaps --all-namespaces=false --loglevel=10 -n kube-service-catalog

oc adm migrate storage --confirm --include=configmaps --all-namespaces=false --loglevel=10 -n kube-system

Assigning to master team so they can review the logs.

Comment 5 Mike Fiedler 2018-06-04 17:10:19 UTC
Re-attempting with io1 provisioned-IOPS disk for etcd.   I did see some etcd cluster unhealthy/leader re-elections so it might have been slowness.

Comment 6 Mike Fiedler 2018-06-04 18:39:39 UTC
Created attachment 1447604 [details]
system log, ansible log, oc adm storage output

Reproduced on a clean cluster.   The cluster only had 3 master/2 infra/5 computes/1 load balancer.  I used the same cluster load as described in the original description, but did not have the 200 nodes this time.   Figured since it was failing in control plane, they might not be relevant.

- 3.9.27 cluster in AWS
     - 2 master: m4.2xlarge
     - 2 infra, 5 computs, 1 lb all m4.xlarge

Masters had io1 storage provisioned for 1500 IOPS

Loaded the cluster as in the description and then started upgrade_control_plane.  It failed with          "E0604 18:04:05.869994 error:     -n kube-service-catalog configmaps/service-catalog-controller-manager: Operation cannot be fulfilled on configmaps \"service-catalog-controller-manager\": the object has been modified; please apply your changes to the latest version and try again",

Captured the full system log from all 3 masters, ansible -vvv and the output of the oc adm storage commands mentioned in comment 4.

I also captured pbench data and will link the master cpu/memory/network data in a private comment.

Comment 8 Mike Fiedler 2018-06-04 19:39:57 UTC
Marking this as TestBlocker.  It blocks the SVT test case to upgrade a 200 node cluster with a loaded control plane.

Comment 9 Mo 2018-06-04 22:20:12 UTC
The migrate logs have no errors (and need to be run at a higher loglevel).  I assume the migrate error is a race, probably due to the cluster being under load from a full storage migration.  Can we get logs from those commands when the cluster is at load?

Easy way to generate load:

oc adm migrate storage --confirm --include='*'

Concurrently with the above, capture the logs as mentioned in comment 4.

Comment 10 Mike Fiedler 2018-06-05 01:32:12 UTC
Created attachment 1447664 [details]
oc adm migrate output

Sorry, forgot to redirect stderr last time.   Captured new logs.

1. oc adm migrate storage --confirm --include='*'
2. Run the commands in comment 4.   Repeated 3 times each, but don't think that made a difference.

Comment 11 Michal Fojtik 2018-06-05 08:39:08 UTC
Is it always failing on service-catalog-controller-manager? No other errors?

Comment 13 Mike Fiedler 2018-06-05 11:28:27 UTC
> Is it always failing on service-catalog-controller-manager? No other errors?

3 times out of 3.  Not a huge sample.   I can re-try today - let me know if any additional instrumentation would be useful or if there is anything you'd like to see run outside of the openshift-ansible upgrade.   I'll get a cluster in the "before" condition now.

Comment 14 Jordan Liggitt 2018-06-05 14:40:40 UTC
the attached logs show success and no errors.

can we get a sense of the write rate for those configmaps?

how many updates do you see on the two problematic configmaps per minute doing this:

oc get configmap/kube-scheduler -n kube-system -w
oc get configmap/service-catalog-controller-manager -n kube-system -w

how long does a single request take on average:

time oc get configmap/kube-scheduler -n kube-system
time oc get configmap/kube-scheduler -n kube-system
time oc get configmap/kube-scheduler -n kube-system
time oc get configmap/kube-scheduler -n kube-system
time oc get configmap/kube-scheduler -n kube-system

Comment 15 Mike Fiedler 2018-06-05 15:07:17 UTC
Updates in the watches for both configmaps in comment 14 pop every 2 seconds.

Response time for oc get configmap seems reasonable.

root@ip-172-31-42-161: ~ # time oc get configmap/kube-scheduler -n kube-system
NAME             DATA      AGE
kube-scheduler   0         1d

real    0m0.139s
user    0m0.155s
sys     0m0.041s
root@ip-172-31-42-161: ~ # time oc get configmap/kube-scheduler -n kube-system
NAME             DATA      AGE
kube-scheduler   0         1d

real    0m0.136s
user    0m0.148s
sys     0m0.043s
root@ip-172-31-42-161: ~ # time oc get configmap/kube-scheduler -n kube-system
NAME             DATA      AGE
kube-scheduler   0         1d

real    0m0.135s
user    0m0.147s
sys     0m0.040s
root@ip-172-31-42-161: ~ # time oc get configmap/kube-scheduler -n kube-system
NAME             DATA      AGE
kube-scheduler   0         1d

real    0m0.135s
user    0m0.155s
sys     0m0.031s
root@ip-172-31-42-161: ~ # time oc get configmap/kube-scheduler -n kube-system
NAME             DATA      AGE
kube-scheduler   0         1d

real    0m0.141s
user    0m0.160s
sys     0m0.035s
root@ip-172-31-42-161: ~ # time oc get configmap/kube-scheduler -n kube-system
NAME             DATA      AGE
kube-scheduler   0         1d

real    0m0.141s
user    0m0.154s
sys     0m0.042s
root@ip-172-31-42-161: ~ # time oc get configmap/kube-scheduler -n kube-system
NAME             DATA      AGE
kube-scheduler   0         1d

real    0m0.134s
user    0m0.155s
sys     0m0.031s

Comment 16 Mo 2018-06-05 17:35:26 UTC
No idea if this will help or not: https://github.com/openshift/origin/pull/19916/files?w=1

Comment 17 Jordan Liggitt 2018-06-06 13:49:27 UTC
Mo, that fix looks right.

Delaying between the GET and PUT makes us vulnerable to conflict errors on regularly updated resources.

That would also explain why the issue was only seen when running the full migration, when the rate limiter would be kicking in

Comment 18 openshift-github-bot 2018-06-06 15:58:30 UTC
Commit pushed to master at https://github.com/openshift/origin

https://github.com/openshift/origin/commit/c20900a287ad5c2b52bfe82d65dd9f695ed00b88
Make API storage migration less sensitive to conflicts

This change moves the rate limiting logic in migrate storage to
after both the GET and PUT have completed (instead of being in
between the two operations).  This should make the command less
sensitive to conflicts caused by resources that change often.

Bug 1585511

Signed-off-by: Monis Khan <mkhan>

Comment 20 Mike Fiedler 2018-06-14 01:54:55 UTC
Verfied on 3.10.0-0.67.0 independently by mffiedler and vlaad on ha cluster with co-located and separate master/etcd installs.

Comment 22 errata-xmlrpc 2018-07-30 19:16:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1816