Description of problem: when running the upgrade_control_plane.yml playbook, during the time, all the oc commands are taking too long. After checking, the commands were waiting for servicecatalog.k8s.io/v1beta1, for the service catalog. The log contains tasks to restart the masters one by one so the load balancing for master api should balance between masters. Version-Release number of the following components: rpm -q openshift-ansible openshift-ansible-3.9.27-1.git.0.52e35b5.el7.noarch rpm -q ansible ansible-2.5.0-2.el7.noarch ansible --version ansible 2.5.0 How reproducible: Steps to Reproduce: 1. 2. 3.
BTW, Why are you using ansible 2.5? Please follow the documentation which instructs you to enable the ansible 2.4 repo and install ansible from there we've done zero testing with 3.9 and ansible 2.5
Hi- What version are you upgrading from? Was service-catalog installed before the upgrade? If so, was service-catalog working correctly before the upgrade? Thanks!
What version are you upgrading from? Which versions exactly? openshift v3.9.14 kubernetes v1.9.1+a0ce1bc657 Was service-catalog installed before the upgrade? Yes If so, was service-catalog working correctly before the upgrade? We don't really use the service catalog at the moment, so there are no real life examples to show that it worked. But all pods in the service-catalog projects where running and where "ready".
I installed 3.9.14 and upgraded to 3.9.27 without any issues. I'd like to see the inventory file for the upgrade along with catalog logs before/after upgrade: # kubectl logs -n kube-service-catalog -lapp=controller-manager # kubectl logs -n kube-service-catalog -lapp=apiserver I don't see anything in the attached ansible log that points to any issue as of yet.
I did an upgrade from 3.9.14 to 3.9.33 with multiple masters and could not reproduce any appreciable delays in oc get requests, except during the following expected portions: - When the first master api was restarted (about 5 seconds) - During the SDN upgrade (about 45 seconds) In order to verify that having catalog run on multiple masters wasn't an issue (Service Catalog wasn't set to run on all the masters until 3.9.21), I installed 3.9.30 and did an upgrade to 3.9.33. I never saw any non-catalog operations become completely unresponsive during the upgrade. I was monitoring this output during the entire upgrade: #!/bin/bash while true; do now=$(date) echo "$now: checking pods" oc get po --all-namespaces --loglevel=9 now=$(date) echo "===" echo "$now: checking serviceinstances" oc get serviceinstances --loglevel=9 now=$(date) echo "$now: $(oc version)" echo "" sleep 5 done I'm told that the customer did not experience any problems in the latest upgrade (3.9.33), so I'm waiting to here back if the next upgrade also works correctly.
I've reproduced this in the following environment: upgrading AWS multi master (3 masters) from 3.10.76-1 to 3.11.44-1. During storage upgrade most OC commands hang for 10 - 15 minutes. Prior to upgrade I loaded about 125 namespaces with data via svt/openshift_scalability/cluster-loader.py. I see https://bugzilla.redhat.com/show_bug.cgi?id=1599428 reported this same basic issue and was fixed but the fix was focused on clarifying the message "CancelRequest not implemented". It doesn't appear to me cluster availability was addressed. Excerpts from the installer logs: 2018-11-14 16:49:02,018 p=20116 u=root | TASK [openshift_control_plane : Wait for APIs to become available] 2018-11-14 16:49:05,719 p=20116 u=root | TASK [openshift_control_plane : Get API logs] 2018-11-14 16:49:05,803 p=20116 u=root | TASK [openshift_control_plane : debug] 2018-11-14 16:49:05,888 p=20116 u=root | TASK [openshift_control_plane : fail] 2018-11-14 16:49:05,975 p=20116 u=root | TASK [openshift_control_plane : Check for apiservices/v1beta1.metrics.k8s.io registration] 2018-11-14 16:49:06,478 p=20116 u=root | TASK [openshift_control_plane : Wait for /apis/metrics.k8s.io/v1beta1 when registered] 2018-11-14 16:49:06,563 p=20116 u=root | TASK [openshift_control_plane : Check for apiservices/v1beta1.servicecatalog.k8s.io registration] 2018-11-14 16:49:07,013 p=20116 u=root | TASK [openshift_control_plane : Wait for /apis/servicecatalog.k8s.io/v1beta1 when registered] 2018-11-14 16:49:07,530 p=20116 u=root | TASK [Upgrade all storage] "E1114 16:50:58.764325 79576 round_trippers.go:169] CancelRequest not implemented", "E1114 16:51:30.769271 79576 round_trippers.go:169] CancelRequest not implemented", "E1114 16:52:02.772536 79576 round_trippers.go:169] CancelRequest not implemented", "E1114 16:52:34.775768 79576 round_trippers.go:169] CancelRequest not implemented", "E1114 16:53:06.779309 79576 round_trippers.go:169] CancelRequest not implemented", "E1114 16:53:38.782635 79576 round_trippers.go:169] CancelRequest not implemented", "E1114 16:54:42.789704 79576 round_trippers.go:169] CancelRequest not implemented", "E1114 16:55:14.792854 79576 round_trippers.go:169] CancelRequest not implemented", "E1114 16:55:46.796156 79576 round_trippers.go:169] CancelRequest not implemented", "E1114 16:56:18.800079 79576 round_trippers.go:169] CancelRequest not implemented", "E1114 16:56:50.803123 79576 round_trippers.go:169] CancelRequest not implemented", "E1114 16:57:22.806648 79576 round_trippers.go:169] CancelRequest not implemented", "E1114 16:58:26.814095 79576 round_trippers.go:169] CancelRequest not implemented", "E1114 16:58:58.817702 79576 round_trippers.go:169] CancelRequest not implemented", "E1114 16:59:30.820948 79576 round_trippers.go:169] CancelRequest not implemented", "E1114 17:00:02.824252 79576 round_trippers.go:169] CancelRequest not implemented", "E1114 17:00:34.827742 79576 round_trippers.go:169] CancelRequest not implemented", "E1114 17:01:06.937605 79576 round_trippers.go:169] CancelRequest not implemented", "E1114 17:01:38.941411 79576 round_trippers.go:169] CancelRequest not implemented", "E1114 17:02:10.944769 79576 round_trippers.go:169] CancelRequest not implemented", "E1114 17:03:13.247153 84510 round_trippers.go:169] CancelRequest not implemented\nerror: could not calculate the list of available resources: Get https://ip-172-18-6-50.ec2.internal:443/apis/servicecatalog.k8s.io/v1beta1?timeout=32s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)", "E1114 17:03:13.247153 84510 round_trippers.go:169] CancelRequest not implemented", 2018-11-14 17:06:01,011 p=20116 u=root | TASK [Migrate legacy HPA scale target refs] 2018-11-14 17:06:01,475 p=20116 u=root | TASK [openshift_master_facts : Verify required variables are set] 2018-11-14 17:06:01,691 p=20116 u=root | TASK [openshift_master_facts : Set g_metrics_hostname] storage upgrade ran from 16:49:07 to 17:06. It is interesting to note there is very limited Service Catalog data to be migrated. I had Ansible Service Broker and Template Service Broker installed, but no Service Instances were present. I've run through reproducing this numerous times including without loading data first, I don't believe the SVT data loading is necessary. Also of interest, I believe storage upgrade always seems to take about the same amount of time regardless of no data loading or loading approx 125 namespaces, 500 or 1,000. I had a script running in another terminal executing OC commands from a jump box (external to the cluster) with --request-timeout=35 which also showed these same timeouts (CancelRequest not implemented) during this same timeframe consistently. The oc commands included `oc get clusterservicebrokers`, `get hostsubnets` `get pods`, `get endpoints` etc. The actual command doesn't seem to matter, most failed during this period with the same CancelRequest not implemented. I captured events in the kube-service-catalog namespace and I don't believe anything terribly abnormal is occurring with Service Catalog. I have a variety of logs for review including: * openshift-ansible log (run with -vvv) * scripts & their output for executing oc commands with --loglevel=9 * output from oc get events -n kube-service-catalog * from 2 of the maters I collected the API, MASTER and NODE logs I don't think Service Catalog is the problem here. It could be an issue with Aggregated API or maybe its something else.
Waiting on https://github.com/openshift/origin/pull/21633 that backports kubernetes/kubernetes#66929 and kubernetes/kubernetes#67024 to 3.10. Will retest with these in place.
Tested upgrading oc v3.10.96 to openshift v3.11.66 and this appears to be rectified by https://github.com/openshift/origin/pull/21633. I could not reproduce the hang or timeouts.
back to assigned, these fixes are needed in 3.9 as well.
3.9 backport: https://github.com/openshift/origin/pull/21741
fix first seen in v3.9.67-1
Will verify it today
Per comment 32, verified in both v3.9.65 --> v3.9.74 (version in above comment 33 link) upgrade and v3.9.68 --> v3.9.74 upgrade, neither hit the hang/timeout issue, given ASB/TSB are installed and 100 projects and apps dated are loaded before upgrade.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0619