Description of problem: Upgrade 3.9.27 to 3.10.0-0.67.0 using openshift-ansible 3.10.0-0.67.0 results in a storage upgrade failure. 1. Installed an HA cluster at 3.9.27 (1 lb, 3 master/etcd colocated, 2 infra, 5 compute) on AWS - inventory attached. Loaded the cluster up with: - 10000 build configs - 20000 completed builds - 10000 deployments - 10000 services - 40000 secrets - 10000 routes 2. Verified the cluster is working properly - nodes are ready, can deploy new applications from templates, pods are running OK. 3. Run openshift_node_group.yml to create the node config groups 4. Run upgrade_control_plane.yml to upgrade the control plane. The storage upgrade fails "stderr": "Flag --confirm has been deprecated, storage migration does not support dry run, this flag is ignored\nERROR: logging before flag.Parse: E0613 19:57:45.903432 70009 round_trippers.go:169] CancelR equest not implemented\nERROR: logging before flag.Parse: E0613 19:58:17.908137 70009 round_trippers.go:169] CancelRequest not implemented\nERROR: logging before flag.Parse: E0613 19:58:49.911682 70009 round _trippers.go:169] CancelRequest not implemented\nERROR: logging before flag.Parse: E0613 19:59:21.914653 70009 round_trippers.go:169] CancelRequest not implemented\nERROR: logging before flag.Parse: E0613 19:5 9:53.917839 70009 round_trippers.go:169] CancelRequest not implemented\nERROR: logging before flag.Parse: E0613 20:00:25.921695 70009 round_trippers.go:169] CancelRequest not implemented\nERROR: logging befo re flag.Parse: E0613 20:00:57.924939 70009 round_trippers.go:169] CancelRequest not implemented\nERROR: logging before flag.Parse: E0613 20:01:29.928053 70009 round_trippers.go:169] CancelRequest not impleme nted\nERROR: logging before flag.Parse: E0613 20:02:01.931478 70009 round_trippers.go:169] CancelRequest not implemented\nERROR: logging before flag.Parse: E0613 20:02:33.934770 70009 round_trippers.go:169] CancelRequest not implemented\nERROR: logging before flag.Parse: E0613 20:03:05.938117 70009 round_trippers.go:169] CancelRequest not implemented\nERROR: logging before flag.Parse: E0613 20:03:37.941278 7000 9 round_trippers.go:169] CancelRequest not implemented\nERROR: logging before flag.Parse: E0613 20:04:09.944540 70009 round_trippers.go:169] CancelRequest not implemented\nERROR: logging before flag.Parse: E06 13 20:04:41.947849 70009 round_trippers.go:169] CancelRequest not implemented\nERROR: logging before flag.Parse: E0613 20:05:13.951562 70009 round_trippers.go:169] CancelRequest not implemented\nERROR: loggi ng before flag.Parse: E0613 20:05:45.954826 70009 round_trippers.go:169] CancelRequest not implemented\nerror: the server doesn't have a resource type \"clusterservicebrokers\"", "stderr_lines": [ "Flag --confirm has been deprecated, storage migration does not support dry run, this flag is ignored", "ERROR: logging before flag.Parse: E0613 19:57:45.903432 70009 round_trippers.go:169] CancelRequest not implemented", "ERROR: logging before flag.Parse: E0613 19:58:17.908137 70009 round_trippers.go:169] CancelRequest not implemented", "ERROR: logging before flag.Parse: E0613 19:58:49.911682 70009 round_trippers.go:169] CancelRequest not implemented", "ERROR: logging before flag.Parse: E0613 19:59:21.914653 70009 round_trippers.go:169] CancelRequest not implemented", "ERROR: logging before flag.Parse: E0613 19:59:53.917839 70009 round_trippers.go:169] CancelRequest not implemented", "ERROR: logging before flag.Parse: E0613 20:00:25.921695 70009 round_trippers.go:169] CancelRequest not implemented", "ERROR: logging before flag.Parse: E0613 20:00:57.924939 70009 round_trippers.go:169] CancelRequest not implemented", "ERROR: logging before flag.Parse: E0613 20:01:29.928053 70009 round_trippers.go:169] CancelRequest not implemented", "ERROR: logging before flag.Parse: E0613 20:02:01.931478 70009 round_trippers.go:169] CancelRequest not implemented", "ERROR: logging before flag.Parse: E0613 20:02:33.934770 70009 round_trippers.go:169] CancelRequest not implemented", "ERROR: logging before flag.Parse: E0613 20:03:05.938117 70009 round_trippers.go:169] CancelRequest not implemented", "ERROR: logging before flag.Parse: E0613 20:03:37.941278 70009 round_trippers.go:169] CancelRequest not implemented", "ERROR: logging before flag.Parse: E0613 20:04:09.944540 70009 round_trippers.go:169] CancelRequest not implemented", "ERROR: logging before flag.Parse: E0613 20:04:41.947849 70009 round_trippers.go:169] CancelRequest not implemented", "ERROR: logging before flag.Parse: E0613 20:05:13.951562 70009 round_trippers.go:169] CancelRequest not implemented", "ERROR: logging before flag.Parse: E0613 20:05:45.954826 70009 round_trippers.go:169] CancelRequest not implemented", "error: the server doesn't have a resource type \"clusterservicebrokers\"" ], "stdout": "", "stdout_lines": [] Version-Release number of selected component (if applicable): Upgrade 3.9.27 to 3.10.0-0.67.0 using openshift-ansible 3.10.0-0.67.0 How reproducible: Once so far in one attempt Steps to Reproduce: 1. See above
Created attachment 1451077 [details] inventory, ansible log, syslogs from master 1. openshift-ansible inventory used for upgrade_control_plane 2. ansible -vvv log 3. journals from 3 masters
Service broker and catalog pods: # oc get clusterservicebroker NAME AGE ansible-service-broker 8h template-service-broker 8h # oc get pods --all-namespaces | grep -E "service" kube-service-catalog apiserver-424rn 1/1 Running 1 8h kube-service-catalog apiserver-wg44n 1/1 Running 1 8h kube-service-catalog apiserver-zlghf 1/1 Running 1 8h kube-service-catalog controller-manager-mjxpf 1/1 Running 4 8h kube-service-catalog controller-manager-wbnwv 1/1 Running 4 8h kube-service-catalog controller-manager-zd9cb 1/1 Running 3 8h openshift-ansible-service-broker asb-1-97dfq 1/1 Running 2 8h openshift-ansible-service-broker asb-etcd-1-lhgpz 1/1 Running 0 8h openshift-template-service-broker apiserver-9d2ww 1/1 Running 0 8h openshift-template-service-broker apiserver-bd4v6 1/1 Running 0 8h
Re-started the upgrade_control_plane.yml playbook after the failure and it completed successfully. Will retry on a fresh cluster and see if the original failure reoccurs.
Storage upgrade is now being retried twice with a delay of 30 secs, PR https://github.com/openshift/openshift-ansible/pull/8661
That change would be in openshift-ansible-3.10.0-0.67.0 already. I'm not sure what else we can do. One common mitigation against storage migration problems is to run the migration prior to performing the upgrade and then skip it during the upgrade. oc adm migrate storage --include=* Then set openshift_upgrade_pre_storage_migration_enabled=false in your inventory. Moving this to master encase there's anything for them to debug here.
Re: comment 7. When I reproduced the issue it failed with the same message about clusterservicebrokers. Coincidence? or something about migrating this part of storage? fatal: [ec2-35-164-67-17.us-west-2.compute.amazonaws.com]: FAILED! => { "changed": true, "cmd": [ "oc", "adm", "--config=/etc/origin/master/admin.kubeconfig", "migrate", "storage", "--include=*", "--confirm" ], "delta": "0:08:32.583116", "end": "2018-06-14 17:28:25.422116", "failed_when_result": true, "invocation": { "module_args": { "_raw_params": "oc adm --config=/etc/origin/master/admin.kubeconfig migrate storage --include=* --confirm", "_uses_shell": false, "chdir": null, "creates": null, "executable": null, "removes": null, "stdin": null, "warn": true } }, "msg": "non-zero return code", "rc": 1, "start": "2018-06-14 17:19:52.839000", "stderr": "Flag --confirm has been deprecated, storage migration does not support dry run, this flag is ignored\nERROR: logging before flag.Parse: E0614 17:20:25.369073 66519 round_trippers.go:169] CancelRequest not implemented\nERROR: logging before flag.Parse: E0614 17:20:5 7.373696 66519 round_trippers.go:169] CancelRequest not implemented\nERROR: logging before flag.Parse: E0614 17:21:29.377028 66519 round_trippers.go:169] CancelRequest not implemented\nERROR: logging before flag.Parse: E0614 17:22:01.380251 66519 round_trippers.go:169] Cancel Request not implemented\nERROR: logging before flag.Parse: E0614 17:22:33.383775 66519 round_trippers.go:169] CancelRequest not implemented\nERROR: logging before flag.Parse: E0614 17:23:05.387153 66519 round_trippers.go:169] CancelRequest not implemented\nERROR: logging before flag.Parse: E0614 17:23:37.390303 66519 round_trippers.go:169] CancelRequest not implemented\nERROR: logging before flag.Parse: E0614 17:24:09.393327 66519 round_trippers.go:169] CancelRequest not implemented\nERROR: logging before flag.Parse: E0614 17:24:41.396523 66519 rou nd_trippers.go:169] CancelRequest not implemented\nERROR: logging before flag.Parse: E0614 17:25:13.399705 66519 round_trippers.go:169] CancelRequest not implemented\nERROR: logging before flag.Parse: E0614 17:25:45.402941 66519 round_trippers.go:169] CancelRequest not implemen ted\nERROR: logging before flag.Parse: E0614 17:26:17.406043 66519 round_trippers.go:169] CancelRequest not implemented\nERROR: logging before flag.Parse: E0614 17:26:49.409307 66519 round_trippers.go:169] CancelRequest not implemented\nERROR: logging before flag.Parse: E0614 1 7:27:21.412609 66519 round_trippers.go:169] CancelRequest not implemented\nERROR: logging before flag.Parse: E0614 17:27:53.415840 66519 round_trippers.go:169] CancelRequest not implemented\nERROR: logging before flag.Parse: E0614 17:28:25.418992 66519 round_trippers.go:169] CancelRequest not implemented\nerror: the server doesn't have a resource type \"clusterservicebrokers\"", "stderr_lines": [ "Flag --confirm has been deprecated, storage migration does not support dry run, this flag is ignored", "ERROR: logging before flag.Parse: E0614 17:20:25.369073 66519 round_trippers.go:169] CancelRequest not implemented", "ERROR: logging before flag.Parse: E0614 17:20:57.373696 66519 round_trippers.go:169] CancelRequest not implemented", "ERROR: logging before flag.Parse: E0614 17:21:29.377028 66519 round_trippers.go:169] CancelRequest not implemented", "ERROR: logging before flag.Parse: E0614 17:22:01.380251 66519 round_trippers.go:169] CancelRequest not implemented", "ERROR: logging before flag.Parse: E0614 17:22:33.383775 66519 round_trippers.go:169] CancelRequest not implemented", "ERROR: logging before flag.Parse: E0614 17:23:05.387153 66519 round_trippers.go:169] CancelRequest not implemented", "ERROR: logging before flag.Parse: E0614 17:23:37.390303 66519 round_trippers.go:169] CancelRequest not implemented", "ERROR: logging before flag.Parse: E0614 17:24:09.393327 66519 round_trippers.go:169] CancelRequest not implemented", "ERROR: logging before flag.Parse: E0614 17:24:41.396523 66519 round_trippers.go:169] CancelRequest not implemented", "ERROR: logging before flag.Parse: E0614 17:25:13.399705 66519 round_trippers.go:169] CancelRequest not implemented", "ERROR: logging before flag.Parse: E0614 17:25:45.402941 66519 round_trippers.go:169] CancelRequest not implemented", "ERROR: logging before flag.Parse: E0614 17:26:17.406043 66519 round_trippers.go:169] CancelRequest not implemented", "ERROR: logging before flag.Parse: E0614 17:26:49.409307 66519 round_trippers.go:169] CancelRequest not implemented", "ERROR: logging before flag.Parse: E0614 17:27:21.412609 66519 round_trippers.go:169] CancelRequest not implemented", "ERROR: logging before flag.Parse: E0614 17:27:53.415840 66519 round_trippers.go:169] CancelRequest not implemented", "ERROR: logging before flag.Parse: E0614 17:28:25.418992 66519 round_trippers.go:169] CancelRequest not implemented", "error: the server doesn't have a resource type \"clusterservicebrokers\"" ], "stdout": "", "stdout_lines": [] }
I saw similar problem on a different cluster, this cluster has CRIO runtime and separate etcd install fatal: [ec2-54-212-222-67.us-west-2.compute.amazonaws.com]: FAILED! => { "changed": true, "cmd": [ "oc", "adm", "--config=/etc/origin/master/admin.kubeconfig", "migrate", "storage", "--include=*", "--confirm" ], "delta": "0:00:00.131388", "end": "2018-06-14 18:56:47.807243", "failed_when_result": true, "invocation": { "module_args": { "_raw_params": "oc adm --config=/etc/origin/master/admin.kubeconfig migrate storage --include=* --confirm", "_uses_shell": false, "chdir": null, "creates": null, "executable": null, "removes": null, "stdin": null, "warn": true } }, "msg": "non-zero return code", "rc": 1, "start": "2018-06-14 18:56:47.675855", "stderr": "Flag --confirm has been deprecated, storage migration does not support dry run, this flag is ignored\nerror: could not calculate the list of available resources: the server could not find the requested resource", "stderr_lines": [ "Flag --confirm has been deprecated, storage migration does not support dry run, this flag is ignored", "error: could not calculate the list of available resources: the server could not find the requested resource" ], "stdout": "", "stdout_lines": [] }
Marking this as TestBlocker. Unable to upgrade this configuration without hitting failed storage migrations.
I've dropped the --confirm from that migration invocation since it only complains about stuff we're not using any more (based on https://github.com/openshift/origin/pull/19691), see https://github.com/openshift/openshift-ansible/pull/8781.
The problem with missing clusterservicebrokers resource is related to a service catalog API server having some issues. I'm moving this to service catalog team to figure out what's going on with their server. I think it'll be helpful for them to get the logs from the api server for investigation.
Any chance we can get the logs from the Service Catalog API Server and Controller Pods?
from 19:39:34 through 19:55:09 all 3 masters are reporting errors communicating with the Catalog API server: available_controller.go:295] v1beta1.servicecatalog.k8s.io failed with: Get https://172.26.131.21:443: dial tcp 172.26.131.21:443: getsockopt: no route to host These errors only list one API Server IP - I would expect there are 3 running (1 on each master). I'm not certain if this is the only SC API pod or if the others are working properly at this time. I'd like to see all of this immediately after the failure: oc describe endpoints -n kube-service-catalog apiserver oc describe pods -l app=apiserver -n kube-service-catalog oc describe apiservices v1beta1.servicecatalog.k8s.io -n kube-service-catalog along with logs from all pods in the kube-service-catalog ns.
Comment 2 shows there are in fact 3 SC API Server pods along with 3 SC Controller pods and at the time indicates all were available and the SC API Servers only had one restart each.
That cluster was terminated (AWS). I am spinning up an identical one now - it is in the process of loading. When/if the failure is reproduced I will gather the requested info in comment 14 and comment 13.
One more log might help: kube-proxy Thanks very much Mike, if it gets into that state and you are done with it, I'd like the chance to jump on it to poke although it kind of sounds like it may be timing specific as well.
From # yum list installed | grep openshift atomic-openshift.x86_64 3.9.27-1.git.0.964617d.el7 To # yum list installed | grep openshift atomic-openshift.x86_64 3.10.0-0.69.0.git.0.248ae38.el7 Crio as runtime. Met this once, and rerunning the update playbook did work it round. Afterwards, I hit this: https://bugzilla.redhat.com/show_bug.cgi?id=1591752 It does NOT have this part in the error msg: "logging before flag.Parse" TASK [Upgrade all storage] ************************************************************************************************** fatal: [ec2-52-34-224-227.us-west-2.compute.amazonaws.com]: FAILED! => {"changed": true, "cmd": ["oc", "adm", "--config=/etc/origin/master/admin.kubeconfig", "migrate", "storage", "--include=*", "--confirm"], "delta": "0:08:28.225847", "end": "2018-06-15 16:38:03.922025", "failed_when_result": true, "msg": "non-zero return code", "rc": 1, "start": "2018-06-15 16:29:35.696178", "stderr": "Flag --confirm has been deprecated, storage migration does not support dry run, this flag is ignored\nE0615 16:30:08.449939 129872 round_trippers.go:169] CancelRequest not implemented\nE0615 16:30:40.459640 129872 round_trippers.go:169] CancelRequest not implemented\nE0615 16:31:12.462841 129872 round_trippers.go:169] CancelRequest not implemented\nE0615 16:31:44.466322 129872 round_trippers.go:169] CancelRequest not implemented\nE0615 16:32:16.469730 129872 round_trippers.go:169] CancelRequest not implemented\nE0615 16:32:48.473245 129872 round_trippers.go:169] CancelRequest not implemented\nE0615 16:33:20.476680 129872 round_trippers.go:169] CancelRequest not implemented\nE0615 16:33:52.479746 129872 round_trippers.go:169] CancelRequest not implemented\nE0615 16:34:56.485943 129872 round_trippers.go:169] CancelRequest not implemented\nE0615 16:35:28.489320 129872 round_trippers.go:169] CancelRequest not implemented\nE0615 16:36:00.492423 129872 round_trippers.go:169] CancelRequest not implemented\nE0615 16:36:32.495544 129872 round_trippers.go:169] CancelRequest not implemented\nE0615 16:37:04.498949 129872 round_trippers.go:169] CancelRequest not implemented\nE0615 16:37:36.502617 129872 round_trippers.go:169] CancelRequest not implemented\nerror: the server doesn't have a resource type \"clusterservicebrokers\"", "stderr_lines": ["Flag --confirm has been deprecated, storage migration does not support dry run, this flag is ignored", "E0615 16:30:08.449939 129872 round_trippers.go:169] CancelRequest not implemented", "E0615 16:30:40.459640 129872 round_trippers.go:169] CancelRequest not implemented", "E0615 16:31:12.462841 129872 round_trippers.go:169] CancelRequest not implemented", "E0615 16:31:44.466322 129872 round_trippers.go:169] CancelRequest not implemented", "E0615 16:32:16.469730 129872 round_trippers.go:169] CancelRequest not implemented", "E0615 16:32:48.473245 129872 round_trippers.go:169] CancelRequest not implemented", "E0615 16:33:20.476680 129872 round_trippers.go:169] CancelRequest not implemented", "E0615 16:33:52.479746 129872 round_trippers.go:169] CancelRequest not implemented", "E0615 16:34:56.485943 129872 round_trippers.go:169] CancelRequest not implemented", "E0615 16:35:28.489320 129872 round_trippers.go:169] CancelRequest not implemented", "E0615 16:36:00.492423 129872 round_trippers.go:169] CancelRequest not implemented", "E0615 16:36:32.495544 129872 round_trippers.go:169] CancelRequest not implemented", "E0615 16:37:04.498949 129872 round_trippers.go:169] CancelRequest not implemented", "E0615 16:37:36.502617 129872 round_trippers.go:169] CancelRequest not implemented", "error: the server doesn't have a resource type \"clusterservicebrokers\""], "stdout": "", "stdout_lines": []} to retry, use: --limit @/root/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_10/upgrade.retry PLAY RECAP ****************************************************************************************************************** ec2-18-236-173-89.us-west-2.compute.amazonaws.com : ok=41 changed=6 unreachable=0 failed=0 ec2-52-11-68-23.us-west-2.compute.amazonaws.com : ok=241 changed=74 unreachable=0 failed=0 ec2-52-34-224-227.us-west-2.compute.amazonaws.com : ok=418 changed=128 unreachable=0 failed=1 ec2-54-245-184-161.us-west-2.compute.amazonaws.com : ok=241 changed=74 unreachable=0 failed=0 localhost : ok=22 changed=0 unreachable=0 failed=0 Failure summary: 1. Hosts: ec2-52-34-224-227.us-west-2.compute.amazonaws.com Play: Pre master upgrade - Upgrade all storage Task: Upgrade all storage Message: non-zero return code
*** Bug 1591758 has been marked as a duplicate of this bug. ***
great, thank you, I'll take a look.
And of course my attempt to recreate (comment 16) did not reproduce it. The best we have right now is comment 20. I will attempt to reproduce it over the weekend.
Thanks very much Mike. I'm still reviewing the deployment from comment 20 but I'm not finding much as the Service Catalog resources were restarted as part of the 2nd attempt at upgrading.
Trying to reproduce to gather additional data and trouble shoot
I tried 3 times in different platform to upgrade HA env with service-catalog, and I did not reproduce this issue.
I think you have to have data loaded - I've never tried to do it on an empty cluster, but am hitting it on the majority of upgrades on loaded clusters (2000 projects loaded as per the description.
Our upgrade tests don't load anything prior to performing the upgrade. We're doing the following and still seeing this intermittently: Install v3.9 Upgrade from v3.9 to v3.10 Run our CI tests to ensure our broker is functional after upgrade.
I haven't been able to reproduce this yet. I have a set of scripts to collect info every 5 seconds (including some oc commands to get hostsubnets, pods, service and endpoint status, iptables-save, and a curl against the catalog service) that I'd like to start prior to the upgrade. Plus after discussing with Dan Williams I understand that we need to collect the logs from the sdn pods in the openshift-sdn namespace after the failure is observed.
mffiedler got a reproduction for me while I was monitoring the process. oc commands executed on master node hung from 16:39:42 to 16:46:03 (perhaps fine) some scripted CURL commands against catalog apiserver service https://apiserver.kube-service-catalog.svc:443 had errors prior to 16:33, but nothing after that, always got a quick response. However, master API logs show: 16:40:40 master-api[36110]: memcache.go:153] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server was unable to return a response in the time allotted, but may still be processing the request 16:41:50 repeat of above 16:43:00 repeat of above 16:44:10 repeat of above 16:45:20 repeat of above 16:45:58 master-api[36110]: memcache.go:153] couldn't get resource list for servicecatalog.k8s.io/v1beta1: an error on the server ("Error: 'read tcp 172.21.0.1:45285->172.27.237.234:443: read: no route to host'\nTrying to reach: 'https://172.27.237.234:443/apis/servicecatalog.k8s.io/v1beta1'") has prevented the request from succeeding Ansible task failed here at 16:44:39 2018-06-20 16:36:05,803 p=70646 u=root | PLAY [Pre master upgrade - Upgrade all storage] ... 2018-06-20 16:36:06,450 p=70646 u=root | TASK [Upgrade all storage] 2018-06-20 16:36:06,450 p=70646 u=root | task path: /usr/share/ansible/openshift-ansible/playbooks/openshift-master/private/upgrade.yml:63 2018-06-20 16:36:06,495 p=70646 u=root | Using module file /usr/lib/python2.7/site-packages/ansible/modules/commands/command.py 2018-06-20 16:44:39,323 p=70646 u=root | FAILED - RETRYING: Upgrade all storage (2 retries left).Result was: { "attempts": 1, "changed": true, "cmd": [ "oc", "adm", "--config=/etc/origin/master/admin.kubeconfig", "migrate", "storage", "--include=*", "--confirm" ], "delta": "0:08:32.557575", "end": "2018-06-20 16:44:39.292596", "failed_when_result": true, "invocation": { "module_args": { "_raw_params": "oc adm --config=/etc/origin/master/admin.kubeconfig migrate storage --include=* --confirm", "_uses_shell": false, "chdir": null, "creates": null, "executable": null, "removes": null, "stdin": null, "warn": true } }, "msg": "non-zero return code", "rc": 1, "retries": 3, "start": "2018-06-20 16:36:06.735021", "stderr": "Flag --confirm has been deprecated, storage migration does not support dry run, this flag is ignored E0620 16:36:39.238143 117200 round_trippers.go:169] CancelRequest not implemented ...repeats same msg... error: the server doesn't have a resource type \"clusterservicebrokers\"", .......
very similar error seen last Friday Jun 15 20:21:43 ip-172-31-33-38.us-west-2.compute.internal atomic-openshift-master-api[101277]: E0615 20:21:43.612129 101277 available_controller.go:295] v1beta1.servicecatalog.k8s.io failed with: Get https://172.24.53.127:443: dial tcp 172.24.53.127:443: getsockopt: no route to host 20:22:06 master-api[101277]: E0615 20:22:06.927984 101277 available_controller.go:295] v1beta1.servicecatalog.k8s.io failed with: Get https://172.24.53.127:443: dial tcp 172.24.53.127:443: getsockopt: no route to host 20:31:45 master-api[101277]: E0615 20:31:45.028654 101277 memcache.go:153] couldn't get resource list for servicecatalog.k8s.io/v1beta1: an error on the server ("Error: 'read tcp 172.20.0.1:43300->172.24.53.127:443: read: no route to host'\nTrying to reach: 'https://172.24.53.127:443/apis/servicecatalog.k8s.io/v1beta1'") has prevented the request from succeeding 20:31:45 master-controllers[101881]: E0615 20:31:45.035700 101881 resource_quota_controller.go:428] failed to discover resources: unable to retrieve the complete list of server APIs: servicecatalog.k8s.io/v1beta1: an error on the server ("Error: 'read tcp 172.20.0.1:43300->172.24.53.127:443: read: no route to host'\nTrying to reach: 'https://172.24.53.127:443/apis/servicecatalog.k8s.io/v1beta1'") has prevented the request from succeeding 20:31:45 master-controllers[101881]: W0615 20:31:45.036080 101881 garbagecollector.go:590] failed to discover some groups: %vmap[servicecatalog.k8s.io/v1beta1:an error on the server ("Error: 'read tcp 172.20.0.1:43300->172.24.53.127:443: read: no route to host'\nTrying to reach: 'https://172.24.53.127:443/apis/servicecatalog.k8s.io/v1beta1'") has prevented the request from succeeding] with ansible log showing failure at 20:30:42: error: the server doesn't have a resource type \"clusterservicebrokers\"
@wmeng, if you have a chance, can you try the failure case from comment 24 using openshift-ansible 3.10.1 or higher? I have attempted to reproduce with that level a couple of times and have not seen the error. 3.10.1 has additional retries for storage migration.
openshift-ansible-3.10.1-1.git.157.2bb6250.el7.noarch.rpm used in comment 34. however, instances were deleted due to resources limitation.
Created attachment 1453415 [details] ansible-playbook -vvv log for upgrade failure under 3.10.1 I hit the same upgrade failure as wmeng in comment 41 in 2 out of 3 upgrades with 3.10.1 - full ansible log attached. @sdodson @jaboyd Open a new bugzilla or work it under this one since it is still apparently in service catalog data migration? could not calculate the list of available resources: Get https://ec2-35-161-126-109.us-west-2.compute.amazonaws.com:8443/apis/servicecatalog.k8s.io/v1beta1?timeout=32s: context deadline exceeded (Client.Timeout exceeded while awaiting headers) ===== Full task output - ansible log attached FAILED! => { "attempts": 2, "changed": true, "cmd": [ "oc", "adm", "--config=/etc/origin/master/admin.kubeconfig", "migrate", "storage", "--include=*", "--confirm" ], "delta": "0:00:32.181916", "end": "2018-06-21 01:36:35.489250", "failed_when_result": true, "invocation": { "module_args": { "_raw_params": "oc adm --config=/etc/origin/master/admin.kubeconfig migrate storage --include=* --confirm", "_uses_shell": false, "chdir": null, "creates": null, "executable": null, "removes": null, "stdin": null, "warn": true } }, "msg": "non-zero return code", "rc": 1, "start": "2018-06-21 01:36:03.307334", "stderr": "Flag --confirm has been deprecated, storage migration does not support dry run, this flag is ignored\nerror: could not calculate the list of available resources: Get https://ec2-35-161-126-109.us-west-2.compute.amazonaws.com:8443/apis/servicecatalog.k8s.io/v1beta1?timeout=32s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)", "stderr_lines": [ "Flag --confirm has been deprecated, storage migration does not support dry run, this flag is ignored", "error: could not calculate the list of available resources: Get https://ec2-35-161-126-109.us-west-2.compute.amazonaws.com:8443/apis/servicecatalog.k8s.io/v1beta1?timeout=32s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)" ], "stdout": "", "stdout_lines": [] }
Created attachment 1453513 [details] logs that go with comment 37
Looking for some help from network team regarding analysis and logs from comment 37. Dan can you help review? I've attached SDN logs. Time period of the issue is 16:41 or so. SDN logs look ok to me, grepping across all nodes for the catalog apiserver service, I only get these entries around this time period: 16:33:43.848601 122992 roundrobin.go:276] LoadBalancerRR: Setting endpoints for kube-service-catalog/apiserver:secure to [172.20.0.4:6443 172.21.0.5:6443 172.23.0.6:6443] 16:33:43.901614 122992 service.go:310] Adding new service port "kube-service-catalog/apiserver:secure" at 172.27.237.234:443/TCP 16:34:01.260158 120871 roundrobin.go:310] LoadBalancerRR: Setting endpoints for kube-service-catalog/apiserver:secure to [172.20.0.6:6443 172.21.0.5:6443 172.23.0.6:6443] 16:34:01.260203 120871 roundrobin.go:240] Delete endpoint 172.20.0.6:6443 for service "kube-service-catalog/apiserver:secure" 16:34:01.260219 120871 roundrobin.go:240] Delete endpoint 172.20.0.4:6443 for service "kube-service-catalog/apiserver:secure" 16:34:01.261035 56906 roundrobin.go:310] LoadBalancerRR: Setting endpoints for kube-service-catalog/apiserver:secure to [172.20.0.6:6443 172.21.0.5:6443 172.23.0.6:6443] 16:34:01.261099 56906 roundrobin.go:240] Delete endpoint 172.20.0.4:6443 for service "kube-service-catalog/apiserver:secure" 16:34:01.261116 56906 roundrobin.go:240] Delete endpoint 172.20.0.6:6443 for service "kube-service-catalog/apiserver:secure" 16:34:01.262471 122992 roundrobin.go:310] LoadBalancerRR: Setting endpoints for kube-service-catalog/apiserver:secure to [172.20.0.6:6443 172.21.0.5:6443 172.23.0.6:6443] 16:34:01.262514 122992 roundrobin.go:240] Delete endpoint 172.20.0.6:6443 for service "kube-service-catalog/apiserver:secure" 16:34:01.262530 122992 roundrobin.go:240] Delete endpoint 172.20.0.4:6443 for service "kube-service-catalog/apiserver:secure" Yet master api is complaining about connectivity: 16:40:40 master-api[36110]: memcache.go:153] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server was unable to return a response in the time allotted, but may still be processing the request 16:41:50 repeat of above 16:43:00 repeat of above 16:44:10 repeat of above 16:45:20 repeat of above 16:45:58 master-api[36110]: memcache.go:153] couldn't get resource list for servicecatalog.k8s.io/v1beta1: an error on the server ("Error: 'read tcp 172.21.0.1:45285->172.27.237.234:443: read: no route to host'\nTrying to reach: 'https://172.27.237.234:443/apis/servicecatalog.k8s.io/v1beta1'")
Created attachment 1453646 [details] log on wmeng39syscon-master-etcd-1.0621-e03.qe.rhcloud.com
Reviewing the master log from comment 45: storage upgrade failed with more general error, no mention of clusterservicebrokers or catalog in any of the Ansible installer output: command was: oc adm --config=/etc/origin/master/admin.kubeconfig migrate storage --include=* --confirm", "_uses_shell": false, "chdir": null, "creates": null, "executable": null, "removes": null, "stdin": null, "warn": true "msg": "non-zero return code", "rc": 1, "retries": 3, "start": "2018-06-21 05:39:57.052415", "stderr": "error: could not calculate the list of available resources: the server was unable to return a response in the time allotted, but may still be processing the request", looks like this command was tried 3 times and each failed with the same issue: "start": "2018-06-21 05:38:26.276022", "end": "2018-06-21 05:39:26.623248", "start": "2018-06-21 05:39:57.052415" "end": "2018-06-21 05:40:57.314836", "start": "2018-06-21 05:41:27.734848", "end": "2018-06-21 05:42:27 Reviewing the master logs attached via comment 45, I do see some of the same errors about connectivity to Catalog apiserver services: failed to discover resources: unable to retrieve the complete list of server APIs: servicecatalog.k8s.io/v1beta1: the server was unable to return a response in the time allotted, but may still be processing the request and "no route to host" issues all around 05:29 - 05:42. It's unclear to me if these resulted in the errors seen in the Ansible logs. I noted in comment 37 that oc commands I was executing on the master during the storage upgrade hung for over 7 minutes. SDodson said this wasn't unexpected, but I do wonder if perhaps there should be more retries here on this command and even better, some backoff injected after a failure waiting for the system to settle down.
With openshift-ansible-3.10.8-1.git.230.830efc0.el7.noarch I am repeatedly getting: TASK [Upgrade all storage] ******************************************************************************************************************************************************************* FAILED - RETRYING: Upgrade all storage (2 retries left). FAILED - RETRYING: Upgrade all storage (1 retries left). fatal: [192.168.121.107.nip.io]: FAILED! => {"attempts": 2, "changed": true, "cmd": ["oc", "adm", "--config=/etc/origin/master/admin.kubeconfig", "migrate", "storage", "--include=*", "--confirm"], "delta": "0:00:32.271083", "end": "2018-06-26 17:43:34.909451", "failed_when_result": true, "msg": "non-zero return code", "rc": 1, "start": "2018-06-26 17:43:02.638368", "stderr": "Flag --confirm has been deprecated, storage migration does not support dry run, this flag is ignored\nE0626 17:43:34.901012 12864 round_trippers.go:169] CancelRequest not implemented\nerror: could not calculate the list of available resources: Get https://192.168.121.107.nip.io:8443/apis/servicecatalog.k8s.io/v1beta1?timeout=32s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)", "stderr_lines": ["Flag --confirm has been deprecated, storage migration does not support dry run, this flag is ignored", "E0626 17:43:34.901012 12864 round_trippers.go:169] CancelRequest not implemented", "error: could not calculate the list of available resources: Get https://192.168.121.107.nip.io:8443/apis/servicecatalog.k8s.io/v1beta1?timeout=32s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)"], "stdout": "", "stdout_lines": []}
root@ip-172-31-7-80: ~ # curl -v -k https://api-server.kube-service-catalog.svc:443/apis/servicecatalog.k8s.io/v1beta1 * Could not resolve host: api-server.kube-service-catalog.svc; Unknown error * Closing connection 0 curl: (6) Could not resolve host: api-server.kube-service-catalog.svc; Unknown error root@ip-172-31-7-80: ~ # curl -v -k https://172.xx.xx.xx:443/apis/servicecatalog.k8s.io/v1beta1 * Could not resolve host: 172.xx.xx.xx; Unknown error * Closing connection 0 curl: (6) Could not resolve host: 172.xx.xx.xx; Unknown error
Jordan, re comment 52: I had a shell script that run every 5 seconds doing oc get hostsubnets -o wide oc get pods --all-namespaces -o wide oc get services -n kube-service-catalog -o wide oc describe endpoints -n kube-service-catalog but found that the oc commands hung for over 6 minutes even with --request-timeout="35". I was running this one of the master hosts. @Vikas, re comment 53: You want to previously get the service ip address: $ oc get services apiserver -n kube-service-catalog -o wide which will shows something like: NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR apiserver ClusterIP 172.27.237.234 <none> 443/TCP 2h app=apiserver You then want to use that cluster-ip service in the 2nd curl command - - not the 172.xx.xx.xx. so in this case it would be curl -v -k https://172.27.237.234:443/apis/servicecatalog.k8s.io/v1beta1 and either do this on the master node or within a container
If I understand correctly, the upgrade process is: 1. run pre-master upgrade (does storage migration prior to changing cluster state) 2. upgrade components (apiservers, nodes, sdn, router, registry, etc) 3. run post upgrade storage migration The pre-master upgrade storage migration failed, and no other steps were performed. I see the SDN pods have already been upgraded to 3.10: oc get pods -n openshift-sdn -o wide -o yaml | grep image: image: registry.reg-aws.openshift.com:443/openshift3/ose-node:v3.10 image: registry.reg-aws.openshift.com:443/openshift3/ose-node:v3.10 image: registry.reg-aws.openshift.com:443/openshift3/ose-node:v3.10 image: registry.reg-aws.openshift.com:443/openshift3/ose-node:v3.10 image: registry.reg-aws.openshift.com:443/openshift3/ose-node:v3.10 image: registry.reg-aws.openshift.com:443/openshift3/ose-node:v3.10 the apiservers are still at 3.9: oc get --raw /version/openshift { "major": "3", "minor": "9+", "gitCommit": "964617d", "gitVersion": "v3.9.27", "buildDate": "2018-04-26T16:48:23Z" } I did not expect to see *any* 3.9->3.10 component changes in the cluster, given that the storage migration failed. Scott, does that mean some upgrades are happening before or in parallel with the storage migration? For the networking team, is it expected that rolling the SDN pods would leave the cluster in a state in which the API server could not resolve service/endpoint IPs? Do we have test coverage of that specific scenario (independent of any upgrade)?
I need to verify, I believe the 3.10 daemonsets are created prior to upgrading the control plane. Also, the node *on the control plane hosts* is updated prior to upgrading the control plane from 3.9 to 3.10, so before step #1 in comment 55.
I would assume that if the SDN got upgraded to 3.10 and everything else was still running 3.9 that massive breakage would be expected...
in particular, you'd be running both the 3.9 in-process SDN code and the 3.10 daemonset SDN code *at the same time*, right?
> in particular, you'd be running both the 3.9 in-process SDN code and the 3.10 daemonset SDN code *at the same time*, right? on all the non-master nodes, yes. that does seem bad. do the service catalog pods run only on master nodes?
Yes, catalog pods run on all master nodes
sdn pods should crashloop until the host level sdn services are shut down at which point they take over.
@weliang could not reproduce it yet. He's trying a different config to see if we can get it to show up.
@Mike: Do you have a broken cluster I can look at?
Would it be possible to get logs from the SDN pod? These will not be in the system journal; they will only be available from "docker logs" or "kubectl logs". Ben observed that the symptoms match those of a kube-proxy (service proxy) issue. If both the 3.10 and 3.9 proxies are running at the same time, that would be bad. (this is just a guess).
@vlaad: api-server is incorrect (in comment #53). It needs to be all one word: apiserver. This curl worked on the test cluster that @mfiedler set up: curl -v -k https://api-server.kube-service-catalog.svc:443/apis/servicecatalog.k8s.io/v1beta1
The update is that the test cluster that showed the problem with ansible had no visible networking problems when I got on it. The masters had working sdn, and the dns was functioning properly. I could curl the service: curl -v -k https://apiserver.kube-service-catalog.svc.cluster.local:443/apis/servicecatalog.k8s.io/v1beta1 [my curl line in comment 68 is wrong, it has the - still] When @vlaad re-ran the ansible playbook it completed the second time. Perhaps some more time needs to be given for things to settle. @vlaad is still re-running the upgrade to see if we can catch the failure.
When I did a recreate with Mike in comment 37 I had scripts that executed commands every 5 seconds and I started them at the same time upgrade was started. 30 or 40 minutes in we hit the failure and some things experienced issues (oc commands hung, the master logs showed errors communicating with the catalog service, etc) but those issues only last 6-10 minutes. I think manually poking around during the issue is going to be hard as its a short lived period of time when things are failing.
I'm passing this back to the installer because it looks like things settle down, we may just need to increase the timeouts to allow things to settle. I'm going to keep working to try to reproduce this at a useful log level, but our test cluster hit another bug blocking further reproduction.
https://github.com/openshift/openshift-ansible/pull/9090 proposed fix It waits for all core apis to come up and increases the number of retries on storage migration to 6 with 30 second delay between each one.
The PR https://github.com/openshift/openshift-ansible/pull/9099 is merged to openshift-ansible-3.10.15-1 , please check the bug.
@Scott I think this is performance / time issue. I suggest to increase more retry times or delay times from my upgrade test experience.
I agree that this is either performance related or some sort of problem that takes time for the API service to get back to a state that will allow successful storage migration. Since the pre upgrade storage migration can be completed very early in the process I'm going to simply move it into an early upgrade pre-check state which happens before we do anything about converting a host to static pods and any services have been restarted.
I meet higher failure rate if system container on RHEL install OCP HA v3.9.33 with service catalog enabled and upgrade to OCP v3.10. @Mike did you cover this upgrade path? could you have a try? Thanks.
We previously covered this path with 3.9.27 -> 3.10.15 and did not see the error. We will re-test today with 3.9.33 -> 3.10.16
@wmeng We hit a failure on 1 out of 4 upgrades from 3.9.33 to 3.10.17 today. The task it failed on was a different from comment 80 but the actual failure is still an issue with a a resource type that should be there. New bz? 1. Hosts: ec2-54-202-210-187.us-west-2.compute.amazonaws.com Play: Upgrade Service Catalog Task: scale up asb deploymentconfig Message: ^[[0;31m{u'cmd': u'/usr/bin/oc get dc asb -o json -n openshift-ansible-service-broker', u'returncode': 1, u'results': [{}], u'stderr': u'error: the server doesn\'t have a resource type "dc"\n', u'stdout': u''} I'll attach the log
Created attachment 1458776 [details] ansible log - server doesn't have a resource type "dc"
Thanks Mike. There is bug tracking the issue. Bug 1579261 - [upgrade] ASB upgrade to 3.10 failed at 'scale up asb deploymentconfig'
Verified on 3.10.17. System test team performed multiple 3.9.33 -> 3.10.17 upgrades and did not encounter this issue. See comment 89 and comment 91.