Description of problem: Upgrade failed at TASK [openshift_control_plane : Wait for /apis/servicecatalog.k8s.io/v1beta1 when registered] task path: /usr/share/ansible/openshift-ansible/roles/openshift_control_plane/tasks/check_master_api_is_ready.yml:49 Friday 14 September 2018 06:54:43 +0000 (0:00:02.808) 0:10:24.190 ****** FAILED - RETRYING: Wait for /apis/servicecatalog.k8s.io/v1beta1 when registered (30 retries left). ... FAILED - RETRYING: Wait for /apis/servicecatalog.k8s.io/v1beta1 when registered (1 retries left). fatal: [x]: FAILED! => {"attempts": 30, "changed": true, "cmd": ["/usr/local/bin/oc", "get", "--raw", "/apis/servicecatalog.k8s.io/v1beta1"], "delta": "0:00:00.389708", "end": "2018-09-14 06:58:28.374427", "msg": "non-zero return code", "rc": 1, "start": "2018-09-14 06:58:27.984719", "stderr": "Error from server (ServiceUnavailable): the server is currently unable to handle the request", "stderr_lines": ["Error from server (ServiceUnavailable): the server is currently unable to handle the request"], "stdout": "", "stdout_lines": []} ================================================== All control pods were in pending status [root@ip-172-18-6-202 ~]# oc get po -n kube-system NAME READY STATUS RESTARTS AGE master-api-ip-172-18-6-202.ec2.internal 0/1 Pending 0 25m master-controllers-ip-172-18-6-202.ec2.internal 0/1 Pending 0 25m master-etcd-ip-172-18-6-202.ec2.internal 0/1 Pending 0 18m [root@ip-172-18-6-202 ~]# oc get --raw /apis/apps.openshift.io/v1 {"kind":"APIResourceList","apiVersion":"v1","groupVersion":"apps.openshift.io/v1","resources":[{"name":"deploymentconfigs","singularName":"","namespaced":true,"kind":"DeploymentConfig","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["dc"],"categories":["all"]},{"name":"deploymentconfigs/instantiate","singularName":"","namespaced":true,"kind":"DeploymentRequest","verbs":["create"]},{"name":"deploymentconfigs/log","singularName":"","namespaced":true,"kind":"DeploymentLog","verbs":["get"]},{"name":"deploymentconfigs/rollback","singularName":"","namespaced":true,"kind":"DeploymentConfigRollback","verbs":["create"]},{"name":"deploymentconfigs/scale","singularName":"","namespaced":true,"group":"extensions","version":"v1beta1","kind":"Scale","verbs":["get","patch","update"]},{"name":"deploymentconfigs/status","singularName":"","namespaced":true,"kind":"DeploymentConfig","verbs":["get","patch","update"]}]} [root@ip-172-18-6-202 ~]# [root@ip-172-18-6-202 ~]# oc get --raw /apis/servicecatalog.k8s.io/v1beta1 Error from server (ServiceUnavailable): the server is currently unable to handle the request master-api log: E0914 07:04:25.687857 1 controller.go:111] loading OpenAPI spec for "v1beta1.servicecatalog.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable E0914 07:05:11.325785 1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted Detail log in attachment. Version-Release number of the following components: ansible-2.6.4-1.el7ae.noarch openshift-ansible-3.11.4-1.git.0.d727082.el7_5.noarch How reproducible: always Steps to Reproduce: 1. Run upgrade against ocp v3.10 2. 3. Actual results: Upgrade failed. Expected results: Upgrade succeed. Additional info: Disable service catalog when install can workaround the issue.
Are these hosts available for us to look at? If not we need node and control plane pod logs from `master-logs etcd etcd`, `master-logs api api`, `master-logs controllers controllers`
Nevermind, attachment has all those logs.
Over to Service Catalog team to diagnose what that API resource becomes unavailable for in excess of 10 minutes whenever the api static pods are restarted. Pod logs from all pods in kube-service-catalog namespace would be helpful.
I can reproduce this readily on my cluster. Service Catalog API servers cannot hit the kube service due to no route to host. E0914 12:47:11.458055 1 reflector.go:205] github.com/kubernetes-incubator/service-catalog/vendor/k8s.io/client-go/informers/factory.go:87: Failed to list *v1beta1.MutatingWebhookConfiguration: Get https://172.30.0.1:443/apis/admissionregistration.k8s.io/v1beta1/mutatingwebhookconfigurations?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: getsockopt: no route to host E0914 12:47:11.463175 1 reflector.go:205] github.com/kubernetes-incubator/service-catalog/vendor/k8s.io/client-go/informers/factory.go:87: Failed to list *v1.Namespace: Get https://172.30.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: getsockopt: no route to host If I delete all the pods in the service catalog they restart and the API works just fine. This may be networking related.
I don't have much more to add to this then what Scott noted. This looks to be a network issue. Service Catalog pods are deployed to all master nodes. We've seen similar issues with network routes not being resolvable during content upgrade in prior release updates.
Taking a look. If anyone has any pods in this state right now, could you paste the output of `ip route`?
[root@192 ~]# ip route default via 192.168.121.1 dev eth0 proto dhcp metric 100 10.128.0.0/14 dev tun0 scope link 172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 172.30.0.0/16 dev tun0 192.168.121.0/24 dev eth0 proto kernel scope link src 192.168.121.19 metric 100 It looks like the pods in the openshift-sdn namespace are stuck terminating. openshift-sdn ovs-pkk87 1/1 Terminating 0 16h openshift-sdn sdn-qtpr6 1/1 Terminating 0 16h Unfortunately I can't see any logs: [root@192 ~]# oc logs -f -n openshift-sdn ovs-pkk87 Error from server (BadRequest): container "openvswitch" in pod "ovs-pkk87" is terminated [root@192 ~]# oc logs -f -n openshift-sdn sdn-qtpr6 Error from server (BadRequest): container "sdn" in pod "sdn-qtpr6" is terminated Multiple other pods are pending: [root@192 ~]# oc get pods --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE default docker-registry-1-csmrl 1/1 Running 0 16h default registry-console-1-2567l 1/1 Running 0 16h default router-1-x254g 1/1 Running 0 16h kube-service-catalog apiserver-dqk49 1/1 Running 0 16h kube-service-catalog controller-manager-vcn5s 1/1 Running 0 16h kube-system master-api-192.168.121.19.nip.io 0/1 Pending 0 16h kube-system master-controllers-192.168.121.19.nip.io 0/1 Pending 0 16h kube-system master-etcd-192.168.121.19.nip.io 0/1 Pending 0 16h openshift-ansible-service-broker asb-1-cphst 1/1 Running 0 16h openshift-node sync-685n9 1/1 Running 0 16h openshift-sdn ovs-pkk87 1/1 Terminating 0 16h openshift-sdn sdn-qtpr6 1/1 Terminating 0 16h openshift-template-service-broker apiserver-hs2th 1/1 Running 0 16h openshift-web-console webconsole-7b88c47974-dxs84 1/1 Running 0 16h
Excellent, thank you! Can you attach the kubelet (node process) logs? Are you able to get the SDN logs via "docker ps -a" and "docker logs" ? They should be on the host until a reboot.
Created attachment 1483345 [details] sos report from affected system (all-in-one master/node) I don't see any running or exited containers for the sdn pods in docker ps -a output. I've attached an sosreport from the host.
(In reply to Scott Dodson from comment #6) > I can reproduce this readily on my cluster. > > Service Catalog API servers cannot hit the kube service due to no route to > host. > > E0914 12:47:11.458055 1 reflector.go:205] > github.com/kubernetes-incubator/service-catalog/vendor/k8s.io/client-go/ > informers/factory.go:87: Failed to list > *v1beta1.MutatingWebhookConfiguration: Get > https://172.30.0.1:443/apis/admissionregistration.k8s.io/v1beta1/ > mutatingwebhookconfigurations?limit=500&resourceVersion=0: dial tcp > 172.30.0.1:443: getsockopt: no route to host > E0914 12:47:11.463175 1 reflector.go:205] > github.com/kubernetes-incubator/service-catalog/vendor/k8s.io/client-go/ > informers/factory.go:87: Failed to list *v1.Namespace: Get > https://172.30.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0: dial > tcp 172.30.0.1:443: getsockopt: no route to host > > If I delete all the pods in the service catalog they restart and the API > works just fine. > > This may be networking related. Scott, can you get "ip route" output from the system that the Service Catalog API server?
Sure, this may be controllers crash looping due to misconfiguration. See https://bugzilla.redhat.com/show_bug.cgi?id=1628961#c3
After fixing the controllers crash looping the service catalog api become available but it took 12 minutes. And at least in this iteration the problem is apiserver -> kube-service-catalog apiserver service rather than before where it was the other way around that was getting no route to host. Here's all the information I could think to gather, # ip route default via 192.168.122.1 dev eth0 proto dhcp metric 100 10.128.0.0/14 dev tun0 scope link 172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 172.30.0.0/16 dev tun0 192.168.122.0/24 dev eth0 proto kernel scope link src 192.168.122.52 metric 100 openshift api server # master-logs api api E0914 15:19:50.353925 1 available_controller.go:311] v1beta1.servicecatalog.k8s.io failed with: Get https://172.30.212.254:443: dial tcp 172.30.212.254:443: getsockopt: no route to host E0914 15:19:50.511708 1 available_controller.go:311] v1beta1.servicecatalog.k8s.io failed with: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io "v1beta1.servicecatalog.k8s.io": the object has been modified; please apply your changes to the latest version and try again I0914 15:19:51.111014 1 healthz.go:72] /healthz/log check E0914 15:19:55.413717 1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted E0914 15:19:56.414758 1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted I0914 15:20:01.109979 1 healthz.go:72] /healthz/log check I0914 15:20:11.109989 1 healthz.go:72] /healthz/log check E0914 15:20:18.229589 1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted E0914 15:20:19.345441 1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted W0914 15:20:19.348025 1 reflector.go:341] github.com/openshift/origin/vendor/k8s.io/client-go/informers/factory.go:87: watch of *v1.Namespace ended with: The resourceVersion for the provided watch is too old. I0914 15:20:21.107436 1 healthz.go:72] /healthz/log check E0914 15:20:26.521093 1 memcache.go:153] couldn't get resource list for servicecatalog.k8s.io/v1beta1: Get https://ose3-master.example.com:8443/apis/servicecatalog.k8s.io/v1beta1?timeout=32s: net/http: request canceled (Client.Timeout exceeded while awaiting headers) I0914 15:20:31.109267 1 healthz.go:72] /healthz/log check E0914 15:20:34.821763 1 controller.go:111] loading OpenAPI spec for "v1beta1.servicecatalog.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: request timed out, Header: map[] I0914 15:20:34.821901 1 controller.go:119] OpenAPI AggregationController: action for item v1beta1.servicecatalog.k8s.io: Rate Limited Requeue. E0914 15:20:39.490374 1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted I0914 15:20:41.109320 1 healthz.go:72] /healthz/log check I0914 15:20:51.110221 1 healthz.go:72] /healthz/log check I0914 15:21:01.122115 1 healthz.go:72] /healthz/log check E0914 15:21:05.500057 1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted W0914 15:21:05.500296 1 reflector.go:341] github.com/openshift/origin/vendor/k8s.io/apiextensions-apiserver/pkg/client/informers/internalversion/factory.go:74: watch of *apiextensions.CustomResourceDefinition ended with: The resourceVersion for the provided watch is too old. E0914 15:21:08.569572 1 memcache.go:153] couldn't get resource list for servicecatalog.k8s.io/v1beta1: Get https://ose3-master.example.com:8443/apis/servicecatalog.k8s.io/v1beta1?timeout=32s: net/http: request canceled (Client.Timeout exceeded while awaiting headers) I0914 15:21:11.110125 1 healthz.go:72] /healthz/log check E0914 15:21:14.215423 1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted E0914 15:21:19.284964 1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted I0914 15:21:21.109606 1 healthz.go:72] /healthz/log check I0914 15:21:31.110875 1 healthz.go:72] /healthz/log check E0914 15:21:32.575301 1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted I0914 15:21:34.822526 1 controller.go:105] OpenAPI AggregationController: Processing item v1beta1.servicecatalog.k8s.io I0914 15:21:41.109374 1 healthz.go:72] /healthz/log check E0914 15:21:47.352011 1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted E0914 15:21:50.614935 1 memcache.go:153] couldn't get resource list for servicecatalog.k8s.io/v1beta1: Get https://ose3-master.example.com:8443/apis/servicecatalog.k8s.io/v1beta1?timeout=32s: net/http: request canceled (Client.Timeout exceeded while awaiting headers) I0914 15:21:51.109679 1 healthz.go:72] /healthz/log check kube-service-catalog apiserver # docker logs 6cec I0914 15:21:17.693745 1 wrap.go:42] PUT /apis/servicecatalog.k8s.io/v1beta1/clusterserviceplans/65b5aab9-aa29-11e8-abd9-52540059342d: (12.009087ms) 200 [[service-catalog/v0.0.0 (linux/amd64) kubernetes/$Format/service-catalog-controller-manager] 10.128.0.1:52236] I0914 15:21:17.849090 1 wrap.go:42] GET /: (223.982µs) 200 [[Go-http-client/1.1] 10.128.0.1:52524] I0914 15:21:17.890127 1 wrap.go:42] PUT /apis/servicecatalog.k8s.io/v1beta1/clusterserviceplans/668edd91-aa29-11e8-abd9-52540059342d: (8.500606ms) 200 [[service-catalog/v0.0.0 (linux/amd64) kubernetes/$Format/service-catalog-controller-manager] 10.128.0.1:52236] I0914 15:21:18.116357 1 wrap.go:42] PUT /apis/servicecatalog.k8s.io/v1beta1/clusterserviceplans/64a19398-aa29-11e8-abd9-52540059342d: (26.154161ms) 200 [[service-catalog/v0.0.0 (linux/amd64) kubernetes/$Format/service-catalog-controller-manager] 10.128.0.1:52236] I0914 15:21:18.323522 1 wrap.go:42] PUT /apis/servicecatalog.k8s.io/v1beta1/clusterserviceplans/669d84a8-aa29-11e8-abd9-52540059342d: (35.423439ms) 200 [[service-catalog/v0.0.0 (linux/amd64) kubernetes/$Format/service-catalog-controller-manager] 10.128.0.1:52236] I0914 15:21:18.493388 1 wrap.go:42] PUT /apis/servicecatalog.k8s.io/v1beta1/clusterserviceplans/64aeac57-aa29-11e8-abd9-52540059342d: (12.540691ms) 200 [[service-catalog/v0.0.0 (linux/amd64) kubernetes/$Format/service-catalog-controller-manager] 10.128.0.1:52236] I0914 15:21:18.689195 1 wrap.go:42] PUT /apis/servicecatalog.k8s.io/v1beta1/clusterserviceplans/64d550ed-aa29-11e8-abd9-52540059342d: (9.805952ms) 200 [[service-catalog/v0.0.0 (linux/amd64) kubernetes/$Format/service-catalog-controller-manager] 10.128.0.1:52236] I0914 15:21:18.940518 1 wrap.go:42] PUT /apis/servicecatalog.k8s.io/v1beta1/clusterserviceplans/64752ccb-aa29-11e8-abd9-52540059342d: (51.623995ms) 200 [[service-catalog/v0.0.0 (linux/amd64) kubernetes/$Format/service-catalog-controller-manager] 10.128.0.1:52236] I0914 15:21:19.125488 1 wrap.go:42] PUT /apis/servicecatalog.k8s.io/v1beta1/clusterserviceplans/656842d7-aa29-11e8-abd9-52540059342d: (15.836569ms) 200 [[service-catalog/v0.0.0 (linux/amd64) kubernetes/$Format/service-catalog-controller-manager] 10.128.0.1:52236] I0914 15:21:19.311576 1 wrap.go:42] PUT /apis/servicecatalog.k8s.io/v1beta1/clusterservicebrokers/template-service-broker/status: (32.406532ms) 200 [[service-catalog/v0.0.0 (linux/amd64) kubernetes/$Format/service-catalog-controller-manager] 10.128.0.1:52236] I0914 15:21:23.573734 1 wrap.go:42] GET /apis/servicecatalog.k8s.io/v1beta1?timeout=32s: (293.008µs) 200 [[openshift/v1.10.0+b81c8f8 (linux/amd64) kubernetes/b81c8f8] 10.128.0.1:52236] I0914 15:21:25.687995 1 wrap.go:42] GET /: (2.387451ms) 200 [[Go-http-client/1.1] 10.129.0.1:36906] I0914 15:21:30.678283 1 wrap.go:42] PUT /apis/servicecatalog.k8s.io/v1beta1/clusterservicebrokers/ansible-service-broker/status: (39.36612ms) 200 [[service-catalog/v0.0.0 (linux/amd64) kubernetes/$Format/service-catalog-controller-manager] 10.128.0.1:52236] I0914 15:21:33.610356 1 wrap.go:42] GET /apis/servicecatalog.k8s.io/v1beta1?timeout=32s: (564.128µs) 200 [[openshift/v1.10.0+b81c8f8 (linux/amd64) kubernetes/b81c8f8] 10.128.0.1:52236] I0914 15:21:43.638916 1 wrap.go:42] GET /apis/servicecatalog.k8s.io/v1beta1?timeout=32s: (258.042µs) 200 [[openshift/v1.10.0+b81c8f8 (linux/amd64) kubernetes/b81c8f8] 10.128.0.1:52236] I0914 15:21:53.663391 1 wrap.go:42] GET /apis/servicecatalog.k8s.io/v1beta1?timeout=32s: (264.29µs) 200 [[openshift/v1.10.0+b81c8f8 (linux/amd64) kubernetes/b81c8f8] 10.128.0.1:52236] I0914 15:21:55.695431 1 wrap.go:42] GET /: (36.632223ms) 200 [[Go-http-client/1.1] 10.129.0.1:36950] I0914 15:22:03.750833 1 wrap.go:42] GET /apis/servicecatalog.k8s.io/v1beta1?timeout=32s: (373.081µs) 200 [[openshift/v1.10.0+b81c8f8 (linux/amd64) kubernetes/b81c8f8] 10.128.0.1:52236] I0914 15:22:11.391636 1 wrap.go:42] GET /: (6.681933ms) 200 [[Go-http-client/1.1] 10.129.0.1:36976] I0914 15:22:13.784406 1 wrap.go:42] GET /apis/servicecatalog.k8s.io/v1beta1?timeout=32s: (332.208µs) 200 [[openshift/v1.10.0+b81c8f8 (linux/amd64) kubernetes/b81c8f8] 10.128.0.1:52236] I0914 15:22:14.685239 1 wrap.go:42] PUT /apis/servicecatalog.k8s.io/v1beta1/clusterservicebrokers/ansible-service-broker/status: (24.291521ms) 200 [[service-catalog/v0.0.0 (linux/amd64) kubernetes/$Format/service-catalog-controller-manager] 10.128.0.1:52236] I0914 15:22:18.502187 1 wrap.go:42] GET /: (147.791µs) 200 [[Go-http-client/1.1] 10.128.0.1:52774] W0914 15:22:21.120162 1 reflector.go:341] github.com/kubernetes-incubator/service-catalog/vendor/k8s.io/client-go/informers/factory.go:87: watch of *v1beta1.MutatingWebhookConfiguration ended with: The resourceVersion for the provided watch is too old. I0914 15:22:22.121260 1 reflector.go:240] Listing and watching *v1beta1.MutatingWebhookConfiguration from github.com/kubernetes-incubator/service-catalog/vendor/k8s.io/client-go/informers/factory.go:87 I0914 15:22:22.765669 1 wrap.go:42] GET /openapi/v2: (5.481822ms) 404 [[] 10.128.0.1:52236] I0914 15:22:22.774718 1 wrap.go:42] GET /swagger.json: (6.420071ms) 404 [[] 10.128.0.1:52236] I0914 15:22:23.822502 1 wrap.go:42] GET /apis/servicecatalog.k8s.io/v1beta1?timeout=32s: (267.381µs) 200 [[openshift/v1.10.0+b81c8f8 (linux/amd64) kubernetes/b81c8f8] 10.128.0.1:52236] I0914 15:22:25.750217 1 wrap.go:42] GET /: (34.655935ms) 200 [[Go-http-client/1.1] 10.129.0.1:36996] I0914 15:22:33.872827 1 wrap.go:42] GET /apis/servicecatalog.k8s.io/v1beta1?timeout=32s: (378.239µs) 200 [[openshift/v1.10.0+b81c8f8 (linux/amd64) kubernetes/b81c8f8] 10.128.0.1:52236] # oc describe svc apiserver -n kube-service-catalog E0914 11:25:18.621122 1343 round_trippers.go:169] CancelRequest not implemented E0914 11:25:50.629434 1343 round_trippers.go:169] CancelRequest not implemented Name: apiserver Namespace: kube-service-catalog Labels: <none> Annotations: <none> Selector: app=apiserver Type: ClusterIP IP: 172.30.212.254 Port: secure 443/TCP TargetPort: 6443/TCP Endpoints: 10.128.0.59:6443,10.130.0.62:6443 Session Affinity: None Events: <none>
Looking at the timing, the last kube api -> service catalog timeout was at 15:21:08, and the first log lines for the service catalog apiserver start at 15:21:17. So did kube-service-catalog take a long time to start, or was it unreachable? Can you pull logs from the kubelet and/or check events to see what the timeline is there? This isn't looking like a network bug. However we might have a chicken-egg issue that is exacerbating upgrade.
@Scott or anyone else that knows: is storage upgrade in process at this time or has it yet to be kicked off?
Still hit it on v3.11.7
I hit the bug with openshift-ovs-networkpolicy enable. # oc get --raw /apis/servicecatalog.k8s.io/v1beta1 Error from server (ServiceUnavailable): the server is currently unable to handle the request
Block upgrade test with service catalog
Maybe also related to https://bugzilla.redhat.com/show_bug.cgi?id=1625194
Just an update: I'm running the upgrade playbook now, trying to repro this.
(In reply to Casey Callendrello from comment #15) > This isn't looking like a network bug. Did we figure out why the SDN pod was lingering in a "terminated" state for a long time without being restarted (comment 9)? Because that would be a problem... eg, if the SDN pod on a node was previously running but has since been killed, then the network will be up and running, but it won't get updated for new changes. So, eg, if the set of Endpoints for a service changes, that node won't get updated iptables rules pointing to the new endpoints, and so attempts by local pods to connect to that service might end up sending packets to dead pods/nodes, eventually resulting in "No route to host" errors.
@sdodson observes: I believe that the lingering sdn pod in terminated/terminating state was due to an order of operations issue and has since been resolved. I've seen this happen during upgrades though. There's also some api aggregator fixes that master team is working on backporting that were leading to api aggregation not updating endpoints until 1000s timeout was hit. https://github.com/openshift/origin/pull/21012 is the fix to that If you guys have been able to make it through a few upgrades successfully with the service catalog installed then I'd say we send it over to QE to test. Given that the above would explain the symptoms, and that @cdc was unable to reproduce it when he tried, I'm going to kick it back. If you encounter the error, please see if any of the sdn pods are stuck terminating and grab the node logs to see if there is any indication of why they are not quickly exiting.
I was able to reproduce it after trying harder :-). However, I agree that https://github.com/openshift/origin/pull/21012 is the likely cause. After the service apiserver was upgraded, the apiserver became extremely slow, taking a few minutes to even list <5 pods. At this point, the cluster is in a bad state. Certainly one of the consequences could be that the network doesn't get endpoint changes in a timely fashion. However, restarting the kube apiserver instantly brought the cluster back to life.
Looking at the logs: NOTE: bring up uses a lot of deprecated flags. I wouldn't expect that in a GA product. ---------------- node failed 3 times between Sep 14 06:10:37 and Sep 14 07:21:19 grep "process exited, code=exited, status=1" * node:Sep 14 06:14:45 ip-172-18-6-202.ec2.internal systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=1/FAILURE node:Sep 14 06:16:05 ip-172-18-6-202.ec2.internal systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=1/FAILURE node:Sep 14 06:26:47 ip-172-18-6-202.ec2.internal systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=1/FAILURE 612 ": E" erors Do we expect node to fail 3 times before it stays up? ---- master-controller log runs 06:53:43.704203 07:15:38.251939 Last Error at line 2054 in log of 61527 lines E0914 06:54:37.383987 So things settled out. 31 "^E" errors ----- master-api log starts 06:53:40.930154 ends 07:15:00.142865 168347 lines Line 9502 E0914 Line 06:54:32.771669 1 controller.go:111] loading OpenAPI spec for "v1beta1.servicecatalog.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable 138 "^E" --------------------- This looks to me like things are racing and the favorites don't always win. Some of the errors are for unexpected missing files, cant talk to an expected process and such. Having node fail and restart 3 times seems unexpected. Just my 2 cents.
I suspect this is still the apisever bug. We need 21012 to merge. Something is also wrong with RBAC. Getting lots and lots of errors like: Sep 19 03:59:02 ip-172-18-2-142.ec2.internal atomic-openshift-node[8504]: W0919 03:59:02.212381 8520 status_manager.go:498] Failed to update status for pod "sdn-4z6zf_openshift-sdn(e529a 4b9-bbba-11e8-bd35-0e701f90095c)": failed to patch status "{\"status\":{\"$setElementOrder/conditions\":[{\"type\":\"Initialized\"},{\"type\":\"Ready\"},{\"type\":\"ContainersReady\"},{\"ty pe\":\"PodScheduled\"}],\"conditions\":[{\"lastTransitionTime\":\"2018-09-19T03:59:02Z\",\"status\":\"True\",\"type\":\"Ready\"},{\"lastProbeTime\":null,\"lastTransitionTime\":null,\"status \":\"True\",\"type\":\"ContainersReady\"}]}}" for pod "openshift-sdn"/"sdn-4z6zf": pods "sdn-4z6zf" is forbidden: User "system:node:ip-172-18-2-142.ec2.internal" cannot patch pods/status in the namespace "openshift-sdn": User "system:node:ip-172-18-2-142.ec2.internal" cannot "patch" "pods/status" with name "sdn-4z6zf" in project "openshift-sdn" in the kubelet logs. Scott, is this a bug?
I'm not sure. I've long suspected that the control plane is suffering a general failure but I've not been able to convince the master team to investigate that.
Still hit it on openshift-ansible-3.11.11-1.git.0.5d4f9d4.el7_5.noarch
Created attachment 1485253 [details] Successful 3.10.45 -> 3.11.11 upgrade ansible log I was able to do successful upgrade from 3.10.45 using 3.11.11 and openshift-ansible 3.11.11. Ansible log is attached. The failing task in the original description was successful. However: I did not hit this issue during 3.11.4 and 3.11.8 upgrades, which were also successful. It would be best for @liujia to verify the fix as well.
The problem here seems to be related with RBAC, specifically this: I0921 03:24:37.261834 1 rbac.go:118] RBAC DENY: user "system:node:ip-172-18-11-156.ec2.internal" groups ["system:nodes" "system:authenticated"] cannot "patch" resource "pods/status" named "master-api-ip-172-18-11-156.ec2.internal" in namespace "kube-system" This results in the kubelet trying forever to patch pod system but never succeeds, which in turns makes the system incomplete. The current playbooks verify API by calling oc get --raw /apis/${group}/v1 an all groups, but I think that's not sufficient, because based on what I'm seeing docker managed to setup the API pod and it is working as expected, so the gets will always succeed, although sometimes it'll take longer, and that's why, I'm guessing, the installation failed oc get apiservices/v1beta1.servicecatalog.k8s.io which btw. works just fine now. I've compared system:node cluster role between 3.10 and 3.11 and the latter needs patch on pod/status, and that was not in place in 3.10. I haven't compared all of the cluster roles, but scrolling through logs I only found sccs reconciliation but nothing wrt cluster roles. So here goes my 2 questions: 1. @Scott do we run cluster role reconciliation, I haven't seen anything in the playbook? 2. @Mo do we need to run role reconciliation, Clayton mentioned something about master dealing with it automatically, if so what if that fails how should I know? And one suggestion: @Scott I think that on top of api verification like we do know (oc get --raw /apis/${group}/v1) we should add additional verifying if the master api and controller pods are in Running state, to verify that the cluster is in healthy state, checking oc get --raw /healthz is not sufficient like I said before, API is there and is serving requests, but there is a problem with communication between components due to RBAC, in this particular case.
This is on atomic. In atomic installs, during the phase where we deploy authentication credentials we update the node system container in order to mount up /var/lib/origin/.docker/config.json I think that should already be mounted so I think we should skip that phase?
https://github.com/openshift/openshift-ansible/pull/10194
https://github.com/openshift/openshift-ansible/pull/10200 pick for release-3.11
In 3.11.13-1 and later.
Version:openshift-ansible-3.11.14-1.git.0.65a0c0c.el7.noarch Upgrade against ocp with system container node and service catalog deployed works well now. >In my previous test, when do upgrade against system container node, hit it 100%. when do upgrade against rpm node, hit it 30%. when do upgrade without service catalog, hit it 0%. According to comment37, will watch on rpm upgrade today. If no issue, then change the bug status.
Also passed task [openshift_control_plane : Wait for /apis/servicecatalog.k8s.io/v1beta1 when registered] during upgrade against rpm installed ocp on openshift-ansible-3.11.14-1.git.0.65a0c0c.el7.noarch. Combined with comment 52, verify the bug.
Scott, thank you! Updating just openshift-ansible before running the upgrade worked.
This was addressed in openshift-ansible-3.11.16-1.git.0.4ac6f81.el7. It was flipped back to ASSIGNED due to workflow problems outside of openshift-ansible that caused failure. As such CLOSED CURRENTRELEASE