1628881 – Upgrade failed at TASK [openshift_control_plane : Wait for /apis/servicecatalog.k8s.io/v1beta1 when registered]

Bug 1628881 - Upgrade failed at TASK [openshift_control_plane : Wait for /apis/servicecatalog.k8s.io/v1beta1 when registered]

Summary: Upgrade failed at TASK [openshift_control_plane : Wait for /apis/servicecatal...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.11.0
Assignee:	Scott Dodson
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-09-14 08:52 UTC by liujia
Modified:	2023-10-06 17:55 UTC (History)
CC List:	21 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-10-26 12:58:14 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
sos report from affected system (all-in-one master/node) (13.98 MB, application/x-xz) 2018-09-14 14:28 UTC, Jason Montleon	no flags	Details
Successful 3.10.45 -> 3.11.11 upgrade ansible log (448.37 KB, application/x-gzip) 2018-09-20 17:50 UTC, Mike Fiedler	no flags	Details
View All

Description liujia 2018-09-14 08:52:32 UTC

Description of problem:
Upgrade failed at TASK [openshift_control_plane : Wait for /apis/servicecatalog.k8s.io/v1beta1 when registered]

task path: /usr/share/ansible/openshift-ansible/roles/openshift_control_plane/tasks/check_master_api_is_ready.yml:49
Friday 14 September 2018  06:54:43 +0000 (0:00:02.808)       0:10:24.190 ****** 
FAILED - RETRYING: Wait for /apis/servicecatalog.k8s.io/v1beta1 when registered (30 retries left).
...
FAILED - RETRYING: Wait for /apis/servicecatalog.k8s.io/v1beta1 when registered (1 retries left).
fatal: [x]: FAILED! => {"attempts": 30, "changed": true, "cmd": ["/usr/local/bin/oc", "get", "--raw", "/apis/servicecatalog.k8s.io/v1beta1"], "delta": "0:00:00.389708", "end": "2018-09-14 06:58:28.374427", "msg": "non-zero return code", "rc": 1, "start": "2018-09-14 06:58:27.984719", "stderr": "Error from server (ServiceUnavailable): the server is currently unable to handle the request", "stderr_lines": ["Error from server (ServiceUnavailable): the server is currently unable to handle the request"], "stdout": "", "stdout_lines": []}
==================================================
All control pods were in pending status
[root@ip-172-18-6-202 ~]# oc get po -n kube-system
NAME                                              READY     STATUS    RESTARTS   AGE
master-api-ip-172-18-6-202.ec2.internal           0/1       Pending   0          25m
master-controllers-ip-172-18-6-202.ec2.internal   0/1       Pending   0          25m
master-etcd-ip-172-18-6-202.ec2.internal          0/1       Pending   0          18m

[root@ip-172-18-6-202 ~]# oc get --raw /apis/apps.openshift.io/v1
{"kind":"APIResourceList","apiVersion":"v1","groupVersion":"apps.openshift.io/v1","resources":[{"name":"deploymentconfigs","singularName":"","namespaced":true,"kind":"DeploymentConfig","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["dc"],"categories":["all"]},{"name":"deploymentconfigs/instantiate","singularName":"","namespaced":true,"kind":"DeploymentRequest","verbs":["create"]},{"name":"deploymentconfigs/log","singularName":"","namespaced":true,"kind":"DeploymentLog","verbs":["get"]},{"name":"deploymentconfigs/rollback","singularName":"","namespaced":true,"kind":"DeploymentConfigRollback","verbs":["create"]},{"name":"deploymentconfigs/scale","singularName":"","namespaced":true,"group":"extensions","version":"v1beta1","kind":"Scale","verbs":["get","patch","update"]},{"name":"deploymentconfigs/status","singularName":"","namespaced":true,"kind":"DeploymentConfig","verbs":["get","patch","update"]}]}
[root@ip-172-18-6-202 ~]# 
[root@ip-172-18-6-202 ~]# oc get --raw /apis/servicecatalog.k8s.io/v1beta1
Error from server (ServiceUnavailable): the server is currently unable to handle the request

master-api log:
E0914 07:04:25.687857       1 controller.go:111] loading OpenAPI spec for "v1beta1.servicecatalog.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable
E0914 07:05:11.325785       1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted

Detail log in attachment.

Version-Release number of the following components:
ansible-2.6.4-1.el7ae.noarch
openshift-ansible-3.11.4-1.git.0.d727082.el7_5.noarch

How reproducible:
always

Steps to Reproduce:
1. Run upgrade against ocp v3.10
2.
3.

Actual results:
Upgrade failed.

Expected results:
Upgrade succeed.

Additional info:
Disable service catalog when install can workaround the issue.

Comment 2 Scott Dodson 2018-09-14 12:28:17 UTC

Are these hosts available for us to look at? If not we need node and control plane pod logs from `master-logs etcd etcd`, `master-logs api api`, `master-logs controllers controllers`

Comment 3 Scott Dodson 2018-09-14 12:33:21 UTC

Nevermind, attachment has all those logs.

Comment 5 Scott Dodson 2018-09-14 12:39:11 UTC

Over to Service Catalog team to diagnose what that API resource becomes unavailable for in excess of 10 minutes whenever the api static pods are restarted.

Pod logs from all pods in kube-service-catalog namespace would be helpful.

Comment 6 Scott Dodson 2018-09-14 12:57:07 UTC

I can reproduce this readily on my cluster. 

Service Catalog API servers cannot hit the kube service due to no route to host.

E0914 12:47:11.458055       1 reflector.go:205] github.com/kubernetes-incubator/service-catalog/vendor/k8s.io/client-go/informers/factory.go:87: Failed to list *v1beta1.MutatingWebhookConfiguration: Get https://172.30.0.1:443/apis/admissionregistration.k8s.io/v1beta1/mutatingwebhookconfigurations?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: getsockopt: no route to host
E0914 12:47:11.463175       1 reflector.go:205] github.com/kubernetes-incubator/service-catalog/vendor/k8s.io/client-go/informers/factory.go:87: Failed to list *v1.Namespace: Get https://172.30.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: getsockopt: no route to host

If I delete all the pods in the service catalog they restart and the API works just fine.

This may be networking related.

Comment 7 Jay Boyd 2018-09-14 13:34:58 UTC

I don't have much more to add to this then what Scott noted.  This looks to be a network issue.  Service Catalog pods are deployed to all master nodes.  We've seen similar issues with network routes not being resolvable during content upgrade in prior release updates.

Comment 8 Casey Callendrello 2018-09-14 13:46:49 UTC

Taking a look.

If anyone has any pods in this state right now, could you paste the output of `ip route`?

Comment 9 Jason Montleon 2018-09-14 14:03:56 UTC

[root@192 ~]# ip route
default via 192.168.121.1 dev eth0 proto dhcp metric 100 
10.128.0.0/14 dev tun0 scope link 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 
172.30.0.0/16 dev tun0 
192.168.121.0/24 dev eth0 proto kernel scope link src 192.168.121.19 metric 100 

It looks like the pods in the openshift-sdn namespace are stuck terminating.
openshift-sdn                       ovs-pkk87                                  1/1       Terminating   0          16h
openshift-sdn                       sdn-qtpr6                                  1/1       Terminating   0          16h

Unfortunately I can't see any logs:
[root@192 ~]# oc logs -f -n openshift-sdn                       ovs-pkk87
Error from server (BadRequest): container "openvswitch" in pod "ovs-pkk87" is terminated

[root@192 ~]# oc logs -f -n openshift-sdn                       sdn-qtpr6
Error from server (BadRequest): container "sdn" in pod "sdn-qtpr6" is terminated

Multiple other pods are pending:
[root@192 ~]# oc get pods --all-namespaces
NAMESPACE                           NAME                                       READY     STATUS        RESTARTS   AGE
default                             docker-registry-1-csmrl                    1/1       Running       0          16h
default                             registry-console-1-2567l                   1/1       Running       0          16h
default                             router-1-x254g                             1/1       Running       0          16h
kube-service-catalog                apiserver-dqk49                            1/1       Running       0          16h
kube-service-catalog                controller-manager-vcn5s                   1/1       Running       0          16h
kube-system                         master-api-192.168.121.19.nip.io           0/1       Pending       0          16h
kube-system                         master-controllers-192.168.121.19.nip.io   0/1       Pending       0          16h
kube-system                         master-etcd-192.168.121.19.nip.io          0/1       Pending       0          16h
openshift-ansible-service-broker    asb-1-cphst                                1/1       Running       0          16h
openshift-node                      sync-685n9                                 1/1       Running       0          16h
openshift-sdn                       ovs-pkk87                                  1/1       Terminating   0          16h
openshift-sdn                       sdn-qtpr6                                  1/1       Terminating   0          16h
openshift-template-service-broker   apiserver-hs2th                            1/1       Running       0          16h
openshift-web-console               webconsole-7b88c47974-dxs84                1/1       Running       0          16h

Comment 10 Casey Callendrello 2018-09-14 14:13:36 UTC

Excellent, thank you!

Can you attach the kubelet (node process) logs? Are you able to get the SDN logs via "docker ps -a" and "docker logs" ? They should be on the host until a reboot.

Comment 11 Jason Montleon 2018-09-14 14:28:53 UTC

Created attachment 1483345 [details]
sos report from affected system (all-in-one master/node)

I don't see any running or exited containers for the sdn pods in docker ps -a output. I've attached an sosreport from the host.

Comment 12 Dan Williams 2018-09-14 14:38:08 UTC

(In reply to Scott Dodson from comment #6)
> I can reproduce this readily on my cluster. 
> 
> Service Catalog API servers cannot hit the kube service due to no route to
> host.
> 
> E0914 12:47:11.458055       1 reflector.go:205]
> github.com/kubernetes-incubator/service-catalog/vendor/k8s.io/client-go/
> informers/factory.go:87: Failed to list
> *v1beta1.MutatingWebhookConfiguration: Get
> https://172.30.0.1:443/apis/admissionregistration.k8s.io/v1beta1/
> mutatingwebhookconfigurations?limit=500&resourceVersion=0: dial tcp
> 172.30.0.1:443: getsockopt: no route to host
> E0914 12:47:11.463175       1 reflector.go:205]
> github.com/kubernetes-incubator/service-catalog/vendor/k8s.io/client-go/
> informers/factory.go:87: Failed to list *v1.Namespace: Get
> https://172.30.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0: dial
> tcp 172.30.0.1:443: getsockopt: no route to host
> 
> If I delete all the pods in the service catalog they restart and the API
> works just fine.
> 
> This may be networking related.

Scott, can you get "ip route" output from the system that the Service Catalog API server?

Comment 13 Scott Dodson 2018-09-14 15:08:42 UTC

Sure, this may be controllers crash looping due to misconfiguration. See 

https://bugzilla.redhat.com/show_bug.cgi?id=1628961#c3

Comment 14 Scott Dodson 2018-09-14 15:32:37 UTC

After fixing the controllers crash looping the service catalog api become available but it took 12 minutes. And at least in this iteration the problem is apiserver -> kube-service-catalog apiserver service rather than before where it was the other way around that was getting no route to host.

Here's all the information I could think to gather, 

# ip route
default via 192.168.122.1 dev eth0 proto dhcp metric 100 
10.128.0.0/14 dev tun0 scope link 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 
172.30.0.0/16 dev tun0 
192.168.122.0/24 dev eth0 proto kernel scope link src 192.168.122.52 metric 100

 openshift api server
# master-logs api api 
E0914 15:19:50.353925       1 available_controller.go:311] v1beta1.servicecatalog.k8s.io failed with: Get https://172.30.212.254:443: dial tcp 172.30.212.254:443: getsockopt: no route to host
E0914 15:19:50.511708       1 available_controller.go:311] v1beta1.servicecatalog.k8s.io failed with: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io "v1beta1.servicecatalog.k8s.io": the object has been modified; please apply your changes to the latest version and try again
I0914 15:19:51.111014       1 healthz.go:72] /healthz/log check
E0914 15:19:55.413717       1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted
E0914 15:19:56.414758       1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted
I0914 15:20:01.109979       1 healthz.go:72] /healthz/log check
I0914 15:20:11.109989       1 healthz.go:72] /healthz/log check
E0914 15:20:18.229589       1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted
E0914 15:20:19.345441       1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted
W0914 15:20:19.348025       1 reflector.go:341] github.com/openshift/origin/vendor/k8s.io/client-go/informers/factory.go:87: watch of *v1.Namespace ended with: The resourceVersion for the provided watch is too old.
I0914 15:20:21.107436       1 healthz.go:72] /healthz/log check
E0914 15:20:26.521093       1 memcache.go:153] couldn't get resource list for servicecatalog.k8s.io/v1beta1: Get https://ose3-master.example.com:8443/apis/servicecatalog.k8s.io/v1beta1?timeout=32s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
I0914 15:20:31.109267       1 healthz.go:72] /healthz/log check
E0914 15:20:34.821763       1 controller.go:111] loading OpenAPI spec for "v1beta1.servicecatalog.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: request timed out, Header: map[]
I0914 15:20:34.821901       1 controller.go:119] OpenAPI AggregationController: action for item v1beta1.servicecatalog.k8s.io: Rate Limited Requeue.
E0914 15:20:39.490374       1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted
I0914 15:20:41.109320       1 healthz.go:72] /healthz/log check
I0914 15:20:51.110221       1 healthz.go:72] /healthz/log check
I0914 15:21:01.122115       1 healthz.go:72] /healthz/log check
E0914 15:21:05.500057       1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted
W0914 15:21:05.500296       1 reflector.go:341] github.com/openshift/origin/vendor/k8s.io/apiextensions-apiserver/pkg/client/informers/internalversion/factory.go:74: watch of *apiextensions.CustomResourceDefinition ended with: The resourceVersion for the provided watch is too old.
E0914 15:21:08.569572       1 memcache.go:153] couldn't get resource list for servicecatalog.k8s.io/v1beta1: Get https://ose3-master.example.com:8443/apis/servicecatalog.k8s.io/v1beta1?timeout=32s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
I0914 15:21:11.110125       1 healthz.go:72] /healthz/log check
E0914 15:21:14.215423       1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted
E0914 15:21:19.284964       1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted
I0914 15:21:21.109606       1 healthz.go:72] /healthz/log check
I0914 15:21:31.110875       1 healthz.go:72] /healthz/log check
E0914 15:21:32.575301       1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted
I0914 15:21:34.822526       1 controller.go:105] OpenAPI AggregationController: Processing item v1beta1.servicecatalog.k8s.io
I0914 15:21:41.109374       1 healthz.go:72] /healthz/log check
E0914 15:21:47.352011       1 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted
E0914 15:21:50.614935       1 memcache.go:153] couldn't get resource list for servicecatalog.k8s.io/v1beta1: Get https://ose3-master.example.com:8443/apis/servicecatalog.k8s.io/v1beta1?timeout=32s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
I0914 15:21:51.109679       1 healthz.go:72] /healthz/log check

 kube-service-catalog apiserver
# docker logs 6cec
I0914 15:21:17.693745       1 wrap.go:42] PUT /apis/servicecatalog.k8s.io/v1beta1/clusterserviceplans/65b5aab9-aa29-11e8-abd9-52540059342d: (12.009087ms) 200 [[service-catalog/v0.0.0 (linux/amd64) kubernetes/$Format/service-catalog-controller-manager] 10.128.0.1:52236]
I0914 15:21:17.849090       1 wrap.go:42] GET /: (223.982µs) 200 [[Go-http-client/1.1] 10.128.0.1:52524]
I0914 15:21:17.890127       1 wrap.go:42] PUT /apis/servicecatalog.k8s.io/v1beta1/clusterserviceplans/668edd91-aa29-11e8-abd9-52540059342d: (8.500606ms) 200 [[service-catalog/v0.0.0 (linux/amd64) kubernetes/$Format/service-catalog-controller-manager] 10.128.0.1:52236]
I0914 15:21:18.116357       1 wrap.go:42] PUT /apis/servicecatalog.k8s.io/v1beta1/clusterserviceplans/64a19398-aa29-11e8-abd9-52540059342d: (26.154161ms) 200 [[service-catalog/v0.0.0 (linux/amd64) kubernetes/$Format/service-catalog-controller-manager] 10.128.0.1:52236]
I0914 15:21:18.323522       1 wrap.go:42] PUT /apis/servicecatalog.k8s.io/v1beta1/clusterserviceplans/669d84a8-aa29-11e8-abd9-52540059342d: (35.423439ms) 200 [[service-catalog/v0.0.0 (linux/amd64) kubernetes/$Format/service-catalog-controller-manager] 10.128.0.1:52236]
I0914 15:21:18.493388       1 wrap.go:42] PUT /apis/servicecatalog.k8s.io/v1beta1/clusterserviceplans/64aeac57-aa29-11e8-abd9-52540059342d: (12.540691ms) 200 [[service-catalog/v0.0.0 (linux/amd64) kubernetes/$Format/service-catalog-controller-manager] 10.128.0.1:52236]
I0914 15:21:18.689195       1 wrap.go:42] PUT /apis/servicecatalog.k8s.io/v1beta1/clusterserviceplans/64d550ed-aa29-11e8-abd9-52540059342d: (9.805952ms) 200 [[service-catalog/v0.0.0 (linux/amd64) kubernetes/$Format/service-catalog-controller-manager] 10.128.0.1:52236]
I0914 15:21:18.940518       1 wrap.go:42] PUT /apis/servicecatalog.k8s.io/v1beta1/clusterserviceplans/64752ccb-aa29-11e8-abd9-52540059342d: (51.623995ms) 200 [[service-catalog/v0.0.0 (linux/amd64) kubernetes/$Format/service-catalog-controller-manager] 10.128.0.1:52236]
I0914 15:21:19.125488       1 wrap.go:42] PUT /apis/servicecatalog.k8s.io/v1beta1/clusterserviceplans/656842d7-aa29-11e8-abd9-52540059342d: (15.836569ms) 200 [[service-catalog/v0.0.0 (linux/amd64) kubernetes/$Format/service-catalog-controller-manager] 10.128.0.1:52236]
I0914 15:21:19.311576       1 wrap.go:42] PUT /apis/servicecatalog.k8s.io/v1beta1/clusterservicebrokers/template-service-broker/status: (32.406532ms) 200 [[service-catalog/v0.0.0 (linux/amd64) kubernetes/$Format/service-catalog-controller-manager] 10.128.0.1:52236]
I0914 15:21:23.573734       1 wrap.go:42] GET /apis/servicecatalog.k8s.io/v1beta1?timeout=32s: (293.008µs) 200 [[openshift/v1.10.0+b81c8f8 (linux/amd64) kubernetes/b81c8f8] 10.128.0.1:52236]
I0914 15:21:25.687995       1 wrap.go:42] GET /: (2.387451ms) 200 [[Go-http-client/1.1] 10.129.0.1:36906]
I0914 15:21:30.678283       1 wrap.go:42] PUT /apis/servicecatalog.k8s.io/v1beta1/clusterservicebrokers/ansible-service-broker/status: (39.36612ms) 200 [[service-catalog/v0.0.0 (linux/amd64) kubernetes/$Format/service-catalog-controller-manager] 10.128.0.1:52236]
I0914 15:21:33.610356       1 wrap.go:42] GET /apis/servicecatalog.k8s.io/v1beta1?timeout=32s: (564.128µs) 200 [[openshift/v1.10.0+b81c8f8 (linux/amd64) kubernetes/b81c8f8] 10.128.0.1:52236]
I0914 15:21:43.638916       1 wrap.go:42] GET /apis/servicecatalog.k8s.io/v1beta1?timeout=32s: (258.042µs) 200 [[openshift/v1.10.0+b81c8f8 (linux/amd64) kubernetes/b81c8f8] 10.128.0.1:52236]
I0914 15:21:53.663391       1 wrap.go:42] GET /apis/servicecatalog.k8s.io/v1beta1?timeout=32s: (264.29µs) 200 [[openshift/v1.10.0+b81c8f8 (linux/amd64) kubernetes/b81c8f8] 10.128.0.1:52236]
I0914 15:21:55.695431       1 wrap.go:42] GET /: (36.632223ms) 200 [[Go-http-client/1.1] 10.129.0.1:36950]
I0914 15:22:03.750833       1 wrap.go:42] GET /apis/servicecatalog.k8s.io/v1beta1?timeout=32s: (373.081µs) 200 [[openshift/v1.10.0+b81c8f8 (linux/amd64) kubernetes/b81c8f8] 10.128.0.1:52236]
I0914 15:22:11.391636       1 wrap.go:42] GET /: (6.681933ms) 200 [[Go-http-client/1.1] 10.129.0.1:36976]
I0914 15:22:13.784406       1 wrap.go:42] GET /apis/servicecatalog.k8s.io/v1beta1?timeout=32s: (332.208µs) 200 [[openshift/v1.10.0+b81c8f8 (linux/amd64) kubernetes/b81c8f8] 10.128.0.1:52236]
I0914 15:22:14.685239       1 wrap.go:42] PUT /apis/servicecatalog.k8s.io/v1beta1/clusterservicebrokers/ansible-service-broker/status: (24.291521ms) 200 [[service-catalog/v0.0.0 (linux/amd64) kubernetes/$Format/service-catalog-controller-manager] 10.128.0.1:52236]
I0914 15:22:18.502187       1 wrap.go:42] GET /: (147.791µs) 200 [[Go-http-client/1.1] 10.128.0.1:52774]
W0914 15:22:21.120162       1 reflector.go:341] github.com/kubernetes-incubator/service-catalog/vendor/k8s.io/client-go/informers/factory.go:87: watch of *v1beta1.MutatingWebhookConfiguration ended with: The resourceVersion for the provided watch is too old.
I0914 15:22:22.121260       1 reflector.go:240] Listing and watching *v1beta1.MutatingWebhookConfiguration from github.com/kubernetes-incubator/service-catalog/vendor/k8s.io/client-go/informers/factory.go:87
I0914 15:22:22.765669       1 wrap.go:42] GET /openapi/v2: (5.481822ms) 404 [[] 10.128.0.1:52236]
I0914 15:22:22.774718       1 wrap.go:42] GET /swagger.json: (6.420071ms) 404 [[] 10.128.0.1:52236]
I0914 15:22:23.822502       1 wrap.go:42] GET /apis/servicecatalog.k8s.io/v1beta1?timeout=32s: (267.381µs) 200 [[openshift/v1.10.0+b81c8f8 (linux/amd64) kubernetes/b81c8f8] 10.128.0.1:52236]
I0914 15:22:25.750217       1 wrap.go:42] GET /: (34.655935ms) 200 [[Go-http-client/1.1] 10.129.0.1:36996]
I0914 15:22:33.872827       1 wrap.go:42] GET /apis/servicecatalog.k8s.io/v1beta1?timeout=32s: (378.239µs) 200 [[openshift/v1.10.0+b81c8f8 (linux/amd64) kubernetes/b81c8f8] 10.128.0.1:52236]

# oc describe svc apiserver -n kube-service-catalog
E0914 11:25:18.621122    1343 round_trippers.go:169] CancelRequest not implemented
E0914 11:25:50.629434    1343 round_trippers.go:169] CancelRequest not implemented
Name:              apiserver
Namespace:         kube-service-catalog
Labels:            <none>
Annotations:       <none>
Selector:          app=apiserver
Type:              ClusterIP
IP:                172.30.212.254
Port:              secure  443/TCP
TargetPort:        6443/TCP
Endpoints:         10.128.0.59:6443,10.130.0.62:6443
Session Affinity:  None
Events:            <none>

Comment 15 Casey Callendrello 2018-09-14 16:02:11 UTC

Looking at the timing, the last kube api -> service catalog timeout was at 15:21:08, and the first log lines for the service catalog apiserver start at 15:21:17.

So did kube-service-catalog take a long time to start, or was it unreachable?

Can you pull logs from the kubelet and/or check events to see what the timeline is there?

This isn't looking like a network bug. However we might have a chicken-egg issue that is exacerbating upgrade.

Comment 16 Jay Boyd 2018-09-14 20:10:15 UTC

@Scott or anyone else that knows: is storage upgrade in process at this time or has it yet to be kicked off?

Comment 17 liujia 2018-09-17 03:17:45 UTC

Still hit it on v3.11.7

Comment 18 sheng.lao 2018-09-17 11:42:05 UTC

I hit the bug with openshift-ovs-networkpolicy enable.

# oc get --raw /apis/servicecatalog.k8s.io/v1beta1
Error from server (ServiceUnavailable): the server is currently unable to handle the request

Comment 19 liujia 2018-09-18 03:48:04 UTC

Block upgrade test with service catalog

Comment 20 Meng Bo 2018-09-18 08:45:34 UTC

Maybe also related to https://bugzilla.redhat.com/show_bug.cgi?id=1625194

Comment 21 Casey Callendrello 2018-09-18 14:30:15 UTC

Just an update: I'm running the upgrade playbook now, trying to repro this.

Comment 22 Dan Winship 2018-09-18 18:40:04 UTC

(In reply to Casey Callendrello from comment #15)
> This isn't looking like a network bug.

Did we figure out why the SDN pod was lingering in a "terminated" state for a long time without being restarted (comment 9)? Because that would be a problem...

eg, if the SDN pod on a node was previously running but has since been killed, then the network will be up and running, but it won't get updated for new changes. So, eg, if the set of Endpoints for a service changes, that node won't get updated iptables rules pointing to the new endpoints, and so attempts by local pods to connect to that service might end up sending packets to dead pods/nodes, eventually resulting in "No route to host" errors.

Comment 23 Ben Bennett 2018-09-18 19:38:21 UTC

@sdodson observes:
  I believe that the lingering sdn pod in terminated/terminating state was due to an order of operations issue and has since been resolved.

  I've seen this happen during upgrades though. There's also some api aggregator fixes that master team is working on backporting that were leading to api aggregation not updating endpoints until 1000s timeout was hit.  https://github.com/openshift/origin/pull/21012 is the fix to that

  If you guys have been able to make it through a few upgrades successfully with the service catalog installed then I'd say we send it over to QE to test.

Given that the above would explain the symptoms, and that @cdc was unable to reproduce it when he tried, I'm going to kick it back.


If you encounter the error, please see if any of the sdn pods are stuck terminating and grab the node logs to see if there is any indication of why they are not quickly exiting.

Comment 24 Casey Callendrello 2018-09-18 20:00:53 UTC

I was able to reproduce it after trying harder :-).

However, I agree that https://github.com/openshift/origin/pull/21012 is the likely cause. After the service apiserver was upgraded, the apiserver became extremely slow, taking a few minutes to even list <5 pods.

At this point, the cluster is in a bad state. Certainly one of the consequences could be that the network doesn't get endpoint changes in a timely fashion.

However, restarting the kube apiserver instantly brought the cluster back to life.

Comment 25 Phil Cameron 2018-09-18 21:02:16 UTC

Looking at the logs:
NOTE: bring up uses a lot of deprecated flags. I wouldn't expect that in a GA product.

----------------

node failed 3 times between Sep 14 06:10:37 and Sep 14 07:21:19
 grep "process exited, code=exited, status=1" *
node:Sep 14 06:14:45 ip-172-18-6-202.ec2.internal systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=1/FAILURE
node:Sep 14 06:16:05 ip-172-18-6-202.ec2.internal systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=1/FAILURE
node:Sep 14 06:26:47 ip-172-18-6-202.ec2.internal systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=1/FAILURE

612 ": E" erors
Do we expect node to fail 3 times before it stays up?

----
master-controller log runs 06:53:43.704203 07:15:38.251939
Last Error at line 2054 in log of 61527 lines
E0914 06:54:37.383987 
So things settled out. 

31 "^E"  errors   

-----
master-api log starts 06:53:40.930154 ends 07:15:00.142865 168347 lines

Line 9502 E0914 Line 06:54:32.771669       1 controller.go:111] loading OpenAPI spec for "v1beta1.servicecatalog.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable

138 "^E"

---------------------
This looks to me like things are racing and the favorites don't always win. 
Some of the errors are for unexpected missing files, cant talk to an expected process and such.
Having node fail and restart 3 times seems unexpected.
Just my 2 cents.

Comment 26 Phil Cameron 2018-09-18 21:02:54 UTC

Looking at the logs:
NOTE: bring up uses a lot of deprecated flags. I wouldn't expect that in a GA product.

----------------

node failed 3 times between Sep 14 06:10:37 and Sep 14 07:21:19
 grep "process exited, code=exited, status=1" *
node:Sep 14 06:14:45 ip-172-18-6-202.ec2.internal systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=1/FAILURE
node:Sep 14 06:16:05 ip-172-18-6-202.ec2.internal systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=1/FAILURE
node:Sep 14 06:26:47 ip-172-18-6-202.ec2.internal systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=1/FAILURE

612 ": E" erors
Do we expect node to fail 3 times before it stays up?

----
master-controller log runs 06:53:43.704203 07:15:38.251939
Last Error at line 2054 in log of 61527 lines
E0914 06:54:37.383987 
So things settled out. 

31 "^E"  errors   

-----
master-api log starts 06:53:40.930154 ends 07:15:00.142865 168347 lines

Line 9502 E0914 Line 06:54:32.771669       1 controller.go:111] loading OpenAPI spec for "v1beta1.servicecatalog.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable

138 "^E"

---------------------
This looks to me like things are racing and the favorites don't always win. 
Some of the errors are for unexpected missing files, cant talk to an expected process and such.
Having node fail and restart 3 times seems unexpected.
Just my 2 cents.

Comment 29 Casey Callendrello 2018-09-19 08:39:44 UTC

I suspect this is still the apisever bug. We need 21012 to merge.

Something is also wrong with RBAC. Getting lots and lots of errors like:

Sep 19 03:59:02 ip-172-18-2-142.ec2.internal atomic-openshift-node[8504]: W0919 03:59:02.212381    8520 status_manager.go:498] Failed to update status for pod "sdn-4z6zf_openshift-sdn(e529a
4b9-bbba-11e8-bd35-0e701f90095c)": failed to patch status "{\"status\":{\"$setElementOrder/conditions\":[{\"type\":\"Initialized\"},{\"type\":\"Ready\"},{\"type\":\"ContainersReady\"},{\"ty
pe\":\"PodScheduled\"}],\"conditions\":[{\"lastTransitionTime\":\"2018-09-19T03:59:02Z\",\"status\":\"True\",\"type\":\"Ready\"},{\"lastProbeTime\":null,\"lastTransitionTime\":null,\"status
\":\"True\",\"type\":\"ContainersReady\"}]}}" for pod "openshift-sdn"/"sdn-4z6zf": pods "sdn-4z6zf" is forbidden: User "system:node:ip-172-18-2-142.ec2.internal" cannot patch pods/status in
 the namespace "openshift-sdn": User "system:node:ip-172-18-2-142.ec2.internal" cannot "patch" "pods/status" with name "sdn-4z6zf" in project "openshift-sdn"


in the kubelet logs. Scott, is this a bug?

Comment 30 Scott Dodson 2018-09-19 14:53:59 UTC

I'm not sure. I've long suspected that the control plane is suffering a general failure but I've not been able to convince the master team to investigate that.

Comment 31 liujia 2018-09-20 08:56:05 UTC

Still hit it on openshift-ansible-3.11.11-1.git.0.5d4f9d4.el7_5.noarch

Comment 36 Mike Fiedler 2018-09-20 17:50:43 UTC

Created attachment 1485253 [details]
Successful 3.10.45 -> 3.11.11 upgrade ansible log

I was able to do successful upgrade from 3.10.45 using 3.11.11 and openshift-ansible 3.11.11.   Ansible log is attached.   The failing task in the original description was successful.   

However:  I did not hit this issue during 3.11.4 and 3.11.8 upgrades, which were also successful.   It would be best for @liujia to verify the fix as well.

Comment 47 Maciej Szulik 2018-09-21 12:12:27 UTC

The problem here seems to be related with RBAC, specifically this:

I0921 03:24:37.261834       1 rbac.go:118] RBAC DENY: user "system:node:ip-172-18-11-156.ec2.internal" groups ["system:nodes" "system:authenticated"] cannot "patch" resource "pods/status" named "master-api-ip-172-18-11-156.ec2.internal" in namespace "kube-system"

This results in the kubelet trying forever to patch pod system but never succeeds, which in turns makes the system incomplete. The current playbooks verify API by calling oc get --raw /apis/${group}/v1 an all groups, but I think that's not sufficient, because based on what I'm seeing docker managed to setup the API pod and it is working as expected, so the gets will always succeed, although sometimes it'll take longer, and that's why, I'm guessing, the installation failed oc get apiservices/v1beta1.servicecatalog.k8s.io which btw. works just fine now.

I've compared system:node cluster role between 3.10 and 3.11 and the latter needs patch on pod/status, and that was not in place in 3.10. I haven't compared all of the cluster roles, but scrolling through logs I only found sccs reconciliation but nothing wrt cluster roles. So here goes my 2 questions:

1. @Scott do we run cluster role reconciliation, I haven't seen anything in the playbook? 
2. @Mo do we need to run role reconciliation, Clayton mentioned something about master dealing with it automatically, if so what if that fails how should I know?

And one suggestion:

@Scott I think that on top of api verification like we do know (oc get --raw /apis/${group}/v1) we should add additional verifying if the master api and controller pods are in Running state, to verify that the cluster is in healthy state, checking oc get --raw /healthz is not sufficient like I said before, API is there and is serving requests, but there is a problem with communication between components due to RBAC, in this particular case.

Comment 48 Scott Dodson 2018-09-21 14:24:22 UTC

This is on atomic. In atomic installs, during the phase where we deploy authentication credentials we update the node system container in order to mount up /var/lib/origin/.docker/config.json

I think that should already be mounted so I think we should skip that phase?

Comment 49 Scott Dodson 2018-09-21 14:57:32 UTC

https://github.com/openshift/openshift-ansible/pull/10194

Comment 50 Scott Dodson 2018-09-21 18:11:33 UTC

https://github.com/openshift/openshift-ansible/pull/10200 pick for release-3.11

Comment 51 Scott Dodson 2018-09-24 12:53:34 UTC

In 3.11.13-1 and later.

Comment 52 liujia 2018-09-25 01:37:46 UTC

Version:openshift-ansible-3.11.14-1.git.0.65a0c0c.el7.noarch

Upgrade against ocp with system container node and service catalog deployed works well now.

>In my previous test, when do upgrade against system container node, hit it 100%. when do upgrade against rpm node, hit it 30%. when do upgrade without service catalog, hit it 0%.

According to comment37, will watch on rpm upgrade today. If no issue, then change the bug status.

Comment 53 liujia 2018-09-25 08:27:15 UTC

Also passed task [openshift_control_plane : Wait for /apis/servicecatalog.k8s.io/v1beta1 when registered] during upgrade against rpm installed ocp on openshift-ansible-3.11.14-1.git.0.65a0c0c.el7.noarch.

Combined with comment 52, verify the bug.

Comment 58 Jason Montleon 2018-09-29 20:04:11 UTC

Scott, thank you! Updating just openshift-ansible before running the upgrade worked.

Comment 59 Scott Dodson 2018-10-26 12:58:14 UTC

This was addressed in openshift-ansible-3.11.16-1.git.0.4ac6f81.el7. It was flipped back to ASSIGNED due to workflow problems outside of openshift-ansible that caused failure.

As such CLOSED CURRENTRELEASE

Note You need to log in before you can comment on or make changes to this bug.

akuriyan
aos-bugs
bandrade
bbennett
danw
dcbw
jaboyd
jiajliu
jmatthew
jmontleo
jokerman
maszulik
mfojtik
mifiedle
mkhan
mmccomas
rekhan
rhowe
sdodson
shlao
wmeng