Bug 1465361

Summary:	Failed to watch networking object api errors appear in the master log
Product:	OpenShift Container Platform	Reporter:	Meng Bo <bmeng>
Component:	Node	Assignee:	Clayton Coleman <ccoleman>
Status:	CLOSED ERRATA	QA Contact:	Meng Bo <bmeng>
Severity:	urgent	Docs Contact:
Priority:	high
Version:	3.6.0	CC:	aos-bugs, bbennett, ccoleman, deads, decarr, eparis, jeder, jokerman, mmccomas, wmeng, xtian
Target Milestone:	---
Target Release:	3.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	aos-scalability-36
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-11-28 21:58:46 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Meng Bo 2017-06-27 10:00:33 UTC

Description of problem:
Setup multi-node env and start the master service, check the master log. Some errors about "Failed to watch *api.HostSubnet/*api.NetNamespace" appear.

Version-Release number of selected component (if applicable):
kubernetes v1.6.1+5115d708d7
openshift v3.6.126
etcd 3.2.0


How reproducible:
always

Steps to Reproduce:
1. Setup multi-node env with multi-tenant plugin
2. Check the master log after master service started
3.

Actual results:
Errors about the networking api objects appear in the master log.
Jun 27 17:27:41 ose-master.bmeng.local atomic-openshift-master[11971]: E0627 17:27:41.499183   11971 reflector.go:304] github.com/openshift/origin/pkg/sdn/plugin/common.go:181: Failed to watch *api.HostSubnet: unknown (get hostsubnets)
Jun 27 17:27:41 ose-master.bmeng.local atomic-openshift-master[11971]: E0627 17:27:41.531161   11971 reflector.go:304] github.com/openshift/origin/pkg/sdn/plugin/common.go:181: Failed to watch *api.NetNamespace: unknown (get netnamespaces)
Jun 27 17:27:42 ose-master.bmeng.local atomic-openshift-master[11971]: E0627 17:27:42.502962   11971 reflector.go:304] github.com/openshift/origin/pkg/sdn/plugin/common.go:181: Failed to watch *api.HostSubnet: unknown (get hostsubnets)
Jun 27 17:27:42 ose-master.bmeng.local atomic-openshift-master[11971]: E0627 17:27:42.535070   11971 reflector.go:304] github.com/openshift/origin/pkg/sdn/plugin/common.go:181: Failed to watch *api.NetNamespace: unknown (get netnamespaces)
Jun 27 17:27:43 ose-master.bmeng.local atomic-openshift-master[11971]: E0627 17:27:43.506565   11971 reflector.go:304] github.com/openshift/origin/pkg/sdn/plugin/common.go:181: Failed to watch *api.HostSubnet: unknown (get hostsubnets)
Jun 27 17:27:43 ose-master.bmeng.local atomic-openshift-master[11971]: E0627 17:27:43.538962   11971 reflector.go:304] github.com/openshift/origin/pkg/sdn/plugin/common.go:181: Failed to watch *api.NetNamespace: unknown (get netnamespaces)
Jun 27 17:27:44 ose-master.bmeng.local atomic-openshift-master[11971]: E0627 17:27:44.510199   11971 reflector.go:304] github.com/openshift/origin/pkg/sdn/plugin/common.go:181: Failed to watch *api.HostSubnet: unknown (get hostsubnets)
Jun 27 17:27:44 ose-master.bmeng.local atomic-openshift-master[11971]: E0627 17:27:44.542627   11971 reflector.go:304] github.com/openshift/origin/pkg/sdn/plugin/common.go:181: Failed to watch *api.NetNamespace: unknown (get netnamespaces)
Jun 27 17:27:45 ose-master.bmeng.local atomic-openshift-master[11971]: E0627 17:27:45.513754   11971 reflector.go:304] github.com/openshift/origin/pkg/sdn/plugin/common.go:181: Failed to watch *api.HostSubnet: unknown (get hostsubnets)
Jun 27 17:27:45 ose-master.bmeng.local atomic-openshift-master[11971]: E0627 17:27:45.546656   11971 reflector.go:304] github.com/openshift/origin/pkg/sdn/plugin/common.go:181: Failed to watch *api.NetNamespace: unknown (get netnamespaces)


Expected results:
Should not have such errors.

Additional info:
Beside the errors, the env works fine so far.

Comment 1 Dan Williams 2017-06-27 21:27:14 UTC

Best I can tell this is coming from vendor/k8s.io/kubernetes/staging/src/k8s.io/client-go/rest/request.go:


Watch() -> transformResponse() -> newUnstructuredResponseError() -> NewGenericServerResponse()

This is quite bizarre, as the only way we would get "unknown" in the error message is when the apiserver returns a completely unknown HTTP Status code.

Meng Bo, does this message appear for any other resources, or just SDN-related ones like hostsubnet and netnamespaces?

Comment 2 Meng Bo 2017-06-28 02:58:33 UTC

Yes, 
with openshift-ovs-subnet plugin, only error the hostsubnets.
with openshift-ovs-networkpolicy or openshift-ovs-multitenant, both the hostsubnets and netnamespaces are reported.

I cannot see any other related errors besides these two resources.

And the errors spawned quite often. About one time per one second with log level 0.

Comment 3 David Eads 2017-06-28 12:20:11 UTC

Try turning up the log level to 8 (--loglevel=8) on the failing process.  Only do it briefly because it will be very large.  We should be able to see the exactly endpoint which is failing and the response that is causing it.  We can chase it from there.

You're probably not seeing issues because the controller trying to watch is re-listing instead which will result in a working client, but also a very expensive one.

Also, I looked for an example HostSubnet or NetNamespace in the origin source tree to see if I could reproduce locally with just a watch and I didn't find one.

Comment 4 Meng Bo 2017-06-29 03:23:49 UTC

Even with loglevel=8, I can see only the hostsubnet and netnamespace related errors in the master log.

Comment 5 Meng Bo 2017-06-29 03:24:19 UTC

(In reply to Meng Bo from comment #4)
> Even with loglevel=8, I can see only the hostsubnet and netnamespace related
> errors in the master log.

With latest build 3.6.126.1

Comment 6 Clayton Coleman 2017-06-29 17:59:32 UTC

Increasing severity, this happens on multiple clusters and results in a failure to start the cluster.

Comment 7 Clayton Coleman 2017-06-29 18:04:52 UTC

This may be related to a bug Mo was looking at where certain API endpoints dropped out of the API server

Comment 8 Clayton Coleman 2017-06-29 18:10:25 UTC

I'm fairly sure this is a problem with how SDN is initialized - I don't think Derek's team has done anything here recently.

Comment 9 Clayton Coleman 2017-06-29 18:13:50 UTC

This could be the master crashing when that endpoint is hit - which causes the client to see a blank error.

Comment 10 Clayton Coleman 2017-06-29 18:16:06 UTC

Jun 29 14:15:12 ci-claytontest-ig-m-8281 origin-node[24871]: W0629 14:15:12.889087   24871 cni.go:157] Unable to update cni config: No networks found in /etc/cni/net.d
Jun 29 14:15:12 ci-claytontest-ig-m-8281 origin-node[24871]: E0629 14:15:12.889261   24871 kubelet.go:2072] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Jun 29 14:15:13 ci-claytontest-ig-m-8281 origin-node[24871]: W0629 14:15:13.366288   24871 sdn_controller.go:38] Could not find an allocated subnet for node: ci-claytontest-ig-m-8281, Waiting...
Jun 29 14:15:17 ci-claytontest-ig-m-8281 origin-node[24871]: W0629 14:15:17.950897   24871 cni.go:157] Unable to update cni config: No networks found in /etc/cni/net.d
Jun 29 14:15:17 ci-claytontest-ig-m-8281 origin-node[24871]: E0629 14:15:17.951061   24871 kubelet.go:2072] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Jun 29 14:15:19 ci-claytontest-ig-m-8281 origin-node[24871]: W0629 14:15:19.768979   24871 sdn_controller.go:38] Could not find an allocated subnet for node: ci-claytontest-ig-m-8281, Waiting...
Jun 29 14:15:19 ci-claytontest-ig-m-8281 origin-node[24871]: F0629 14:15:19.769021   24871 node.go:305] error: SDN node startup failed: failed to get subnet for this host: ci-claytontest-ig-m-8281, error: timed out waiting for the condition
Jun 29 14:15:19 ci-claytontest-ig-m-8281 systemd[1]: origin-node.service: main process exited, code=exited, status=255/n/a

Comment 11 Clayton Coleman 2017-06-29 19:22:49 UTC

From that cluster, it appears that this is a 403

sudo oc get --raw '/oapi/v1/hostsubnets?resourceVersion=2231&timeoutSeconds=554&watch=true' --as=system:serviceaccount:openshift-infra:sdn-controller
Error from server (Forbidden): User "system:serviceaccount:openshift-infra:sdn-controller" cannot watch all hostsubnets in the cluster

Comment 12 Clayton Coleman 2017-06-29 19:36:32 UTC

How are networking tests even working?  The controller has no permissions.

Comment 13 Clayton Coleman 2017-06-29 20:20:31 UTC

Fix in https://github.com/openshift/origin/pull/14968

Because the controller doesn't have permission to watch, all resources are retrieved only via list.  The behavior when watch is denied is to simply retry the list, which is a 1 second period by default.  Not sure how that leads to this race, but it does.

Comment 15 Meng Bo 2017-07-05 10:37:56 UTC

Checked on OCP version 3.6.133. No such errors in the master log.

Jul 05 18:34:31 ose-master.bmeng.local atomic-openshift-master[11624]: I0705 18:34:31.064445   11624 rest.go:324] Starting watch for /oapi/v1/netnamespaces, rv=280 labels= fields= timeout=7m33s
Jul 05 18:35:54 ose-master.bmeng.local atomic-openshift-master[11624]: I0705 18:35:54.476359   11624 rest.go:324] Starting watch for /oapi/v1/hostsubnets, rv=852 labels= fields= timeout=7m18s

verify the bug

Comment 19 errata-xmlrpc 2017-11-28 21:58:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188