Bug 1465361
Summary: | Failed to watch networking object api errors appear in the master log | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Meng Bo <bmeng> |
Component: | Node | Assignee: | Clayton Coleman <ccoleman> |
Status: | CLOSED ERRATA | QA Contact: | Meng Bo <bmeng> |
Severity: | urgent | Docs Contact: | |
Priority: | high | ||
Version: | 3.6.0 | CC: | aos-bugs, bbennett, ccoleman, deads, decarr, eparis, jeder, jokerman, mmccomas, wmeng, xtian |
Target Milestone: | --- | ||
Target Release: | 3.7.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | aos-scalability-36 | ||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-11-28 21:58:46 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Meng Bo
2017-06-27 10:00:33 UTC
Best I can tell this is coming from vendor/k8s.io/kubernetes/staging/src/k8s.io/client-go/rest/request.go: Watch() -> transformResponse() -> newUnstructuredResponseError() -> NewGenericServerResponse() This is quite bizarre, as the only way we would get "unknown" in the error message is when the apiserver returns a completely unknown HTTP Status code. Meng Bo, does this message appear for any other resources, or just SDN-related ones like hostsubnet and netnamespaces? Yes, with openshift-ovs-subnet plugin, only error the hostsubnets. with openshift-ovs-networkpolicy or openshift-ovs-multitenant, both the hostsubnets and netnamespaces are reported. I cannot see any other related errors besides these two resources. And the errors spawned quite often. About one time per one second with log level 0. Try turning up the log level to 8 (--loglevel=8) on the failing process. Only do it briefly because it will be very large. We should be able to see the exactly endpoint which is failing and the response that is causing it. We can chase it from there. You're probably not seeing issues because the controller trying to watch is re-listing instead which will result in a working client, but also a very expensive one. Also, I looked for an example HostSubnet or NetNamespace in the origin source tree to see if I could reproduce locally with just a watch and I didn't find one. Even with loglevel=8, I can see only the hostsubnet and netnamespace related errors in the master log. (In reply to Meng Bo from comment #4) > Even with loglevel=8, I can see only the hostsubnet and netnamespace related > errors in the master log. With latest build 3.6.126.1 Increasing severity, this happens on multiple clusters and results in a failure to start the cluster. This may be related to a bug Mo was looking at where certain API endpoints dropped out of the API server I'm fairly sure this is a problem with how SDN is initialized - I don't think Derek's team has done anything here recently. This could be the master crashing when that endpoint is hit - which causes the client to see a blank error. Jun 29 14:15:12 ci-claytontest-ig-m-8281 origin-node[24871]: W0629 14:15:12.889087 24871 cni.go:157] Unable to update cni config: No networks found in /etc/cni/net.d Jun 29 14:15:12 ci-claytontest-ig-m-8281 origin-node[24871]: E0629 14:15:12.889261 24871 kubelet.go:2072] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized Jun 29 14:15:13 ci-claytontest-ig-m-8281 origin-node[24871]: W0629 14:15:13.366288 24871 sdn_controller.go:38] Could not find an allocated subnet for node: ci-claytontest-ig-m-8281, Waiting... Jun 29 14:15:17 ci-claytontest-ig-m-8281 origin-node[24871]: W0629 14:15:17.950897 24871 cni.go:157] Unable to update cni config: No networks found in /etc/cni/net.d Jun 29 14:15:17 ci-claytontest-ig-m-8281 origin-node[24871]: E0629 14:15:17.951061 24871 kubelet.go:2072] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized Jun 29 14:15:19 ci-claytontest-ig-m-8281 origin-node[24871]: W0629 14:15:19.768979 24871 sdn_controller.go:38] Could not find an allocated subnet for node: ci-claytontest-ig-m-8281, Waiting... Jun 29 14:15:19 ci-claytontest-ig-m-8281 origin-node[24871]: F0629 14:15:19.769021 24871 node.go:305] error: SDN node startup failed: failed to get subnet for this host: ci-claytontest-ig-m-8281, error: timed out waiting for the condition Jun 29 14:15:19 ci-claytontest-ig-m-8281 systemd[1]: origin-node.service: main process exited, code=exited, status=255/n/a From that cluster, it appears that this is a 403 sudo oc get --raw '/oapi/v1/hostsubnets?resourceVersion=2231&timeoutSeconds=554&watch=true' --as=system:serviceaccount:openshift-infra:sdn-controller Error from server (Forbidden): User "system:serviceaccount:openshift-infra:sdn-controller" cannot watch all hostsubnets in the cluster How are networking tests even working? The controller has no permissions. Fix in https://github.com/openshift/origin/pull/14968 Because the controller doesn't have permission to watch, all resources are retrieved only via list. The behavior when watch is denied is to simply retry the list, which is a 1 second period by default. Not sure how that leads to this race, but it does. Checked on OCP version 3.6.133. No such errors in the master log. Jul 05 18:34:31 ose-master.bmeng.local atomic-openshift-master[11624]: I0705 18:34:31.064445 11624 rest.go:324] Starting watch for /oapi/v1/netnamespaces, rv=280 labels= fields= timeout=7m33s Jul 05 18:35:54 ose-master.bmeng.local atomic-openshift-master[11624]: I0705 18:35:54.476359 11624 rest.go:324] Starting watch for /oapi/v1/hostsubnets, rv=852 labels= fields= timeout=7m18s verify the bug Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3188 |