Description of problem: In a large cluster, sdn daemonset can DoS the kube-apiserver with un-paginated LIST calls on high count resources. Version-Release number of selected component (if applicable): How reproducible: NA Steps to Reproduce: NA Actual results: Kube API Server and Openshift API Server in one of the cluster keeps restarting, without proper exception. The cluster is not accessible. Expected results: Kube API Server and Openshift API Server should be stable. Additional info:
Filling in more information on the bug description. Please let me know if I can provide anything else here, I am happy to assist. Description of problem: In a large cluster, sdn daemonset can DoS the kube-apiserver with un-paginated LIST calls on high count resources. Version-Release number of selected component (if applicable): 4.8.23 How reproducible: 100% Steps to Reproduce: 1. Create greater than 500 pods, networkpolicies, services, endpoints, netnamespaces, or projects in a project. 2. Restart one or more SDN pods. Actual results: Verify through kube-apiserver audit events that LIST calls on these resources are executed without paging, and are thus querying >500 resources in a single LIST request. Repeated significantly large list requests (>15k) can cause the kube-apiserver, openshift-apiserver, and etcd to consume extremely large amounts of memory, which can lead to other issues. Expected results: SDN should make fixed-size LIST calls using pagination as to limit the amount of memory balooning on the controlPlane. Additional info: These are all the counts of resources in a cluster environment that are being executed at high frequency when controlPlane becomes unstable, which only negatively contribute to the controlPlane instability. ``` $ oc get --raw '/api/v1/endpoints?labelSelector=%21service.kubernetes.io%2Fheadless%2C%21service.kubernetes.io%2Fservice-proxy-name&resourceVersion=480361449' | jq -s ' .[].items[].metadata.name' | wc 10694 10694 251354 $ oc get --raw '/apis/network.openshift.io/v1/netnamespaces?resourceVersion=480360525' | jq -s ' .[].items[].metadata.name' | wc 4984 4984 106907 $ oc get --raw '/apis/network.openshift.io/v1/hostsubnets?resourceVersion=480361230' | jq -s ' .[].items[].metadata.name' | wc 256 256 10365 $ oc get --raw '/api/v1/pods?labelSelector=%21service.kubernetes.io%2Fheadless%2C%21service.kubernetes.io%2Fservice-proxy-name&resourceVersion=480361631' | jq -s ' .[].items[].metadata.name' | wc 18134 18134 586113 $ oc get --raw '/api/v1/services?labelSelector=%21service.kubernetes.io%2Fheadless%2C%21service.kubernetes.io%2Fservice-proxy-name&resourceVersion=480361112' | jq -s ' .[].items[].metadata.name' | wc 11012 11012 260087 $ oc get --raw '/apis/networking.k8s.io/v1/networkpolicies?labelSelector=%21service.kubernetes.io%2Fheadless%2C%21service.kubernetes.io%2Fservice-proxy-name&resourceVersion=480360489' | jq -s ' .[].items[].metadata.name' | wc 15438 15438 456408 ```
@qili Hi, Qiujie Could you help take look if you can help verified this bug during your testing. thanks.
> @ancollin is there any way to verify if this change has actually helped at all? I took a look through the audit logs (Thank you Qiujie for uploading). I do still see list calls with `limit=500&resourceVersion=0`, but they appear to be followed by a watch request. These look like the consequence of ListWatch that we can't get around. I looked for the request parameters that I originally filed on (i.e. "labelSelector=%21service.kubernetes.io%2Fheadless%2C%21service.kubernetes.io%2Fservice-proxy-name&resourceVersion=480361631") and I do see fewer occurrences of these. Only two list calls: services and endpointslices. Unfortunately both of these also have the page-negating `resourceVersion=0`, so I gather these are also the initial List of the ListWatch. To restate the problem: The large un-paginated list calls only come after the API has already been unstable, and will only generate additional load on already-burdened API servers. Even though these un-paginated calls only happen in certain conditions, I believe the times when these certain conditions are met are precisely when the paginated calls are needed most. I will not know whether these make a difference in the customer environment until they are released in a Z stream, but based on these results, I do expect the unpaginated list calls to continue to be disruptive. I do see pagination being used in two calls, evident by the subsequent "continue" calls (resources: namespaces and netnamespaces, time: 2022-04-26T09:18:03 ), so I believe you have done as much as you can from the sdn side, and the rest is chasing down client-go (as you said). Thank you for your help to bring this to a close. If there is some way to add this as data to support removing unpaginated ListWatch bugs, other to improve API stability on large clusters, or similar client-go bugs, I am all for that and please let me know how you best think to approach those maintainers.
@jcaamano I didn't see where I can remove FailedQA.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.13.0 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:1326