Created attachment 1249446 [details]
partial log of oc get secrets with loglevel=6
Description of problem:
In OCP 3.5 in a large cluster (300 nodes, 1000 projects, 5000 pods, 7000 builds, 30000 secrets, etc), oc get commands on resources such as these takes many minutes. The commands do eventually complete successfully.
5000 pods: 6m 26s
7000 builds: 5m 31s
30000 secrets: 44m 39s
In 3.4 these commands took seconds
Note: strangely, oc get for these resources with -w to set a watch returns the full list very quickly - on the order of what I expect.
Running oc get with --loglevel 6 shows a repeated pattern of errors with too many files open/failed to write to cache (see below in additional info)
How reproducible: Always in a cluster with large #s of resources
Steps to Reproduce:
1. Create a cluster with 1000 projects and create 30 secrets in each project
2. oc get secrets --all-namespaces
Command will take 40+ minutes to run
Command runs in a few seconds
Running with loglevel=6 shows a repeated throttling/failed to write to cache/too many files open pattern. See the attached log
Setting TestBlocker as this makes large clusters generally unusable for normal testing at this scale.
Contact me for access to clusters displaying this behavior.
Verified on 184.108.40.206
Since this bug never reached customers, I am closing it.