Created attachment 1249446 [details] partial log of oc get secrets with loglevel=6 Description of problem: In OCP 3.5 in a large cluster (300 nodes, 1000 projects, 5000 pods, 7000 builds, 30000 secrets, etc), oc get commands on resources such as these takes many minutes. The commands do eventually complete successfully. 5000 pods: 6m 26s 7000 builds: 5m 31s 30000 secrets: 44m 39s In 3.4 these commands took seconds Note: strangely, oc get for these resources with -w to set a watch returns the full list very quickly - on the order of what I expect. Running oc get with --loglevel 6 shows a repeated pattern of errors with too many files open/failed to write to cache (see below in additional info) Version: 3.5.0.17 How reproducible: Always in a cluster with large #s of resources Steps to Reproduce: 1. Create a cluster with 1000 projects and create 30 secrets in each project 2. oc get secrets --all-namespaces Actual results: Command will take 40+ minutes to run Expected results: Command runs in a few seconds Additional info: Running with loglevel=6 shows a repeated throttling/failed to write to cache/too many files open pattern. See the attached log
Setting TestBlocker as this makes large clusters generally unusable for normal testing at this scale.
Contact me for access to clusters displaying this behavior.
opened https://github.com/openshift/origin/pull/12938
Verified on 3.5.0.31
Since this bug never reached customers, I am closing it.