Bug 1421401 - oc get commands on large numbers (thousands) of resources take minutes on 3.5
Summary: oc get commands on large numbers (thousands) of resources take minutes on 3.5
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: oc
Version: 3.5.0
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: David Eads
QA Contact: Mike Fiedler
URL:
Whiteboard: aos-scalability-35
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-02-12 01:35 UTC by Mike Fiedler
Modified: 2017-03-02 22:19 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-03-02 22:19:03 UTC
Target Upstream Version:


Attachments (Terms of Use)
partial log of oc get secrets with loglevel=6 (211.82 KB, text/plain)
2017-02-12 01:35 UTC, Mike Fiedler
no flags Details

Description Mike Fiedler 2017-02-12 01:35:33 UTC
Created attachment 1249446 [details]
partial log of oc get secrets with loglevel=6

Description of problem:

In OCP 3.5 in a large cluster (300 nodes, 1000 projects, 5000 pods, 7000 builds, 30000 secrets, etc), oc get commands on resources such as these takes many minutes.   The commands do eventually complete successfully.

5000 pods: 6m 26s
7000 builds: 5m 31s
30000 secrets: 44m 39s

In 3.4 these commands took seconds

Note:  strangely, oc get for these resources with -w to set a watch returns the full list very quickly - on the order of what I expect.

Running oc get with --loglevel 6 shows a repeated pattern of errors with too many files open/failed to write to cache (see below in additional info)
 
Version:  3.5.0.17


How reproducible: Always in a cluster with large #s of resources


Steps to Reproduce:
1. Create a cluster with 1000 projects and create 30 secrets in each project
2. oc get secrets --all-namespaces


Actual results:

Command will take 40+ minutes to run

Expected results:

Command runs in a few seconds

Additional info:

Running with loglevel=6 shows a repeated throttling/failed to write to cache/too many files open pattern.  See the attached log

Comment 1 Mike Fiedler 2017-02-12 01:36:46 UTC
Setting TestBlocker as this makes large clusters generally unusable for normal testing at this scale.

Comment 2 Mike Fiedler 2017-02-12 01:38:32 UTC
Contact me for access to clusters displaying this behavior.

Comment 3 David Eads 2017-02-13 15:15:07 UTC
opened https://github.com/openshift/origin/pull/12938

Comment 4 Mike Fiedler 2017-02-20 18:13:43 UTC
Verified on 3.5.0.31

Comment 5 Troy Dawson 2017-03-02 22:19:03 UTC
Since this bug never reached customers, I am closing it.


Note You need to log in before you can comment on or make changes to this bug.