Bug 1470009
Summary: | It takes long time for pre-bind if pvc is created before pv | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Wenqi He <wehe> |
Component: | Node | Assignee: | Jordan Liggitt <jliggitt> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | DeShuai Ma <dma> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 3.6.0 | CC: | aos-bugs, eparis, jliggitt, jokerman, jsafrane, mmccomas, wehe, wsun |
Target Milestone: | --- | Keywords: | Regression |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-08-16 19:37:46 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Wenqi He
2017-07-12 09:16:00 UTC
Reproduced locally, So far I observed it only on fresh "installations" of locally complied openshift (tag v3.6.140-1): $ rm -rf openshift.local.etcd $ openshift start --loglevel=5 |& tee log $ oc create -f pvc.yaml (any PVC is fine as long as it does not trigger dynamic provisioning) $ grep synchronizing log 11:05:47.192983 12268 pv_controller.go:219] synchronizing PersistentVolumeClaim[default/nfsc]: phase: Pending, ... This line should repeat every 15 seconds. *Sometimes* it happens that it does not repeat at all. Restart of openshift-master usually fixes the problem. Digging deeper, I noticed these lines in the log: I0713 11:47:59.704986 17163 client_builder.go:233] Verified credential for persistent-volume-binder/kube-system ... W0713 11:47:59.705900 17163 shared_informer.go:298] resyncPeriod 15000000000 is smaller than resyncCheckPeriod 600000000000 and the informer has already started. Changing it to 600000000000 W0713 11:47:59.705951 17163 shared_informer.go:298] resyncPeriod 15000000000 is smaller than resyncCheckPeriod 600000000000 and the informer has already started. Changing it to 600000000000 It's from PV controller startup where it calls AddEventHandlerWithResyncPeriod() on PV and PVC shared informers, requesting resync every 15 second. Instead it gets 600 seconds, i.e. 10 minutes. There seems to be some race where this bug is reproduced when OpenShift is busy initializing everything (namespaces, roles, ...) and not reproduced (or much harder to reproduce) when it gets already pre-populated etcd and just starts. Bisect revealed the culprit: commit f80e65ca2976add1bc994f3b65532e53408ed120 (HEAD, refs/bisect/bad) Author: deads2k <deads> Date: Tue Jun 20 09:17:20 2017 -0400 refactor openshift start to separate controllers and apiserver Steps to reproduce: 1. rm -rf openshift.local.etcd (this makes the bug more reproducible as it slows down openshift startup) 2. openshift start --loglevel=5 |& tee log 3. wait ~1 minute 4. grep resyncPeriod log Actual result (in ~50% of cases): shared_informer.go:298] resyncPeriod 300000000000 is smaller than resyncCheckPeriod 600000000000 and the informer has already started. Changing it to 600000000000 shared_informer.go:298] resyncPeriod 15000000000 is smaller than resyncCheckPeriod 600000000000 and the informer has already started. Changing it to 600000000000 shared_informer.go:298] resyncPeriod 30000000000 is smaller than resyncCheckPeriod 600000000000 and the informer has already started. Changing it to 600000000000 shared_informer.go:298] resyncPeriod 120000000000 is smaller than resyncCheckPeriod 600000000000 and the informer has already started. Changing it to 600000000000 shared_informer.go:298] resyncPeriod 120000000000 is smaller than resyncCheckPeriod 600000000000 and the informer has already started. Changing it to 600000000000 Expected result: W0713 12:13:30.171812 20020 shared_informer.go:298] resyncPeriod 120000000000 is smaller than resyncCheckPeriod 600000000000 and the informer has already started. Changing it to 600000000000 [From some reason there was always one complaining informer, even before the patch. It's worth investigation too, maybe it breaks something.] Looking at the code, Master.Start() starts both API server and controller-manager this way: if controllersEnabled { ... go func() { controllerPlug.WaitForStart() startControllers() informers.Start(utilwait.NeverStop) }() } if m.api { ... informers.Start(utilwait.NeverStop) } So, we have two goroutines starting `informers` eventually. Can the "api" one start the informers before the "controller" one initializes everything? Opened https://github.com/kubernetes/kubernetes/issues/48941 upstream to fix the PV controller's dependence on a 15 second resync Is this only occurring in an "all-in-one" setup (with controllers and apiserver running in the same process)? I have tried on a HA OCP which means it has controllers and apiserver service run separately. This issue is not repro. version is below: openshift v3.6.151 kubernetes v1.6.1+5115d708d7 # for num in {0..10} ; do oc get pvc ; done pvc-w8bdw Pending 19s NAME STATUS VOLUME CAPACITY ACCESSMODES STORAGECLASS AGE pvc-w8bdw Pending pv-prebind-w8bdw 0 19s NAME STATUS VOLUME CAPACITY ACCESSMODES STORAGECLASS AGE pvc-w8bdw Pending pv-prebind-w8bdw 0 20s NAME STATUS VOLUME CAPACITY ACCESSMODES STORAGECLASS AGE pvc-w8bdw Bound pv-prebind-w8bdw 1Gi RWO 20s NAME STATUS VOLUME CAPACITY ACCESSMODES STORAGECLASS AGE pvc-w8bdw Bound pv-prebind-w8bdw 1Gi RWO 20s NAME STATUS VOLUME CAPACITY ACCESSMODES STORAGECLASS AGE pvc-w8bdw Bound pv-prebind-w8bdw 1Gi RWO 20s NAME STATUS VOLUME CAPACITY ACCESSMODES STORAGECLASS AGE pvc-w8bdw Bound pv-prebind-w8bdw 1Gi RWO 20s NAME STATUS VOLUME CAPACITY ACCESSMODES STORAGECLASS AGE pvc-w8bdw Bound pv-prebind-w8bdw 1Gi RWO 21s > Is this only occurring in an "all-in-one" setup (with controllers and apiserver running in the same process)?
Yes, PV controller is fine in standalone controller-manager process.
On the other hand, I still see two informers started too early on Friday's(?) master:
github.com/openshift/origin/pkg/service/controller/servingcert.NewServiceServingCertController:
resyncPeriod 120000000000 is smaller than resyncCheckPeriod 600000000000 and the informer has already started
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/namespace.NewNamespaceController:
resyncPeriod 300000000000 is smaller than resyncCheckPeriod 600000000000 and the informer has already started.
I am not sure if it has any impact on OpenShift stability, feel free to open a separate bug if you think it's worth fixing.
I'm not concerned about those fixed in 3.6.148-1 Tested on below version: openshift v3.6.153 kubernetes v1.6.1+5115d708d7 This bug is fixed. This was fixed in 3.6.0 |