Bug 1418498
| Summary: | Recycling PV stuck at Released state causes master controllers process 100% CPU | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Takayoshi Kimura <tkimura> |
| Component: | Storage | Assignee: | Hemant Kumar <hekumar> |
| Status: | CLOSED ERRATA | QA Contact: | Liang Xia <lxia> |
| Severity: | medium | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 3.4.0 | CC: | aos-bugs, bchilds, jhou, kurktchiev, mawong, rhowe, tdawson |
| Target Milestone: | --- | ||
| Target Release: | 3.4.z | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Cause:
When multiple Hostpath volumes with recycling policy are created and destroyed at same time - the recycler pod's template is in-place modified and reused. This causes different recyclers to share recycling template and they trample over each other - resulting in weird behaviors.
Consequence:
Because multiple recyclers overwrite each other's template, they can enter a state which is non-deterministic and error prone.
Fix:
Each recycler should clone and create its own recyling template and not modify other recyclers.
Result:
The recyclers don't trample over each other's state and don't end up using 100% cpu.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-02-22 18:11:57 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Takayoshi Kimura
2017-02-02 00:40:55 UTC
brief output of pprof:
29980ms of 30110ms total (99.57%)
Dropped 91 nodes (cum <= 150.55ms)
Showing top 10 nodes out of 14 (cum >= 29980ms)
flat flat% sum% cum cum%
8870ms 29.46% 29.46% 8870ms 29.46% runtime.cputicks
8590ms 28.53% 57.99% 8590ms 28.53% runtime/internal/atomic.Xchg
3550ms 11.79% 69.78% 7940ms 26.37% runtime.unlock
3020ms 10.03% 79.81% 26550ms 88.18% runtime.chanrecv
2330ms 7.74% 87.55% 29980ms 99.57% github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/volume.internalRecycleVolumeByWatchingPodUntilCompletion
1690ms 5.61% 93.16% 5890ms 19.56% runtime.lock
1100ms 3.65% 96.81% 27650ms 91.83% runtime.chanrecv1
830ms 2.76% 99.57% 830ms 2.76% runtime.memclr
0 0% 99.57% 29980ms 99.57% github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/volume/persistentvolume.(*PersistentVolumeController).reclaimVolume.func1
0 0% 99.57% 29980ms 99.57% github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/volume/persistentvolume.(*PersistentVolumeController).recycleVolumeOperation
(pprof) tree
29980ms of 30110ms total (99.57%)
Dropped 91 nodes (cum <= 150.55ms)
----------------------------------------------------------+-------------
flat flat% sum% cum cum% calls calls% + context
----------------------------------------------------------+-------------
8870ms 100% | runtime.chanrecv
8870ms 29.46% 29.46% 8870ms 29.46% | runtime.cputicks
----------------------------------------------------------+-------------
4390ms 51.11% | runtime.unlock
4200ms 48.89% | runtime.lock
8590ms 28.53% 57.99% 8590ms 28.53% | runtime/internal/atomic.Xchg
----------------------------------------------------------+-------------
7940ms 100% | runtime.chanrecv
3550ms 11.79% 69.78% 7940ms 26.37% | runtime.unlock
4390ms 100% | runtime/internal/atomic.Xchg
----------------------------------------------------------+-------------
26550ms 100% | runtime.chanrecv1
3020ms 10.03% 79.81% 26550ms 88.18% | runtime.chanrecv
8870ms 37.70% | runtime.cputicks
7940ms 33.74% | runtime.unlock
5890ms 25.03% | runtime.lock
830ms 3.53% | runtime.memclr
----------------------------------------------------------+-------------
29980ms 100% | github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/volume.RecycleVolumeByWatchingPodUntilCompletion
2330ms 7.74% 87.55% 29980ms 99.57% | github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/volume.internalRecycleVolumeByWatchingPodUntilCompletion
27650ms 100% | runtime.chanrecv1
----------------------------------------------------------+-------------
5890ms 100% | runtime.chanrecv
1690ms 5.61% 93.16% 5890ms 19.56% | runtime.lock
4200ms 100% | runtime/internal/atomic.Xchg
----------------------------------------------------------+-------------
27650ms 100% | github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/volume.internalRecycleVolumeByWatchingPodUntilCompletion
1100ms 3.65% 96.81% 27650ms 91.83% | runtime.chanrecv1
26550ms 100% | runtime.chanrecv
----------------------------------------------------------+-------------
830ms 100% | runtime.chanrecv
830ms 2.76% 99.57% 830ms 2.76% | runtime.memclr
----------------------------------------------------------+-------------
29980ms 100% | github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/goroutinemap.(*goRoutineMap).Run.func1
0 0% 99.57% 29980ms 99.57% | github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/volume/persistentvolume.(*PersistentVolumeController).reclaimVolume.func1
29980ms 100% | github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/volume/persistentvolume.(*PersistentVolumeController).recycleVolumeOperation
----------------------------------------------------------+-------------
29980ms 100% | github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/volume/persistentvolume.(*PersistentVolumeController).reclaimVolume.func1
0 0% 99.57% 29980ms 99.57% | github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/volume/persistentvolume.(*PersistentVolumeController).recycleVolumeOperation
29980ms 100% | github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/volume/host_path.(*hostPathRecycler).Recycle
----------------------------------------------------------+-------------
29980ms 100% | runtime.goexit
0 0% 99.57% 29980ms 99.57% | github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/goroutinemap.(*goRoutineMap).Run.func1
29980ms 100% | github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/volume/persistentvolume.(*PersistentVolumeController).reclaimVolume.func1
----------------------------------------------------------+-------------
29980ms 100% | github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/volume/host_path.(*hostPathRecycler).Recycle
0 0% 99.57% 29980ms 99.57% | github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/volume.RecycleVolumeByWatchingPodUntilCompletion
29980ms 100% | github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/volume.internalRecycleVolumeByWatchingPodUntilCompletion
----------------------------------------------------------+-------------
29980ms 100% | github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/volume/persistentvolume.(*PersistentVolumeController).recycleVolumeOperation
0 0% 99.57% 29980ms 99.57% | github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/volume/host_path.(*hostPathRecycler).Recycle
29980ms 100% | github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/volume.RecycleVolumeByWatchingPodUntilCompletion
----------------------------------------------------------+-------------
0 0% 99.57% 30060ms 99.83% | runtime.goexit
29980ms 100% | github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/goroutinemap.(*goRoutineMap).Run.func1
----------------------------------------------------------+-------------
OpenShift 3.4.0.39-1
Steps to Reproduce:
1. Create new project
2. Create application with persistent volume claims mounted to pod
3. Confirm PVCs are bound to PVs
4. Delete Project `oc delete project`
- preform delete when pods are running and volume mounted to pods are active
Actual results:
- PVs that were bound get stuck in released state, recycler never gets ran.
- While PVs are stuck in released state active controller begins to show short spikes in CPU. After 4 hours CPU is constantly at 100% (400% in top b/c host has 4 cores)
Expected results:
- No excessive CPU usage of the master controller process even when there is a broken recycling PV
- Successful clean up of volumes.
MORE INFO:
- This issue was seen in 3.3 but presented differently, in that the volumes would be stuck in released but after a period of time they would become available, even when the recycler never cleaned them up. The result was the cluster ended up with a bunch of dirty volumes.
Since the PVs became available the controller cpu issue never presented itself most likely due to fact the controller stopped trying to release the volume.
- Volumes are PVs using hostmount, with NFS backing the the mount on the host. Each host has the nfs shares mounted.
Example mount:
nfs.example.com:/ose_prod_pv_5gb_storage on /ose_prod_pv_5gb_storage type nfs (rw,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=1.2.3.4,mountvers=3,mountport=635,mountproto=tcp,local_lock=none,addr=1.2.3.4)
Example PV:
# oc get pv pv2d7a49f1 -o yaml
apiVersion: v1
kind: PersistentVolume
metadata:
annotations:
pv.kubernetes.io/bound-by-controller: "yes"
creationTimestamp: 2016-11-09T22:46:07Z
labels:
size: 2Gi
type: localnfs
name: pv2d7a49f1
resourceVersion: "21609206"
selfLink: /api/v1/persistentvolumes/pv2d7a49f1
uid: 4da1f7f2-a6ce-11e6-817a-00505691bba9
spec:
accessModes:
- ReadWriteOnce
- ReadWriteMany
capacity:
storage: 2Gi
claimRef:
apiVersion: v1
kind: PersistentVolumeClaim
name: mysql
namespace: boris
resourceVersion: "21606330"
uid: ab6b72a5-e95a-11e6-955c-00505691bba9
hostPath:
path: /ose_prod_pv_2gb_storage/pv2d7a49f1/
persistentVolumeReclaimPolicy: Recycle
status:
phase: Released
This has been merged into ocp and is in OCP v3.4.1.7 or newer. Did not see PV stuck in released status, and CPU usage is normal. # openshift version openshift v3.4.1.7 kubernetes v1.4.0+776c994 etcd 3.1.0-rc.0 Steps, Create 100 recycling PVs. In a loop, create project, create pvc, create pod, create file in the mounted path within the pod, delete project. After several hours running, no PV stuck, and CPU usage is normal. Just so i ask for my peace of mind. this bug generally seems to come about with NFS PVs, did the QA test run against NFS PVs? This bug wasn't caused by NFS PVs per se, but user was using NFS mounted partitions as HostPath PVs and they shared recycler (and ended up tramping over each other). ok looks like the test was ran against the other bug that had the same symptoms https://bugzilla.redhat.com/show_bug.cgi?id=1415624 so I think my concerns have been addressed. Thanks guys! Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:0289 |