Description of problem: There is a PV with Recycling policy and it gets stuck at Released status for some reason. With this broken PV, the master controllers process use 100% CPU after about 2 hours. We took pprof and it shows PV recycler code is using the CPU. The broken PV is already removed and the high CPU is resolved. We want RCA for the high CPU usage. We have sosreport and pprof during the issue. Version-Release number of selected component (if applicable): atomic-openshift-3.4.0.39-1.git.0.5f32f06.el7.x86_64 How reproducible: Only 1 time and not sure how to reproduce it. Steps to Reproduce: 1. 2. 3. Actual results: The master controller process uses 100% CPU with broken recycling PV Expected results: No excessive CPU usage of the master controller process even there is a broken recycling PV Additional info:
brief output of pprof: 29980ms of 30110ms total (99.57%) Dropped 91 nodes (cum <= 150.55ms) Showing top 10 nodes out of 14 (cum >= 29980ms) flat flat% sum% cum cum% 8870ms 29.46% 29.46% 8870ms 29.46% runtime.cputicks 8590ms 28.53% 57.99% 8590ms 28.53% runtime/internal/atomic.Xchg 3550ms 11.79% 69.78% 7940ms 26.37% runtime.unlock 3020ms 10.03% 79.81% 26550ms 88.18% runtime.chanrecv 2330ms 7.74% 87.55% 29980ms 99.57% github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/volume.internalRecycleVolumeByWatchingPodUntilCompletion 1690ms 5.61% 93.16% 5890ms 19.56% runtime.lock 1100ms 3.65% 96.81% 27650ms 91.83% runtime.chanrecv1 830ms 2.76% 99.57% 830ms 2.76% runtime.memclr 0 0% 99.57% 29980ms 99.57% github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/volume/persistentvolume.(*PersistentVolumeController).reclaimVolume.func1 0 0% 99.57% 29980ms 99.57% github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/volume/persistentvolume.(*PersistentVolumeController).recycleVolumeOperation (pprof) tree 29980ms of 30110ms total (99.57%) Dropped 91 nodes (cum <= 150.55ms) ----------------------------------------------------------+------------- flat flat% sum% cum cum% calls calls% + context ----------------------------------------------------------+------------- 8870ms 100% | runtime.chanrecv 8870ms 29.46% 29.46% 8870ms 29.46% | runtime.cputicks ----------------------------------------------------------+------------- 4390ms 51.11% | runtime.unlock 4200ms 48.89% | runtime.lock 8590ms 28.53% 57.99% 8590ms 28.53% | runtime/internal/atomic.Xchg ----------------------------------------------------------+------------- 7940ms 100% | runtime.chanrecv 3550ms 11.79% 69.78% 7940ms 26.37% | runtime.unlock 4390ms 100% | runtime/internal/atomic.Xchg ----------------------------------------------------------+------------- 26550ms 100% | runtime.chanrecv1 3020ms 10.03% 79.81% 26550ms 88.18% | runtime.chanrecv 8870ms 37.70% | runtime.cputicks 7940ms 33.74% | runtime.unlock 5890ms 25.03% | runtime.lock 830ms 3.53% | runtime.memclr ----------------------------------------------------------+------------- 29980ms 100% | github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/volume.RecycleVolumeByWatchingPodUntilCompletion 2330ms 7.74% 87.55% 29980ms 99.57% | github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/volume.internalRecycleVolumeByWatchingPodUntilCompletion 27650ms 100% | runtime.chanrecv1 ----------------------------------------------------------+------------- 5890ms 100% | runtime.chanrecv 1690ms 5.61% 93.16% 5890ms 19.56% | runtime.lock 4200ms 100% | runtime/internal/atomic.Xchg ----------------------------------------------------------+------------- 27650ms 100% | github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/volume.internalRecycleVolumeByWatchingPodUntilCompletion 1100ms 3.65% 96.81% 27650ms 91.83% | runtime.chanrecv1 26550ms 100% | runtime.chanrecv ----------------------------------------------------------+------------- 830ms 100% | runtime.chanrecv 830ms 2.76% 99.57% 830ms 2.76% | runtime.memclr ----------------------------------------------------------+------------- 29980ms 100% | github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/goroutinemap.(*goRoutineMap).Run.func1 0 0% 99.57% 29980ms 99.57% | github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/volume/persistentvolume.(*PersistentVolumeController).reclaimVolume.func1 29980ms 100% | github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/volume/persistentvolume.(*PersistentVolumeController).recycleVolumeOperation ----------------------------------------------------------+------------- 29980ms 100% | github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/volume/persistentvolume.(*PersistentVolumeController).reclaimVolume.func1 0 0% 99.57% 29980ms 99.57% | github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/volume/persistentvolume.(*PersistentVolumeController).recycleVolumeOperation 29980ms 100% | github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/volume/host_path.(*hostPathRecycler).Recycle ----------------------------------------------------------+------------- 29980ms 100% | runtime.goexit 0 0% 99.57% 29980ms 99.57% | github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/goroutinemap.(*goRoutineMap).Run.func1 29980ms 100% | github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/volume/persistentvolume.(*PersistentVolumeController).reclaimVolume.func1 ----------------------------------------------------------+------------- 29980ms 100% | github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/volume/host_path.(*hostPathRecycler).Recycle 0 0% 99.57% 29980ms 99.57% | github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/volume.RecycleVolumeByWatchingPodUntilCompletion 29980ms 100% | github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/volume.internalRecycleVolumeByWatchingPodUntilCompletion ----------------------------------------------------------+------------- 29980ms 100% | github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/volume/persistentvolume.(*PersistentVolumeController).recycleVolumeOperation 0 0% 99.57% 29980ms 99.57% | github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/volume/host_path.(*hostPathRecycler).Recycle 29980ms 100% | github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/volume.RecycleVolumeByWatchingPodUntilCompletion ----------------------------------------------------------+------------- 0 0% 99.57% 30060ms 99.83% | runtime.goexit 29980ms 100% | github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/goroutinemap.(*goRoutineMap).Run.func1 ----------------------------------------------------------+-------------
OpenShift 3.4.0.39-1 Steps to Reproduce: 1. Create new project 2. Create application with persistent volume claims mounted to pod 3. Confirm PVCs are bound to PVs 4. Delete Project `oc delete project` - preform delete when pods are running and volume mounted to pods are active Actual results: - PVs that were bound get stuck in released state, recycler never gets ran. - While PVs are stuck in released state active controller begins to show short spikes in CPU. After 4 hours CPU is constantly at 100% (400% in top b/c host has 4 cores) Expected results: - No excessive CPU usage of the master controller process even when there is a broken recycling PV - Successful clean up of volumes. MORE INFO: - This issue was seen in 3.3 but presented differently, in that the volumes would be stuck in released but after a period of time they would become available, even when the recycler never cleaned them up. The result was the cluster ended up with a bunch of dirty volumes. Since the PVs became available the controller cpu issue never presented itself most likely due to fact the controller stopped trying to release the volume. - Volumes are PVs using hostmount, with NFS backing the the mount on the host. Each host has the nfs shares mounted. Example mount: nfs.example.com:/ose_prod_pv_5gb_storage on /ose_prod_pv_5gb_storage type nfs (rw,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=1.2.3.4,mountvers=3,mountport=635,mountproto=tcp,local_lock=none,addr=1.2.3.4) Example PV: # oc get pv pv2d7a49f1 -o yaml apiVersion: v1 kind: PersistentVolume metadata: annotations: pv.kubernetes.io/bound-by-controller: "yes" creationTimestamp: 2016-11-09T22:46:07Z labels: size: 2Gi type: localnfs name: pv2d7a49f1 resourceVersion: "21609206" selfLink: /api/v1/persistentvolumes/pv2d7a49f1 uid: 4da1f7f2-a6ce-11e6-817a-00505691bba9 spec: accessModes: - ReadWriteOnce - ReadWriteMany capacity: storage: 2Gi claimRef: apiVersion: v1 kind: PersistentVolumeClaim name: mysql namespace: boris resourceVersion: "21606330" uid: ab6b72a5-e95a-11e6-955c-00505691bba9 hostPath: path: /ose_prod_pv_2gb_storage/pv2d7a49f1/ persistentVolumeReclaimPolicy: Recycle status: phase: Released
This has been merged into ocp and is in OCP v3.4.1.7 or newer.
Did not see PV stuck in released status, and CPU usage is normal. # openshift version openshift v3.4.1.7 kubernetes v1.4.0+776c994 etcd 3.1.0-rc.0 Steps, Create 100 recycling PVs. In a loop, create project, create pvc, create pod, create file in the mounted path within the pod, delete project. After several hours running, no PV stuck, and CPU usage is normal.
Just so i ask for my peace of mind. this bug generally seems to come about with NFS PVs, did the QA test run against NFS PVs?
This bug wasn't caused by NFS PVs per se, but user was using NFS mounted partitions as HostPath PVs and they shared recycler (and ended up tramping over each other).
ok looks like the test was ran against the other bug that had the same symptoms https://bugzilla.redhat.com/show_bug.cgi?id=1415624 so I think my concerns have been addressed. Thanks guys!
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:0289