Bug 1911782 - Descheduler should not evict pod used local storage by the PVC
Summary: Descheduler should not evict pod used local storage by the PVC
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-scheduler
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.7.0
Assignee: Mike Dame
QA Contact: RamaKasturi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-12-31 07:17 UTC by zhou ying
Modified: 2021-02-24 15:49 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-24 15:49:31 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-kube-descheduler-operator pull 168 0 None closed Bug 1911782: ignore PVC pods by default 2021-02-02 01:04:58 UTC
Github openshift descheduler pull 53 0 None open Bug 1911782: pull upstream master branch 2021-01-29 16:35:00 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:49:54 UTC

Description zhou ying 2020-12-31 07:17:15 UTC
Description of problem:
Descheduler should not evict pod used local storage by the PVC

Version-Release number of selected component (if applicable):
[root@dhcp-140-138 ~]# oc get csv
NAME                                                   DISPLAY                     VERSION                 REPLACES   PHASE
clusterkubedescheduleroperator.4.7.0-202012260223.p0   Kube Descheduler Operator   4.7.0-202012260223.p0              Succeeded

How reproducible:
always

Steps to Reproduce:
1) Run the command below in one of the worker node
# sudo mkdir  /mnt/data
2) create a index.html and add the content below
Hello from Kubernetes storage

3) Create a hostpath persistent volume using the following.

apiVersion: v1
kind: PersistentVolume
metadata:
  name: task-pv-volume
  labels:
    type: local
spec:
  storageClassName: manual
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: "/mnt/data
4) Run the command to test if the pv is created
# oc get pv
NAME             CAPACITY   ACCESSMODES   RECLAIMPOLICY
   STATUS      CLAIM     STORAGECLASS   REASON    AGE
task-pv-volume   10Gi       RWO           Retain 
      Available manual 4s

5)Create a persistent volume using the following

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: task-pv-claim
spec:
  storageClassName: manual
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 3Gi

6) Run the command to test if the pvc is created

# oc get pv task-pv-volume

NAME             CAPACITY   ACCESSMODES   RECLAIMPOLICY   STATUS
    CLAIM                   STORAGECLASS   REASON    AGE
task-pv-volume   10Gi       RWO           Retain          Bound
     default/task-pv-claim   manual                   2m

7) Now create a rc with this pvc as below

[ramakasturinarra@dhcp35-60 ocp_files]$ cat rc.yaml 
apiVersion: v1
kind: ReplicationController
metadata:
  name: rcex
spec:
  replicas: 3
  selector:
    app: sise
  template:
    metadata:
      name: somename
      labels:
        app: sise
    spec:
      containers:
      - name: sise
        image: quay.io/openshifttest/hello-openshift@sha256:aaea76ff622d2f8bcb32e538e7b3cd0ef6d291953f3e7c9f556c1ba5baf47e2e
        ports:
        - containerPort: 9876
      volumeMounts:
        - mountPath: /tmp
          name: task-pv-storage
      volumes:
        - name: task-pv-storage
          persistentVolumeClaim:
            claimName: task-pv-claim

Actual results:
[zhouying@dhcp-140-138 ~]$ oc get po -o wide
rcex-22pxc                                  1/1     Running            0          20m   10.128.2.173   yinzhou-1230-2gh6m-compute-1   <none>           <none>
rcex-cgbw2                                  1/1     Running            0          48s   10.128.2.212   yinzhou-1230-2gh6m-compute-1   <none>           <none>
rcex-tcz28                                  1/1     Running            0          48s   10.128.2.213   yinzhou-1230-2gh6m-compute-1   <none>           <none>


Check the descheduler logs , will always evicted the pods belongs to the RC which used the localstorage: 
I1231 06:11:04.120738       1 evictions.go:117] "Evicted pod" pod="zhouy/rcex-g9t7c" reason=" (RemoveDuplicatePods)"
I1231 06:11:04.121133       1 event.go:291] "Event occurred" object="zhouy/rcex-g9t7c" kind="Pod" apiVersion="v1" type="Normal" reason="Descheduled" message="pod evicted by sigs.k8s.io/descheduler (RemoveDuplicatePods)"

Expected results:
The PVC use the PV with hostPath, is also a type of local storage , descheduler should not evict this type of pod. 

Additional info:

Comment 1 Mike Dame 2021-01-05 16:11:44 UTC
Upstream issue for this: https://github.com/kubernetes-sigs/descheduler/issues/96

Comment 2 Mike Dame 2021-01-11 16:25:04 UTC
Upstream PR: https://github.com/kubernetes-sigs/descheduler/pull/481

Comment 3 Mike Dame 2021-01-27 15:10:18 UTC
Opened https://github.com/openshift/descheduler/pull/53 to add this to our descheduler fork.

However, for consistency this is opt-in (with the default behavior still being to evict PVC pods). So we will need to update our descheduler operator to either provide this option or enable it by default.

Jan/Maciej, what do you think? I would be fine just exposing it as a bool in the CRD since it is a component-level setting that would affect all profiles. Taking the opinionated approach of enabling it by default is a break from the current behavior.

Comment 4 Jan Chaloupka 2021-01-27 15:24:04 UTC
The less the better. Let's make it true by default. It's easier to change it to false or make it configurable later rather than have it on from the start.

Comment 5 Mike Dame 2021-01-27 15:26:14 UTC
Ah right, I was thinking we had already GA'd the descheduler with the current behavior, but 4.7 is not released yet. In that case we can make it true by default.

Comment 7 RamaKasturi 2021-01-29 12:40:29 UTC
still see the issue happening on the csv below, will try again on monday

[knarra@knarra ~]$ oc get csv -n openshift-kube-descheduler-operator
NAME                                                   DISPLAY                     VERSION                 REPLACES   PHASE
clusterkubedescheduleroperator.4.7.0-202101281146.p0   Kube Descheduler Operator   4.7.0-202101281146.p0              Succeeded

I0129 12:38:46.859836       1 duplicates.go:189] "Average occurrence per node" node="ip-10-0-152-104.us-east-2.compute.internal" ownerKey={namespace:knarra kind:ReplicationController name:rcex imagesHash:quay.io/openshifttest/hello-openshift@sha256:aaea76ff622d2f8bcb32e538e7b3cd0ef6d291953f3e7c9f556c1ba5baf47e2e} avg=1
I0129 12:38:46.872496       1 evictions.go:117] "Evicted pod" pod="knarra/rcex-p4c75" reason=" (RemoveDuplicatePods)"
I0129 12:38:46.882118       1 evictions.go:117] "Evicted pod" pod="knarra/rcex-vljvz" reason=" (RemoveDuplicatePods)"

Comment 8 Mike Dame 2021-01-29 15:14:13 UTC
Could you share the output for `oc get -o yaml cm/cluster` too? That will tell us if the default setting is being properly set in the operator

Comment 9 Mike Dame 2021-01-29 16:36:06 UTC
The shared configmap data shows it's being set:
```
[knarra@knarra ~]$ oc get -o yaml cm/cluster -n openshift-kube-descheduler-operator
apiVersion: v1
data:
  policy.yaml: |
    apiVersion: descheduler/v1alpha1
    ignorePvcPods: true
    kind: DeschedulerPolicy
...
```

The reason this isn't working yet is https://github.com/openshift/descheduler/pull/53 never got linked to this PR (which pulls the actual descheduler change into our fork). That's been updated, and once it merges and a new descheduler build happens we should attempt to re-verify

Comment 11 RamaKasturi 2021-02-02 15:21:54 UTC
Verified bug with the payload below and did not see the issue happening.

[knarra@knarra ~]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-02-01-232332   True        False         146m    Cluster version is 4.7.0-0.nightly-2021-02-01-232332

[knarra@knarra ~]$ oc get csv -n openshift-kube-descheduler-operator
NAME                                                   DISPLAY                     VERSION                 REPLACES   PHASE
clusterkubedescheduleroperator.4.7.0-202101300133.p0   Kube Descheduler Operator   4.7.0-202101300133.p0              Succeeded

[knarra@knarra ~]$ oc get -o yaml cm/cluster -n openshift-kube-descheduler-operator
apiVersion: v1
data:
  policy.yaml: |
    apiVersion: descheduler/v1alpha1
    ignorePvcPods: true
    kind: DeschedulerPolicy
    strategies:
      RemoveDuplicates:
        enabled: true
        params:
          includeSoftConstraints: false
          namespaces:
            exclude:

[knarra@knarra ~]$ oc get pods -o wide
NAME         READY   STATUS    RESTARTS   AGE   IP            NODE                                         NOMINATED NODE   READINESS GATES
rcex-5c7px   1/1     Running   0          77m   10.131.0.58   ip-10-0-160-222.us-east-2.compute.internal   <none>           <none>
rcex-lznkn   1/1     Running   0          77m   10.129.2.55   ip-10-0-205-212.us-east-2.compute.internal   <none>           <none>
rcex-wqvpm   1/1     Running   0          77m   10.128.2.40   ip-10-0-146-72.us-east-2.compute.internal    <none>           <none>

[knarra@knarra ~]$ oc logs cluster-69866c5699-8jsxt -n openshift-kube-descheduler-operator | grep "Evicted"
[knarra@knarra ~]$

Based on the above moving bug to verified state.

Comment 14 errata-xmlrpc 2021-02-24 15:49:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.