Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1911782

Summary:	Descheduler should not evict pod used local storage by the PVC
Product:	OpenShift Container Platform	Reporter:	zhou ying <yinzhou>
Component:	kube-scheduler	Assignee:	Mike Dame <mdame>
Status:	CLOSED ERRATA	QA Contact:	RamaKasturi <knarra>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.7	CC:	aos-bugs, jchaloup, maszulik, mfojtik
Target Milestone:	---	Keywords:	UpcomingSprint
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-02-24 15:49:31 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description zhou ying 2020-12-31 07:17:15 UTC

Description of problem:
Descheduler should not evict pod used local storage by the PVC

Version-Release number of selected component (if applicable):
[root@dhcp-140-138 ~]# oc get csv
NAME                                                   DISPLAY                     VERSION                 REPLACES   PHASE
clusterkubedescheduleroperator.4.7.0-202012260223.p0   Kube Descheduler Operator   4.7.0-202012260223.p0              Succeeded

How reproducible:
always

Steps to Reproduce:
1) Run the command below in one of the worker node
# sudo mkdir  /mnt/data
2) create a index.html and add the content below
Hello from Kubernetes storage

3) Create a hostpath persistent volume using the following.

apiVersion: v1
kind: PersistentVolume
metadata:
  name: task-pv-volume
  labels:
    type: local
spec:
  storageClassName: manual
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: "/mnt/data
4) Run the command to test if the pv is created
# oc get pv
NAME             CAPACITY   ACCESSMODES   RECLAIMPOLICY
   STATUS      CLAIM     STORAGECLASS   REASON    AGE
task-pv-volume   10Gi       RWO           Retain 
      Available manual 4s

5)Create a persistent volume using the following

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: task-pv-claim
spec:
  storageClassName: manual
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 3Gi

6) Run the command to test if the pvc is created

# oc get pv task-pv-volume

NAME             CAPACITY   ACCESSMODES   RECLAIMPOLICY   STATUS
    CLAIM                   STORAGECLASS   REASON    AGE
task-pv-volume   10Gi       RWO           Retain          Bound
     default/task-pv-claim   manual                   2m

7) Now create a rc with this pvc as below

[ramakasturinarra@dhcp35-60 ocp_files]$ cat rc.yaml 
apiVersion: v1
kind: ReplicationController
metadata:
  name: rcex
spec:
  replicas: 3
  selector:
    app: sise
  template:
    metadata:
      name: somename
      labels:
        app: sise
    spec:
      containers:
      - name: sise
        image: quay.io/openshifttest/hello-openshift@sha256:aaea76ff622d2f8bcb32e538e7b3cd0ef6d291953f3e7c9f556c1ba5baf47e2e
        ports:
        - containerPort: 9876
      volumeMounts:
        - mountPath: /tmp
          name: task-pv-storage
      volumes:
        - name: task-pv-storage
          persistentVolumeClaim:
            claimName: task-pv-claim

Actual results:
[zhouying@dhcp-140-138 ~]$ oc get po -o wide
rcex-22pxc                                  1/1     Running            0          20m   10.128.2.173   yinzhou-1230-2gh6m-compute-1   <none>           <none>
rcex-cgbw2                                  1/1     Running            0          48s   10.128.2.212   yinzhou-1230-2gh6m-compute-1   <none>           <none>
rcex-tcz28                                  1/1     Running            0          48s   10.128.2.213   yinzhou-1230-2gh6m-compute-1   <none>           <none>


Check the descheduler logs , will always evicted the pods belongs to the RC which used the localstorage: 
I1231 06:11:04.120738       1 evictions.go:117] "Evicted pod" pod="zhouy/rcex-g9t7c" reason=" (RemoveDuplicatePods)"
I1231 06:11:04.121133       1 event.go:291] "Event occurred" object="zhouy/rcex-g9t7c" kind="Pod" apiVersion="v1" type="Normal" reason="Descheduled" message="pod evicted by sigs.k8s.io/descheduler (RemoveDuplicatePods)"

Expected results:
The PVC use the PV with hostPath, is also a type of local storage , descheduler should not evict this type of pod. 

Additional info:

Comment 1 Mike Dame 2021-01-05 16:11:44 UTC

Upstream issue for this: https://github.com/kubernetes-sigs/descheduler/issues/96

Comment 2 Mike Dame 2021-01-11 16:25:04 UTC

Upstream PR: https://github.com/kubernetes-sigs/descheduler/pull/481

Comment 3 Mike Dame 2021-01-27 15:10:18 UTC

Opened https://github.com/openshift/descheduler/pull/53 to add this to our descheduler fork.

However, for consistency this is opt-in (with the default behavior still being to evict PVC pods). So we will need to update our descheduler operator to either provide this option or enable it by default.

Jan/Maciej, what do you think? I would be fine just exposing it as a bool in the CRD since it is a component-level setting that would affect all profiles. Taking the opinionated approach of enabling it by default is a break from the current behavior.

Comment 4 Jan Chaloupka 2021-01-27 15:24:04 UTC

The less the better. Let's make it true by default. It's easier to change it to false or make it configurable later rather than have it on from the start.

Comment 5 Mike Dame 2021-01-27 15:26:14 UTC

Ah right, I was thinking we had already GA'd the descheduler with the current behavior, but 4.7 is not released yet. In that case we can make it true by default.

Comment 7 RamaKasturi 2021-01-29 12:40:29 UTC

still see the issue happening on the csv below, will try again on monday

[knarra@knarra ~]$ oc get csv -n openshift-kube-descheduler-operator
NAME                                                   DISPLAY                     VERSION                 REPLACES   PHASE
clusterkubedescheduleroperator.4.7.0-202101281146.p0   Kube Descheduler Operator   4.7.0-202101281146.p0              Succeeded

I0129 12:38:46.859836       1 duplicates.go:189] "Average occurrence per node" node="ip-10-0-152-104.us-east-2.compute.internal" ownerKey={namespace:knarra kind:ReplicationController name:rcex imagesHash:quay.io/openshifttest/hello-openshift@sha256:aaea76ff622d2f8bcb32e538e7b3cd0ef6d291953f3e7c9f556c1ba5baf47e2e} avg=1
I0129 12:38:46.872496       1 evictions.go:117] "Evicted pod" pod="knarra/rcex-p4c75" reason=" (RemoveDuplicatePods)"
I0129 12:38:46.882118       1 evictions.go:117] "Evicted pod" pod="knarra/rcex-vljvz" reason=" (RemoveDuplicatePods)"

Comment 8 Mike Dame 2021-01-29 15:14:13 UTC

Could you share the output for `oc get -o yaml cm/cluster` too? That will tell us if the default setting is being properly set in the operator

Comment 9 Mike Dame 2021-01-29 16:36:06 UTC

The shared configmap data shows it's being set:
```
[knarra@knarra ~]$ oc get -o yaml cm/cluster -n openshift-kube-descheduler-operator
apiVersion: v1
data:
  policy.yaml: |
    apiVersion: descheduler/v1alpha1
    ignorePvcPods: true
    kind: DeschedulerPolicy
...
```

The reason this isn't working yet is https://github.com/openshift/descheduler/pull/53 never got linked to this PR (which pulls the actual descheduler change into our fork). That's been updated, and once it merges and a new descheduler build happens we should attempt to re-verify

Comment 11 RamaKasturi 2021-02-02 15:21:54 UTC

Verified bug with the payload below and did not see the issue happening.

[knarra@knarra ~]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-02-01-232332   True        False         146m    Cluster version is 4.7.0-0.nightly-2021-02-01-232332

[knarra@knarra ~]$ oc get csv -n openshift-kube-descheduler-operator
NAME                                                   DISPLAY                     VERSION                 REPLACES   PHASE
clusterkubedescheduleroperator.4.7.0-202101300133.p0   Kube Descheduler Operator   4.7.0-202101300133.p0              Succeeded

[knarra@knarra ~]$ oc get -o yaml cm/cluster -n openshift-kube-descheduler-operator
apiVersion: v1
data:
  policy.yaml: |
    apiVersion: descheduler/v1alpha1
    ignorePvcPods: true
    kind: DeschedulerPolicy
    strategies:
      RemoveDuplicates:
        enabled: true
        params:
          includeSoftConstraints: false
          namespaces:
            exclude:

[knarra@knarra ~]$ oc get pods -o wide
NAME         READY   STATUS    RESTARTS   AGE   IP            NODE                                         NOMINATED NODE   READINESS GATES
rcex-5c7px   1/1     Running   0          77m   10.131.0.58   ip-10-0-160-222.us-east-2.compute.internal   <none>           <none>
rcex-lznkn   1/1     Running   0          77m   10.129.2.55   ip-10-0-205-212.us-east-2.compute.internal   <none>           <none>
rcex-wqvpm   1/1     Running   0          77m   10.128.2.40   ip-10-0-146-72.us-east-2.compute.internal    <none>           <none>

[knarra@knarra ~]$ oc logs cluster-69866c5699-8jsxt -n openshift-kube-descheduler-operator | grep "Evicted"
[knarra@knarra ~]$

Based on the above moving bug to verified state.

Comment 14 errata-xmlrpc 2021-02-24 15:49:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633