Bug 1535940

Summary:	Cluster capacity can not run after update for kube rebase 1.9
Product:	OpenShift Container Platform	Reporter:	weiwei jiang <wjiang>
Component:	Node	Assignee:	Avesh Agarwal <avagarwa>
Status:	CLOSED ERRATA	QA Contact:	weiwei jiang <wjiang>
Severity:	medium	Docs Contact:
Priority:	high
Version:	3.9.0	CC:	aos-bugs, jokerman, mmccomas, sjenning
Target Milestone:	---
Target Release:	3.9.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: In a pod, kubeconfig is not supplied and cluster-capacity exited incorrectly because in a pod, in cluster config is used. Consequence: cluster capacity stopped working in a pod. Fix: Now in a pod, absence of kubeconfig is ignored because in cluster config is used. Result: cluster capacity is now woking in a pod.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-03-28 14:20:40 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description weiwei jiang 2018-01-18 09:56:13 UTC

Description of problem:
Checked with cluster capacity and found it always got "Failed to set default scheduler config: Error in opening default scheduler config file: open : no such file or directory" even I give a existed path for --default-config

/bin/sh -ec /bin/cluster-capacity --default-config=/test-s/scheduler.json --podspec=/test-pod/pod.yaml --verbose

$ ls /test-*
/test-pod:
pod.yaml

/test-s:
scheduler.json


Version-Release number of selected component (if applicable):
atomic-openshift-cluster-capacity-3.9.0-0.21.0.git.0.2a50d06.el7.x86_64
# openshift version 
openshift v3.9.0-0.20.0
kubernetes v1.9.1+a0ce1bc657
etcd 3.2.8

How reproducible:
always

Steps to Reproduce:
1. Create a podspec as a configmap
oc create configmap cluster-capacity-configmap --from-file=pod.yaml=pod.yaml -n default
# cat pod.yaml 
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  name: cluster-capacity-stub-container
  namespace: cluster-capacity
spec:
  containers:
  - image: gcr.io/google_containers/pause:2.0
    imagePullPolicy: Always
    name: cluster-capacity-stub-container
    resources:
      limits:
        cpu: 200m
        memory: 100Mi
      requests:
        cpu: 100m
        memory: 80Mi
  dnsPolicy: Default
  nodeSelector:
    load: high
    region: hpc
  restartPolicy: OnFailure
  schedulerName: default-scheduler
status: {}

2. oc create secret generic sf --from-file=scheduler.json=/etc/origin/master/scheduler.json  -n default
3. create a rc for cluster-capacity
apiVersion: v1
kind: ReplicationController
metadata:
  creationTimestamp: 2018-01-18T07:30:32Z
  generation: 5
  labels:
    run: cluster-capacity
  name: cluster-capacity
  namespace: default
  resourceVersion: "22043"
  selfLink: /api/v1/namespaces/default/replicationcontrollers/cluster-capacity
  uid: 778a86ed-fc21-11e7-80da-fa163ead188e
spec:
  replicas: 1
  selector:
    run: cluster-capacity
  template:
    metadata:
      creationTimestamp: null
      labels:
        run: cluster-capacity
    spec:
      containers:
      - command:
        - /bin/sh
        - -ec
        - |
          /bin/cluster-capacity --default-config=/test-s/scheduler.json --podspec=/test-pod/pod.yaml --verbose;while true;do sleep 10;done
        env:
        - name: CC_INCLUSTER
          value: "true"
        image: brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/ose-cluster-capacity
        imagePullPolicy: Always
        name: cluster-capacity
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /test-pod
          name: test-volume
        - mountPath: /test-s
          name: ss
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: cluster-capacity-sa
      serviceAccountName: cluster-capacity-sa
      terminationGracePeriodSeconds: 30
      volumes:
      - configMap:
          defaultMode: 420
          name: cluster-capacity-configmap
        name: test-volume
      - name: ss
        secret:
          defaultMode: 420
          secretName: sf


Actual results:
# oc logs -n default -f cluster-capacity-bgjh5
Failed to set default scheduler config: Error in opening default scheduler config file: open : no such file or directory

Expected results:
cluster-capacity should work well

Additional info:

Comment 1 Avesh Agarwal 2018-01-19 15:57:03 UTC

(In reply to weiwei jiang from comment #0)
> Description of problem:
> Checked with cluster capacity and found it always got "Failed to set default
> scheduler config: Error in opening default scheduler config file: open : no
> such file or directory" even I give a existed path for --default-config
> 
> /bin/sh -ec /bin/cluster-capacity --default-config=/test-s/scheduler.json
> --podspec=/test-pod/pod.yaml --verbose
> 
> $ ls /test-*
> /test-pod:
> pod.yaml
> 
> /test-s:
> scheduler.json
> 
> 
> Version-Release number of selected component (if applicable):
> atomic-openshift-cluster-capacity-3.9.0-0.21.0.git.0.2a50d06.el7.x86_64
> # openshift version 
> openshift v3.9.0-0.20.0
> kubernetes v1.9.1+a0ce1bc657
> etcd 3.2.8
> 
> How reproducible:
> always
> 
> Steps to Reproduce:
> 1. Create a podspec as a configmap
> oc create configmap cluster-capacity-configmap --from-file=pod.yaml=pod.yaml
> -n default
> # cat pod.yaml 
> apiVersion: v1
> kind: Pod
> metadata:
>   creationTimestamp: null
>   name: cluster-capacity-stub-container
>   namespace: cluster-capacity
> spec:
>   containers:
>   - image: gcr.io/google_containers/pause:2.0
>     imagePullPolicy: Always
>     name: cluster-capacity-stub-container
>     resources:
>       limits:
>         cpu: 200m
>         memory: 100Mi
>       requests:
>         cpu: 100m
>         memory: 80Mi
>   dnsPolicy: Default
>   nodeSelector:
>     load: high
>     region: hpc
>   restartPolicy: OnFailure
>   schedulerName: default-scheduler
> status: {}
> 
> 2. oc create secret generic sf
> --from-file=scheduler.json=/etc/origin/master/scheduler.json  -n default

Why are you creating secret from scheduler config file? 


> 3. create a rc for cluster-capacity
> apiVersion: v1
> kind: ReplicationController
> metadata:
>   creationTimestamp: 2018-01-18T07:30:32Z
>   generation: 5
>   labels:
>     run: cluster-capacity
>   name: cluster-capacity
>   namespace: default
>   resourceVersion: "22043"
>   selfLink:
> /api/v1/namespaces/default/replicationcontrollers/cluster-capacity
>   uid: 778a86ed-fc21-11e7-80da-fa163ead188e
> spec:
>   replicas: 1
>   selector:
>     run: cluster-capacity
>   template:
>     metadata:
>       creationTimestamp: null
>       labels:
>         run: cluster-capacity
>     spec:
>       containers:
>       - command:
>         - /bin/sh
>         - -ec
>         - |
>           /bin/cluster-capacity --default-config=/test-s/scheduler.json

why are you mounting scheduler.json config file as secret?

> --podspec=/test-pod/pod.yaml --verbose;while true;do sleep 10;done
>         env:
>         - name: CC_INCLUSTER
>           value: "true"
>         image:
> brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/ose-cluster-
> capacity
>         imagePullPolicy: Always
>         name: cluster-capacity
>         resources: {}
>         terminationMessagePath: /dev/termination-log
>         terminationMessagePolicy: File
>         volumeMounts:
>         - mountPath: /test-pod
>           name: test-volume
>         - mountPath: /test-s
>           name: ss
>       dnsPolicy: ClusterFirst
>       restartPolicy: Always
>       schedulerName: default-scheduler
>       securityContext: {}
>       serviceAccount: cluster-capacity-sa
>       serviceAccountName: cluster-capacity-sa
>       terminationGracePeriodSeconds: 30
>       volumes:
>       - configMap:
>           defaultMode: 420
>           name: cluster-capacity-configmap
>         name: test-volume
>       - name: ss
>         secret:
>           defaultMode: 420
>           secretName: sf
> 
> 
> Actual results:
> # oc logs -n default -f cluster-capacity-bgjh5
> Failed to set default scheduler config: Error in opening default scheduler
> config file: open : no such file or directory
> 
> Expected results:
> cluster-capacity should work well
> 
> Additional info:

Also in general, you need to node provide scheduler.json unless your cluster us using some customize default scheduler.

Comment 2 Avesh Agarwal 2018-01-19 15:58:02 UTC

Just want to correct my comment:
Also in general, you need NOT provide scheduler.json unless your cluster is 
using some customized default scheduler.

Comment 3 Avesh Agarwal 2018-01-19 16:06:11 UTC

Also can you show me your scheduler.json? I just tested locally --default-config and it seemed to work. I will try in a pod now.

Comment 4 Avesh Agarwal 2018-01-19 17:53:43 UTC

I can reproduce the issue. I am working on a fix.

Comment 5 Avesh Agarwal 2018-01-19 21:17:15 UTC

PR: https://github.com/openshift/origin/pull/18198

In case you want to test it before, you could use this image: docker.io/aveshagarwal/cluster-capacity

Comment 6 weiwei jiang 2018-01-26 09:54:32 UTC

Checked with atomic-openshift-cluster-capacity-3.9.0-0.24.0.git.0.fc8ad63.el7.x86_64, and work now.

# oc logs -f cluster-capacity-g9xfc
cluster-capacity-stub-container pod requirements:
	- CPU: 100m
	- Memory: 80Mi
	- NodeSelector: load=high,region=hpc

The cluster can schedule 0 instance(s) of the pod cluster-capacity-stub-container.

Termination reason: Unschedulable: 0/5 nodes are available: 1 NodeUnschedulable, 5 MatchNodeSelector.

Comment 8 weiwei jiang 2018-01-29 02:17:29 UTC

According to https://bugzilla.redhat.com/show_bug.cgi?id=1535940#c6, move to verified.

Comment 11 errata-xmlrpc 2018-03-28 14:20:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0489