Created attachment 1295764 [details] atomic-openshift-master.log Description of problem: When install service-catalog by openshift-ansible, I met service-catalog can't running error, then ssh to install debug, the ds's DESIRED=0, but actually there is matched node. Then restart master service can fix this. Version-Release number of selected component (if applicable): openshift v3.6.136 kubernetes v1.6.1+5115d708d7 etcd 3.2.1 How reproducible: Sometime Steps to Reproduce: 1.[root@ip-172-18-0-4 ~]# oc get ds -n kube-service-catalog NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE-SELECTOR AGE apiserver 0 0 0 0 0 openshift-infra=apiserver 27m controller-manager 0 0 0 0 0 openshift-infra=apiserver 27m [root@ip-172-18-0-4 ~]# [root@ip-172-18-0-4 ~]# [root@ip-172-18-0-4 ~]# [root@ip-172-18-0-4 ~]# [root@ip-172-18-0-4 ~]# oc get no --show-labels NAME STATUS AGE VERSION LABELS ip-172-18-0-4.ec2.internal Ready,SchedulingDisabled 43m v1.6.1+5115d708d7 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m3.medium,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1d,kubernetes.io/hostname=ip-172-18-0-4.ec2.internal,openshift-infra=apiserver,role=node ip-172-18-11-233.ec2.internal Ready 43m v1.6.1+5115d708d7 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m3.medium,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1d,kubernetes.io/hostname=ip-172-18-11-233.ec2.internal,registry=enabled,role=node,router=enabled 2. 3. Actual results: Expected results: https://openshift-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/Launch%20Environment%20Flexy/17715/console Additional info:
Created attachment 1295776 [details] nod1.log
Created attachment 1295777 [details] node2.log
Eric Wolinetz is attempting to reproduce this now. He says the node labels in the original comment look correct. Could we get the logs from the controller manager? I reviewed the node logs and they looked uneventful.
Could we also get a yaml dump of the daemon sets that were created?
The controller-manager log is attached in file atomic-openshift-master.log daemonset.yaml: http://pastebin.test.redhat.com/501739 (note: the daemonset I provided the link is working well as I restart the master)
Created attachment 1296100 [details] ds&node info Reproduce again. Attach some info about ds and node
We debugged a customer issue similar to this one yesterday. Can we establish: 1. Are pods being created at all for the daemon set? If so, can we get yamls and describe output for them? 2. Is there a node selector associated with the namespace? Can we get a yaml for the namespace? In the issue we debugged today, the default node selectors for the project and later the cluster were resulting in pods being created, but not being scheduled on certain nodes due to conflicts between the pod's node selector and the nodes labels that were introduced by the project node selector.
When happen again, I'll check what you said. To be honest, it's really hard to reproduce it.
This daemonset doesn't create by my manual. it create by openshift-ansible when enable service-catalog. This ds is service-catalog apiserver and controller-manager in kube-service-catalog project.
I spoke to Eric and he is not currently using a node selector on the namespace the installer creates for the catalog components. He is going to add one in this PR: https://github.com/openshift/openshift-ansible/pull/4781 That should address this issue - I don't think that we have a cause to believe that something else is happening. I am going to reassign this bug to Eric and he can move it to ON_QA once that PR is merged.
Verify on openshift-ansible-3.6.162-1.git.0.50e29bd.el7.noarch.rpm. Now can't met the error again.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1716