Bug 1469037

Summary: Sometime daemonset DESIRED=0 even this matched node
Product: OpenShift Container Platform Reporter: DeShuai Ma <dma>
Component: InstallerAssignee: ewolinet
Status: CLOSED ERRATA QA Contact: DeShuai Ma <dma>
Severity: medium Docs Contact:
Priority: high    
Version: 3.6.0CC: aos-bugs, dma, eparis, jokerman, mmccomas, pruan, wmeng
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-10 05:31:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
atomic-openshift-master.log
none
nod1.log
none
node2.log
none
ds&node info none

Description DeShuai Ma 2017-07-10 10:19:41 UTC
Created attachment 1295764 [details]
atomic-openshift-master.log

Description of problem:
When install service-catalog by openshift-ansible, I met service-catalog can't running error, then ssh to install debug, the ds's DESIRED=0, but actually there is matched node. Then restart master service can fix this.

Version-Release number of selected component (if applicable):
openshift v3.6.136
kubernetes v1.6.1+5115d708d7
etcd 3.2.1


How reproducible:
Sometime

Steps to Reproduce:
1.[root@ip-172-18-0-4 ~]# oc get ds -n kube-service-catalog
NAME                 DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE-SELECTOR               AGE
apiserver            0         0         0         0            0           openshift-infra=apiserver   27m
controller-manager   0         0         0         0            0           openshift-infra=apiserver   27m
[root@ip-172-18-0-4 ~]# 
[root@ip-172-18-0-4 ~]# 
[root@ip-172-18-0-4 ~]# 
[root@ip-172-18-0-4 ~]# 
[root@ip-172-18-0-4 ~]# oc get no --show-labels
NAME                            STATUS                     AGE       VERSION             LABELS
ip-172-18-0-4.ec2.internal      Ready,SchedulingDisabled   43m       v1.6.1+5115d708d7   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m3.medium,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1d,kubernetes.io/hostname=ip-172-18-0-4.ec2.internal,openshift-infra=apiserver,role=node
ip-172-18-11-233.ec2.internal   Ready                      43m       v1.6.1+5115d708d7   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m3.medium,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1d,kubernetes.io/hostname=ip-172-18-11-233.ec2.internal,registry=enabled,role=node,router=enabled

2.
3.

Actual results:


Expected results:
https://openshift-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/Launch%20Environment%20Flexy/17715/console



Additional info:

Comment 1 DeShuai Ma 2017-07-10 10:20:44 UTC
Created attachment 1295776 [details]
nod1.log

Comment 2 DeShuai Ma 2017-07-10 10:22:34 UTC
Created attachment 1295777 [details]
node2.log

Comment 3 Paul Morie 2017-07-10 15:49:26 UTC
Eric Wolinetz is attempting to reproduce this now.  He says the node labels in the original comment look correct.

Could we get the logs from the controller manager?  I reviewed the node logs and they looked uneventful.

Comment 4 Paul Morie 2017-07-10 15:50:12 UTC
Could we also get a yaml dump of the daemon sets that were created?

Comment 5 DeShuai Ma 2017-07-11 01:59:48 UTC
The controller-manager log is attached in file atomic-openshift-master.log

daemonset.yaml: http://pastebin.test.redhat.com/501739 (note: the daemonset I provided the link is working well as I restart the master)

Comment 6 DeShuai Ma 2017-07-11 07:04:39 UTC
Created attachment 1296100 [details]
ds&node info

Reproduce again. Attach some info about ds and node

Comment 7 Paul Morie 2017-07-13 18:38:32 UTC
We debugged a customer issue similar to this one yesterday.  Can we establish:

1.  Are pods being created at all for the daemon set?  If so, can we get yamls and describe output for them?
2.  Is there a node selector associated with the namespace? Can we get a yaml for the namespace?

In the issue we debugged today, the default node selectors for the project and later the cluster were resulting in pods being created, but not being scheduled on certain nodes due to conflicts between the pod's node selector and the nodes labels that were introduced by the project node selector.

Comment 8 DeShuai Ma 2017-07-14 16:39:43 UTC
When happen again, I'll check what you said. To be honest, it's really hard to reproduce it.

Comment 9 DeShuai Ma 2017-07-14 16:44:07 UTC
This daemonset doesn't create by my manual. it create by openshift-ansible when enable service-catalog. This ds is service-catalog apiserver and controller-manager in kube-service-catalog project.

Comment 10 Paul Morie 2017-07-18 17:47:02 UTC
I spoke to Eric and he is not currently using a node selector on the namespace the installer creates for the catalog components.  He is going to add one in this PR: https://github.com/openshift/openshift-ansible/pull/4781

That should address this issue - I don't think that we have a cause to believe that something else is happening.  I am going to reassign this bug to Eric and he can move it to ON_QA once that PR is merged.

Comment 12 DeShuai Ma 2017-07-24 06:44:05 UTC
Verify on openshift-ansible-3.6.162-1.git.0.50e29bd.el7.noarch.rpm.

Now can't met the error again.

Comment 14 errata-xmlrpc 2017-08-10 05:31:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1716