1579803 – Installation failed if project openshift-node has incorrect annotations

Bug 1579803 - Installation failed if project openshift-node has incorrect annotations

Summary: Installation failed if project openshift-node has incorrect annotations

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	3.10.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	3.10.0
Assignee:	Russell Teague
QA Contact:	Weihua Meng
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-05-18 11:07 UTC by Weihua Meng
Modified:	2018-07-30 19:16 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: All control plane pods were not started before proceeding to subsequent tasks. Consequence: Subsequent tasks would fail randomly on occasion when some control plane pods were not yet available. Fix: Added wait for all control plane pods to be available before continuing.
Clone Of:
Environment:
Last Closed:	2018-07-30 19:16:09 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
installation log with inventory embeded (1.27 MB, text/plain) 2018-05-29 09:24 UTC, Johnny Liu	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:1816	0	None	None	None	2018-07-30 19:16:29 UTC

Description Weihua Meng 2018-05-18 11:07:21 UTC

Description of problem:
if project openshift-node does not have openshift.io/node-selector: "" annotations,
sync pod will not on master hosts due to default node selector  defaultNodeSelector: node-role.kubernetes.io/compute=true
 in master-config.yaml
resulting in NO node-config.yaml generated,
eventually, sdn failed to set up
# oc logs -n openshift-sdn sdn-z84bj
2018/05/18 10:22:18 socat[12998] E connect(5, AF=1 "/var/run/openshift-sdn/cni-server.sock", 40): No such file or directory
warning: Cannot find existing node-config.yaml, waiting 15s ...
warning: Cannot find existing node-config.yaml, waiting 15s ...
warning: Cannot find existing node-config.yaml, waiting 15s ...

this is incorrect in failed installation: 
# oc get project openshift-node -o yaml
apiVersion: project.openshift.io/v1
kind: Project
metadata:
  annotations:
    openshift.io/sa.scc.mcs: s0:c8,c2
    openshift.io/sa.scc.supplemental-groups: 1000060000/10000
    openshift.io/sa.scc.uid-range: 1000060000/10000
  creationTimestamp: 2018-05-18T07:26:39Z
  name: openshift-node
  resourceVersion: "898"
  selfLink: /apis/project.openshift.io/v1/projects/openshift-node
  uid: cde0d046-5a6c-11e8-be6d-0ef2415a08d4
spec:
  finalizers:
  - kubernetes
  - openshift.io/origin
status:
  phase: Active

# ll /etc/origin/node/
total 28
-rw-------. 1 root root 7636 May 17 21:06 bootstrap.kubeconfig
-rw-------. 1 root root 1719 May 17 21:04 bootstrap-node-config.yaml
drwxr-xr-x. 2 root root  132 May 17 21:10 certificates
-rw-r--r--. 1 root root 1070 May 17 21:08 client-ca.crt
-rw-------. 1 root root 7636 May 17 21:06 node.kubeconfig
drwxr-xr-x. 2 root root   68 May 17 21:08 pods
-rw-------. 1 root root   22 May 17 21:01 resolv.conf


This is correct one in successful installation: 
# oc get project openshift-node -o yaml
apiVersion: project.openshift.io/v1
kind: Project
metadata:
  annotations:
    openshift.io/node-selector: ""
    openshift.io/sa.scc.mcs: s0:c7,c4
    openshift.io/sa.scc.supplemental-groups: 1000050000/10000
    openshift.io/sa.scc.uid-range: 1000050000/10000
  creationTimestamp: 2018-05-18T07:45:37Z
  name: openshift-node
  resourceVersion: "906"
  selfLink: /apis/project.openshift.io/v1/projects/openshift-node
  uid: 74954672-5a6f-11e8-888d-0ee46b987fb6
spec:
  finalizers:
  - kubernetes
  - openshift.io/origin
status:
  phase: Active

# ll /etc/origin/node/
total 32
-rw-------. 1 root root 7632 May 18 03:42 bootstrap.kubeconfig
-rw-------. 1 root root 1540 May 18 03:40 bootstrap-node-config.yaml
drwxr-xr-x. 2 root root  212 May 18 03:47 certificates
-rw-r--r--. 1 root root 1070 May 18 03:44 client-ca.crt
-rw-------. 1 root root 1575 May 18 06:25 node-config.yaml
-rw-------. 1 root root 7632 May 18 03:42 node.kubeconfig
drwxr-xr-x. 2 root root   68 May 18 03:44 pods
-rw-------. 1 root root   22 May 18 03:37 resolv.conf


Version-Release number of selected component (if applicable):
openshift-ansible-3.10.0-0.47.0.git.0.c018c8f.el7

How reproducible:
sometimes (20%)

Steps to Reproduce:
1. Install OCP 

Actual results:
install failed

Expected results:
install succeeds

Comment 2 Scott Dodson 2018-05-18 12:56:15 UTC

So you didn't do anything to modify the openshift-node project? Do you know why it doesn't have the proper annotation?

Comment 3 Weihua Meng 2018-05-18 13:50:15 UTC

I did not do anything with the openshift-node project.
I just observe that some install failed and some install success with identical parameters.
I have no idea why this happened.
I did not find the task which create openshift-node project.

Can we at least make sure the project openshift-node has the correct node-selector if no better solution?

Comment 4 Scott Dodson 2018-05-18 14:20:39 UTC

Yup, we'll look at it, thanks.

Comment 5 Wenkai Shi 2018-05-21 03:28:01 UTC

I met same issue during deploy OCP in Azure.

Comment 6 Johnny Liu 2018-05-22 10:10:44 UTC

Today I also met the same issue during deploy OCP on openstack with 3.10.0-0.50.0 build.

Comment 7 Russell Teague 2018-05-25 15:02:27 UTC

When setting up sync the node selector is set for the openshift-node project at [1].  This task list is called from [2].

Please attach logs so we can see what is happening at this task.


[1] https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_node_group/tasks/sync.yml#L2-L7
[2] https://github.com/openshift/openshift-ansible/blob/master/playbooks/openshift-master/private/config.yml#L112-L114

Comment 8 Weihua Meng 2018-05-29 07:29:51 UTC

Got it.
Thanks.
Not meet this recently with latest build 3.10.0-0.53.0
No need to try old version.
Will collect logs when meet this again.

Comment 9 Johnny Liu 2018-05-29 09:22:25 UTC

Yesterday I met such issue twice with 3.10.0-0.53.0 build, when trying the third time, the installation is passed.

Pls refer to my install log to get more details, the installation exited due to the catalog api server was not running, because master node did not get ready due to the same issue. (Unfortunately the broken env is already terminated, no chance to log in for debugging)

Comment 10 Johnny Liu 2018-05-29 09:24:21 UTC

Created attachment 1445285 [details]
installation log with inventory embeded

Comment 11 Scott Dodson 2018-05-31 13:28:51 UTC

Can this be tested in the next build? We've cleaned up problems that led to the API and ETCD pods being restarted unexpectedly and some problems created by that scenario related to caching in the oc client.

Comment 12 Weihua Meng 2018-06-05 02:44:11 UTC

Fixed.
openshift-ansible-3.10.0-0.58.0.git.0.d8f6377.el7.noarch

  Operating System: Red Hat Enterprise Linux Atomic Host 7.5.1
            Kernel: Linux 3.10.0-862.2.3.el7.x86_64

Comment 13 Russell Teague 2018-06-06 14:51:14 UTC

Adding PR link which fixed this issue
https://github.com/openshift/openshift-ansible/pull/8563

Comment 15 errata-xmlrpc 2018-07-30 19:16:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1816

Note You need to log in before you can comment on or make changes to this bug.