Description of problem: if project openshift-node does not have openshift.io/node-selector: "" annotations, sync pod will not on master hosts due to default node selector defaultNodeSelector: node-role.kubernetes.io/compute=true in master-config.yaml resulting in NO node-config.yaml generated, eventually, sdn failed to set up # oc logs -n openshift-sdn sdn-z84bj 2018/05/18 10:22:18 socat[12998] E connect(5, AF=1 "/var/run/openshift-sdn/cni-server.sock", 40): No such file or directory warning: Cannot find existing node-config.yaml, waiting 15s ... warning: Cannot find existing node-config.yaml, waiting 15s ... warning: Cannot find existing node-config.yaml, waiting 15s ... this is incorrect in failed installation: # oc get project openshift-node -o yaml apiVersion: project.openshift.io/v1 kind: Project metadata: annotations: openshift.io/sa.scc.mcs: s0:c8,c2 openshift.io/sa.scc.supplemental-groups: 1000060000/10000 openshift.io/sa.scc.uid-range: 1000060000/10000 creationTimestamp: 2018-05-18T07:26:39Z name: openshift-node resourceVersion: "898" selfLink: /apis/project.openshift.io/v1/projects/openshift-node uid: cde0d046-5a6c-11e8-be6d-0ef2415a08d4 spec: finalizers: - kubernetes - openshift.io/origin status: phase: Active # ll /etc/origin/node/ total 28 -rw-------. 1 root root 7636 May 17 21:06 bootstrap.kubeconfig -rw-------. 1 root root 1719 May 17 21:04 bootstrap-node-config.yaml drwxr-xr-x. 2 root root 132 May 17 21:10 certificates -rw-r--r--. 1 root root 1070 May 17 21:08 client-ca.crt -rw-------. 1 root root 7636 May 17 21:06 node.kubeconfig drwxr-xr-x. 2 root root 68 May 17 21:08 pods -rw-------. 1 root root 22 May 17 21:01 resolv.conf This is correct one in successful installation: # oc get project openshift-node -o yaml apiVersion: project.openshift.io/v1 kind: Project metadata: annotations: openshift.io/node-selector: "" openshift.io/sa.scc.mcs: s0:c7,c4 openshift.io/sa.scc.supplemental-groups: 1000050000/10000 openshift.io/sa.scc.uid-range: 1000050000/10000 creationTimestamp: 2018-05-18T07:45:37Z name: openshift-node resourceVersion: "906" selfLink: /apis/project.openshift.io/v1/projects/openshift-node uid: 74954672-5a6f-11e8-888d-0ee46b987fb6 spec: finalizers: - kubernetes - openshift.io/origin status: phase: Active # ll /etc/origin/node/ total 32 -rw-------. 1 root root 7632 May 18 03:42 bootstrap.kubeconfig -rw-------. 1 root root 1540 May 18 03:40 bootstrap-node-config.yaml drwxr-xr-x. 2 root root 212 May 18 03:47 certificates -rw-r--r--. 1 root root 1070 May 18 03:44 client-ca.crt -rw-------. 1 root root 1575 May 18 06:25 node-config.yaml -rw-------. 1 root root 7632 May 18 03:42 node.kubeconfig drwxr-xr-x. 2 root root 68 May 18 03:44 pods -rw-------. 1 root root 22 May 18 03:37 resolv.conf Version-Release number of selected component (if applicable): openshift-ansible-3.10.0-0.47.0.git.0.c018c8f.el7 How reproducible: sometimes (20%) Steps to Reproduce: 1. Install OCP Actual results: install failed Expected results: install succeeds
So you didn't do anything to modify the openshift-node project? Do you know why it doesn't have the proper annotation?
I did not do anything with the openshift-node project. I just observe that some install failed and some install success with identical parameters. I have no idea why this happened. I did not find the task which create openshift-node project. Can we at least make sure the project openshift-node has the correct node-selector if no better solution?
Yup, we'll look at it, thanks.
I met same issue during deploy OCP in Azure.
Today I also met the same issue during deploy OCP on openstack with 3.10.0-0.50.0 build.
When setting up sync the node selector is set for the openshift-node project at [1]. This task list is called from [2]. Please attach logs so we can see what is happening at this task. [1] https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_node_group/tasks/sync.yml#L2-L7 [2] https://github.com/openshift/openshift-ansible/blob/master/playbooks/openshift-master/private/config.yml#L112-L114
Got it. Thanks. Not meet this recently with latest build 3.10.0-0.53.0 No need to try old version. Will collect logs when meet this again.
Yesterday I met such issue twice with 3.10.0-0.53.0 build, when trying the third time, the installation is passed. Pls refer to my install log to get more details, the installation exited due to the catalog api server was not running, because master node did not get ready due to the same issue. (Unfortunately the broken env is already terminated, no chance to log in for debugging)
Created attachment 1445285 [details] installation log with inventory embeded
Can this be tested in the next build? We've cleaned up problems that led to the API and ETCD pods being restarted unexpectedly and some problems created by that scenario related to caching in the oc client.
Fixed. openshift-ansible-3.10.0-0.58.0.git.0.d8f6377.el7.noarch Operating System: Red Hat Enterprise Linux Atomic Host 7.5.1 Kernel: Linux 3.10.0-862.2.3.el7.x86_64
Adding PR link which fixed this issue https://github.com/openshift/openshift-ansible/pull/8563
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1816