Description of problem: Try to setup OCP cluster on aws via the openshift-install tool. After the cluster setup, check the pods and openshift-sdn project. Found that some of the sdn-controller pod and the sdn pod which is running on master node get restarted. Version-Release number of selected component (if applicable): 4.0.0-0.nightly-2019-02-11-024933 How reproducible: always Steps to Reproduce: 1. Setup the ocp cluster via openshift-install tool # openshift-install create cluster 2. Switch to the openshift-sdn project and check the pods after cluster setup # oc get po -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE ovs-277gv 1/1 Running 0 18m 10.0.0.6 ip-10-0-0-6.us-east-2.compute.internal <none> ovs-8x7rh 1/1 Running 0 18m 10.0.23.247 ip-10-0-23-247.us-east-2.compute.internal <none> ovs-gqlsx 1/1 Running 0 9m52s 10.0.133.194 ip-10-0-133-194.us-east-2.compute.internal <none> ovs-jwb45 1/1 Running 0 10m 10.0.149.22 ip-10-0-149-22.us-east-2.compute.internal <none> ovs-l2twg 1/1 Running 0 18m 10.0.33.73 ip-10-0-33-73.us-east-2.compute.internal <none> ovs-rfp8w 1/1 Running 0 9m52s 10.0.173.129 ip-10-0-173-129.us-east-2.compute.internal <none> sdn-4vlt5 1/1 Running 0 18m 10.0.23.247 ip-10-0-23-247.us-east-2.compute.internal <none> sdn-controller-62hpw 1/1 Running 0 18m 10.0.0.6 ip-10-0-0-6.us-east-2.compute.internal <none> sdn-controller-7rxpq 1/1 Running 1 18m 10.0.33.73 ip-10-0-33-73.us-east-2.compute.internal <none> sdn-controller-kszpb 1/1 Running 0 18m 10.0.23.247 ip-10-0-23-247.us-east-2.compute.internal <none> sdn-kj8jl 1/1 Running 1 18m 10.0.0.6 ip-10-0-0-6.us-east-2.compute.internal <none> sdn-kpzp4 1/1 Running 0 10m 10.0.149.22 ip-10-0-149-22.us-east-2.compute.internal <none> sdn-qlfp9 1/1 Running 1 18m 10.0.33.73 ip-10-0-33-73.us-east-2.compute.internal <none> sdn-vsxwg 1/1 Running 0 9m52s 10.0.133.194 ip-10-0-133-194.us-east-2.compute.internal <none> sdn-wh4c4 1/1 Running 0 9m52s 10.0.173.129 ip-10-0-173-129.us-east-2.compute.internal <none> 3. Check the event of the restarted pods # oc describe po Actual results: The sdn pods get restarted during the cluster setup. Expected results: The sdn pods should not get restarted unexpectedly. Additional info: # oc describe po sdn-controller-7rxpq ----- State: Running Started: Mon, 11 Feb 2019 16:26:23 +0800 Last State: Terminated Reason: Error Message: 1 subnets.go:133] Created HostSubnet ip-10-0-33-73.us-east-2.compute.internal (host: "ip-10-0-33-73.us-east-2.compute.internal", ip: "10.0.33.73", subnet: "10.130.0.0/23") I0211 08:26:11.154850 1 vnids.go:115] Allocated netid 10872476 for namespace "openshift-cluster-version" I0211 08:26:11.161586 1 vnids.go:115] Allocated netid 14266677 for namespace "openshift-config-managed" I0211 08:26:11.167059 1 vnids.go:115] Allocated netid 8818831 for namespace "openshift-network-operator" I0211 08:26:11.172249 1 vnids.go:115] Allocated netid 0 for namespace "default" I0211 08:26:11.177547 1 vnids.go:115] Allocated netid 13800730 for namespace "openshift-kube-scheduler" I0211 08:26:11.183488 1 vnids.go:115] Allocated netid 2608316 for namespace "openshift-core-operators" I0211 08:26:11.188820 1 vnids.go:115] Allocated netid 7384650 for namespace "openshift-kube-apiserver" I0211 08:26:11.194070 1 vnids.go:115] Allocated netid 1484940 for namespace "openshift-machine-config-operator" I0211 08:26:11.199301 1 vnids.go:115] Allocated netid 3246404 for namespace "openshift-kube-controller-manager" I0211 08:26:11.204842 1 vnids.go:115] Allocated netid 8275068 for namespace "openshift-kube-apiserver-operator" I0211 08:26:11.211155 1 vnids.go:115] Allocated netid 16503818 for namespace "kube-system" I0211 08:26:11.216712 1 vnids.go:115] Allocated netid 14024028 for namespace "openshift-config" I0211 08:26:11.222650 1 vnids.go:115] Allocated netid 12493483 for namespace "openshift-cluster-machine-approver" I0211 08:26:11.229501 1 vnids.go:115] Allocated netid 839991 for namespace "openshift-sdn" I0211 08:26:11.234792 1 vnids.go:115] Allocated netid 9817333 for namespace "openshift-cluster-api" I0211 08:26:22.882077 1 leaderelection.go:249] failed to renew lease openshift-sdn/openshift-network-controller: failed to tryAcquireOrRenew context deadline exceeded F0211 08:26:22.882105 1 network_controller.go:82] leaderelection lost Exit Code: 255 Started: Mon, 11 Feb 2019 16:26:08 +0800 Finished: Mon, 11 Feb 2019 16:26:23 +0800 # oc describe po sdn-kj8jl ----- State: Running Started: Mon, 11 Feb 2019 16:26:11 +0800 Last State: Terminated Reason: Error Message: 2019/02/11 08:26:07 socat[5717] E connect(5, AF=1 "/var/run/openshift-sdn/cni-server.sock", 40): No such file or directory I0211 08:26:07.850354 5625 cmd.go:229] Overriding kubernetes api to https://qe-bmeng-api.qe.devcluster.openshift.com:6443 I0211 08:26:07.850444 5625 cmd.go:132] Reading node configuration from /config/sdn-config.yaml W0211 08:26:07.851820 5625 server.go:194] WARNING: all flags other than --config, --write-config-to, and --cleanup are deprecated. Please begin using a config file ASAP. I0211 08:26:07.851881 5625 feature_gate.go:206] feature gates: &{map[]} I0211 08:26:07.854074 5625 node.go:147] Initializing SDN node of type "redhat/openshift-ovs-networkpolicy" with configured hostname "ip-10-0-0-6.us-east-2.compute.internal" (IP ""), iptab les sync period "30s" I0211 08:26:07.859593 5625 cmd.go:196] Starting node networking (v4.0.0-0.150.0) I0211 08:26:07.859615 5625 node.go:266] Starting openshift-sdn network plugin F0211 08:26:10.844129 5625 cmd.go:113] Failed to start sdn: failed to validate network configuration: master has not created a default cluster network, network plugin "redhat/openshift-ov s-networkpolicy" can not start Exit Code: 255 Started: Mon, 11 Feb 2019 16:26:07 +0800 Finished: Mon, 11 Feb 2019 16:26:10 +0800
I don't think these are bugs. Instead, you're seeing two things: 1. An SDN pod managed to come up before the sdn-controller. So it crashes and restarts - that's expected behavior and not a big dea. 2. The SDN controller lost its lease: that's because of the bootstrap apiserver going down and the real one(s) coming up.
(In reply to Casey Callendrello from comment #1) > I don't think these are bugs. Instead, you're seeing two things: > > 1. An SDN pod managed to come up before the sdn-controller. So it crashes > and restarts - that's expected behavior and not a big dea. > 2. The SDN controller lost its lease: that's because of the bootstrap > apiserver going down and the real one(s) coming up. Yeah, for the SDN pods restarts, from the log, I found that the sdn pod started before the sdn-controller pod ready. That is what I want to report. We should control the order of the start up for the network services. WDYT?
I don't really see the point in changing the code to wait; it only happens once or twice very early on in the bootstrapping process. I'll ask if anyone else is concerned about this, but I don't think it's a problem.