Bug 1674384 - Some of the sdn pods get restarted during the cluster setup
Summary: Some of the sdn pods get restarted during the cluster setup
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.1.0
Assignee: Casey Callendrello
QA Contact: Meng Bo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-02-11 08:50 UTC by Meng Bo
Modified: 2019-03-12 14:25 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-02-28 15:22:49 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Meng Bo 2019-02-11 08:50:41 UTC
Description of problem:
Try to setup OCP cluster on aws via the openshift-install tool.

After the cluster setup, check the pods and openshift-sdn project. Found that some of the sdn-controller pod and the sdn pod which is running on master node get restarted.

Version-Release number of selected component (if applicable):
4.0.0-0.nightly-2019-02-11-024933

How reproducible:
always

Steps to Reproduce:
1. Setup the ocp cluster via openshift-install tool
# openshift-install create cluster

2. Switch to the openshift-sdn project and check the pods after cluster setup
# oc get po -o wide
NAME                   READY     STATUS    RESTARTS   AGE       IP             NODE                                         NOMINATED NODE
ovs-277gv              1/1       Running   0          18m       10.0.0.6       ip-10-0-0-6.us-east-2.compute.internal       <none>
ovs-8x7rh              1/1       Running   0          18m       10.0.23.247    ip-10-0-23-247.us-east-2.compute.internal    <none>
ovs-gqlsx              1/1       Running   0          9m52s     10.0.133.194   ip-10-0-133-194.us-east-2.compute.internal   <none>
ovs-jwb45              1/1       Running   0          10m       10.0.149.22    ip-10-0-149-22.us-east-2.compute.internal    <none>
ovs-l2twg              1/1       Running   0          18m       10.0.33.73     ip-10-0-33-73.us-east-2.compute.internal     <none>
ovs-rfp8w              1/1       Running   0          9m52s     10.0.173.129   ip-10-0-173-129.us-east-2.compute.internal   <none>
sdn-4vlt5              1/1       Running   0          18m       10.0.23.247    ip-10-0-23-247.us-east-2.compute.internal    <none>
sdn-controller-62hpw   1/1       Running   0          18m       10.0.0.6       ip-10-0-0-6.us-east-2.compute.internal       <none>
sdn-controller-7rxpq   1/1       Running   1          18m       10.0.33.73     ip-10-0-33-73.us-east-2.compute.internal     <none>
sdn-controller-kszpb   1/1       Running   0          18m       10.0.23.247    ip-10-0-23-247.us-east-2.compute.internal    <none>
sdn-kj8jl              1/1       Running   1          18m       10.0.0.6       ip-10-0-0-6.us-east-2.compute.internal       <none>
sdn-kpzp4              1/1       Running   0          10m       10.0.149.22    ip-10-0-149-22.us-east-2.compute.internal    <none>
sdn-qlfp9              1/1       Running   1          18m       10.0.33.73     ip-10-0-33-73.us-east-2.compute.internal     <none>
sdn-vsxwg              1/1       Running   0          9m52s     10.0.133.194   ip-10-0-133-194.us-east-2.compute.internal   <none>
sdn-wh4c4              1/1       Running   0          9m52s     10.0.173.129   ip-10-0-173-129.us-east-2.compute.internal   <none>

3. Check the event of the restarted pods 
# oc describe po

Actual results:
The sdn pods get restarted during the cluster setup.

Expected results:
The sdn pods should not get restarted unexpectedly.

Additional info:
# oc describe po sdn-controller-7rxpq
-----
    State:       Running
      Started:   Mon, 11 Feb 2019 16:26:23 +0800
    Last State:  Terminated
      Reason:    Error
      Message:        1 subnets.go:133] Created HostSubnet ip-10-0-33-73.us-east-2.compute.internal (host: "ip-10-0-33-73.us-east-2.compute.internal", ip: "10.0.33.73", subnet: "10.130.0.0/23")
I0211 08:26:11.154850       1 vnids.go:115] Allocated netid 10872476 for namespace "openshift-cluster-version"
I0211 08:26:11.161586       1 vnids.go:115] Allocated netid 14266677 for namespace "openshift-config-managed"
I0211 08:26:11.167059       1 vnids.go:115] Allocated netid 8818831 for namespace "openshift-network-operator"
I0211 08:26:11.172249       1 vnids.go:115] Allocated netid 0 for namespace "default"
I0211 08:26:11.177547       1 vnids.go:115] Allocated netid 13800730 for namespace "openshift-kube-scheduler"
I0211 08:26:11.183488       1 vnids.go:115] Allocated netid 2608316 for namespace "openshift-core-operators"
I0211 08:26:11.188820       1 vnids.go:115] Allocated netid 7384650 for namespace "openshift-kube-apiserver"
I0211 08:26:11.194070       1 vnids.go:115] Allocated netid 1484940 for namespace "openshift-machine-config-operator"
I0211 08:26:11.199301       1 vnids.go:115] Allocated netid 3246404 for namespace "openshift-kube-controller-manager"
I0211 08:26:11.204842       1 vnids.go:115] Allocated netid 8275068 for namespace "openshift-kube-apiserver-operator"
I0211 08:26:11.211155       1 vnids.go:115] Allocated netid 16503818 for namespace "kube-system"
I0211 08:26:11.216712       1 vnids.go:115] Allocated netid 14024028 for namespace "openshift-config"
I0211 08:26:11.222650       1 vnids.go:115] Allocated netid 12493483 for namespace "openshift-cluster-machine-approver"
I0211 08:26:11.229501       1 vnids.go:115] Allocated netid 839991 for namespace "openshift-sdn"
I0211 08:26:11.234792       1 vnids.go:115] Allocated netid 9817333 for namespace "openshift-cluster-api"
I0211 08:26:22.882077       1 leaderelection.go:249] failed to renew lease openshift-sdn/openshift-network-controller: failed to tryAcquireOrRenew context deadline exceeded
F0211 08:26:22.882105       1 network_controller.go:82] leaderelection lost

      Exit Code:    255
      Started:      Mon, 11 Feb 2019 16:26:08 +0800
      Finished:     Mon, 11 Feb 2019 16:26:23 +0800


# oc describe po sdn-kj8jl
-----
    State:       Running 
      Started:   Mon, 11 Feb 2019 16:26:11 +0800
    Last State:  Terminated
      Reason:    Error   
      Message:   2019/02/11 08:26:07 socat[5717] E connect(5, AF=1 "/var/run/openshift-sdn/cni-server.sock", 40): No such file or directory
I0211 08:26:07.850354    5625 cmd.go:229] Overriding kubernetes api to https://qe-bmeng-api.qe.devcluster.openshift.com:6443
I0211 08:26:07.850444    5625 cmd.go:132] Reading node configuration from /config/sdn-config.yaml
W0211 08:26:07.851820    5625 server.go:194] WARNING: all flags other than --config, --write-config-to, and --cleanup are deprecated. Please begin using a config file ASAP.
I0211 08:26:07.851881    5625 feature_gate.go:206] feature gates: &{map[]}
I0211 08:26:07.854074    5625 node.go:147] Initializing SDN node of type "redhat/openshift-ovs-networkpolicy" with configured hostname "ip-10-0-0-6.us-east-2.compute.internal" (IP ""), iptab
les sync period "30s"
I0211 08:26:07.859593    5625 cmd.go:196] Starting node networking (v4.0.0-0.150.0)
I0211 08:26:07.859615    5625 node.go:266] Starting openshift-sdn network plugin
F0211 08:26:10.844129    5625 cmd.go:113] Failed to start sdn: failed to validate network configuration: master has not created a default cluster network, network plugin "redhat/openshift-ov
s-networkpolicy" can not start

      Exit Code:    255  
      Started:      Mon, 11 Feb 2019 16:26:07 +0800
      Finished:     Mon, 11 Feb 2019 16:26:10 +0800

Comment 1 Casey Callendrello 2019-02-18 17:44:36 UTC
I don't think these are bugs. Instead, you're seeing two things:

1. An SDN pod managed to come up before the sdn-controller. So it crashes and restarts - that's expected behavior and not a big dea.
2. The SDN controller lost its lease: that's because of the bootstrap apiserver going down and the real one(s) coming up.

Comment 2 Meng Bo 2019-02-22 06:21:12 UTC
(In reply to Casey Callendrello from comment #1)
> I don't think these are bugs. Instead, you're seeing two things:
> 
> 1. An SDN pod managed to come up before the sdn-controller. So it crashes
> and restarts - that's expected behavior and not a big dea.
> 2. The SDN controller lost its lease: that's because of the bootstrap
> apiserver going down and the real one(s) coming up.

Yeah, for the SDN pods restarts, from the log, I found that the sdn pod started before the sdn-controller pod ready. That is what I want to report. 
We should control the order of the start up for the network services.
WDYT?

Comment 3 Casey Callendrello 2019-02-22 12:54:34 UTC
I don't really see the point in changing the code to wait; it only happens once or twice very early on in the bootstrapping process. I'll ask if anyone else is concerned about this, but I don't think it's a problem.


Note You need to log in before you can comment on or make changes to this bug.