Bug 1570877 - All master-controller pods stuck in crash loop claiming serviceNetworkCIDR changed when it did not
Summary: All master-controller pods stuck in crash loop claiming serviceNetworkCIDR ch...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.10.0
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: 3.10.0
Assignee: Mike Fiedler
QA Contact: Meng Bo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-23 15:32 UTC by Mike Fiedler
Modified: 2018-05-01 17:00 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-05-01 17:00:13 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
pod and system logs (1.07 MB, application/x-gzip)
2018-04-23 15:32 UTC, Mike Fiedler
no flags Details
redacted inventory (7.50 KB, text/plain)
2018-04-23 15:33 UTC, Mike Fiedler
no flags Details
master-config.yaml (6.20 KB, text/plain)
2018-04-23 15:34 UTC, Mike Fiedler
no flags Details

Description Mike Fiedler 2018-04-23 15:32:50 UTC
Created attachment 1425680 [details]
pod and system logs

Description of problem:

After running the e2e conformance tests a couple of times on an HA cluster, all of the master-controller pods were stuck in CrashLoopBackoff with the following fatal error in the pod logs:

                                    
F0423 15:00:42.262989       1 controller_manager.go:184] Error starting "openshift.io/sdn" (failed to start SDN plugin controller: cannot change the serviceNetworkCIDR of an already-deployed cluster)


Checking master-config.yaml, the serviceNetworkCIDR has the same value as when the cluster was originally installed.   Not sure why it thinks it has changed.  Restarting the node and rebooting the node did not bring it out of this condition.

Attaching the pod logs, system journal from 1 master, inventory used to install and master-config.yaml from one of the failed masters.

Version-Release number of selected component (if applicable): 3.10.0-0.27.0


How reproducible:  Unknown - need to reinstall and try to reproduce again.


Steps to Reproduce:
1.  Installed HA cluster in AWS:  3 master/etcd, 1 infra, 2 computes
2.  Ran a subset of conformance and EmptyDir tests:  https://github.com/openshift/svt/blob/master/conformance/svt_conformance.sh.  It ran to completion and the cluster looked OK, but I did not do any extensive vetting of it.
3.  Ran the script again and after a while the master-controller pods started crash looping.

I'll try this again to see if it is repeatable.  Running these tests is a normal part of smoke testing our clusters - it has not had this issue in the past.

Actual results:

master controller pods all in CrashLoopBackoff with this message in the log:

F0423 15:00:42.262989       1 controller_manager.go:184] Error starting "openshift.io/sdn" (failed to start SDN plugin controller: cannot change the serviceNetworkCIDR of an already-deployed cluster)

Comment 1 Mike Fiedler 2018-04-23 15:33:48 UTC
Created attachment 1425681 [details]
redacted inventory

openshift_master_portal_net=172.24.0.0/14
openshift_portal_net=172.24.0.0/14
osm_cluster_network_cidr=172.20.0.0/14

These are our usual values for scalability and performance testing.

Comment 2 Mike Fiedler 2018-04-23 15:34:44 UTC
Created attachment 1425682 [details]
master-config.yaml

networkConfig:
  clusterNetworkCIDR: 172.20.0.0/14
  clusterNetworks:
  - cidr: 172.20.0.0/14
    hostSubnetLength: 9
  externalIPNetworkCIDRs:
  - 0.0.0.0/0
  hostSubnetLength: 9
  networkPluginName: redhat/openshift-ovs-networkpolicy
  serviceNetworkCIDR: 172.24.0.0/14

Comment 3 Mike Fiedler 2018-04-23 19:39:11 UTC
This was reproduced in a second cluster and was not related to running e2e tests.

Comment 4 Hongkai Liu 2018-04-23 19:47:16 UTC
(In reply to Mike Fiedler from comment #3)
> This was reproduced in a second cluster and was not related to running e2e
> tests.

I got this issue on one-master cluster:

# mv /etc/origin/node/pods/controller.yaml /tmp/
and then mv it back after 1 minute

We will see the following:

# oc get pod -w -n kube-system 
NAME                                                             READY     STATUS    RESTARTS   AGE
master-api-ip-172-31-27-254.us-west-2.compute.internal           1/1       Running   0          12m
master-controllers-ip-172-31-27-254.us-west-2.compute.internal   1/1       Running   0          12m
master-etcd-ip-172-31-27-254.us-west-2.compute.internal          1/1       Running   0          12m
master-controllers-ip-172-31-27-254.us-west-2.compute.internal   1/1       Terminating   0         12m
master-controllers-ip-172-31-27-254.us-west-2.compute.internal   1/1       Terminating   0         12m
master-controllers-ip-172-31-27-254.us-west-2.compute.internal   0/1       Pending   0         0s
master-controllers-ip-172-31-27-254.us-west-2.compute.internal   0/1       Error     0         2s
master-controllers-ip-172-31-27-254.us-west-2.compute.internal   0/1       Error     1         3s
master-controllers-ip-172-31-27-254.us-west-2.compute.internal   0/1       CrashLoopBackOff   1         4s
master-controllers-ip-172-31-27-254.us-west-2.compute.internal   1/1       Running   2         24s
master-controllers-ip-172-31-27-254.us-west-2.compute.internal   0/1       Error     2         25s
master-controllers-ip-172-31-27-254.us-west-2.compute.internal   0/1       CrashLoopBackOff   2         32s
master-controllers-ip-172-31-27-254.us-west-2.compute.internal   1/1       Running   3         49s
master-controllers-ip-172-31-27-254.us-west-2.compute.internal   0/1       Error     3         50s
master-controllers-ip-172-31-27-254.us-west-2.compute.internal   0/1       CrashLoopBackOff   3         52s
master-controllers-ip-172-31-27-254.us-west-2.compute.internal   1/1       Running   4         1m
master-controllers-ip-172-31-27-254.us-west-2.compute.internal   0/1       Error     4         1m
master-controllers-ip-172-31-27-254.us-west-2.compute.internal   0/1       CrashLoopBackOff   4         1m
master-controllers-ip-172-31-27-254.us-west-2.compute.internal   1/1       Running   5         2m
master-controllers-ip-172-31-27-254.us-west-2.compute.internal   0/1       Error     5         2m
master-controllers-ip-172-31-27-254.us-west-2.compute.internal   0/1       CrashLoopBackOff   5         3m
master-controllers-ip-172-31-27-254.us-west-2.compute.internal   1/1       Running   6         5m
master-controllers-ip-172-31-27-254.us-west-2.compute.internal   0/1       Error     6         5m
master-controllers-ip-172-31-27-254.us-west-2.compute.internal   0/1       CrashLoopBackOff   6         5m
master-controllers-ip-172-31-27-254.us-west-2.compute.internal   1/1       Running   7         10m
master-controllers-ip-172-31-27-254.us-west-2.compute.internal   0/1       Error     7         10m
master-controllers-ip-172-31-27-254.us-west-2.compute.internal   0/1       CrashLoopBackOff   7         10m


Is there a way to recover the master controller pod in this situation?

Thanks.

Comment 5 Ben Bennett 2018-04-24 13:27:25 UTC
Weibin, can you reproduce this please and then grab me and we can look at it.  Thanks.

Comment 6 Weibin Liang 2018-04-25 18:13:09 UTC
@Mike, openshift_portal_net=172.24.0.0/14 will not take effect in oc v3.10.0-0.28.0, could you re try to reproduce one more time? Thanks!

Comment 7 Hongkai Liu 2018-04-25 19:08:35 UTC
@Weibin,

I tried with
# yum list installed | grep openshift
atomic-openshift.x86_64       3.10.0-0.28.0.git.0.66790cb.el7

With moving the file /etc/origin/node/pods/controller.yaml, controller-pod disappears and it recovers well when the file is moved back.

Comment 8 Weibin Liang 2018-04-25 20:20:17 UTC
@Hongkai,

Yes, I tried your above step this morning, and got same results as your comment 7 .
So in this latest code, we both did not see the controller-pod crash in your comment 4

Comment 9 Mike Fiedler 2018-05-01 17:00:13 UTC
I've not seen this crash recently.   Marking this as worksforme


Note You need to log in before you can comment on or make changes to this bug.