Created attachment 1425680 [details] pod and system logs Description of problem: After running the e2e conformance tests a couple of times on an HA cluster, all of the master-controller pods were stuck in CrashLoopBackoff with the following fatal error in the pod logs: F0423 15:00:42.262989 1 controller_manager.go:184] Error starting "openshift.io/sdn" (failed to start SDN plugin controller: cannot change the serviceNetworkCIDR of an already-deployed cluster) Checking master-config.yaml, the serviceNetworkCIDR has the same value as when the cluster was originally installed. Not sure why it thinks it has changed. Restarting the node and rebooting the node did not bring it out of this condition. Attaching the pod logs, system journal from 1 master, inventory used to install and master-config.yaml from one of the failed masters. Version-Release number of selected component (if applicable): 3.10.0-0.27.0 How reproducible: Unknown - need to reinstall and try to reproduce again. Steps to Reproduce: 1. Installed HA cluster in AWS: 3 master/etcd, 1 infra, 2 computes 2. Ran a subset of conformance and EmptyDir tests: https://github.com/openshift/svt/blob/master/conformance/svt_conformance.sh. It ran to completion and the cluster looked OK, but I did not do any extensive vetting of it. 3. Ran the script again and after a while the master-controller pods started crash looping. I'll try this again to see if it is repeatable. Running these tests is a normal part of smoke testing our clusters - it has not had this issue in the past. Actual results: master controller pods all in CrashLoopBackoff with this message in the log: F0423 15:00:42.262989 1 controller_manager.go:184] Error starting "openshift.io/sdn" (failed to start SDN plugin controller: cannot change the serviceNetworkCIDR of an already-deployed cluster)
Created attachment 1425681 [details] redacted inventory openshift_master_portal_net=172.24.0.0/14 openshift_portal_net=172.24.0.0/14 osm_cluster_network_cidr=172.20.0.0/14 These are our usual values for scalability and performance testing.
Created attachment 1425682 [details] master-config.yaml networkConfig: clusterNetworkCIDR: 172.20.0.0/14 clusterNetworks: - cidr: 172.20.0.0/14 hostSubnetLength: 9 externalIPNetworkCIDRs: - 0.0.0.0/0 hostSubnetLength: 9 networkPluginName: redhat/openshift-ovs-networkpolicy serviceNetworkCIDR: 172.24.0.0/14
This was reproduced in a second cluster and was not related to running e2e tests.
(In reply to Mike Fiedler from comment #3) > This was reproduced in a second cluster and was not related to running e2e > tests. I got this issue on one-master cluster: # mv /etc/origin/node/pods/controller.yaml /tmp/ and then mv it back after 1 minute We will see the following: # oc get pod -w -n kube-system NAME READY STATUS RESTARTS AGE master-api-ip-172-31-27-254.us-west-2.compute.internal 1/1 Running 0 12m master-controllers-ip-172-31-27-254.us-west-2.compute.internal 1/1 Running 0 12m master-etcd-ip-172-31-27-254.us-west-2.compute.internal 1/1 Running 0 12m master-controllers-ip-172-31-27-254.us-west-2.compute.internal 1/1 Terminating 0 12m master-controllers-ip-172-31-27-254.us-west-2.compute.internal 1/1 Terminating 0 12m master-controllers-ip-172-31-27-254.us-west-2.compute.internal 0/1 Pending 0 0s master-controllers-ip-172-31-27-254.us-west-2.compute.internal 0/1 Error 0 2s master-controllers-ip-172-31-27-254.us-west-2.compute.internal 0/1 Error 1 3s master-controllers-ip-172-31-27-254.us-west-2.compute.internal 0/1 CrashLoopBackOff 1 4s master-controllers-ip-172-31-27-254.us-west-2.compute.internal 1/1 Running 2 24s master-controllers-ip-172-31-27-254.us-west-2.compute.internal 0/1 Error 2 25s master-controllers-ip-172-31-27-254.us-west-2.compute.internal 0/1 CrashLoopBackOff 2 32s master-controllers-ip-172-31-27-254.us-west-2.compute.internal 1/1 Running 3 49s master-controllers-ip-172-31-27-254.us-west-2.compute.internal 0/1 Error 3 50s master-controllers-ip-172-31-27-254.us-west-2.compute.internal 0/1 CrashLoopBackOff 3 52s master-controllers-ip-172-31-27-254.us-west-2.compute.internal 1/1 Running 4 1m master-controllers-ip-172-31-27-254.us-west-2.compute.internal 0/1 Error 4 1m master-controllers-ip-172-31-27-254.us-west-2.compute.internal 0/1 CrashLoopBackOff 4 1m master-controllers-ip-172-31-27-254.us-west-2.compute.internal 1/1 Running 5 2m master-controllers-ip-172-31-27-254.us-west-2.compute.internal 0/1 Error 5 2m master-controllers-ip-172-31-27-254.us-west-2.compute.internal 0/1 CrashLoopBackOff 5 3m master-controllers-ip-172-31-27-254.us-west-2.compute.internal 1/1 Running 6 5m master-controllers-ip-172-31-27-254.us-west-2.compute.internal 0/1 Error 6 5m master-controllers-ip-172-31-27-254.us-west-2.compute.internal 0/1 CrashLoopBackOff 6 5m master-controllers-ip-172-31-27-254.us-west-2.compute.internal 1/1 Running 7 10m master-controllers-ip-172-31-27-254.us-west-2.compute.internal 0/1 Error 7 10m master-controllers-ip-172-31-27-254.us-west-2.compute.internal 0/1 CrashLoopBackOff 7 10m Is there a way to recover the master controller pod in this situation? Thanks.
Weibin, can you reproduce this please and then grab me and we can look at it. Thanks.
@Mike, openshift_portal_net=172.24.0.0/14 will not take effect in oc v3.10.0-0.28.0, could you re try to reproduce one more time? Thanks!
@Weibin, I tried with # yum list installed | grep openshift atomic-openshift.x86_64 3.10.0-0.28.0.git.0.66790cb.el7 With moving the file /etc/origin/node/pods/controller.yaml, controller-pod disappears and it recovers well when the file is moved back.
@Hongkai, Yes, I tried your above step this morning, and got same results as your comment 7 . So in this latest code, we both did not see the controller-pod crash in your comment 4
I've not seen this crash recently. Marking this as worksforme