Description of problem: After installing openshift on GCE, unable to create pod due to FailedScheduling Version-Release number of selected component (if applicable): openshift v3.3.0.6 kubernetes v1.3.0+57fb9ac etcd 2.3.0+git How reproducible: Always Steps to Reproduce: 1.Install openshift with ansible on GCE of 2 nodes 2.Check the registry and router pods as admin in default project Actual results: The registry and router pods keep in pending status due to FailedScheduling oc get nodes qe-wehe-master-1 Ready,SchedulingDisabled 6h qe-wehe-node-registry-router-1 Ready 6h oc get pods NAME READY STATUS RESTARTS AGE docker-registry-6-deploy 0/1 Pending 0 6h router-1-deploy 0/1 Pending 0 6h oc describe pods FirstSeen LastSeen Count From SubobjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 6h 19s 1365 {default-scheduler } Warning FailedScheduling no nodes available to schedule pods Expected results: The pod can be scheduling and running in nodes Additional info: To check the node info: journalctl -u atomic-openshift-master | grep qe-wehe-node-registry-router-1 Jul 18 04:45:17 qe-wehe-master-1 atomic-openshift-master[23171]: I0718 04:45:17.164555 23171 nodecontroller.go:821] Node qe-wehe-node-registry-router-1 ReadyCondition updated. Updating timestamp: {Capacity:map[alpha.kubernetes.io/nvidia-gpu:{i:{value:0 scale:0} d:{Dec:<nil>} s:0 Format:DecimalSI} cpu:{i:{value:1 scale:0} d:{Dec:<nil>} s:1 Format:DecimalSI} memory:{i:{value:3711508480 scale:0} d:{Dec:<nil>} s: Format:BinarySI} pods:{i:{value:110 scale:0} d:{Dec:<nil>} s:110 Format:DecimalSI}] Allocatable:map[alpha.kubernetes.io/nvidia-gpu:{i:{value:0 scale:0} d:{Dec:<nil>} s:0 Format:DecimalSI} cpu:{i:{value:1 scale:0} d:{Dec:<nil>} s:1 Format:DecimalSI} memory:{i:{value:3711508480 scale:0} d:{Dec:<nil>} s: Format:BinarySI} pods:{i:{value:110 scale:0} d:{Dec:<nil>} s:110 Format:DecimalSI}] Phase: Conditions:[{Type:NetworkUnavailable Status:True LastHeartbeatTime:{Time:0001-01-01 00:00:00 +0000 UTC} LastTransitionTime:{Time:2016-07-17 21:57:35 -0400 EDT} Reason:NoRouteCreated Message:Node created without a route} {Type:OutOfDisk Status:False LastHeartbeatTime:{Time:2016-07-18 04:45:04 -0400 EDT} LastTransitionTime:{Time:2016-07-17 21:57:36 -0400 EDT} Reason:KubeletHasSufficientDisk Message:kubelet has sufficient disk space available} {Type:MemoryPressure Status:False LastHeartbeatTime:{Time:2016-07-18 04:45:04 -0400 EDT} LastTransitionTime:{Time:2016-07-17 21:57:36 -0400 EDT} Reason:KubeletHasSufficientMemory Message:kubelet has sufficient memory available} {Type:Ready Status:True LastHeartbeatTime:{Time:2016-07-18 04:45:04 -0400 EDT} LastTransitionTime:{Time:2016-07-17 21:57:36 -0400 EDT} Reason:KubeletReady Message:kubelet is posting ready status}] Addresses:[{Type:InternalIP Address:10.240.0.4} {Type:ExternalIP Address:104.197.105.156}] DaemonEndpoints:{KubeletEndpoint:{Port:10250}} NodeInfo:{MachineID:4093bf66a4a4444886ac88feb9f56896 SystemUUID:452EC365-F247-2419-CF0D-E07E92D50793 BootID:29c8f130-5496-41f8-8cba-3649fa60fceb KernelVersion:3.10.0-327.el7.x86_64 OSImage:Red Hat Enterprise Linux Server 7.2 (Maipo) ContainerRuntimeVersion:docker://1.10.3 KubeletVersion:v1.3.0+5 Jul 18 04:45:19 qe-wehe-master-1 atomic-openshift-master[23171]: I0718 04:45:19.468769 23171 factory.go:448] Ignoring node qe-wehe-node-registry-router-1 with NetworkUnavailable condition status True Jul 18 04:45:19 qe-wehe-master-1 atomic-openshift-master[23171]: I0718 04:45:19.468777 23171 listers.go:160] Node qe-wehe-node-registry-router-1 matches none of the conditions Jul 18 04:45:20 qe-wehe-master-1 atomic-openshift-master[23171]: I0718 04:45:20.390718 23171 factory.go:448] Ignoring node qe-wehe-node-registry-router-1 with NetworkUnavailable condition status True Jul 18 04:45:20 qe-wehe-master-1 atomic-openshift-master[23171]: I0718 04:45:20.390724 23171 listers.go:160] Node qe-wehe-node-registry-router-1 matches none of the conditions Jul 18 04:45:21 qe-wehe-master-1 atomic-openshift-master[23171]: I0718 04:45:21.393346 23171 factory.go:448] Ignoring node qe-wehe-node-registry-router-1 with NetworkUnavailable condition status True Jul 18 04:45:21 qe-wehe-master-1 atomic-openshift-master[23171]: I0718 04:45:21.393355 23171 listers.go:160] Node qe-wehe-node-registry-router-1 matches none of the conditions Jul 18 04:45:23 qe-wehe-master-1 atomic-openshift-master[23171]: I0718 04:45:23.396600 23171 factory.go:448] Ignoring node qe-wehe-node-registry-router-1 with NetworkUnavailable condition status True Jul 18 04:45:23 qe-wehe-master-1 atomic-openshift-master[23171]: I0718 04:45:23.396605 23171 listers.go:160] Node qe-wehe-node-registry-router-1 matches none of the conditions
This looks related to https://github.com/kubernetes/kubernetes/issues/26983
atomic-openshift 3.3.0.6 already includes https://github.com/kubernetes/kubernetes/pull/27525 which was the supposed fix for the GCE thing. So either that fix is incomplete or this is a different problem.
Kube 26983 and 27525 actually only apply to non-GCE setups, so they don't appear to be relevant here. Instead, this looks more like: https://github.com/kubernetes/kubernetes/issues/27994 https://github.com/kubernetes/kubernetes/issues/27071 - lastHeartbeatTime: null lastTransitionTime: 2016-07-19T05:30:08Z message: Node created without a route reason: NoRouteCreated status: "True" type: NetworkUnavailable Wenqi He, are you sure the GCE routes to your nodes are created and correct? The route controller should create them eventually, but it does that asynchronously. So if there are no routes to the nodes created after 10 or 15 minutes or so, then perhaps your permissions are wrong and the routes cannot be created? Can you look in your master-controller logs for "Could not create route" error messages?
Higher the priority since it is blocking testing on GCE.
Theory: On startup for GCE, kubelet runs this code: // Initially, set NodeNetworkUnavailable to true. if kl.providerRequiresNetworkingConfiguration() { node.Status.Conditions = append(node.Status.Conditions, api.NodeCondition{ Type: api.NodeNetworkUnavailable, Status: api.ConditionTrue, Reason: "NoRouteCreated", Message: "Node created without a route", LastTransitionTime: unversioned.NewTime(kl.clock.Now()), }) } which sets NodeNetworkUnavailable on the node in some cases. This is determined by: func (kl *Kubelet) providerRequiresNetworkingConfiguration() bool { if kl.cloud == nil || kl.cloud.ProviderName() != "gce" || kl.flannelExperimentalOverlay { return false } _, supported := kl.cloud.Routes() return supported } And in the case of GCE (1) the early return false does not trigger and (2) routes exist for the cluster. Thus this function returns 'true' and the node gets NodeNetworkUnavailable set on it. This condition is supposed to be cleared by the route controller. But the route controller has this code: func (rc *RouteController) reconcile(nodes []api.Node, routes []*cloudprovider.Route) error { ... <snip> for _, node := range nodes { // Skip if the node hasn't been assigned a CIDR yet. if node.Spec.PodCIDR == "" { continue } // Check if we have a route for this node w/ the correct CIDR. r := routeMap[node.Name] Thus if the node does not have a PodCIDR, it will never get the NodeNetworkUnavailable condition cleared. openshift-sdn does not use PodCIDR, instead it uses HostSubnet resources. Thus when running openshift-sdn on GCE, it seems expected that this condition would happen. AWS is not affected because the kubelet code in providerRequiresNetworkingConfiguration() exempts the node from the initial network unavailable condition. This is an expectation mismatch between upstream Kubernetes cloud support and openshift-sdn cloud support.
Possible fix in https://github.com/openshift/origin/pull/10471. Is there any way you can try that build that fix in? Or would it work better if you had some OpenShift RPMs that you could manually update the master with?
Discussed with QE prodution team, it is very hard for QE to build and deploy the the OCP with only a PR fix provided, maybe we can ask @sdodson (sdodson)to help here. Thanks.
(In reply to Wenqi He from comment #12) > Discussed with QE prodution team, it is very hard for QE to build and deploy > the the OCP with only a PR fix provided, maybe we can ask @sdodson > (sdodson)to help here. Thanks. I can build a new set of atomic-openshift RPMs for you with the candidate fix included if you can tell me the RPM name/version/release of what you are currently running. Then you could use these RPMs for the deployment and testing. Would that work?
(In reply to Dan Williams from comment #13) > (In reply to Wenqi He from comment #12) > > Discussed with QE prodution team, it is very hard for QE to build and deploy > > the the OCP with only a PR fix provided, maybe we can ask @sdodson > > (sdodson)to help here. Thanks. > > I can build a new set of atomic-openshift RPMs for you with the candidate > fix included if you can tell me the RPM name/version/release of what you are > currently running. Then you could use these RPMs for the deployment and > testing. Would that work? I have installed a new GCE env and update the rpm include this fix, but unfortunately, this issue still repro, all the pods are in pending staus with failed scheduling.
Updated github PR: https://github.com/openshift/origin/pull/10545
I have tested this on containerized ocp on below version, this problem is fixed. openshift v3.3.0.25+d2ac65e-dirty kubernetes v1.3.0+507d3a7 etcd 2.3.0+git I have not got a chance to verify it on rpm installation, we met a blocking issue today, will try to verify it tomorrow.
I have verified this on below version, this bug is fixed: openshift v3.3.0.25+d2ac65e-dirty kubernetes v1.3.0+507d3a7 etcd 2.3.0+git [root@qe-wehe-master-1 ~]# oc get pods NAME READY STATUS RESTARTS AGE hello-openshift 1/1 Running 0 8h
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1933