Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1392296 - OpenShift handing out .255 network address to a pod when HOST_SUBNET_LENGTH is 8
OpenShift handing out .255 network address to a pod when HOST_SUBNET_LENGTH is 8
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking (Show other bugs)
3.4.0
x86_64 Linux
unspecified Severity high
: ---
: ---
Assigned To: Dan Williams
Mike Fiedler
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-11-07 01:57 EST by Mike Fiedler
Modified: 2017-03-08 13 EST (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-01-18 07:50:04 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Origin (Github) 11815 None None None 2016-11-08 11:19 EST
Red Hat Product Errata RHBA-2017:0066 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.4 RPM Release Advisory 2017-01-18 12:23:26 EST

  None (edit)
Description Mike Fiedler 2016-11-07 01:57:11 EST
Description of problem:

While doing hundreds of scaleup/scaledown operations, "randomly" getting pods that will never start.  They never pass the readiness check and get stuck in CrashLoopBackoff.   The pods which get stuck like this always have .255 IP addresses which should be the broadcast for the node network.

networkConfig:
  clusterNetworkCIDR: 172.20.0.0/14
  hostSubnetLength: 8
  networkPluginName: redhat/openshift-ovs-multitenant
# serviceNetworkCIDR must match kubernetesMasterConfig.servicesSubnet
  serviceNetworkCIDR: 172.24.0.0/14
  externalIPNetworkCIDRs:
  - 0.0.0.0/0


Version-Release number of selected component (if applicable): 3.4.0.22


How reproducible:  Always when running scale up/down long enough


Steps to Reproduce:
1.  Install a 1 master 2 node OCP 3.4 cluster with the network configuration above
2.  oc new-app cakephp-mysql-example
3.  oc edit dc/cakephp-mysql-example and remove the 512M memory limit
4.  oc scale --replicas=200 dc/cakephp-mysql-example
5.  oc scale --replicas=0 dc/cakephp-mysql-example
6.  verify all nodes are running and none are in CrashLoopBackoff
7.  repeat 4 through 6 until one of the pods gets assigned a .255 address and gets stuck and cannot initialize

Actual results:

Eventually a pod will get a .255 address and get stuck.   Network debug script location will be added shortly.

Expected results:

All pods can start.  No pods handed an invalid address for their subnet.
Comment 3 Dan Winship 2016-11-07 09:11:58 EST
dcbw has already fixed this in CNI upstream. Presumably he knows whether we should fully rebase cni or just pull in those fixes
Comment 4 Dan Williams 2016-11-07 15:29:18 EST
Origin PR that bumps CNI: https://github.com/openshift/origin/pull/11815
Comment 5 Troy Dawson 2016-11-09 14:50:41 EST
This has been merged into ose and is in OSE v3.4.0.24 or newer.
Comment 7 Mike Fiedler 2016-11-09 21:03:13 EST
Verified on 3.4.0.24.   Performed hundreds of scale up/down as in the original scenario and all pods came active.  No pods were handed a bad IP.
Comment 9 errata-xmlrpc 2017-01-18 07:50:04 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0066

Note You need to log in before you can comment on or make changes to this bug.