Bug 1425388

Summary:	Increase the ARP cache size in the atomic-openshift-master and atomic-openshift-node tuned profiles
Product:	OpenShift Container Platform	Reporter:	Jiří Mencák <jmencak>
Component:	Networking	Assignee:	Phil Cameron <pcameron>
Status:	CLOSED ERRATA	QA Contact:	Meng Bo <bmeng>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3.5.0	CC:	aos-bugs, eparis, pcameron, tdawson, zzhao
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	aos-scalability-35
Fixed In Version:		Doc Type:	Enhancement
Doc Text:	Feature: new default values for arp cache (docs PR 3803 Reason: cluster fails with gt 1024 routes Result: problem does not occur	Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-08-10 05:18:47 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jiří Mencák 2017-02-21 10:18:59 UTC

Description of problem:

In OCP clusters with large numbers of routes (greater than the value of net.ipv4.neigh.default.gc_thresh3, which is 1024 by default) the ARP cache is not large enough to accommodate for all the entries needed by the nodes running the router pods.  While this has been documented here: 

https://docs.openshift.com/container-platform/3.4/install_config/router/default_haproxy_router.html#deploy-router-arp-cach-tuning-for-large-scale-clusters

I believe this should be the default in the atomic-openshift-master and atomic-openshift-node tuned profiles.

Version-Release number of selected component (if applicable):
All


How reproducible:
Always


Steps to Reproduce:
1. Create an OCP environment with around 1024 routes (I've personally started noticing problems already at around 900 routes).

Actual results:
1) Kernel messages:
[ 1738.811139] net_ratelimit: 1045 callbacks suppressed
[ 1743.823136] net_ratelimit: 293 callbacks suppressed

2) oc client and networking in general stops working properly.

Expected results:
None of the issues in "Actual results".

Additional info:
http://post-office.corp.redhat.com/archives/atomic-networking/2016-November/msg00082.html

Comment 6 openshift-github-bot 2017-02-22 16:52:12 UTC

Commit pushed to master at https://github.com/openshift/openshift-docs

https://github.com/openshift/openshift-docs/commit/59be5f894be526396d8b160adccc4481f489f765
Change default arp cache size on nodes

In OCP clusters with large numbers of routes (greater than the value of
net.ipv4.neigh.default.gc_thresh3, which is 1024 by default) the ARP
cache is not large enough to accommodate for all the entries needed by
the nodes running the router pods.

This change increases the cache size.

bug 1425388
https://bugzilla.redhat.com/show_bug.cgi?id=1425388

Signed-off-by: Phil Cameron <pcameron>

Comment 7 openshift-github-bot 2017-02-23 13:08:57 UTC

Commit pushed to master at https://github.com/openshift/origin

https://github.com/openshift/origin/commit/ba842078f3bba0282d62a2c9db70ca4d9339e733
Change default arp cache size on nodes

In OCP clusters with large numbers of routes (greater than the value of
net.ipv4.neigh.default.gc_thresh3, which is 1024 by default) the ARP
cache is not large enough to accommodate for all the entries needed by
the nodes running the router pods.

This change increases the cache size.

bug 1425388
https://bugzilla.redhat.com/show_bug.cgi?id=1425388

Signed-off-by: Phil Cameron <pcameron>

Comment 9 Troy Dawson 2017-04-11 21:00:37 UTC

This has been merged into ocp and is in OCP v3.6.27 or newer.

Comment 13 errata-xmlrpc 2017-08-10 05:18:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1716