1425388 – Increase the ARP cache size in the atomic-openshift-master and atomic-openshift-node tuned profiles

Bug 1425388 - Increase the ARP cache size in the atomic-openshift-master and atomic-openshift-node tuned profiles

Summary: Increase the ARP cache size in the atomic-openshift-master and atomic-openshi...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.5.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Phil Cameron
QA Contact:	Meng Bo
Docs Contact:
URL:
Whiteboard:	aos-scalability-35
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-02-21 10:18 UTC by Jiří Mencák
Modified:	2017-08-16 19:51 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Enhancement
Doc Text:	Feature: new default values for arp cache (docs PR 3803 Reason: cluster fails with gt 1024 routes Result: problem does not occur
Clone Of:
Environment:
Last Closed:	2017-08-10 05:18:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Origin (Github)	13034	0	None	None	None	2017-02-21 19:27:50 UTC
Red Hat Product Errata	RHEA-2017:1716	0	normal	SHIPPED_LIVE	Red Hat OpenShift Container Platform 3.6 RPM Release Advisory	2017-08-10 09:02:50 UTC

Description Jiří Mencák 2017-02-21 10:18:59 UTC

Description of problem:

In OCP clusters with large numbers of routes (greater than the value of net.ipv4.neigh.default.gc_thresh3, which is 1024 by default) the ARP cache is not large enough to accommodate for all the entries needed by the nodes running the router pods.  While this has been documented here: 

https://docs.openshift.com/container-platform/3.4/install_config/router/default_haproxy_router.html#deploy-router-arp-cach-tuning-for-large-scale-clusters

I believe this should be the default in the atomic-openshift-master and atomic-openshift-node tuned profiles.

Version-Release number of selected component (if applicable):
All


How reproducible:
Always


Steps to Reproduce:
1. Create an OCP environment with around 1024 routes (I've personally started noticing problems already at around 900 routes).

Actual results:
1) Kernel messages:
[ 1738.811139] net_ratelimit: 1045 callbacks suppressed
[ 1743.823136] net_ratelimit: 293 callbacks suppressed

2) oc client and networking in general stops working properly.

Expected results:
None of the issues in "Actual results".

Additional info:
http://post-office.corp.redhat.com/archives/atomic-networking/2016-November/msg00082.html

Comment 6 openshift-github-bot 2017-02-22 16:52:12 UTC

Commit pushed to master at https://github.com/openshift/openshift-docs

https://github.com/openshift/openshift-docs/commit/59be5f894be526396d8b160adccc4481f489f765
Change default arp cache size on nodes

In OCP clusters with large numbers of routes (greater than the value of
net.ipv4.neigh.default.gc_thresh3, which is 1024 by default) the ARP
cache is not large enough to accommodate for all the entries needed by
the nodes running the router pods.

This change increases the cache size.

bug 1425388
https://bugzilla.redhat.com/show_bug.cgi?id=1425388

Signed-off-by: Phil Cameron <pcameron>

Comment 7 openshift-github-bot 2017-02-23 13:08:57 UTC

Commit pushed to master at https://github.com/openshift/origin

https://github.com/openshift/origin/commit/ba842078f3bba0282d62a2c9db70ca4d9339e733
Change default arp cache size on nodes

In OCP clusters with large numbers of routes (greater than the value of
net.ipv4.neigh.default.gc_thresh3, which is 1024 by default) the ARP
cache is not large enough to accommodate for all the entries needed by
the nodes running the router pods.

This change increases the cache size.

bug 1425388
https://bugzilla.redhat.com/show_bug.cgi?id=1425388

Signed-off-by: Phil Cameron <pcameron>

Comment 9 Troy Dawson 2017-04-11 21:00:37 UTC

This has been merged into ocp and is in OCP v3.6.27 or newer.

Comment 13 errata-xmlrpc 2017-08-10 05:18:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1716

Note You need to log in before you can comment on or make changes to this bug.