Bug 1425388 - Increase the ARP cache size in the atomic-openshift-master and atomic-openshift-node tuned profiles
Summary: Increase the ARP cache size in the atomic-openshift-master and atomic-openshi...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.5.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Phil Cameron
QA Contact: Meng Bo
URL:
Whiteboard: aos-scalability-35
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-02-21 10:18 UTC by Jiří Mencák
Modified: 2017-08-16 19:51 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Feature: new default values for arp cache (docs PR 3803 Reason: cluster fails with gt 1024 routes Result: problem does not occur
Clone Of:
Environment:
Last Closed: 2017-08-10 05:18:47 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Origin (Github) 13034 0 None None None 2017-02-21 19:27:50 UTC
Red Hat Product Errata RHEA-2017:1716 0 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.6 RPM Release Advisory 2017-08-10 09:02:50 UTC

Description Jiří Mencák 2017-02-21 10:18:59 UTC
Description of problem:

In OCP clusters with large numbers of routes (greater than the value of net.ipv4.neigh.default.gc_thresh3, which is 1024 by default) the ARP cache is not large enough to accommodate for all the entries needed by the nodes running the router pods.  While this has been documented here: 

https://docs.openshift.com/container-platform/3.4/install_config/router/default_haproxy_router.html#deploy-router-arp-cach-tuning-for-large-scale-clusters

I believe this should be the default in the atomic-openshift-master and atomic-openshift-node tuned profiles.

Version-Release number of selected component (if applicable):
All


How reproducible:
Always


Steps to Reproduce:
1. Create an OCP environment with around 1024 routes (I've personally started noticing problems already at around 900 routes).

Actual results:
1) Kernel messages:
[ 1738.811139] net_ratelimit: 1045 callbacks suppressed
[ 1743.823136] net_ratelimit: 293 callbacks suppressed

2) oc client and networking in general stops working properly.

Expected results:
None of the issues in "Actual results".

Additional info:
http://post-office.corp.redhat.com/archives/atomic-networking/2016-November/msg00082.html

Comment 6 openshift-github-bot 2017-02-22 16:52:12 UTC
Commit pushed to master at https://github.com/openshift/openshift-docs

https://github.com/openshift/openshift-docs/commit/59be5f894be526396d8b160adccc4481f489f765
Change default arp cache size on nodes

In OCP clusters with large numbers of routes (greater than the value of
net.ipv4.neigh.default.gc_thresh3, which is 1024 by default) the ARP
cache is not large enough to accommodate for all the entries needed by
the nodes running the router pods.

This change increases the cache size.

bug 1425388
https://bugzilla.redhat.com/show_bug.cgi?id=1425388

Signed-off-by: Phil Cameron <pcameron>

Comment 7 openshift-github-bot 2017-02-23 13:08:57 UTC
Commit pushed to master at https://github.com/openshift/origin

https://github.com/openshift/origin/commit/ba842078f3bba0282d62a2c9db70ca4d9339e733
Change default arp cache size on nodes

In OCP clusters with large numbers of routes (greater than the value of
net.ipv4.neigh.default.gc_thresh3, which is 1024 by default) the ARP
cache is not large enough to accommodate for all the entries needed by
the nodes running the router pods.

This change increases the cache size.

bug 1425388
https://bugzilla.redhat.com/show_bug.cgi?id=1425388

Signed-off-by: Phil Cameron <pcameron>

Comment 9 Troy Dawson 2017-04-11 21:00:37 UTC
This has been merged into ocp and is in OCP v3.6.27 or newer.

Comment 13 errata-xmlrpc 2017-08-10 05:18:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1716


Note You need to log in before you can comment on or make changes to this bug.