Description of problem: In OCP clusters with large numbers of routes (greater than the value of net.ipv4.neigh.default.gc_thresh3, which is 1024 by default) the ARP cache is not large enough to accommodate for all the entries needed by the nodes running the router pods. While this has been documented here: https://docs.openshift.com/container-platform/3.4/install_config/router/default_haproxy_router.html#deploy-router-arp-cach-tuning-for-large-scale-clusters I believe this should be the default in the atomic-openshift-master and atomic-openshift-node tuned profiles. Version-Release number of selected component (if applicable): All How reproducible: Always Steps to Reproduce: 1. Create an OCP environment with around 1024 routes (I've personally started noticing problems already at around 900 routes). Actual results: 1) Kernel messages: [ 1738.811139] net_ratelimit: 1045 callbacks suppressed [ 1743.823136] net_ratelimit: 293 callbacks suppressed 2) oc client and networking in general stops working properly. Expected results: None of the issues in "Actual results". Additional info: http://post-office.corp.redhat.com/archives/atomic-networking/2016-November/msg00082.html
Commit pushed to master at https://github.com/openshift/openshift-docs https://github.com/openshift/openshift-docs/commit/59be5f894be526396d8b160adccc4481f489f765 Change default arp cache size on nodes In OCP clusters with large numbers of routes (greater than the value of net.ipv4.neigh.default.gc_thresh3, which is 1024 by default) the ARP cache is not large enough to accommodate for all the entries needed by the nodes running the router pods. This change increases the cache size. bug 1425388 https://bugzilla.redhat.com/show_bug.cgi?id=1425388 Signed-off-by: Phil Cameron <pcameron>
Commit pushed to master at https://github.com/openshift/origin https://github.com/openshift/origin/commit/ba842078f3bba0282d62a2c9db70ca4d9339e733 Change default arp cache size on nodes In OCP clusters with large numbers of routes (greater than the value of net.ipv4.neigh.default.gc_thresh3, which is 1024 by default) the ARP cache is not large enough to accommodate for all the entries needed by the nodes running the router pods. This change increases the cache size. bug 1425388 https://bugzilla.redhat.com/show_bug.cgi?id=1425388 Signed-off-by: Phil Cameron <pcameron>
This has been merged into ocp and is in OCP v3.6.27 or newer.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1716