Bug 1524808 - VMs not able to reach metadata server because of l3 agent is throwing errors
Summary: VMs not able to reach metadata server because of l3 agent is throwing errors
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: async
: 10.0 (Newton)
Assignee: Brian Haley
QA Contact: Federico Ressi
URL:
Whiteboard:
Depends On:
Blocks: 1558197 1763193
TreeView+ depends on / blocked
 
Reported: 2017-12-12 06:43 UTC by PURANDHAR SAIRAM MANNIDI
Modified: 2022-08-16 11:34 UTC (History)
12 users (show)

Fixed In Version: openstack-tripleo-heat-templates-5.3.10-4.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1558197 (view as bug list)
Environment:
Last Closed: 2018-06-27 23:30:46 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1690087 0 None None None 2018-01-10 19:11:28 UTC
OpenStack gerrit 463937 0 None MERGED Optimize kernel neighbour table for large scale environments 2020-09-22 14:41:34 UTC
Red Hat Issue Tracker OSP-4797 0 None None None 2022-08-16 11:34:12 UTC
Red Hat Product Errata RHBA-2018:2101 0 None None None 2018-06-27 23:32:36 UTC

Description PURANDHAR SAIRAM MANNIDI 2017-12-12 06:43:03 UTC
Description of problem:
VMs not able to reach metadata server because of l3 agent is throwing these errors

ERROR neutron.agent.l3-dvr_local_router [-] DVR: Failed updating arp entry
ERROR neutron.agent.l3-dvr_local_router Traceback (most recent call last):
ERROR neutron.agent.l3-dvr_local_router File "/usr/lib/python2.7/site-packages/neutron/agent/l3/dvr_local_router.py, line 253, in _update_arp_entry
ERROR neutron.agent.l3-dvr_local_router device.neigh.add(ip,mac)

...(Python errors continued)

Stderr: RTNETLINK answers: No buffer space available

Version-Release number of selected component (if applicable):
Red Hat OpenStack Platform 10

How reproducible:
Intermittently

Steps to Reproduce:
1. Create 5-6 VMs through heat
2. Half of them work and others fail
3. L3 agent logs throw RTNETLINK answers: No buffer space available


Actual results:
Instances failed to connect to metadata server

Expected results:
Instances spawning and applying metadata

Additional info:

Comment 1 Brian Haley 2017-12-14 17:32:49 UTC
Can you please attach the l3-agent.log file and/or an sosreport so this can be further diagnosed?

From the "No buffer space available" it looks like the neighbour table on the system is full, which shouldn't happen normally.  One place to look for failures is /var/log/messages - it could show the table is full.  Also, looking at the net.ipv4.gc_thresh* sysctl settings would be useful, since it would tell you how many entries the table is configured for.

Comment 4 Brian Haley 2017-12-18 17:36:30 UTC
Sai,

Thanks for the info.  From the l3-agent log it looks like the neighbour table has overflowed since adding an ARP entry is failing.  Can the customer check in /var/log/messages for any related warnings?  Also, can they provide the output of 'sysctl -a | grep gc_thresh' ?  It could just be the need to increase the size of the table for their workload.

Comment 5 Brian Haley 2018-01-03 15:11:21 UTC
Just checking if additional info on my previous comment is available.

Comment 7 Brian Haley 2018-01-10 19:10:35 UTC
Those numbers for gc_thresh* are very low if this is a large deployment.  I actually found a bug and patch that increased these values done less than a year ago, I'll link it in the bug.

That said, on the affected node they could try increasing these values and see if the problem persists:

# sysctl -w net.ipv4.neigh.default.gc_thresh1=1024
# sysctl -w net.ipv4.neigh.default.gc_thresh2=2048
# sysctl -w net.ipv4.neigh.default.gc_thresh3=4096

Those are the new default values.

If that works then someone could look at backporting the changes.

Comment 8 Brent Eagles 2018-01-10 19:36:50 UTC
upstream backport for OSPd/TripleO fix for this on the newton branch is https://review.openstack.org/#/c/532612/

Comment 12 Brian Haley 2018-06-14 13:57:16 UTC
I don't think you have to boot 5-6 instances, verifying the gc_thresh settings are correct should be enough as it shows the size of the neighbour table has been increased.

Comment 14 errata-xmlrpc 2018-06-27 23:30:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2101


Note You need to log in before you can comment on or make changes to this bug.