1524808 – VMs not able to reach metadata server because of l3 agent is throwing errors

Bug 1524808 - VMs not able to reach metadata server because of l3 agent is throwing errors

Summary: VMs not able to reach metadata server because of l3 agent is throwing errors

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	10.0 (Newton)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	async
Target Release:	10.0 (Newton)
Assignee:	Brian Haley
QA Contact:	Federico Ressi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1558197 1763193
TreeView+	depends on / blocked

Reported:	2017-12-12 06:43 UTC by PURANDHAR SAIRAM MANNIDI
Modified:	2022-08-16 11:34 UTC (History)
CC List:	12 users (show)
Fixed In Version:	openstack-tripleo-heat-templates-5.3.10-4.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1558197 (view as bug list)
Environment:
Last Closed:	2018-06-27 23:30:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1690087	None	None	None	2018-01-10 19:11:28 UTC
OpenStack gerrit	463937	None	MERGED	Optimize kernel neighbour table for large scale environments	2020-09-22 14:41:34 UTC
Red Hat Issue Tracker	OSP-4797	None	None	None	2022-08-16 11:34:12 UTC
Red Hat Product Errata	RHBA-2018:2101	None	None	None	2018-06-27 23:32:36 UTC

Description PURANDHAR SAIRAM MANNIDI 2017-12-12 06:43:03 UTC

Description of problem:
VMs not able to reach metadata server because of l3 agent is throwing these errors

ERROR neutron.agent.l3-dvr_local_router [-] DVR: Failed updating arp entry
ERROR neutron.agent.l3-dvr_local_router Traceback (most recent call last):
ERROR neutron.agent.l3-dvr_local_router File "/usr/lib/python2.7/site-packages/neutron/agent/l3/dvr_local_router.py, line 253, in _update_arp_entry
ERROR neutron.agent.l3-dvr_local_router device.neigh.add(ip,mac)

...(Python errors continued)

Stderr: RTNETLINK answers: No buffer space available

Version-Release number of selected component (if applicable):
Red Hat OpenStack Platform 10

How reproducible:
Intermittently

Steps to Reproduce:
1. Create 5-6 VMs through heat
2. Half of them work and others fail
3. L3 agent logs throw RTNETLINK answers: No buffer space available


Actual results:
Instances failed to connect to metadata server

Expected results:
Instances spawning and applying metadata

Additional info:

Comment 1 Brian Haley 2017-12-14 17:32:49 UTC

Can you please attach the l3-agent.log file and/or an sosreport so this can be further diagnosed?

From the "No buffer space available" it looks like the neighbour table on the system is full, which shouldn't happen normally.  One place to look for failures is /var/log/messages - it could show the table is full.  Also, looking at the net.ipv4.gc_thresh* sysctl settings would be useful, since it would tell you how many entries the table is configured for.

Comment 4 Brian Haley 2017-12-18 17:36:30 UTC

Sai,

Thanks for the info.  From the l3-agent log it looks like the neighbour table has overflowed since adding an ARP entry is failing.  Can the customer check in /var/log/messages for any related warnings?  Also, can they provide the output of 'sysctl -a | grep gc_thresh' ?  It could just be the need to increase the size of the table for their workload.

Comment 5 Brian Haley 2018-01-03 15:11:21 UTC

Just checking if additional info on my previous comment is available.

Comment 7 Brian Haley 2018-01-10 19:10:35 UTC

Those numbers for gc_thresh* are very low if this is a large deployment.  I actually found a bug and patch that increased these values done less than a year ago, I'll link it in the bug.

That said, on the affected node they could try increasing these values and see if the problem persists:

# sysctl -w net.ipv4.neigh.default.gc_thresh1=1024
# sysctl -w net.ipv4.neigh.default.gc_thresh2=2048
# sysctl -w net.ipv4.neigh.default.gc_thresh3=4096

Those are the new default values.

If that works then someone could look at backporting the changes.

Comment 8 Brent Eagles 2018-01-10 19:36:50 UTC

upstream backport for OSPd/TripleO fix for this on the newton branch is https://review.openstack.org/#/c/532612/

Comment 12 Brian Haley 2018-06-14 13:57:16 UTC

I don't think you have to boot 5-6 instances, verifying the gc_thresh settings are correct should be enough as it shows the size of the neighbour table has been increased.

Comment 14 errata-xmlrpc 2018-06-27 23:30:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2101

Note You need to log in before you can comment on or make changes to this bug.