Bug 1524808

Summary: VMs not able to reach metadata server because of l3 agent is throwing errors
Product: Red Hat OpenStack Reporter: PURANDHAR SAIRAM MANNIDI <pmannidi>
Component: openstack-tripleo-heat-templatesAssignee: Brian Haley <bhaley>
Status: CLOSED ERRATA QA Contact: Federico Ressi <fressi>
Severity: high Docs Contact:
Priority: high    
Version: 10.0 (Newton)CC: amuller, beagles, bhaley, chrisw, jlibosva, lbezdick, mburns, nyechiel, pmannidi, ragiman, rhel-osp-director-maint, srevivo
Target Milestone: asyncKeywords: Triaged, ZStream
Target Release: 10.0 (Newton)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-5.3.10-4.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1558197 (view as bug list) Environment:
Last Closed: 2018-06-27 23:30:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1558197, 1763193    

Description PURANDHAR SAIRAM MANNIDI 2017-12-12 06:43:03 UTC
Description of problem:
VMs not able to reach metadata server because of l3 agent is throwing these errors

ERROR neutron.agent.l3-dvr_local_router [-] DVR: Failed updating arp entry
ERROR neutron.agent.l3-dvr_local_router Traceback (most recent call last):
ERROR neutron.agent.l3-dvr_local_router File "/usr/lib/python2.7/site-packages/neutron/agent/l3/dvr_local_router.py, line 253, in _update_arp_entry
ERROR neutron.agent.l3-dvr_local_router device.neigh.add(ip,mac)

...(Python errors continued)

Stderr: RTNETLINK answers: No buffer space available

Version-Release number of selected component (if applicable):
Red Hat OpenStack Platform 10

How reproducible:
Intermittently

Steps to Reproduce:
1. Create 5-6 VMs through heat
2. Half of them work and others fail
3. L3 agent logs throw RTNETLINK answers: No buffer space available


Actual results:
Instances failed to connect to metadata server

Expected results:
Instances spawning and applying metadata

Additional info:

Comment 1 Brian Haley 2017-12-14 17:32:49 UTC
Can you please attach the l3-agent.log file and/or an sosreport so this can be further diagnosed?

From the "No buffer space available" it looks like the neighbour table on the system is full, which shouldn't happen normally.  One place to look for failures is /var/log/messages - it could show the table is full.  Also, looking at the net.ipv4.gc_thresh* sysctl settings would be useful, since it would tell you how many entries the table is configured for.

Comment 4 Brian Haley 2017-12-18 17:36:30 UTC
Sai,

Thanks for the info.  From the l3-agent log it looks like the neighbour table has overflowed since adding an ARP entry is failing.  Can the customer check in /var/log/messages for any related warnings?  Also, can they provide the output of 'sysctl -a | grep gc_thresh' ?  It could just be the need to increase the size of the table for their workload.

Comment 5 Brian Haley 2018-01-03 15:11:21 UTC
Just checking if additional info on my previous comment is available.

Comment 7 Brian Haley 2018-01-10 19:10:35 UTC
Those numbers for gc_thresh* are very low if this is a large deployment.  I actually found a bug and patch that increased these values done less than a year ago, I'll link it in the bug.

That said, on the affected node they could try increasing these values and see if the problem persists:

# sysctl -w net.ipv4.neigh.default.gc_thresh1=1024
# sysctl -w net.ipv4.neigh.default.gc_thresh2=2048
# sysctl -w net.ipv4.neigh.default.gc_thresh3=4096

Those are the new default values.

If that works then someone could look at backporting the changes.

Comment 8 Brent Eagles 2018-01-10 19:36:50 UTC
upstream backport for OSPd/TripleO fix for this on the newton branch is https://review.openstack.org/#/c/532612/

Comment 12 Brian Haley 2018-06-14 13:57:16 UTC
I don't think you have to boot 5-6 instances, verifying the gc_thresh settings are correct should be enough as it shows the size of the neighbour table has been increased.

Comment 14 errata-xmlrpc 2018-06-27 23:30:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2101