Description of problem: VMs not able to reach metadata server because of l3 agent is throwing these errors ERROR neutron.agent.l3-dvr_local_router [-] DVR: Failed updating arp entry ERROR neutron.agent.l3-dvr_local_router Traceback (most recent call last): ERROR neutron.agent.l3-dvr_local_router File "/usr/lib/python2.7/site-packages/neutron/agent/l3/dvr_local_router.py, line 253, in _update_arp_entry ERROR neutron.agent.l3-dvr_local_router device.neigh.add(ip,mac) ...(Python errors continued) Stderr: RTNETLINK answers: No buffer space available Version-Release number of selected component (if applicable): Red Hat OpenStack Platform 10 How reproducible: Intermittently Steps to Reproduce: 1. Create 5-6 VMs through heat 2. Half of them work and others fail 3. L3 agent logs throw RTNETLINK answers: No buffer space available Actual results: Instances failed to connect to metadata server Expected results: Instances spawning and applying metadata Additional info:
Can you please attach the l3-agent.log file and/or an sosreport so this can be further diagnosed? From the "No buffer space available" it looks like the neighbour table on the system is full, which shouldn't happen normally. One place to look for failures is /var/log/messages - it could show the table is full. Also, looking at the net.ipv4.gc_thresh* sysctl settings would be useful, since it would tell you how many entries the table is configured for.
Sai, Thanks for the info. From the l3-agent log it looks like the neighbour table has overflowed since adding an ARP entry is failing. Can the customer check in /var/log/messages for any related warnings? Also, can they provide the output of 'sysctl -a | grep gc_thresh' ? It could just be the need to increase the size of the table for their workload.
Just checking if additional info on my previous comment is available.
Those numbers for gc_thresh* are very low if this is a large deployment. I actually found a bug and patch that increased these values done less than a year ago, I'll link it in the bug. That said, on the affected node they could try increasing these values and see if the problem persists: # sysctl -w net.ipv4.neigh.default.gc_thresh1=1024 # sysctl -w net.ipv4.neigh.default.gc_thresh2=2048 # sysctl -w net.ipv4.neigh.default.gc_thresh3=4096 Those are the new default values. If that works then someone could look at backporting the changes.
upstream backport for OSPd/TripleO fix for this on the newton branch is https://review.openstack.org/#/c/532612/
I don't think you have to boot 5-6 instances, verifying the gc_thresh settings are correct should be enough as it shows the size of the neighbour table has been increased.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2101