Bug 1291111 - DHCP Stops working in RHOS 7 tenants
Summary: DHCP Stops working in RHOS 7 tenants
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 7.0 (Kilo)
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 7.0 (Kilo)
Assignee: Miguel Angel Ajo
QA Contact: Ofer Blaut
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-12-14 01:58 UTC by Graeme Gillies
Modified: 2019-09-12 09:33 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-06-05 23:05:35 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Log and packet capture for rabbitmq + dhcp_agent + rabbitmq server until trace happens (10.91 MB, application/x-bzip)
2015-12-14 16:33 UTC, Miguel Angel Ajo
no flags Details
dhcp-agent_errors.log (19.16 KB, text/plain)
2015-12-15 09:00 UTC, Nir Magnezi
no flags Details
dhcp-agent to Rabbitmq connection trace from the previous .cap file (1.75 MB, application/pdf)
2015-12-15 15:05 UTC, Miguel Angel Ajo
no flags Details

Description Graeme Gillies 2015-12-14 01:58:51 UTC
We have a stable RHOS 7.0.2 environment deployed via Director 7.1, which is deployed with HA Controllers (3) and roughly 30 compute nodes, as well as 3 ceph storage nodes.

Networking in the environment works for a period of time, after which we notice that DHCP in certain private networks stop working altogether, and indeed any instances in those network become unreachable, and launching new ones, they fail to get an IP and time out connecting to the metadata service etc.

There doesn't seem to be any pattern as to how long it takes for this to happen, why it happens, or who is affected. Deleting and recreating the issue seems to fix the problem.

Reproducibility of the problem is difficult, but we currently do have an environment that is experiencing the problem. we are able work with you on a debugging session if needed.

Comment 3 Miguel Angel Ajo 2015-12-14 16:33:39 UTC
Created attachment 1105610 [details]
Log and packet capture for rabbitmq + dhcp_agent + rabbitmq server until trace happens

Comment 4 Graeme Gillies 2015-12-15 01:03:52 UTC
Hi,

So a full service restart, including rabbitmq, galera, mongo, memcached, redis, all Openstack services, etc. did *not* fix the issue.

However, doing a reboot of the controller nodes did reboot the issue.

I am wondering if this is a problem with the underlying operating system components e.g.

A problem with openvswitch
A problem with the network bonding interfaces used for tenant traffic
A problem with sysctl settings that are applied to the controller hosts (buffers etc filling up?)

The only custom sysctl settings we have applied to the environment are the following

net.core.netdev_max_backlog = 10000
net.bridge.bridge-nf-call-arptables = 0
net.bridge.bridge-nf-call-iptables = 0
net.bridge.bridge-nf-call-ip6tables = 0

I hope these aren't the cause of the issue

Regards,

Graeme

Comment 5 Graeme Gillies 2015-12-15 01:07:07 UTC
(In reply to Graeme Gillies from comment #4)
> Hi,
> 
> So a full service restart, including rabbitmq, galera, mongo, memcached,
> redis, all Openstack services, etc. did *not* fix the issue.
> 
> However, doing a reboot of the controller nodes did reboot the issue.

Meant to say system reboot of controllers did fix the issue

> 
> I am wondering if this is a problem with the underlying operating system
> components e.g.
> 
> A problem with openvswitch
> A problem with the network bonding interfaces used for tenant traffic
> A problem with sysctl settings that are applied to the controller hosts
> (buffers etc filling up?)
> 
> The only custom sysctl settings we have applied to the environment are the
> following
> 
> net.core.netdev_max_backlog = 10000
> net.bridge.bridge-nf-call-arptables = 0
> net.bridge.bridge-nf-call-iptables = 0
> net.bridge.bridge-nf-call-ip6tables = 0

Correction the bridge sysctl settings are applied to compute nodes only, and do not seem to be the cause as we have not rebooted the computes and they are no working fine. Only change to controllers is the netdev_max_backlog change

> 
> I hope these aren't the cause of the issue
> 
> Regards,
> 
> Graeme

Comment 6 Nir Magnezi 2015-12-15 09:00:15 UTC
Created attachment 1105928 [details]
dhcp-agent_errors.log

When this issue occurs, I noticed that for some reason both neutron-dhcp and neutron-metadata agents failed to reach rabbitmq server.

snipped from the metadata agent on ctrl-1 node:
2015-12-10 15:03:09.383 141211 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 172.16.104.15:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.

snipped from the dhcp agent on ctrl-1 node:
error: [Errno 104] Connection reset by peer
2015-12-14 03:53:04.522 10293 ERROR oslo_messaging._drivers.impl_rabbit [-] Failed to consume message from queue: 
2015-12-14 03:55:02.911 41736 ERROR root [-] Unexpected exception occurred 1 time(s)... retrying. (see attached 'dhcp-agent_errors.log' which consists from multiple errors snipped from the original log, from multiple dates)

Also, on all 3 rabbitmq servers we noticed that the dhcp_agent queues occupied a large portion of memory (which kept growing):
4119968	dhcp_agent.rhqe-bare-ctrl-1.localdomain
4119968	dhcp_agent.rhqe-bare-ctrl-2.localdomain
4120048	dhcp_agent.rhqe-bare-ctrl-0.localdomain

* Side note: to extract this (thanks petro): rabbitmqctl list_queues memory name | sort -n

This information is pasted here for (possible) later reference. It is not actionable at the moment since we still did not reach the root cause of this, as Graeme stated in comment #5.

Comment 7 Miguel Angel Ajo 2015-12-15 14:57:44 UTC
@nmagnezi, please not, that's not failure to reach, it's rabbitmq actively closing the TCP connection (if you look at the rabbitmq log messages you will find that out).

Comment 8 Miguel Angel Ajo 2015-12-15 15:04:54 UTC
@fpercoco, could you look at my attachment?

It does contain a tcpdump capture of the conversation of rabbitmq during:

1) start of dhcp-agent
2) wait for dhcp-agent to sync on networks, etc...
3) dhcp-agent starts to try consuming agent notifications, but rabbitmq has closed the connection..

It seems that dhcp-agent(/oslomessaging) does not gracefully recover from that, the messages which arrive via fanout are not consumed anymore.

(I will add another pdf, see from page 698 and beyond, it's a filtered stream of conversation between the dhcp-agent and the rabbitmq server).

It seems to me that rabbitmq sends the data for a notification, agent host TCP keeps responding the TCP window is full [TCP ZeroWindow] , I guess that means the application didn't consume it from the tcp stack, TCP Keepalives keep happening, with indication of ZeroWindow, and finally connection is RST by rabbitmq (Page 718), that's where you see the TRACE error on the dhcp-agent, and from there on, no more notifications (fanout method calls) get consumed.

Comment 9 Miguel Angel Ajo 2015-12-15 15:05:58 UTC
Created attachment 1106064 [details]
dhcp-agent to Rabbitmq connection trace from the previous .cap file

Comment 10 John Eckersberg 2015-12-15 15:29:04 UTC
A bit of background.  We've got RabbitMQ configured to time out TCP sessions after 5 seconds.  This is on purpose, so if a host drops offline we quickly notice, tear down the channel, and any inflight messages get requeued to be consumed by another host.

Looking at the PDF, I see:

- Page 716, No 234268, the RabbitMQ host sends a packet at time 116.167559 with sequence 360627

- Page 717, No 243903, the client ACKs number 360627 at time 121.176581, which is 5.009 seconds RTT

It came really close to making it in time, but not quite, so the kernel sends the RST on the last page and the connection gets torn down.

So, for some reason that ACK was a bit slow coming back.  Either network congestion or the client host being overloaded or something similar.

Comment 11 Flavio Percoco 2015-12-15 17:18:17 UTC
John can provide better insights when it comes to rabbitmq and tcpdumps coming from it (as he already did).

I'll ping Petr as well to get his thoughts on this.

Comment 12 Assaf Muller 2015-12-15 17:58:32 UTC
(In reply to John Eckersberg from comment #10)
> A bit of background.  We've got RabbitMQ configured to time out TCP sessions
> after 5 seconds.  This is on purpose, so if a host drops offline we quickly
> notice, tear down the channel, and any inflight messages get requeued to be
> consumed by another host.

In this case if I understand correctly a message was not ACKed in time, so that channel got closed, and the agent never recovers. This doesn't seem like ideal behavior, especially considering that in this case there is no other host that should be consuming the message.

Comment 13 John Eckersberg 2015-12-15 18:08:08 UTC
(In reply to Assaf Muller from comment #12)
> In this case if I understand correctly a message was not ACKed in time, so
> that channel got closed, and the agent never recovers. This doesn't seem
> like ideal behavior, especially considering that in this case there is no
> other host that should be consuming the message.

Right.  There's also a bug somewhere in the client (maybe neutron, more likely oslo.messaging) preventing the agent from reconnecting like it should.  The server side strategy of killing off slow/unresponsive clients only works when the clients will reconnect when they are healthy again.

Comment 17 Miguel Angel Ajo 2015-12-17 12:23:20 UTC
Even with tcp_keepalive_time=60 I get into:

2015-12-17 07:20:06.330 98126 WARNING neutron.openstack.common.loopingcall [req-d19b4fa4-06d2-4c87-ae36-09adef207c09 ] task <bound method DhcpAgentWithStateReport._report_state of <neutron.agent.dhcp.agent.DhcpAgentWithStateReport object at 0x2a2d8d0>> run outlasted interval by 459.15 sec
2015-12-17 07:20:06.351 98126 WARNING neutron.agent.dhcp.agent [req-62e91c60-2c5c-48e9-9020-4001c3c4e12f ] Network 87baaa6c-bf70-411e-b399-55d1a9637b49 has been deleted.
2015-12-17 07:20:06.353 98126 ERROR root [-] Unexpected exception occurred 1 time(s)... retrying.
2015-12-17 07:20:06.353 98126 TRACE root Traceback (most recent call last):
2015-12-17 07:20:06.353 98126 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 95, in inner_func
2015-12-17 07:20:06.353 98126 TRACE root     return infunc(*args, **kwargs)
2015-12-17 07:20:06.353 98126 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 98, in _executor_thread
2015-12-17 07:20:06.353 98126 TRACE root     self._dispatch(incoming)
2015-12-17 07:20:06.353 98126 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 86, in _dispatch
2015-12-17 07:20:06.353 98126 TRACE root     spawn_with(ctxt=self.dispatcher(incoming), pool=self._greenpool)
2015-12-17 07:20:06.353 98126 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 53, in spawn_with
2015-12-17 07:20:06.353 98126 TRACE root     callback = ctxt.__enter__()
2015-12-17 07:20:06.353 98126 TRACE root   File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
2015-12-17 07:20:06.353 98126 TRACE root     return self.gen.next()
2015-12-17 07:20:06.353 98126 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 135, in __call__
2015-12-17 07:20:06.353 98126 TRACE root     incoming.acknowledge()
2015-12-17 07:20:06.353 98126 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 79, in acknowledge
2015-12-17 07:20:06.353 98126 TRACE root     self.acknowledge_callback()
2015-12-17 07:20:06.353 98126 TRACE root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 168, in acknowledge
2015-12-17 07:20:06.353 98126 TRACE root     self._raw_message.ack()
2015-12-17 07:20:06.353 98126 TRACE root   File "/usr/lib/python2.7/site-packages/kombu/message.py", line 88, in ack
2015-12-17 07:20:06.353 98126 TRACE root     self.channel.basic_ack(self.delivery_tag)
2015-12-17 07:20:06.353 98126 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 1583, in basic_ack
2015-12-17 07:20:06.353 98126 TRACE root     self._send_method((60, 80), args)
2015-12-17 07:20:06.353 98126 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 56, in _send_method
2015-12-17 07:20:06.353 98126 TRACE root     self.channel_id, method_sig, args, content,
2015-12-17 07:20:06.353 98126 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 221, in write_method
2015-12-17 07:20:06.353 98126 TRACE root     write_frame(1, channel, payload)
2015-12-17 07:20:06.353 98126 TRACE root   File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 182, in write_frame
2015-12-17 07:20:06.353 98126 TRACE root     frame_type, channel, size, payload, 0xce,
2015-12-17 07:20:06.353 98126 TRACE root   File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 376, in sendall
2015-12-17 07:20:06.353 98126 TRACE root     tail = self.send(data, flags)
2015-12-17 07:20:06.353 98126 TRACE root   File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 359, in send
2015-12-17 07:20:06.353 98126 TRACE root     total_sent += fd.send(data[total_sent:], flags)
2015-12-17 07:20:06.353 98126 TRACE root error: [Errno 104] Connection reset by peer

Comment 18 John Eckersberg 2015-12-17 14:31:06 UTC
It's not the keepalive that is dumping you, it's the TCP_USER_TIMEOUT we've set on the listening socket.  If you want to increase it, look for "5000" in /etc/rabbitmq/rabbitmq-env.conf, that's the timeout value in milliseconds.  Try bumping it to 10000 or greater, but be aware that there are probably underlying network issues causing packet loss.

Comment 19 Miguel Angel Ajo 2015-12-17 17:17:03 UTC
According to our conversation via bluejeans, I've commented this on ctrl-0 to disable the TCP_USER_TIMEOUT for now.

#RABBITMQ_SERVER_ERL_ARGS="+K true +A30 +P 1048576 -kernel inet_default_connect_options [{nodelay,true},{raw,6,18,<<5000:64/native>>}] -kernel inet_default_listen_options [{raw,6,18,<<5000:64/native>>}]"

But it needs to be commented on the other hosts,

Comment 20 Flavio Percoco 2015-12-17 17:49:30 UTC
This bug looks very much like rhbz#1270353

There's a package built for that bug `python-oslo-messaging.1.8-3` that should fix this issue. Would it be possible to install this package in the environment and try to reproduce this issue?

Comment 21 Will Foster 2015-12-17 18:03:56 UTC
(In reply to Flavio Percoco from comment #20)
> This bug looks very much like rhbz#1270353
> 
> There's a package built for that bug `python-oslo-messaging.1.8-3` that
> should fix this issue. Would it be possible to install this package in the
> environment and try to reproduce this issue?

Yes, we can install anything required to test this out.
Currently this is what's on the controllers:

python-oslo-policy-0.3.1-1.el7ost.noarch
python-oslo-rootwrap-1.6.0-1.el7ost.noarch
python-oslo-i18n-1.5.0-3.el7ost.noarch
python-oslo-concurrency-1.8.2-1.el7ost.noarch
python-oslo-middleware-1.0.0-2.el7ost.noarch
python-oslo-context-0.2.0-5.el7ost.noarch
python-oslo-serialization-1.4.0-1.el7ost.noarch
python-oslo-vmware-0.11.1-1.el7ost.noarch
python-oslo-config-1.9.3-1.el7ost.noarch
python-oslo-messaging-1.8.3-1.el7ost.noarch
python-oslo-versionedobjects-0.1.1-1.el7ost.noarch
python-oslo-log-1.0.0-1.el7ost.noarch
python-oslo-utils-1.4.0-1.el7ost.noarch
python-oslo-db-1.7.1-1.el7ost.noarch

Comment 22 Will Foster 2015-12-17 18:07:57 UTC
(In reply to Will Foster from comment #21)
> (In reply to Flavio Percoco from comment #20)
> > This bug looks very much like rhbz#1270353
> > 
> > There's a package built for that bug `python-oslo-messaging.1.8-3` that
> > should fix this issue. Would it be possible to install this package in the
> > environment and try to reproduce this issue?
> 
> Yes, we can install anything required to test this out.
> Currently this is what's on the controllers:
> 
> python-oslo-policy-0.3.1-1.el7ost.noarch
> python-oslo-rootwrap-1.6.0-1.el7ost.noarch
> python-oslo-i18n-1.5.0-3.el7ost.noarch
> python-oslo-concurrency-1.8.2-1.el7ost.noarch
> python-oslo-middleware-1.0.0-2.el7ost.noarch
> python-oslo-context-0.2.0-5.el7ost.noarch
> python-oslo-serialization-1.4.0-1.el7ost.noarch
> python-oslo-vmware-0.11.1-1.el7ost.noarch
> python-oslo-config-1.9.3-1.el7ost.noarch
> python-oslo-messaging-1.8.3-1.el7ost.noarch
> python-oslo-versionedobjects-0.1.1-1.el7ost.noarch
> python-oslo-log-1.0.0-1.el7ost.noarch
> python-oslo-utils-1.4.0-1.el7ost.noarch
> python-oslo-db-1.7.1-1.el7ost.noarch

Comment from the bug Flavio mentioned, if it's BZ1270353 it's fixed in:

python-oslo-messaging-1.8.3-3.el7ost

https://bugzilla.redhat.com/show_bug.cgi?id=1270353

Comment 23 Vincent S. Cojot 2015-12-17 18:47:55 UTC
Hi Flavio,
If it's BZ1270353, would that explain the missing namespaces?
Thanks,

Comment 24 Kambiz Aghaiepour 2015-12-17 20:42:57 UTC
The recommended changes have been implemented and the python-oslo-messaging-1.8.3-3.el7ost rpm installed on the controllers.  The switch configs have been re-audited by network engineers, and set to default MTU settings.  The recommended rabbit configuration option was commented out as per Comment#19 from above as per Miguel.

Things are currently running and we ran some small rally scenarios verifying that FlIPs are working and ssh access to launched guests is working at 100% (for 100 vms launched with a concurrency of 10).   Current jenkins workloads that previously triggered this issue have been initiated by Yair Fried.

Comment 25 Miguel Angel Ajo 2015-12-18 08:48:52 UTC
Hi, can we get the package also updated in the compute nodes?, It's not happening now, but it could certainly happen to a nova compute, or neutron-openvswitch-agent, to eventually get disconnected on a message ack, and stay off the messaging hook (compute node broken..).

Comment 26 Graeme Gillies 2015-12-18 08:58:38 UTC
(In reply to Miguel Angel Ajo from comment #25)
> Hi, can we get the package also updated in the compute nodes?, It's not
> happening now, but it could certainly happen to a nova compute, or
> neutron-openvswitch-agent, to eventually get disconnected on a message ack,
> and stay off the messaging hook (compute node broken..).

Hi,

I can confirm the package update was applied to all nodes, and all services had a restart to make sure they picked up the new code

Regards,

Graeme

Comment 27 Will Foster 2015-12-18 11:52:59 UTC
We've completed a 200 instance, 5 x concurrency Rally run that tests everything with 100% sucess:

+------------------------------------------------------------------------------------------------------+
|                                         Response Times (sec)                                         |
+------------------------------+--------+--------+--------+--------+--------+--------+---------+-------+
| action                       | min    | median | 90%ile | 95%ile | max    | avg    | success | count |
+------------------------------+--------+--------+--------+--------+--------+--------+---------+-------+
| neutronPlugin.create_router  | 0.918  | 1.26   | 1.587  | 1.681  | 2.229  | 1.279  | 100.0%  | 200   |
| neutron.list_networks        | 0.484  | 0.77   | 1.014  | 1.084  | 1.483  | 0.778  | 100.0%  | 200   |
| neutron.create_network       | 0.139  | 0.209  | 0.266  | 0.309  | 0.452  | 0.223  | 100.0%  | 200   |
| neutron.create_subnet        | 0.197  | 0.276  | 0.37   | 0.439  | 1.334  | 0.304  | 100.0%  | 200   |
| neutron.add_interface_router | 0.245  | 0.379  | 0.485  | 0.56   | 0.766  | 0.395  | 100.0%  | 200   |
| nova.boot_server             | 35.542 | 36.721 | 37.686 | 38.665 | 44.41  | 36.953 | 100.0%  | 200   |
| vm.attach_floating_ip        | 0.946  | 1.319  | 1.883  | 2.196  | 2.476  | 1.417  | 100.0%  | 200   |
| vm.wait_for_ping             | 2.605  | 4.852  | 6.095  | 6.097  | 7.222  | 5.029  | 100.0%  | 200   |
| vm.wait_for_ssh              | 2.238  | 8.284  | 9.63   | 9.951  | 11.577 | 6.746  | 100.0%  | 200   |
| vm.run_command_over_ssh      | 2.21   | 2.237  | 2.244  | 2.251  | 2.28   | 2.239  | 100.0%  | 200   |
| total                        | 48.324 | 55.301 | 57.717 | 58.404 | 63.535 | 54.639 | 100.0%  | 200   |
+------------------------------+--------+--------+--------+--------+--------+--------+---------+-------+

Things seem stable now after applying the updated oslo package.  We will need some longer running workloads to fully vet that the dhcp-agent issues are solved however.

Comment 29 lpeer 2015-12-21 08:13:48 UTC
will - can we close this bug?

Comment 30 Will Foster 2015-12-21 15:25:01 UTC
We are still seeing issues here with pacemaker and dhcp-agent (In reply to lpeer from comment #29)
> will - can we close this bug?

We saw some further instability with the dhcp-agent processes after throwing more load onto the environment over the weekend.

I don't believe that this issue is fixed, however the oslo-messaging errata seemed to at least get things running again.


Note You need to log in before you can comment on or make changes to this bug.