We have a stable RHOS 7.0.2 environment deployed via Director 7.1, which is deployed with HA Controllers (3) and roughly 30 compute nodes, as well as 3 ceph storage nodes. Networking in the environment works for a period of time, after which we notice that DHCP in certain private networks stop working altogether, and indeed any instances in those network become unreachable, and launching new ones, they fail to get an IP and time out connecting to the metadata service etc. There doesn't seem to be any pattern as to how long it takes for this to happen, why it happens, or who is affected. Deleting and recreating the issue seems to fix the problem. Reproducibility of the problem is difficult, but we currently do have an environment that is experiencing the problem. we are able work with you on a debugging session if needed.
Created attachment 1105610 [details] Log and packet capture for rabbitmq + dhcp_agent + rabbitmq server until trace happens
Hi, So a full service restart, including rabbitmq, galera, mongo, memcached, redis, all Openstack services, etc. did *not* fix the issue. However, doing a reboot of the controller nodes did reboot the issue. I am wondering if this is a problem with the underlying operating system components e.g. A problem with openvswitch A problem with the network bonding interfaces used for tenant traffic A problem with sysctl settings that are applied to the controller hosts (buffers etc filling up?) The only custom sysctl settings we have applied to the environment are the following net.core.netdev_max_backlog = 10000 net.bridge.bridge-nf-call-arptables = 0 net.bridge.bridge-nf-call-iptables = 0 net.bridge.bridge-nf-call-ip6tables = 0 I hope these aren't the cause of the issue Regards, Graeme
(In reply to Graeme Gillies from comment #4) > Hi, > > So a full service restart, including rabbitmq, galera, mongo, memcached, > redis, all Openstack services, etc. did *not* fix the issue. > > However, doing a reboot of the controller nodes did reboot the issue. Meant to say system reboot of controllers did fix the issue > > I am wondering if this is a problem with the underlying operating system > components e.g. > > A problem with openvswitch > A problem with the network bonding interfaces used for tenant traffic > A problem with sysctl settings that are applied to the controller hosts > (buffers etc filling up?) > > The only custom sysctl settings we have applied to the environment are the > following > > net.core.netdev_max_backlog = 10000 > net.bridge.bridge-nf-call-arptables = 0 > net.bridge.bridge-nf-call-iptables = 0 > net.bridge.bridge-nf-call-ip6tables = 0 Correction the bridge sysctl settings are applied to compute nodes only, and do not seem to be the cause as we have not rebooted the computes and they are no working fine. Only change to controllers is the netdev_max_backlog change > > I hope these aren't the cause of the issue > > Regards, > > Graeme
Created attachment 1105928 [details] dhcp-agent_errors.log When this issue occurs, I noticed that for some reason both neutron-dhcp and neutron-metadata agents failed to reach rabbitmq server. snipped from the metadata agent on ctrl-1 node: 2015-12-10 15:03:09.383 141211 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 172.16.104.15:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds. snipped from the dhcp agent on ctrl-1 node: error: [Errno 104] Connection reset by peer 2015-12-14 03:53:04.522 10293 ERROR oslo_messaging._drivers.impl_rabbit [-] Failed to consume message from queue: 2015-12-14 03:55:02.911 41736 ERROR root [-] Unexpected exception occurred 1 time(s)... retrying. (see attached 'dhcp-agent_errors.log' which consists from multiple errors snipped from the original log, from multiple dates) Also, on all 3 rabbitmq servers we noticed that the dhcp_agent queues occupied a large portion of memory (which kept growing): 4119968 dhcp_agent.rhqe-bare-ctrl-1.localdomain 4119968 dhcp_agent.rhqe-bare-ctrl-2.localdomain 4120048 dhcp_agent.rhqe-bare-ctrl-0.localdomain * Side note: to extract this (thanks petro): rabbitmqctl list_queues memory name | sort -n This information is pasted here for (possible) later reference. It is not actionable at the moment since we still did not reach the root cause of this, as Graeme stated in comment #5.
@nmagnezi, please not, that's not failure to reach, it's rabbitmq actively closing the TCP connection (if you look at the rabbitmq log messages you will find that out).
@fpercoco, could you look at my attachment? It does contain a tcpdump capture of the conversation of rabbitmq during: 1) start of dhcp-agent 2) wait for dhcp-agent to sync on networks, etc... 3) dhcp-agent starts to try consuming agent notifications, but rabbitmq has closed the connection.. It seems that dhcp-agent(/oslomessaging) does not gracefully recover from that, the messages which arrive via fanout are not consumed anymore. (I will add another pdf, see from page 698 and beyond, it's a filtered stream of conversation between the dhcp-agent and the rabbitmq server). It seems to me that rabbitmq sends the data for a notification, agent host TCP keeps responding the TCP window is full [TCP ZeroWindow] , I guess that means the application didn't consume it from the tcp stack, TCP Keepalives keep happening, with indication of ZeroWindow, and finally connection is RST by rabbitmq (Page 718), that's where you see the TRACE error on the dhcp-agent, and from there on, no more notifications (fanout method calls) get consumed.
Created attachment 1106064 [details] dhcp-agent to Rabbitmq connection trace from the previous .cap file
A bit of background. We've got RabbitMQ configured to time out TCP sessions after 5 seconds. This is on purpose, so if a host drops offline we quickly notice, tear down the channel, and any inflight messages get requeued to be consumed by another host. Looking at the PDF, I see: - Page 716, No 234268, the RabbitMQ host sends a packet at time 116.167559 with sequence 360627 - Page 717, No 243903, the client ACKs number 360627 at time 121.176581, which is 5.009 seconds RTT It came really close to making it in time, but not quite, so the kernel sends the RST on the last page and the connection gets torn down. So, for some reason that ACK was a bit slow coming back. Either network congestion or the client host being overloaded or something similar.
John can provide better insights when it comes to rabbitmq and tcpdumps coming from it (as he already did). I'll ping Petr as well to get his thoughts on this.
(In reply to John Eckersberg from comment #10) > A bit of background. We've got RabbitMQ configured to time out TCP sessions > after 5 seconds. This is on purpose, so if a host drops offline we quickly > notice, tear down the channel, and any inflight messages get requeued to be > consumed by another host. In this case if I understand correctly a message was not ACKed in time, so that channel got closed, and the agent never recovers. This doesn't seem like ideal behavior, especially considering that in this case there is no other host that should be consuming the message.
(In reply to Assaf Muller from comment #12) > In this case if I understand correctly a message was not ACKed in time, so > that channel got closed, and the agent never recovers. This doesn't seem > like ideal behavior, especially considering that in this case there is no > other host that should be consuming the message. Right. There's also a bug somewhere in the client (maybe neutron, more likely oslo.messaging) preventing the agent from reconnecting like it should. The server side strategy of killing off slow/unresponsive clients only works when the clients will reconnect when they are healthy again.
Even with tcp_keepalive_time=60 I get into: 2015-12-17 07:20:06.330 98126 WARNING neutron.openstack.common.loopingcall [req-d19b4fa4-06d2-4c87-ae36-09adef207c09 ] task <bound method DhcpAgentWithStateReport._report_state of <neutron.agent.dhcp.agent.DhcpAgentWithStateReport object at 0x2a2d8d0>> run outlasted interval by 459.15 sec 2015-12-17 07:20:06.351 98126 WARNING neutron.agent.dhcp.agent [req-62e91c60-2c5c-48e9-9020-4001c3c4e12f ] Network 87baaa6c-bf70-411e-b399-55d1a9637b49 has been deleted. 2015-12-17 07:20:06.353 98126 ERROR root [-] Unexpected exception occurred 1 time(s)... retrying. 2015-12-17 07:20:06.353 98126 TRACE root Traceback (most recent call last): 2015-12-17 07:20:06.353 98126 TRACE root File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 95, in inner_func 2015-12-17 07:20:06.353 98126 TRACE root return infunc(*args, **kwargs) 2015-12-17 07:20:06.353 98126 TRACE root File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 98, in _executor_thread 2015-12-17 07:20:06.353 98126 TRACE root self._dispatch(incoming) 2015-12-17 07:20:06.353 98126 TRACE root File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 86, in _dispatch 2015-12-17 07:20:06.353 98126 TRACE root spawn_with(ctxt=self.dispatcher(incoming), pool=self._greenpool) 2015-12-17 07:20:06.353 98126 TRACE root File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_eventlet.py", line 53, in spawn_with 2015-12-17 07:20:06.353 98126 TRACE root callback = ctxt.__enter__() 2015-12-17 07:20:06.353 98126 TRACE root File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__ 2015-12-17 07:20:06.353 98126 TRACE root return self.gen.next() 2015-12-17 07:20:06.353 98126 TRACE root File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 135, in __call__ 2015-12-17 07:20:06.353 98126 TRACE root incoming.acknowledge() 2015-12-17 07:20:06.353 98126 TRACE root File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 79, in acknowledge 2015-12-17 07:20:06.353 98126 TRACE root self.acknowledge_callback() 2015-12-17 07:20:06.353 98126 TRACE root File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 168, in acknowledge 2015-12-17 07:20:06.353 98126 TRACE root self._raw_message.ack() 2015-12-17 07:20:06.353 98126 TRACE root File "/usr/lib/python2.7/site-packages/kombu/message.py", line 88, in ack 2015-12-17 07:20:06.353 98126 TRACE root self.channel.basic_ack(self.delivery_tag) 2015-12-17 07:20:06.353 98126 TRACE root File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 1583, in basic_ack 2015-12-17 07:20:06.353 98126 TRACE root self._send_method((60, 80), args) 2015-12-17 07:20:06.353 98126 TRACE root File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 56, in _send_method 2015-12-17 07:20:06.353 98126 TRACE root self.channel_id, method_sig, args, content, 2015-12-17 07:20:06.353 98126 TRACE root File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 221, in write_method 2015-12-17 07:20:06.353 98126 TRACE root write_frame(1, channel, payload) 2015-12-17 07:20:06.353 98126 TRACE root File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 182, in write_frame 2015-12-17 07:20:06.353 98126 TRACE root frame_type, channel, size, payload, 0xce, 2015-12-17 07:20:06.353 98126 TRACE root File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 376, in sendall 2015-12-17 07:20:06.353 98126 TRACE root tail = self.send(data, flags) 2015-12-17 07:20:06.353 98126 TRACE root File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 359, in send 2015-12-17 07:20:06.353 98126 TRACE root total_sent += fd.send(data[total_sent:], flags) 2015-12-17 07:20:06.353 98126 TRACE root error: [Errno 104] Connection reset by peer
It's not the keepalive that is dumping you, it's the TCP_USER_TIMEOUT we've set on the listening socket. If you want to increase it, look for "5000" in /etc/rabbitmq/rabbitmq-env.conf, that's the timeout value in milliseconds. Try bumping it to 10000 or greater, but be aware that there are probably underlying network issues causing packet loss.
According to our conversation via bluejeans, I've commented this on ctrl-0 to disable the TCP_USER_TIMEOUT for now. #RABBITMQ_SERVER_ERL_ARGS="+K true +A30 +P 1048576 -kernel inet_default_connect_options [{nodelay,true},{raw,6,18,<<5000:64/native>>}] -kernel inet_default_listen_options [{raw,6,18,<<5000:64/native>>}]" But it needs to be commented on the other hosts,
This bug looks very much like rhbz#1270353 There's a package built for that bug `python-oslo-messaging.1.8-3` that should fix this issue. Would it be possible to install this package in the environment and try to reproduce this issue?
(In reply to Flavio Percoco from comment #20) > This bug looks very much like rhbz#1270353 > > There's a package built for that bug `python-oslo-messaging.1.8-3` that > should fix this issue. Would it be possible to install this package in the > environment and try to reproduce this issue? Yes, we can install anything required to test this out. Currently this is what's on the controllers: python-oslo-policy-0.3.1-1.el7ost.noarch python-oslo-rootwrap-1.6.0-1.el7ost.noarch python-oslo-i18n-1.5.0-3.el7ost.noarch python-oslo-concurrency-1.8.2-1.el7ost.noarch python-oslo-middleware-1.0.0-2.el7ost.noarch python-oslo-context-0.2.0-5.el7ost.noarch python-oslo-serialization-1.4.0-1.el7ost.noarch python-oslo-vmware-0.11.1-1.el7ost.noarch python-oslo-config-1.9.3-1.el7ost.noarch python-oslo-messaging-1.8.3-1.el7ost.noarch python-oslo-versionedobjects-0.1.1-1.el7ost.noarch python-oslo-log-1.0.0-1.el7ost.noarch python-oslo-utils-1.4.0-1.el7ost.noarch python-oslo-db-1.7.1-1.el7ost.noarch
(In reply to Will Foster from comment #21) > (In reply to Flavio Percoco from comment #20) > > This bug looks very much like rhbz#1270353 > > > > There's a package built for that bug `python-oslo-messaging.1.8-3` that > > should fix this issue. Would it be possible to install this package in the > > environment and try to reproduce this issue? > > Yes, we can install anything required to test this out. > Currently this is what's on the controllers: > > python-oslo-policy-0.3.1-1.el7ost.noarch > python-oslo-rootwrap-1.6.0-1.el7ost.noarch > python-oslo-i18n-1.5.0-3.el7ost.noarch > python-oslo-concurrency-1.8.2-1.el7ost.noarch > python-oslo-middleware-1.0.0-2.el7ost.noarch > python-oslo-context-0.2.0-5.el7ost.noarch > python-oslo-serialization-1.4.0-1.el7ost.noarch > python-oslo-vmware-0.11.1-1.el7ost.noarch > python-oslo-config-1.9.3-1.el7ost.noarch > python-oslo-messaging-1.8.3-1.el7ost.noarch > python-oslo-versionedobjects-0.1.1-1.el7ost.noarch > python-oslo-log-1.0.0-1.el7ost.noarch > python-oslo-utils-1.4.0-1.el7ost.noarch > python-oslo-db-1.7.1-1.el7ost.noarch Comment from the bug Flavio mentioned, if it's BZ1270353 it's fixed in: python-oslo-messaging-1.8.3-3.el7ost https://bugzilla.redhat.com/show_bug.cgi?id=1270353
Hi Flavio, If it's BZ1270353, would that explain the missing namespaces? Thanks,
The recommended changes have been implemented and the python-oslo-messaging-1.8.3-3.el7ost rpm installed on the controllers. The switch configs have been re-audited by network engineers, and set to default MTU settings. The recommended rabbit configuration option was commented out as per Comment#19 from above as per Miguel. Things are currently running and we ran some small rally scenarios verifying that FlIPs are working and ssh access to launched guests is working at 100% (for 100 vms launched with a concurrency of 10). Current jenkins workloads that previously triggered this issue have been initiated by Yair Fried.
Hi, can we get the package also updated in the compute nodes?, It's not happening now, but it could certainly happen to a nova compute, or neutron-openvswitch-agent, to eventually get disconnected on a message ack, and stay off the messaging hook (compute node broken..).
(In reply to Miguel Angel Ajo from comment #25) > Hi, can we get the package also updated in the compute nodes?, It's not > happening now, but it could certainly happen to a nova compute, or > neutron-openvswitch-agent, to eventually get disconnected on a message ack, > and stay off the messaging hook (compute node broken..). Hi, I can confirm the package update was applied to all nodes, and all services had a restart to make sure they picked up the new code Regards, Graeme
We've completed a 200 instance, 5 x concurrency Rally run that tests everything with 100% sucess: +------------------------------------------------------------------------------------------------------+ | Response Times (sec) | +------------------------------+--------+--------+--------+--------+--------+--------+---------+-------+ | action | min | median | 90%ile | 95%ile | max | avg | success | count | +------------------------------+--------+--------+--------+--------+--------+--------+---------+-------+ | neutronPlugin.create_router | 0.918 | 1.26 | 1.587 | 1.681 | 2.229 | 1.279 | 100.0% | 200 | | neutron.list_networks | 0.484 | 0.77 | 1.014 | 1.084 | 1.483 | 0.778 | 100.0% | 200 | | neutron.create_network | 0.139 | 0.209 | 0.266 | 0.309 | 0.452 | 0.223 | 100.0% | 200 | | neutron.create_subnet | 0.197 | 0.276 | 0.37 | 0.439 | 1.334 | 0.304 | 100.0% | 200 | | neutron.add_interface_router | 0.245 | 0.379 | 0.485 | 0.56 | 0.766 | 0.395 | 100.0% | 200 | | nova.boot_server | 35.542 | 36.721 | 37.686 | 38.665 | 44.41 | 36.953 | 100.0% | 200 | | vm.attach_floating_ip | 0.946 | 1.319 | 1.883 | 2.196 | 2.476 | 1.417 | 100.0% | 200 | | vm.wait_for_ping | 2.605 | 4.852 | 6.095 | 6.097 | 7.222 | 5.029 | 100.0% | 200 | | vm.wait_for_ssh | 2.238 | 8.284 | 9.63 | 9.951 | 11.577 | 6.746 | 100.0% | 200 | | vm.run_command_over_ssh | 2.21 | 2.237 | 2.244 | 2.251 | 2.28 | 2.239 | 100.0% | 200 | | total | 48.324 | 55.301 | 57.717 | 58.404 | 63.535 | 54.639 | 100.0% | 200 | +------------------------------+--------+--------+--------+--------+--------+--------+---------+-------+ Things seem stable now after applying the updated oslo package. We will need some longer running workloads to fully vet that the dhcp-agent issues are solved however.
will - can we close this bug?
We are still seeing issues here with pacemaker and dhcp-agent (In reply to lpeer from comment #29) > will - can we close this bug? We saw some further instability with the dhcp-agent processes after throwing more load onto the environment over the weekend. I don't believe that this issue is fixed, however the oslo-messaging errata seemed to at least get things running again.