Bug 1302387
Summary: | [HA] Hypervisor reported as down after failover of one of the controllers | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Marian Krcmarik <mkrcmari> | ||||
Component: | python-oslo-messaging | Assignee: | John Eckersberg <jeckersb> | ||||
Status: | CLOSED ERRATA | QA Contact: | Marian Krcmarik <mkrcmari> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 8.0 (Liberty) | CC: | apevec, berrange, dasmith, eglynn, fdinitto, jeckersb, jschluet, kchamart, lhh, mkrcmari, oblaut, sbauza, sclewis, sferdjao, sgordon, srevivo, vromanso, vstinner | ||||
Target Milestone: | async | Keywords: | AutomationBlocker, TestBlocker, ZStream | ||||
Target Release: | 8.0 (Liberty) | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | python-oslo-messaging-2.5.0-7.el7ost | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 1306708 1322903 (view as bug list) | Environment: | |||||
Last Closed: | 2016-06-29 13:48:03 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1306708, 1322903 | ||||||
Attachments: |
|
Description
Marian Krcmarik
2016-01-27 16:56:21 UTC
This appears to be using OSP 8 core puddle from December I managed to reproduce the bug in following steps: 1. Having three controllers and one compute node, multiple nove-compute processes connected to AMQP servers on the controllers. 2. Dropping communication from compute node to one of the controllers with iptables. AS result I hit that hypervisor was being reported as down for over 15 minutes and rabbitmq did not detect that AMQP server on one of the controllers was unreachable. Note: It's somehow much easier to reproduce when doing hard reset of a controller then just iptables traffic drop. Marian, can you confirm that you have reproduced this on the latest available code for all components per comment #2? (In reply to Dan Smith from comment #5) > Marian, can you confirm that you have reproduced this on the latest > available code for all components per comment #2? I can confirm that, I did try with latest bits for RHOS8 available by today. Moreover I am changing the component to oslo.messaging, It effects the same way neutron agents and hopefully It will get right attention there. Latest reproduction: - RHOS8 HA setup with 3 controller nodes and 1 compute node. 1. nova stop overcloud-controller-0 from undercloud at around 10:20:00 2. give it a minute for VIP to recover. 3. Enter "nova hypervisor-list" and "neutron agent-list" -> all neutron agents reported as down, hypervisor reported as down as well. 4. Regulary issue commands from 3. until It reports agents up on alive controllers and hypervisor up 5. For hypervisor It succeed after 7 minutes, for some neutron agents after 16 minutes. Example from neutron l3 agent 10:36:06 is the time when agent started to be reported as alive, which means It took 16 minutes: 2016-02-11 10:20:23.609 26015 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.16.20.13:5672 is unreachable: [Errno 110] Connection timed out. Trying again in 1 seconds. 2016-02-11 10:20:23.895 26015 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.16.20.13:5672 is unreachable: [Errno 110] Connection timed out. Trying again in 1 seconds. 2016-02-11 10:20:29.645 26015 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.16.20.13:5672 is unreachable: timed out. Trying again in 1 seconds. 2016-02-11 10:20:29.912 26015 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.16.20.13:5672 is unreachable: timed out. Trying again in 1 seconds. 2016-02-11 10:22:03.103 26015 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 172.16.20.15:5672 2016-02-11 10:23:30.956 26015 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to consume message from queue: Socket closed 2016-02-11 10:23:30.957 26015 ERROR oslo.messaging._drivers.impl_rabbit [-] Unable to connect to AMQP server on 172.16.20.14:5672 after None tries: Socket closed 2016-02-11 10:23:30.957 26015 ERROR root [-] Unexpected exception occurred 1 time(s)... retrying. 2016-02-11 10:23:30.957 26015 ERROR root Traceback (most recent call last): 2016-02-11 10:23:30.957 26015 ERROR root File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 205, in inner_func 2016-02-11 10:23:30.957 26015 ERROR root return infunc(*args, **kwargs) 2016-02-11 10:23:30.957 26015 ERROR root File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_pooledexecutor.py", line 95, in _runner 2016-02-11 10:23:30.957 26015 ERROR root incoming = self.listener.poll() 2016-02-11 10:23:30.957 26015 ERROR root File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 196, in poll 2016-02-11 10:23:30.957 26015 ERROR root self.conn.consume(timeout=timeout) 2016-02-11 10:23:30.957 26015 ERROR root File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 899, in consume 2016-02-11 10:23:30.957 26015 ERROR root error_callback=_error_callback) 2016-02-11 10:23:30.957 26015 ERROR root File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 704, in ensure 2016-02-11 10:23:30.957 26015 ERROR root raise exceptions.MessageDeliveryFailure(msg) 2016-02-11 10:23:30.957 26015 ERROR root MessageDeliveryFailure: Unable to connect to AMQP server on 172.16.20.14:5672 after None tries: Socket closed 2016-02-11 10:23:30.957 26015 ERROR root 2016-02-11 10:23:31.976 26015 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.16.20.14:5672 is unreachable: connection already closed. Trying again in 1 seconds. 2016-02-11 10:27:18.307 26015 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 172.16.20.14:5672 2016-02-11 10:35:53.429 26015 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.16.20.13:5672 is unreachable: [Errno 110] Connection timed out. Trying again in 1 seconds. 2016-02-11 10:35:59.444 26015 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.16.20.13:5672 is unreachable: timed out. Trying again in 1 seconds. 2016-02-11 10:36:05.460 26015 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.16.20.13:5672 is unreachable: timed out. Trying again in 1 seconds. 2016-02-11 10:36:06.481 26015 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 172.16.20.14:5672 2016-02-11 10:36:06.492 26015 WARNING oslo.service.loopingcall [-] Function 'neutron.agent.l3.agent.L3NATAgentWithStateReport._report_state' run outlasted interval by 911.02 sec Another example from different try on RHOS8. Controller-1 was taken down, I am pasting log for metadata agent on controller-2, It took again 15 minutes for metadata neutron agent on controller2 to be reported alive after controller-1 was takne down nongracefully. 2016-02-11 11:05:56.111 23849 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.16.20.14:5672 is unreachable: [Errno 110] Connection timed out. Trying again in 1 seconds. 2016-02-11 11:06:02.124 23849 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.16.20.14:5672 is unreachable: timed out. Trying again in 1 seconds. 2016-02-11 11:08:08.144 23849 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 172.16.20.13:5672 2016-02-11 11:21:20.173 23849 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.16.20.14:5672 is unreachable: [Errno 110] Connection timed out. Trying again in 1 seconds. 2016-02-11 11:21:26.186 23849 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.16.20.14:5672 is unreachable: timed out. Trying again in 1 seconds. 2016-02-11 11:21:27.238 23849 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 172.16.20.13:5672 2016-02-11 11:21:27.259 23849 WARNING oslo.service.loopingcall [-] Function 'neutron.agent.metadata.agent.UnixDomainMetadataProxy._report_state' run outlasted interval by 909.09 sec Is 15 minutes some known timeout for connections? Some new observations after intensive testing: All my debugging was focused on the wrong side of connection where I wasted too much time, in the end, the problem is on the rabbit "client" side and not on the server side. I am actually taking down controllers non-gracefully (rabbitmq server), not the clients. When I look (to make it simple) at TCP rabbit connections (to remote 5672 port) on the computes nodes at the time of controller failover I can see that most likely some data tries to get send which are not acknowledged so retransmission timer kicks in which keeps the connection established up to 15 mins. The socket still appears to be alive even though in reality dead until retransmission timer expires. This causing all the agents and hypervisors which established rabbit connection to server in failover to appear dead. So I can either limit that systemwise with net.ipv4.tcp_retries2 option which would speed up retransmission timer to expired or I tried to set TCP_USER_TIMEOUT in pyaqmp code to the socket, something like: --- /--- /usr/lib/python2.7/site-packages/amqp/transport.py.bkp 2016-02-29 09:11:02.561000000 -0500 +++ /usr/lib/python2.7/site-packages/amqp/transport.py 2016-02-29 09:11:24.367000000 -0500 @@ -97,6 +97,8 @@ try: self.sock.settimeout(None) self.sock.setsockopt(SOL_TCP, socket.TCP_NODELAY, 1) + socket.TCP_USER_TIMEOUT = 18 + self.sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_USER_TIMEOUT, 10000) self.sock.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1) self._setup_transport() self._setup_transport() These two options seem to help and speed up failover recovery. Interesting article: http://john.eckersberg.com/improving-ha-failures-with-tcp-timeouts.html My initial thought is that heartbeats should be catching this but obviously isn't, so I'll start trying to reproduce and look into what's going on there. OK, I've tracked this down. The rough sequence is: - controller is powered off - compute calls _report_state - connection lock is obtained - Publisher object is created which sends exchange.declare over the wire - The exchange.declare packet sits in the send-q due to no route to host - py-amqp calls recv() to read exchange.declare_ok but never gets response since other end is gone Meanwhile... - heartbeat thread spins in heartbeat_acquire because the other thread still holds the connection lock So the end result is the heartbeat thread can't progress far enough to realize the heartbeat has been missed and kill the connection. I'll file a more detailed bug upstream and get this sorted out. Stay tuned. Sorry for the delay, a package is now available for testing. FYI backporting the final fix (TCP_USER_TIMEOUT) was not trivial and required to backport 4 other (simple) changes. John prepared a patch serie, I just approved (and merged) it and built a package. At least, unit tests and functional tests pass. The python-oslo-messaging-2.5.0-7.el7ost package contains 9 changes (including the TCP_USER_TIMEOUT fix) compared to the previous version (python-oslo-messaging-2.5.0-6.el7ost). Verified by Marian Verified on osp 8 & 9 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1352 |