Bug 1302387 - [HA] Hypervisor reported as down after failover of one of the controllers [NEEDINFO]
[HA] Hypervisor reported as down after failover of one of the controllers
Status: CLOSED ERRATA
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-oslo-messaging (Show other bugs)
8.0 (Liberty)
Unspecified Unspecified
urgent Severity urgent
: async
: 8.0 (Liberty)
Assigned To: John Eckersberg
Marian Krcmarik
: AutomationBlocker, TestBlocker, ZStream
Depends On:
Blocks: 1306708 1322903
  Show dependency treegraph
 
Reported: 2016-01-27 11:56 EST by Marian Krcmarik
Modified: 2016-06-29 09:48 EDT (History)
19 users (show)

See Also:
Fixed In Version: python-oslo-messaging-2.5.0-7.el7ost
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1306708 1322903 (view as bug list)
Environment:
Last Closed: 2016-06-29 09:48:03 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
oblaut: needinfo? (jeckersb)
lhh: needinfo? (jeckersb)


Attachments (Terms of Use)
nova-compute debug log (229.46 KB, text/plain)
2016-01-27 11:56 EST, Marian Krcmarik
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
OpenStack gerrit 294074 None None None 2016-03-17 11:36 EDT

  None (edit)
Description Marian Krcmarik 2016-01-27 11:56:21 EST
Created attachment 1118832 [details]
nova-compute debug log

Description of problem:
RHOS8 Setup in HA mode -> 3 controllers nodes and 1 compute node.

The short story:
Sometimes after one of the controllers is being reset through ipmi to test HA functionality, "nova hypervisor-list" reports hypervisor to be down and new instances cannot be booted due to reported error: "No valid host was found. There are not enough hosts available." for long time (several minutes) or until reset controller is back online.

The long story:
I am hitting several issues while testing HA mostly related to rabbitmq (I guess). One of the issues I could observe is: https://bugs.launchpad.net/oslo.messaging/+bug/1519851, so I apply the patch to my RHOS8 setup which solved one problem of rabbitmq being reconnected faster to different AMQP server once the one where the connection was made to is down. But I can still see following problem where It looks like unaccessiblity of controller is not detected correctly.

My controller nodes are:
computer     10.35.169.12
controller-0 10.35.169.13
controller-1 10.35.169.14
controller-2 10.35.169.15

If I looked at the netstat output of estabslished connection from compute node to AMQP servers on controllers I can see following:
# netstat -putna | grep 10.35.169.13
tcp        8      0 10.35.169.12:53011      10.35.169.13:5672       ESTABLISHED 77035/python2       
tcp        0      0 10.35.169.12:52984      10.35.169.13:5672       ESTABLISHED 3369/python2

# netstat -putna | grep 10.35.169.14
tcp        0      0 10.35.169.12:45993      10.35.169.14:5672       ESTABLISHED 77035/python2       
tcp        0      0 10.35.169.12:45990      10.35.169.14:5672       ESTABLISHED 1868/python2

netstat -putna | grep 10.35.169.15
tcp        0      0 10.35.169.12:57938      10.35.169.15:5672       ESTABLISHED 77035/python2       
tcp        0      0 10.35.169.12:57924      10.35.169.15:5672       ESTABLISHED 3369/python2        
tcp        0      0 10.35.169.12:57887      10.35.169.15:5672       ESTABLISHED 3369/python2        
tcp        0      0 10.35.169.12:57886      10.35.169.15:5672       ESTABLISHED 77035/python2       
tcp        0      0 10.35.169.12:57883      10.35.169.15:5672       ESTABLISHED 3369/python2        
tcp        0      0 10.35.169.12:57926      10.35.169.15:5672       ESTABLISHED 3369/python2        
tcp        0      0 10.35.169.12:57891      10.35.169.15:5672       ESTABLISHED 3369/python2        
tcp        0      0 10.35.169.12:57892      10.35.169.15:5672       ESTABLISHED 3369/python2        
tcp        0      0 10.35.169.12:57923      10.35.169.15:5672       ESTABLISHED 3369/python2        
tcp        0      0 10.35.169.12:57940      10.35.169.15:5672       ESTABLISHED 77035/python2       
tcp        0      0 10.35.169.12:57920      10.35.169.15:5672       ESTABLISHED 3369/python2        
tcp        0      0 10.35.169.12:57885      10.35.169.15:5672       ESTABLISHED 77035/python2

where:
ps ax | grep 77035
 77035 ?        Ssl    4:52 /usr/bin/python2 /usr/bin/nova-compute
# ps ax | grep 3369
  3369 ?        Ss    73:37 /usr/bin/python2 /usr/bin/neutron-openvswitch-agent --config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-openvswitch-agent --log-file /var/log/neutron/openvswitch-agent.log

I do not understand how exactly things work here but for reproducer it's important that these connections to AMQP servers from compute node from nova-compute are spread to more controllers, in the outputs I pasted above, there is a connection to all controllers from computer node.

10:36:19 IPMI reset is triggered to controller-1
After that nothing is changing in netstat outputs I pasted above and "nova hypervisor-list start reporting hypervisor down.
10:38:39 Attempt to boot an instance with msg "No valid host was found. There are not enough hosts available."
"nova hypervisor-list" keeps reporting the hypervisor down.
10:40:00 The node which was reset comes back online
10:40:40 nova-compute log reports for first time that something is wrong with controller-1 (10.35.169.14)
2016-01-27 10:40:40.126 77035 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server 10.35.169.14:5672 closed the connection. Check login credentials: Socket closed
2016-01-27 10:40:42.172 77035 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 10.35.169.15:5672
and "nova hypervisor-list" starts to report hypervisor to be up and instance can be boot.

It took 4 minutes to get recovered and I assume only coz socket was reporetd as closed once the controller-1 came back online.

Version-Release number of selected component (if applicable):
# rpm -qa | grep openstack
openstack-swift-plugin-swift3-1.9-1.el7ost.noarch
openstack-ceilometer-common-5.0.1-1.el7ost.noarch
openstack-nova-compute-12.0.0-4.el7ost.noarch
openstack-nova-console-12.0.0-4.el7ost.noarch
openstack-manila-1.0.0-2.el7ost.noarch
openstack-ceilometer-alarm-5.0.1-1.el7ost.noarch
openstack-selinux-0.6.50-1.el7ost.noarch
openstack-neutron-metering-agent-7.0.1-2.el7ost.noarch
openstack-ceilometer-compute-5.0.1-1.el7ost.noarch
openstack-utils-2014.2-1.el7ost.noarch
openstack-puppet-modules-7.0.1-2.el7ost.noarch
python-django-openstack-auth-2.0.1-1.2.el7ost.noarch
openstack-dashboard-8.0.0-8.el7ost.noarch
openstack-ceilometer-notification-5.0.1-1.el7ost.noarch
openstack-ceilometer-polling-5.0.1-1.el7ost.noarch
openstack-neutron-7.0.1-2.el7ost.noarch
openstack-glance-11.0.0-1.el7ost.noarch
openstack-nova-scheduler-12.0.0-4.el7ost.noarch
openstack-nova-api-12.0.0-4.el7ost.noarch
openstack-neutron-ml2-7.0.1-2.el7ost.noarch
openstack-heat-engine-5.0.0-1.el7ost.noarch
openstack-ceilometer-central-5.0.1-1.el7ost.noarch
python-openstackclient-1.7.1-1.el7ost.noarch
openstack-swift-2.5.0-1.el7ost.noarch
openstack-neutron-common-7.0.1-2.el7ost.noarch
openstack-nova-common-12.0.0-4.el7ost.noarch
openstack-neutron-lbaas-7.0.0-2.el7ost.noarch
openstack-nova-cert-12.0.0-4.el7ost.noarch
openstack-nova-novncproxy-12.0.0-4.el7ost.noarch
openstack-neutron-openvswitch-7.0.1-2.el7ost.noarch
openstack-heat-api-cfn-5.0.0-1.el7ost.noarch
openstack-ceilometer-collector-5.0.1-1.el7ost.noarch
openstack-swift-proxy-2.5.0-1.el7ost.noarch
openstack-swift-account-2.5.0-1.el7ost.noarch
openstack-heat-common-5.0.0-1.el7ost.noarch
openstack-manila-share-1.0.0-2.el7ost.noarch
openstack-keystone-8.0.0-1.el7ost.noarch
openstack-heat-api-cloudwatch-5.0.0-1.el7ost.noarch
openstack-swift-object-2.5.0-1.el7ost.noarch
openstack-swift-container-2.5.0-1.el7ost.noarch
openstack-dashboard-theme-8.0.0-8.el7ost.noarch
openstack-cinder-7.0.0-2.el7ost.noarch
openstack-nova-conductor-12.0.0-4.el7ost.noarch
openstack-heat-api-5.0.0-1.el7ost.noarch
openstack-ceilometer-api-5.0.1-1.el7ost.noarch

How reproducible:
I can reproduce constantly unless I can see connections spread from compute node to more controllers AMQ servers.

Additional info:
I have no idea what component to check or what logs to look at more, so please advice/change component. It was just my assumption It can be related to rabbitmq but I am not sure.
Note: To observe the problem more easily I delay the boot of controller in GRUB (since in default I am using fencing).
Comment 2 Jon Schlueter 2016-01-29 11:35:09 EST
This appears to be using OSP 8 core puddle from December
Comment 4 Marian Krcmarik 2016-02-02 09:50:54 EST
I managed to reproduce the bug in following steps:
1. Having three controllers and one compute node, multiple nove-compute processes connected to AMQP servers on the controllers.
2. Dropping communication from compute node to one of the controllers with iptables.

AS result I hit that hypervisor was being reported as down for over 15 minutes and rabbitmq did not detect that AMQP server on one of the controllers was unreachable.

Note: It's somehow much easier to reproduce when doing hard reset of a controller then just iptables traffic drop.
Comment 5 Dan Smith 2016-02-08 11:22:06 EST
Marian, can you confirm that you have reproduced this on the latest available code for all components per comment #2?
Comment 7 Marian Krcmarik 2016-02-11 11:01:44 EST
(In reply to Dan Smith from comment #5)
> Marian, can you confirm that you have reproduced this on the latest
> available code for all components per comment #2?

I can confirm that, I did try with latest bits for RHOS8 available by today.
Moreover I am changing the component to oslo.messaging, It effects the same way neutron agents and hopefully It will get right attention there.

Latest reproduction:
- RHOS8 HA setup with 3 controller nodes and 1 compute node.

1. nova stop overcloud-controller-0 from undercloud at around 10:20:00
2. give it a minute for VIP to recover.
3. Enter "nova hypervisor-list" and "neutron agent-list" -> all neutron agents reported as down, hypervisor reported as down as well.
4. Regulary issue commands from 3. until It reports agents up on alive controllers and hypervisor up
5. For hypervisor It succeed after 7 minutes, for some neutron agents after 16 minutes.

Example from neutron l3 agent 10:36:06 is the time when agent started to be reported as alive, which means It took 16 minutes:
2016-02-11 10:20:23.609 26015 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.16.20.13:5672 is unreachable: [Errno 110] Connection timed out. Trying again in 1 seconds.
2016-02-11 10:20:23.895 26015 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.16.20.13:5672 is unreachable: [Errno 110] Connection timed out. Trying again in 1 seconds.
2016-02-11 10:20:29.645 26015 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.16.20.13:5672 is unreachable: timed out. Trying again in 1 seconds.
2016-02-11 10:20:29.912 26015 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.16.20.13:5672 is unreachable: timed out. Trying again in 1 seconds.
2016-02-11 10:22:03.103 26015 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 172.16.20.15:5672
2016-02-11 10:23:30.956 26015 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to consume message from queue: Socket closed
2016-02-11 10:23:30.957 26015 ERROR oslo.messaging._drivers.impl_rabbit [-] Unable to connect to AMQP server on 172.16.20.14:5672 after None tries: Socket closed
2016-02-11 10:23:30.957 26015 ERROR root [-] Unexpected exception occurred 1 time(s)... retrying.
2016-02-11 10:23:30.957 26015 ERROR root Traceback (most recent call last):
2016-02-11 10:23:30.957 26015 ERROR root   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 205, in inner_func
2016-02-11 10:23:30.957 26015 ERROR root     return infunc(*args, **kwargs)
2016-02-11 10:23:30.957 26015 ERROR root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_pooledexecutor.py", line 95, in _runner
2016-02-11 10:23:30.957 26015 ERROR root     incoming = self.listener.poll()
2016-02-11 10:23:30.957 26015 ERROR root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 196, in poll
2016-02-11 10:23:30.957 26015 ERROR root     self.conn.consume(timeout=timeout)
2016-02-11 10:23:30.957 26015 ERROR root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 899, in consume
2016-02-11 10:23:30.957 26015 ERROR root     error_callback=_error_callback)
2016-02-11 10:23:30.957 26015 ERROR root   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 704, in ensure
2016-02-11 10:23:30.957 26015 ERROR root     raise exceptions.MessageDeliveryFailure(msg)
2016-02-11 10:23:30.957 26015 ERROR root MessageDeliveryFailure: Unable to connect to AMQP server on 172.16.20.14:5672 after None tries: Socket closed
2016-02-11 10:23:30.957 26015 ERROR root 
2016-02-11 10:23:31.976 26015 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.16.20.14:5672 is unreachable: connection already closed. Trying again in 1 seconds.
2016-02-11 10:27:18.307 26015 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 172.16.20.14:5672
2016-02-11 10:35:53.429 26015 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.16.20.13:5672 is unreachable: [Errno 110] Connection timed out. Trying again in 1 seconds.
2016-02-11 10:35:59.444 26015 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.16.20.13:5672 is unreachable: timed out. Trying again in 1 seconds.
2016-02-11 10:36:05.460 26015 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.16.20.13:5672 is unreachable: timed out. Trying again in 1 seconds.
2016-02-11 10:36:06.481 26015 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 172.16.20.14:5672
2016-02-11 10:36:06.492 26015 WARNING oslo.service.loopingcall [-] Function 'neutron.agent.l3.agent.L3NATAgentWithStateReport._report_state' run outlasted interval by 911.02 sec
Comment 8 Marian Krcmarik 2016-02-11 11:23:59 EST
Another example from different try on RHOS8.

Controller-1 was taken down, I am pasting log for metadata agent on controller-2, It took again 15 minutes for metadata neutron agent on controller2 to be reported alive after controller-1 was takne down nongracefully.

2016-02-11 11:05:56.111 23849 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.16.20.14:5672 is unreachable: [Errno 110] Connection timed out. Trying again in 1 seconds.
2016-02-11 11:06:02.124 23849 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.16.20.14:5672 is unreachable: timed out. Trying again in 1 seconds.
2016-02-11 11:08:08.144 23849 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 172.16.20.13:5672
2016-02-11 11:21:20.173 23849 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.16.20.14:5672 is unreachable: [Errno 110] Connection timed out. Trying again in 1 seconds.
2016-02-11 11:21:26.186 23849 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.16.20.14:5672 is unreachable: timed out. Trying again in 1 seconds.
2016-02-11 11:21:27.238 23849 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 172.16.20.13:5672
2016-02-11 11:21:27.259 23849 WARNING oslo.service.loopingcall [-] Function 'neutron.agent.metadata.agent.UnixDomainMetadataProxy._report_state' run outlasted interval by 909.09 sec

Is 15 minutes some known timeout for connections?
Comment 9 Marian Krcmarik 2016-02-29 14:10:37 EST
Some new observations after intensive testing:
All my debugging was focused on the wrong side of connection where I wasted too much time, in the end, the problem is on the rabbit "client" side and not on the server side.
I am actually taking down controllers non-gracefully (rabbitmq server), not the clients. When I look (to make it simple) at TCP rabbit connections (to remote 5672 port) on the computes nodes at the time of controller failover I can see that most likely some data tries to get send which are not acknowledged so retransmission timer kicks in which keeps the connection established up to 15 mins. The socket still appears to be alive even though in reality dead until retransmission timer expires. This causing all the agents and hypervisors which established rabbit connection to server in failover to appear dead. So I can either limit that systemwise with net.ipv4.tcp_retries2 option which would speed up retransmission timer to expired or I tried to set TCP_USER_TIMEOUT in pyaqmp code to the socket, something like:

--- /--- /usr/lib/python2.7/site-packages/amqp/transport.py.bkp	2016-02-29 09:11:02.561000000 -0500
+++ /usr/lib/python2.7/site-packages/amqp/transport.py	2016-02-29 09:11:24.367000000 -0500
@@ -97,6 +97,8 @@
         try:
             self.sock.settimeout(None)
             self.sock.setsockopt(SOL_TCP, socket.TCP_NODELAY, 1)
+            socket.TCP_USER_TIMEOUT = 18
+            self.sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_USER_TIMEOUT, 10000)
             self.sock.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)
 
             self._setup_transport()
             self._setup_transport()

These two options seem to help and speed up failover recovery.
Comment 10 Victor Stinner 2016-03-07 11:30:45 EST
Interesting article: http://john.eckersberg.com/improving-ha-failures-with-tcp-timeouts.html
Comment 11 John Eckersberg 2016-03-15 09:49:44 EDT
My initial thought is that heartbeats should be catching this but obviously isn't, so I'll start trying to reproduce and look into what's going on there.
Comment 13 John Eckersberg 2016-03-16 16:09:33 EDT
OK, I've tracked this down.  The rough sequence is:

- controller is powered off

- compute calls _report_state

- connection lock is obtained

- Publisher object is created which sends exchange.declare over the wire

- The exchange.declare packet sits in the send-q due to no route to host

- py-amqp calls recv() to read exchange.declare_ok but never gets response since other end is gone

Meanwhile...

- heartbeat thread spins in heartbeat_acquire because the other thread still holds the connection lock

So the end result is the heartbeat thread can't progress far enough to realize the heartbeat has been missed and kill the connection.

I'll file a more detailed bug upstream and get this sorted out.  Stay tuned.
Comment 14 John Eckersberg 2016-03-17 11:36:24 EDT
https://review.openstack.org/#/c/294074
Comment 21 Victor Stinner 2016-05-31 16:50:57 EDT
Sorry for the delay, a package is now available for testing.

FYI backporting the final fix (TCP_USER_TIMEOUT) was not trivial and required to backport 4 other (simple) changes. John prepared a patch serie, I just approved (and merged) it and built a package. At least, unit tests and functional tests pass.

The python-oslo-messaging-2.5.0-7.el7ost package contains 9 changes (including the TCP_USER_TIMEOUT fix) compared to the previous version (python-oslo-messaging-2.5.0-6.el7ost).
Comment 23 Ofer Blaut 2016-06-07 04:46:03 EDT
Verified by Marian
Comment 24 Ofer Blaut 2016-06-07 04:46:58 EDT
Verified on osp 8 & 9
Comment 26 errata-xmlrpc 2016-06-29 09:48:03 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1352

Note You need to log in before you can comment on or make changes to this bug.