Created attachment 1118832 [details] nova-compute debug log Description of problem: RHOS8 Setup in HA mode -> 3 controllers nodes and 1 compute node. The short story: Sometimes after one of the controllers is being reset through ipmi to test HA functionality, "nova hypervisor-list" reports hypervisor to be down and new instances cannot be booted due to reported error: "No valid host was found. There are not enough hosts available." for long time (several minutes) or until reset controller is back online. The long story: I am hitting several issues while testing HA mostly related to rabbitmq (I guess). One of the issues I could observe is: https://bugs.launchpad.net/oslo.messaging/+bug/1519851, so I apply the patch to my RHOS8 setup which solved one problem of rabbitmq being reconnected faster to different AMQP server once the one where the connection was made to is down. But I can still see following problem where It looks like unaccessiblity of controller is not detected correctly. My controller nodes are: computer 10.35.169.12 controller-0 10.35.169.13 controller-1 10.35.169.14 controller-2 10.35.169.15 If I looked at the netstat output of estabslished connection from compute node to AMQP servers on controllers I can see following: # netstat -putna | grep 10.35.169.13 tcp 8 0 10.35.169.12:53011 10.35.169.13:5672 ESTABLISHED 77035/python2 tcp 0 0 10.35.169.12:52984 10.35.169.13:5672 ESTABLISHED 3369/python2 # netstat -putna | grep 10.35.169.14 tcp 0 0 10.35.169.12:45993 10.35.169.14:5672 ESTABLISHED 77035/python2 tcp 0 0 10.35.169.12:45990 10.35.169.14:5672 ESTABLISHED 1868/python2 netstat -putna | grep 10.35.169.15 tcp 0 0 10.35.169.12:57938 10.35.169.15:5672 ESTABLISHED 77035/python2 tcp 0 0 10.35.169.12:57924 10.35.169.15:5672 ESTABLISHED 3369/python2 tcp 0 0 10.35.169.12:57887 10.35.169.15:5672 ESTABLISHED 3369/python2 tcp 0 0 10.35.169.12:57886 10.35.169.15:5672 ESTABLISHED 77035/python2 tcp 0 0 10.35.169.12:57883 10.35.169.15:5672 ESTABLISHED 3369/python2 tcp 0 0 10.35.169.12:57926 10.35.169.15:5672 ESTABLISHED 3369/python2 tcp 0 0 10.35.169.12:57891 10.35.169.15:5672 ESTABLISHED 3369/python2 tcp 0 0 10.35.169.12:57892 10.35.169.15:5672 ESTABLISHED 3369/python2 tcp 0 0 10.35.169.12:57923 10.35.169.15:5672 ESTABLISHED 3369/python2 tcp 0 0 10.35.169.12:57940 10.35.169.15:5672 ESTABLISHED 77035/python2 tcp 0 0 10.35.169.12:57920 10.35.169.15:5672 ESTABLISHED 3369/python2 tcp 0 0 10.35.169.12:57885 10.35.169.15:5672 ESTABLISHED 77035/python2 where: ps ax | grep 77035 77035 ? Ssl 4:52 /usr/bin/python2 /usr/bin/nova-compute # ps ax | grep 3369 3369 ? Ss 73:37 /usr/bin/python2 /usr/bin/neutron-openvswitch-agent --config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-openvswitch-agent --log-file /var/log/neutron/openvswitch-agent.log I do not understand how exactly things work here but for reproducer it's important that these connections to AMQP servers from compute node from nova-compute are spread to more controllers, in the outputs I pasted above, there is a connection to all controllers from computer node. 10:36:19 IPMI reset is triggered to controller-1 After that nothing is changing in netstat outputs I pasted above and "nova hypervisor-list start reporting hypervisor down. 10:38:39 Attempt to boot an instance with msg "No valid host was found. There are not enough hosts available." "nova hypervisor-list" keeps reporting the hypervisor down. 10:40:00 The node which was reset comes back online 10:40:40 nova-compute log reports for first time that something is wrong with controller-1 (10.35.169.14) 2016-01-27 10:40:40.126 77035 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server 10.35.169.14:5672 closed the connection. Check login credentials: Socket closed 2016-01-27 10:40:42.172 77035 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 10.35.169.15:5672 and "nova hypervisor-list" starts to report hypervisor to be up and instance can be boot. It took 4 minutes to get recovered and I assume only coz socket was reporetd as closed once the controller-1 came back online. Version-Release number of selected component (if applicable): # rpm -qa | grep openstack openstack-swift-plugin-swift3-1.9-1.el7ost.noarch openstack-ceilometer-common-5.0.1-1.el7ost.noarch openstack-nova-compute-12.0.0-4.el7ost.noarch openstack-nova-console-12.0.0-4.el7ost.noarch openstack-manila-1.0.0-2.el7ost.noarch openstack-ceilometer-alarm-5.0.1-1.el7ost.noarch openstack-selinux-0.6.50-1.el7ost.noarch openstack-neutron-metering-agent-7.0.1-2.el7ost.noarch openstack-ceilometer-compute-5.0.1-1.el7ost.noarch openstack-utils-2014.2-1.el7ost.noarch openstack-puppet-modules-7.0.1-2.el7ost.noarch python-django-openstack-auth-2.0.1-1.2.el7ost.noarch openstack-dashboard-8.0.0-8.el7ost.noarch openstack-ceilometer-notification-5.0.1-1.el7ost.noarch openstack-ceilometer-polling-5.0.1-1.el7ost.noarch openstack-neutron-7.0.1-2.el7ost.noarch openstack-glance-11.0.0-1.el7ost.noarch openstack-nova-scheduler-12.0.0-4.el7ost.noarch openstack-nova-api-12.0.0-4.el7ost.noarch openstack-neutron-ml2-7.0.1-2.el7ost.noarch openstack-heat-engine-5.0.0-1.el7ost.noarch openstack-ceilometer-central-5.0.1-1.el7ost.noarch python-openstackclient-1.7.1-1.el7ost.noarch openstack-swift-2.5.0-1.el7ost.noarch openstack-neutron-common-7.0.1-2.el7ost.noarch openstack-nova-common-12.0.0-4.el7ost.noarch openstack-neutron-lbaas-7.0.0-2.el7ost.noarch openstack-nova-cert-12.0.0-4.el7ost.noarch openstack-nova-novncproxy-12.0.0-4.el7ost.noarch openstack-neutron-openvswitch-7.0.1-2.el7ost.noarch openstack-heat-api-cfn-5.0.0-1.el7ost.noarch openstack-ceilometer-collector-5.0.1-1.el7ost.noarch openstack-swift-proxy-2.5.0-1.el7ost.noarch openstack-swift-account-2.5.0-1.el7ost.noarch openstack-heat-common-5.0.0-1.el7ost.noarch openstack-manila-share-1.0.0-2.el7ost.noarch openstack-keystone-8.0.0-1.el7ost.noarch openstack-heat-api-cloudwatch-5.0.0-1.el7ost.noarch openstack-swift-object-2.5.0-1.el7ost.noarch openstack-swift-container-2.5.0-1.el7ost.noarch openstack-dashboard-theme-8.0.0-8.el7ost.noarch openstack-cinder-7.0.0-2.el7ost.noarch openstack-nova-conductor-12.0.0-4.el7ost.noarch openstack-heat-api-5.0.0-1.el7ost.noarch openstack-ceilometer-api-5.0.1-1.el7ost.noarch How reproducible: I can reproduce constantly unless I can see connections spread from compute node to more controllers AMQ servers. Additional info: I have no idea what component to check or what logs to look at more, so please advice/change component. It was just my assumption It can be related to rabbitmq but I am not sure. Note: To observe the problem more easily I delay the boot of controller in GRUB (since in default I am using fencing).
This appears to be using OSP 8 core puddle from December
I managed to reproduce the bug in following steps: 1. Having three controllers and one compute node, multiple nove-compute processes connected to AMQP servers on the controllers. 2. Dropping communication from compute node to one of the controllers with iptables. AS result I hit that hypervisor was being reported as down for over 15 minutes and rabbitmq did not detect that AMQP server on one of the controllers was unreachable. Note: It's somehow much easier to reproduce when doing hard reset of a controller then just iptables traffic drop.
Marian, can you confirm that you have reproduced this on the latest available code for all components per comment #2?
(In reply to Dan Smith from comment #5) > Marian, can you confirm that you have reproduced this on the latest > available code for all components per comment #2? I can confirm that, I did try with latest bits for RHOS8 available by today. Moreover I am changing the component to oslo.messaging, It effects the same way neutron agents and hopefully It will get right attention there. Latest reproduction: - RHOS8 HA setup with 3 controller nodes and 1 compute node. 1. nova stop overcloud-controller-0 from undercloud at around 10:20:00 2. give it a minute for VIP to recover. 3. Enter "nova hypervisor-list" and "neutron agent-list" -> all neutron agents reported as down, hypervisor reported as down as well. 4. Regulary issue commands from 3. until It reports agents up on alive controllers and hypervisor up 5. For hypervisor It succeed after 7 minutes, for some neutron agents after 16 minutes. Example from neutron l3 agent 10:36:06 is the time when agent started to be reported as alive, which means It took 16 minutes: 2016-02-11 10:20:23.609 26015 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.16.20.13:5672 is unreachable: [Errno 110] Connection timed out. Trying again in 1 seconds. 2016-02-11 10:20:23.895 26015 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.16.20.13:5672 is unreachable: [Errno 110] Connection timed out. Trying again in 1 seconds. 2016-02-11 10:20:29.645 26015 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.16.20.13:5672 is unreachable: timed out. Trying again in 1 seconds. 2016-02-11 10:20:29.912 26015 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.16.20.13:5672 is unreachable: timed out. Trying again in 1 seconds. 2016-02-11 10:22:03.103 26015 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 172.16.20.15:5672 2016-02-11 10:23:30.956 26015 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to consume message from queue: Socket closed 2016-02-11 10:23:30.957 26015 ERROR oslo.messaging._drivers.impl_rabbit [-] Unable to connect to AMQP server on 172.16.20.14:5672 after None tries: Socket closed 2016-02-11 10:23:30.957 26015 ERROR root [-] Unexpected exception occurred 1 time(s)... retrying. 2016-02-11 10:23:30.957 26015 ERROR root Traceback (most recent call last): 2016-02-11 10:23:30.957 26015 ERROR root File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 205, in inner_func 2016-02-11 10:23:30.957 26015 ERROR root return infunc(*args, **kwargs) 2016-02-11 10:23:30.957 26015 ERROR root File "/usr/lib/python2.7/site-packages/oslo_messaging/_executors/impl_pooledexecutor.py", line 95, in _runner 2016-02-11 10:23:30.957 26015 ERROR root incoming = self.listener.poll() 2016-02-11 10:23:30.957 26015 ERROR root File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 196, in poll 2016-02-11 10:23:30.957 26015 ERROR root self.conn.consume(timeout=timeout) 2016-02-11 10:23:30.957 26015 ERROR root File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 899, in consume 2016-02-11 10:23:30.957 26015 ERROR root error_callback=_error_callback) 2016-02-11 10:23:30.957 26015 ERROR root File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 704, in ensure 2016-02-11 10:23:30.957 26015 ERROR root raise exceptions.MessageDeliveryFailure(msg) 2016-02-11 10:23:30.957 26015 ERROR root MessageDeliveryFailure: Unable to connect to AMQP server on 172.16.20.14:5672 after None tries: Socket closed 2016-02-11 10:23:30.957 26015 ERROR root 2016-02-11 10:23:31.976 26015 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.16.20.14:5672 is unreachable: connection already closed. Trying again in 1 seconds. 2016-02-11 10:27:18.307 26015 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 172.16.20.14:5672 2016-02-11 10:35:53.429 26015 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.16.20.13:5672 is unreachable: [Errno 110] Connection timed out. Trying again in 1 seconds. 2016-02-11 10:35:59.444 26015 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.16.20.13:5672 is unreachable: timed out. Trying again in 1 seconds. 2016-02-11 10:36:05.460 26015 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.16.20.13:5672 is unreachable: timed out. Trying again in 1 seconds. 2016-02-11 10:36:06.481 26015 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 172.16.20.14:5672 2016-02-11 10:36:06.492 26015 WARNING oslo.service.loopingcall [-] Function 'neutron.agent.l3.agent.L3NATAgentWithStateReport._report_state' run outlasted interval by 911.02 sec
Another example from different try on RHOS8. Controller-1 was taken down, I am pasting log for metadata agent on controller-2, It took again 15 minutes for metadata neutron agent on controller2 to be reported alive after controller-1 was takne down nongracefully. 2016-02-11 11:05:56.111 23849 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.16.20.14:5672 is unreachable: [Errno 110] Connection timed out. Trying again in 1 seconds. 2016-02-11 11:06:02.124 23849 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.16.20.14:5672 is unreachable: timed out. Trying again in 1 seconds. 2016-02-11 11:08:08.144 23849 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 172.16.20.13:5672 2016-02-11 11:21:20.173 23849 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.16.20.14:5672 is unreachable: [Errno 110] Connection timed out. Trying again in 1 seconds. 2016-02-11 11:21:26.186 23849 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.16.20.14:5672 is unreachable: timed out. Trying again in 1 seconds. 2016-02-11 11:21:27.238 23849 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 172.16.20.13:5672 2016-02-11 11:21:27.259 23849 WARNING oslo.service.loopingcall [-] Function 'neutron.agent.metadata.agent.UnixDomainMetadataProxy._report_state' run outlasted interval by 909.09 sec Is 15 minutes some known timeout for connections?
Some new observations after intensive testing: All my debugging was focused on the wrong side of connection where I wasted too much time, in the end, the problem is on the rabbit "client" side and not on the server side. I am actually taking down controllers non-gracefully (rabbitmq server), not the clients. When I look (to make it simple) at TCP rabbit connections (to remote 5672 port) on the computes nodes at the time of controller failover I can see that most likely some data tries to get send which are not acknowledged so retransmission timer kicks in which keeps the connection established up to 15 mins. The socket still appears to be alive even though in reality dead until retransmission timer expires. This causing all the agents and hypervisors which established rabbit connection to server in failover to appear dead. So I can either limit that systemwise with net.ipv4.tcp_retries2 option which would speed up retransmission timer to expired or I tried to set TCP_USER_TIMEOUT in pyaqmp code to the socket, something like: --- /--- /usr/lib/python2.7/site-packages/amqp/transport.py.bkp 2016-02-29 09:11:02.561000000 -0500 +++ /usr/lib/python2.7/site-packages/amqp/transport.py 2016-02-29 09:11:24.367000000 -0500 @@ -97,6 +97,8 @@ try: self.sock.settimeout(None) self.sock.setsockopt(SOL_TCP, socket.TCP_NODELAY, 1) + socket.TCP_USER_TIMEOUT = 18 + self.sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_USER_TIMEOUT, 10000) self.sock.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1) self._setup_transport() self._setup_transport() These two options seem to help and speed up failover recovery.
Interesting article: http://john.eckersberg.com/improving-ha-failures-with-tcp-timeouts.html
My initial thought is that heartbeats should be catching this but obviously isn't, so I'll start trying to reproduce and look into what's going on there.
OK, I've tracked this down. The rough sequence is: - controller is powered off - compute calls _report_state - connection lock is obtained - Publisher object is created which sends exchange.declare over the wire - The exchange.declare packet sits in the send-q due to no route to host - py-amqp calls recv() to read exchange.declare_ok but never gets response since other end is gone Meanwhile... - heartbeat thread spins in heartbeat_acquire because the other thread still holds the connection lock So the end result is the heartbeat thread can't progress far enough to realize the heartbeat has been missed and kill the connection. I'll file a more detailed bug upstream and get this sorted out. Stay tuned.
https://review.openstack.org/#/c/294074
https://code.engineering.redhat.com/gerrit/71195 https://code.engineering.redhat.com/gerrit/71199
Sorry for the delay, a package is now available for testing. FYI backporting the final fix (TCP_USER_TIMEOUT) was not trivial and required to backport 4 other (simple) changes. John prepared a patch serie, I just approved (and merged) it and built a package. At least, unit tests and functional tests pass. The python-oslo-messaging-2.5.0-7.el7ost package contains 9 changes (including the TCP_USER_TIMEOUT fix) compared to the previous version (python-oslo-messaging-2.5.0-6.el7ost).
Verified by Marian
Verified on osp 8 & 9
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1352