Bug 1231885 - Nodes reconnection to the broker not working properly. Compute nodes are not registered again
Summary: Nodes reconnection to the broker not working properly. Compute nodes are not ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 7.0 (Kilo)
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: y2
: 7.0 (Kilo)
Assignee: Jay Dobies
QA Contact: Amit Ugol
URL:
Whiteboard:
: 1231881 (view as bug list)
Depends On:
Blocks: 1185030 1191185 1251948 1261487 1269995 1271347
TreeView+ depends on / blocked
 
Reported: 2015-06-15 14:35 UTC by Miguel Angel Ajo
Modified: 2023-09-18 00:11 UTC (History)
19 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1269995 1271347 (view as bug list)
Environment:
Last Closed: 2015-12-21 16:46:55 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Messaging errors in HA environment (61.55 KB, text/plain)
2015-06-15 14:40 UTC, Miguel Angel Ajo
no flags Details
messaging-errors-non-ha-nova-compute-only (NON HA) (142.07 KB, text/plain)
2015-06-15 14:44 UTC, Miguel Angel Ajo
no flags Details
messaging-errors-non-ha-openvswitch-agent-only (NON HA) (140.66 KB, text/plain)
2015-06-15 14:44 UTC, Miguel Angel Ajo
no flags Details
Controller 1 (HA) /var/log (1.49 MB, application/x-gzip)
2015-06-16 13:32 UTC, Miguel Angel Ajo
no flags Details
Controller 2 (HA) /var/log (1.50 MB, application/x-gzip)
2015-06-16 13:33 UTC, Miguel Angel Ajo
no flags Details
Controller 3 (HA) /var/log (1.52 MB, application/x-gzip)
2015-06-16 13:33 UTC, Miguel Angel Ajo
no flags Details
Compute 1 (HA) /var/log (214.74 KB, application/x-gzip)
2015-06-16 13:39 UTC, Miguel Angel Ajo
no flags Details
Compute 2 (HA) /var/log (228.12 KB, application/x-gzip)
2015-06-16 13:39 UTC, Miguel Angel Ajo
no flags Details
latest logs (2.04 MB, text/plain)
2015-06-18 09:40 UTC, Fabio Massimo Di Nitto
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1460652 0 None None None Never
Launchpad 1466007 0 None None None Never
OpenStack gerrit 203060 0 None MERGED Set rabbitmq heatbeat timeout threshold to 60 2020-12-02 13:35:00 UTC
Red Hat Product Errata RHSA-2015:2650 0 normal SHIPPED_LIVE Moderate: Red Hat Enterprise Linux OpenStack Platform 7 director update 2015-12-21 21:44:54 UTC

Description Miguel Angel Ajo 2015-06-15 14:35:56 UTC
Description of problem:


Version-Release number of selected component (if applicable):

python-oslo-messaging-1.8.2-1.el7ost.noarch

How reproducible:

Always

Steps to Reproduce:
1. Start an instance
2. Restart all controllers (in HA), probably that also work with single controller and single compute.
3. Try to fire an instance.

Actual results:

Agents reconnect to rabbitmq after a while, when it's back up.

Nova hypervisor seems to be down, and nova-compute logs complain about message timeouts. (See attached logs)


Expected results:

Everything stays working across restarts.

Additional info:

The first attached logs are from an HA deployment, the 2nd logs, are from a single controller-single compute, rebooted controller. They look sightly different, but messaging equally broken. (it could be two different bugs)

Comment 3 Miguel Angel Ajo 2015-06-15 14:40:22 UTC
Created attachment 1039075 [details]
Messaging errors in HA environment

Comment 5 Miguel Angel Ajo 2015-06-15 14:44:36 UTC
Created attachment 1039076 [details]
messaging-errors-non-ha-nova-compute-only  (NON HA)

Comment 6 Miguel Angel Ajo 2015-06-15 14:44:59 UTC
Created attachment 1039077 [details]
messaging-errors-non-ha-openvswitch-agent-only (NON HA)

Comment 7 John Eckersberg 2015-06-15 16:06:38 UTC
Can you provide the NVR for python-kombu and python-amqp as well?  Just want to make sure I'm looking at the right builds of those.

Comment 8 Fabio Massimo Di Nitto 2015-06-15 17:03:49 UTC
rpm -q -a |grep python-kombu
python-kombu-3.0.23-1.el7ost.noarch

[root@rdo7-node2 rpms(keystone_admin)]$ rpm -q -a |grep python-amqp
python-amqp-1.4.6-1.el7ost.noarch
python-amqplib-1.0.2-7.el7ost.noarch

Comment 9 Miguel Angel Ajo 2015-06-16 07:50:21 UTC
Hi Fabio, thanks for responding.

A side note, the 1 controller / 1 compute test was done in RDO. the 3 controller (HA) / 3 compute was done with OSP7.

Comment 10 Miguel Angel Ajo 2015-06-16 07:51:24 UTC
For the RDO/1 controller/1 compute, those are the versions:
[root@rdocomp1 ~]# rpm -qa | grep kombu
python-kombu-2.5.16-1.el7.noarch
[root@rdocomp1 ~]# rpm -qa | grep amqp
python-amqp-1.4.6-1.el7.noarch

In this bug, please ignore that as it's probably noise and deserves a separate bug, sorry.

Comment 12 Miguel Angel Ajo 2015-06-16 13:32:41 UTC
Created attachment 1039491 [details]
Controller 1 (HA) /var/log

Comment 13 Miguel Angel Ajo 2015-06-16 13:33:10 UTC
Created attachment 1039492 [details]
Controller 2 (HA) /var/log

Comment 14 Miguel Angel Ajo 2015-06-16 13:33:45 UTC
Created attachment 1039493 [details]
Controller 3 (HA) /var/log

Comment 15 Miguel Angel Ajo 2015-06-16 13:39:35 UTC
Created attachment 1039496 [details]
Compute 1 (HA) /var/log

Comment 16 Miguel Angel Ajo 2015-06-16 13:39:57 UTC
Created attachment 1039498 [details]
Compute 2 (HA) /var/log

Comment 17 Flavio Percoco 2015-06-17 09:12:42 UTC
So, after some extra debugging, I think I've narrowed the issue to oslo.messaging not re-declaring the "reply_q" exchange/queue when reconnects happen. It's possible to see this from these 2 links[0][1]. The former shows the `on_reconnection` code and the later shows the method that returns the `reply_q` where replies should be sent. The on_reconnection function calls the `declare` method on the consumer but the `Consumer`'s `declare` method does not re-create the exchange instance[2].

[0] https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L660-L670
[1] https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/amqpdriver.py#L287-L302
[2] https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L205-L209

Comment 18 John Eckersberg 2015-06-17 13:17:28 UTC
Good catch, that would certainly explain it!

Comment 19 Flavio Percoco 2015-06-17 15:37:48 UTC
Actually, this issue goes beyond that. Exchanges are being correctly re-declared as part of queue's re-declaration. The issue here is that the `RetryOnMissingExchangePublisher`[0] blocks for 60 seconds and it's way longer than the timeout in the consumer side.

This is being worked on upstream.

https://github.com/openstack/oslo.messaging/blob/stable/kilo/oslo_messaging/_drivers/impl_rabbit.py#L458

Comment 20 Miguel Angel Ajo 2015-06-18 06:35:52 UTC
Good catches, so we have two scenarios with two upstream bugs here in the end?

Comment 21 Fabio Massimo Di Nitto 2015-06-18 09:39:34 UTC
Bumping severity and asking for blocker?

I was able to hit this bug in the HA environment by killing just one controller node. One of the compute node has TRACEBACKS with message Timeouts.

Comment 22 Fabio Massimo Di Nitto 2015-06-18 09:40:32 UTC
Created attachment 1040388 [details]
latest logs

Log is big. revelant bits are the end of the file (last 30 minutes or even less).

Comment 23 Mehdi ABAAKOUK 2015-06-19 07:12:24 UTC
Proposed upstream fix: https://review.openstack.org/#/c/193037/

Comment 24 Joe Donohue 2015-06-19 13:35:42 UTC
*** Bug 1209979 has been marked as a duplicate of this bug. ***

Comment 25 Fabio Massimo Di Nitto 2015-06-20 05:34:35 UTC
Moving from email to bz (from John):

At the request of Fabio, I went ahead and did the backport of the
in-progress patch so we can start trying it out.  Here's a scratch build
of the patched python-oslo-messaging:

https://brewweb.devel.redhat.com/taskinfo?taskID=9396068

This pulls in a new library, cachetools, so here's a scratch build of
python-cachetools to go along with it:

https://brewweb.devel.redhat.com/taskinfo?taskID=9395248

Comment 26 Fabio Massimo Di Nitto 2015-06-20 05:37:19 UTC
I am currently running with the packages from the scratch build and the problem still persist.

I have tested the "full HA controller backplane" reboot and the compute nodes were unable to start talking again to the controllers.

I´ll test the single controller node reboot over the weekend.

Comment 27 Fabio Massimo Di Nitto 2015-06-20 06:55:28 UTC
This patch is actually worst that before for how it behaves.

pre-patch:

reboot controller nodes
controller nodes come up
compute nodes stop talking to controller nodes

restart daemons on compute nodes will restore connectivity

post-patch:

reboot controller nodes
controller nodes come up
compute nodes stop talking to controller nodes

restart daemons on compute nodes will NOT restart connectivity.

A full cluster stop, node reboots, cluster start is required to recover.

Comment 28 Flavio Percoco 2015-06-20 08:11:31 UTC
Fabio,

Could you please provide access to your test environment? It's entirely possible that our environments are not configured properly and having access to yours will ease the debug/test of this work from the engineering side.

Also, could you upload the logs of your latest tests?

Thanks

Comment 29 Fabio Massimo Di Nitto 2015-06-20 09:05:17 UTC
(In reply to Flavio Percoco from comment #28)
> Fabio,
> 
> Could you please provide access to your test environment?

no for at least another week from today.

> It's entirely
> possible that our environments are not configured properly and having access
> to yours will ease the debug/test of this work from the engineering side.
> 
> Also, could you upload the logs of your latest tests?


sosreports here:

http://mrg-01.mpc.lab.eng.bos.redhat.com/sosreports/

please make sure to download them ASAP because they might disappear within a few days and they are too big to attach to BZ.

rdo7-* are controller nodes
mrg-* are compute node.

Don´t get fooled by the host names :) the installation is running RHEL7 and OSP7 latest.

Due to the size of the logs I recommend starting from the bottom and look at errors around 01:00am from today.

Comment 30 Miguel Angel Ajo 2015-06-22 10:04:56 UTC
*** Bug 1231881 has been marked as a duplicate of this bug. ***

Comment 31 Flavio Percoco 2015-06-24 13:31:04 UTC
Update on this bug,

The issue is being investigated and tested further. The patch packaged in that scratch-build doesn't seem to do all the job needed. Will report back as soon as I have more info or a new scratch build

Comment 32 Flavio Percoco 2015-07-06 13:50:32 UTC
Ok, after further testing and investigation (with the amazing help from Mehdi's sharp eye), we've found that the heartbeat was disabled in the environment where this bug was found.

By uncommenting out the `heartbeat_timeout_threshold` in `nova.conf` the issue seems to have gone away. Is it possible that the config file was generated by an older oslo.messaging version?

By enabling heartbeat, the applied patch shouldn't be enabled, although the optimizations it introduces would good to have in this release.

Comment 33 Fabio Massimo Di Nitto 2015-07-06 15:22:44 UTC
(In reply to Flavio Percoco from comment #32)
> Ok, after further testing and investigation (with the amazing help from
> Mehdi's sharp eye), we've found that the heartbeat was disabled in the
> environment where this bug was found.
> 
> By uncommenting out the `heartbeat_timeout_threshold` in `nova.conf` the
> issue seems to have gone away. Is it possible that the config file was
> generated by an older oslo.messaging version?
> 
> By enabling heartbeat, the applied patch shouldn't be enabled, although the
> optimizations it introduces would good to have in this release.

As I explained, this config was generated manually, by installing OSP7 packages and applying the config settings as documented here:

https://github.com/beekhof/osp-ha-deploy/blob/master/pcmk/nova.scenario
https://github.com/beekhof/osp-ha-deploy/blob/master/pcmk/compute-common.scenario

If that given option "heartbeat_timeoute_threadshold" is required for a functional environment, then it should be on by default in olso.messaging since a similar problem can affect all services (even tho you used nova to debug since it was more evident when compute <-> controller communication was interrupted).

Agreed?

Comment 35 Flavio Percoco 2015-07-07 06:58:44 UTC
Agreed!

The reason it was not enabled by default is because it requires `kombu>=3.0.7' - the one installed in these nodes is 3.0.23 - which is not part of Kilo. However, this was a good reminder that we should make it the default in Liberty, which I just did[0].

[0] https://review.openstack.org/#/c/198986/

Let me know how your tests go

Comment 37 Miguel Angel Ajo 2015-07-07 09:08:30 UTC
(In reply to Flavio Percoco from comment #32)
> Ok, after further testing and investigation (with the amazing help from
> Mehdi's sharp eye), we've found that the heartbeat was disabled in the
> environment where this bug was found.
> 
> By uncommenting out the `heartbeat_timeout_threshold` in `nova.conf` the
> issue seems to have gone away. Is it possible that the config file was
> generated by an older oslo.messaging version?
> 
> By enabling heartbeat, the applied patch shouldn't be enabled, although the
> optimizations it introduces would good to have in this release.

Have we checked the behaviour with neutron?, 

Neutron was getting stuck on the Timeout loop too, shall we enable heartbeat there too and check?

Comment 38 Fabio Massimo Di Nitto 2015-07-08 15:52:39 UTC
(In reply to Miguel Angel Ajo from comment #37)
> (In reply to Flavio Percoco from comment #32)
> > Ok, after further testing and investigation (with the amazing help from
> > Mehdi's sharp eye), we've found that the heartbeat was disabled in the
> > environment where this bug was found.
> > 
> > By uncommenting out the `heartbeat_timeout_threshold` in `nova.conf` the
> > issue seems to have gone away. Is it possible that the config file was
> > generated by an older oslo.messaging version?
> > 
> > By enabling heartbeat, the applied patch shouldn't be enabled, although the
> > optimizations it introduces would good to have in this release.
> 
> Have we checked the behaviour with neutron?, 
> 
> Neutron was getting stuck on the Timeout loop too, shall we enable heartbeat
> there too and check?

We will need to check all services. I think it's much safer to enable the option in one spot (the right one) vs having to configure it across the board.

Comment 39 Flavio Percoco 2015-07-08 17:13:49 UTC
(In reply to Fabio Massimo Di Nitto from comment #38)
> (In reply to Miguel Angel Ajo from comment #37)
> > 
> > Neutron was getting stuck on the Timeout loop too, shall we enable heartbeat
> > there too and check?

We do have to!

> 
> We will need to check all services. I think it's much safer to enable the
> option in one spot (the right one) vs having to configure it across the
> board.

Agreed. Although, for OSP7, this will have to be enabled manually. The patch is not going to be backported upstream because it changes the default value of a option, which goes against the stable backport policies.

Comment 41 Fabio Massimo Di Nitto 2015-07-08 17:18:27 UTC
(In reply to Flavio Percoco from comment #39)
> (In reply to Fabio Massimo Di Nitto from comment #38)
> > (In reply to Miguel Angel Ajo from comment #37)
> > > 
> > > Neutron was getting stuck on the Timeout loop too, shall we enable heartbeat
> > > there too and check?
> 
> We do have to!
> 
> > 
> > We will need to check all services. I think it's much safer to enable the
> > option in one spot (the right one) vs having to configure it across the
> > board.
> 
> Agreed. Although, for OSP7, this will have to be enabled manually. The patch
> is not going to be backported upstream because it changes the default value
> of a option, which goes against the stable backport policies.

Ok then we will use this bug to track it for OSP8, we will need a clone for the installer to make sure it is enabled by default on all config files.

Is heartbeat_timeout_threshold available in DEFAULT section or some other part of config files? so we can also update the manual HA arch.

Comment 42 Flavio Percoco 2015-07-09 07:22:33 UTC
(In reply to Fabio Massimo Di Nitto from comment #41)
> (In reply to Flavio Percoco from comment #39)
> > 
> > Agreed. Although, for OSP7, this will have to be enabled manually. The patch
> > is not going to be backported upstream because it changes the default value
> > of a option, which goes against the stable backport policies.
> 
> Ok then we will use this bug to track it for OSP8, we will need a clone for
> the installer to make sure it is enabled by default on all config files.

Sounds good to me. Before doing so, though, please release Fabio's hammer on this "fix" and see if it survives.

> 
> Is heartbeat_timeout_threshold available in DEFAULT section or some other
> part of config files? so we can also update the manual HA arch.

It's under the `oslo_messaging_rabbit` section.

Comment 43 Fabio Massimo Di Nitto 2015-07-09 07:27:03 UTC
(In reply to Flavio Percoco from comment #42)
> (In reply to Fabio Massimo Di Nitto from comment #41)
> > (In reply to Flavio Percoco from comment #39)
> > > 
> > > Agreed. Although, for OSP7, this will have to be enabled manually. The patch
> > > is not going to be backported upstream because it changes the default value
> > > of a option, which goes against the stable backport policies.
> > 
> > Ok then we will use this bug to track it for OSP8, we will need a clone for
> > the installer to make sure it is enabled by default on all config files.
> 
> Sounds good to me. Before doing so, though, please release Fabio's hammer on
> this "fix" and see if it survives.

I'll do so next week when I am back from PTO. Did you deploy this fix across all nodes/services on the mrg cluster?

Are the packages on the mrg cluster the same as in the OSP channel? just to make sure there are no leftovers of your debugging code around while testing.

> 
> > 
> > Is heartbeat_timeout_threshold available in DEFAULT section or some other
> > part of config files? so we can also update the manual HA arch.
> 
> It's under the `oslo_messaging_rabbit` section.

ack.

Comment 44 Flavio Percoco 2015-07-09 08:01:37 UTC
(In reply to Fabio Massimo Di Nitto from comment #43)
> I'll do so next week when I am back from PTO. Did you deploy this fix across
> all nodes/services on the mrg cluster?
> 
> Are the packages on the mrg cluster the same as in the OSP channel? just to
> make sure there are no leftovers of your debugging code around while testing.
> 

I'll be on PTO next week so you may want to ping sileht (Mehdi) if you need support on the o.m side.

The packages installed still have the upstream patch and the config has been set just on the `nova.conf` files. I'll set it on the neutron configs as well and let you downgrade the packages to whatever is on the  OSP channel.

Comment 45 Fabio Massimo Di Nitto 2015-07-09 09:49:23 UTC
(In reply to Flavio Percoco from comment #44)
> (In reply to Fabio Massimo Di Nitto from comment #43)
> > I'll do so next week when I am back from PTO. Did you deploy this fix across
> > all nodes/services on the mrg cluster?
> > 
> > Are the packages on the mrg cluster the same as in the OSP channel? just to
> > make sure there are no leftovers of your debugging code around while testing.
> > 
> 
> I'll be on PTO next week so you may want to ping sileht (Mehdi) if you need
> support on the o.m side.

Ok.

> 
> The packages installed still have the upstream patch and the config has been
> set just on the `nova.conf` files. I'll set it on the neutron configs as
> well and let you downgrade the packages to whatever is on the  OSP channel.

That won't be necessary, we will have to change config for all services that use rabbitmq. neutron and nova is not enough.

Comment 48 Fabio Massimo Di Nitto 2015-07-16 04:29:33 UTC
(In reply to Flavio Percoco from comment #44)
> (In reply to Fabio Massimo Di Nitto from comment #43)
> > I'll do so next week when I am back from PTO. Did you deploy this fix across
> > all nodes/services on the mrg cluster?
> > 
> > Are the packages on the mrg cluster the same as in the OSP channel? just to
> > make sure there are no leftovers of your debugging code around while testing.
> > 
> 
> I'll be on PTO next week so you may want to ping sileht (Mehdi) if you need
> support on the o.m side.
> 
> The packages installed still have the upstream patch and the config has been
> set just on the `nova.conf` files. I'll set it on the neutron configs as
> well and let you downgrade the packages to whatever is on the  OSP channel.

Update. I am currently redeploying the environment with the latest puddle and applied:

- oslo_messaging_rabbit heartbeat_timeout_threshold 60
  to all config files for all services
- updated rabbitmq resource agent
  to include the "kick the node after fencing" fix

Once the environment is up and running, I´ll run a few tests and let you know.

Comment 49 Fabio Massimo Di Nitto 2015-07-17 05:55:25 UTC
I can confirm that enabling oslo_messaging_rabbit heartbeat_timeout_threshold 60 solves the problem without requiring the tcp keepalive workaround as noted here: https://bugzilla.redhat.com/show_bug.cgi?id=1231384#c13

The environment has been deployed _without_ the proposed optimization and with stock packages from the poodle.

Comment 51 John Eckersberg 2015-07-21 13:34:28 UTC
epel7 update for python-cachetools - https://admin.fedoraproject.org/updates/python-cachetools-1.0.3-1.el7

Comment 57 Mike Burns 2015-10-19 12:29:07 UTC
Giulio,

What do we need to fix here?  This is POST, but the patches listed are oslo.messaging, but there is a comment

Comment 58 Giulio Fidente 2015-10-19 18:39:20 UTC
Mike, as you pointed we only had to port the changes from https://review.openstack.org/203060; thanks for updating the status and the links!

Comment 61 Amit Ugol 2015-12-09 13:56:37 UTC
I am unsure what was the root cause here but I am unable to reproduce this issue with the latest builds.

Comment 63 errata-xmlrpc 2015-12-21 16:46:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2015:2650

Comment 64 Red Hat Bugzilla 2023-09-18 00:11:33 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.