Description of problem: Version-Release number of selected component (if applicable): python-oslo-messaging-1.8.2-1.el7ost.noarch How reproducible: Always Steps to Reproduce: 1. Start an instance 2. Restart all controllers (in HA), probably that also work with single controller and single compute. 3. Try to fire an instance. Actual results: Agents reconnect to rabbitmq after a while, when it's back up. Nova hypervisor seems to be down, and nova-compute logs complain about message timeouts. (See attached logs) Expected results: Everything stays working across restarts. Additional info: The first attached logs are from an HA deployment, the 2nd logs, are from a single controller-single compute, rebooted controller. They look sightly different, but messaging equally broken. (it could be two different bugs)
Created attachment 1039075 [details] Messaging errors in HA environment
Created attachment 1039076 [details] messaging-errors-non-ha-nova-compute-only (NON HA)
Created attachment 1039077 [details] messaging-errors-non-ha-openvswitch-agent-only (NON HA)
Can you provide the NVR for python-kombu and python-amqp as well? Just want to make sure I'm looking at the right builds of those.
rpm -q -a |grep python-kombu python-kombu-3.0.23-1.el7ost.noarch [root@rdo7-node2 rpms(keystone_admin)]$ rpm -q -a |grep python-amqp python-amqp-1.4.6-1.el7ost.noarch python-amqplib-1.0.2-7.el7ost.noarch
Hi Fabio, thanks for responding. A side note, the 1 controller / 1 compute test was done in RDO. the 3 controller (HA) / 3 compute was done with OSP7.
For the RDO/1 controller/1 compute, those are the versions: [root@rdocomp1 ~]# rpm -qa | grep kombu python-kombu-2.5.16-1.el7.noarch [root@rdocomp1 ~]# rpm -qa | grep amqp python-amqp-1.4.6-1.el7.noarch In this bug, please ignore that as it's probably noise and deserves a separate bug, sorry.
Created attachment 1039491 [details] Controller 1 (HA) /var/log
Created attachment 1039492 [details] Controller 2 (HA) /var/log
Created attachment 1039493 [details] Controller 3 (HA) /var/log
Created attachment 1039496 [details] Compute 1 (HA) /var/log
Created attachment 1039498 [details] Compute 2 (HA) /var/log
So, after some extra debugging, I think I've narrowed the issue to oslo.messaging not re-declaring the "reply_q" exchange/queue when reconnects happen. It's possible to see this from these 2 links[0][1]. The former shows the `on_reconnection` code and the later shows the method that returns the `reply_q` where replies should be sent. The on_reconnection function calls the `declare` method on the consumer but the `Consumer`'s `declare` method does not re-create the exchange instance[2]. [0] https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L660-L670 [1] https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/amqpdriver.py#L287-L302 [2] https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L205-L209
Good catch, that would certainly explain it!
Actually, this issue goes beyond that. Exchanges are being correctly re-declared as part of queue's re-declaration. The issue here is that the `RetryOnMissingExchangePublisher`[0] blocks for 60 seconds and it's way longer than the timeout in the consumer side. This is being worked on upstream. https://github.com/openstack/oslo.messaging/blob/stable/kilo/oslo_messaging/_drivers/impl_rabbit.py#L458
Good catches, so we have two scenarios with two upstream bugs here in the end?
Bumping severity and asking for blocker? I was able to hit this bug in the HA environment by killing just one controller node. One of the compute node has TRACEBACKS with message Timeouts.
Created attachment 1040388 [details] latest logs Log is big. revelant bits are the end of the file (last 30 minutes or even less).
Proposed upstream fix: https://review.openstack.org/#/c/193037/
*** Bug 1209979 has been marked as a duplicate of this bug. ***
Moving from email to bz (from John): At the request of Fabio, I went ahead and did the backport of the in-progress patch so we can start trying it out. Here's a scratch build of the patched python-oslo-messaging: https://brewweb.devel.redhat.com/taskinfo?taskID=9396068 This pulls in a new library, cachetools, so here's a scratch build of python-cachetools to go along with it: https://brewweb.devel.redhat.com/taskinfo?taskID=9395248
I am currently running with the packages from the scratch build and the problem still persist. I have tested the "full HA controller backplane" reboot and the compute nodes were unable to start talking again to the controllers. I´ll test the single controller node reboot over the weekend.
This patch is actually worst that before for how it behaves. pre-patch: reboot controller nodes controller nodes come up compute nodes stop talking to controller nodes restart daemons on compute nodes will restore connectivity post-patch: reboot controller nodes controller nodes come up compute nodes stop talking to controller nodes restart daemons on compute nodes will NOT restart connectivity. A full cluster stop, node reboots, cluster start is required to recover.
Fabio, Could you please provide access to your test environment? It's entirely possible that our environments are not configured properly and having access to yours will ease the debug/test of this work from the engineering side. Also, could you upload the logs of your latest tests? Thanks
(In reply to Flavio Percoco from comment #28) > Fabio, > > Could you please provide access to your test environment? no for at least another week from today. > It's entirely > possible that our environments are not configured properly and having access > to yours will ease the debug/test of this work from the engineering side. > > Also, could you upload the logs of your latest tests? sosreports here: http://mrg-01.mpc.lab.eng.bos.redhat.com/sosreports/ please make sure to download them ASAP because they might disappear within a few days and they are too big to attach to BZ. rdo7-* are controller nodes mrg-* are compute node. Don´t get fooled by the host names :) the installation is running RHEL7 and OSP7 latest. Due to the size of the logs I recommend starting from the bottom and look at errors around 01:00am from today.
*** Bug 1231881 has been marked as a duplicate of this bug. ***
Update on this bug, The issue is being investigated and tested further. The patch packaged in that scratch-build doesn't seem to do all the job needed. Will report back as soon as I have more info or a new scratch build
Ok, after further testing and investigation (with the amazing help from Mehdi's sharp eye), we've found that the heartbeat was disabled in the environment where this bug was found. By uncommenting out the `heartbeat_timeout_threshold` in `nova.conf` the issue seems to have gone away. Is it possible that the config file was generated by an older oslo.messaging version? By enabling heartbeat, the applied patch shouldn't be enabled, although the optimizations it introduces would good to have in this release.
(In reply to Flavio Percoco from comment #32) > Ok, after further testing and investigation (with the amazing help from > Mehdi's sharp eye), we've found that the heartbeat was disabled in the > environment where this bug was found. > > By uncommenting out the `heartbeat_timeout_threshold` in `nova.conf` the > issue seems to have gone away. Is it possible that the config file was > generated by an older oslo.messaging version? > > By enabling heartbeat, the applied patch shouldn't be enabled, although the > optimizations it introduces would good to have in this release. As I explained, this config was generated manually, by installing OSP7 packages and applying the config settings as documented here: https://github.com/beekhof/osp-ha-deploy/blob/master/pcmk/nova.scenario https://github.com/beekhof/osp-ha-deploy/blob/master/pcmk/compute-common.scenario If that given option "heartbeat_timeoute_threadshold" is required for a functional environment, then it should be on by default in olso.messaging since a similar problem can affect all services (even tho you used nova to debug since it was more evident when compute <-> controller communication was interrupted). Agreed?
Agreed! The reason it was not enabled by default is because it requires `kombu>=3.0.7' - the one installed in these nodes is 3.0.23 - which is not part of Kilo. However, this was a good reminder that we should make it the default in Liberty, which I just did[0]. [0] https://review.openstack.org/#/c/198986/ Let me know how your tests go
(In reply to Flavio Percoco from comment #32) > Ok, after further testing and investigation (with the amazing help from > Mehdi's sharp eye), we've found that the heartbeat was disabled in the > environment where this bug was found. > > By uncommenting out the `heartbeat_timeout_threshold` in `nova.conf` the > issue seems to have gone away. Is it possible that the config file was > generated by an older oslo.messaging version? > > By enabling heartbeat, the applied patch shouldn't be enabled, although the > optimizations it introduces would good to have in this release. Have we checked the behaviour with neutron?, Neutron was getting stuck on the Timeout loop too, shall we enable heartbeat there too and check?
(In reply to Miguel Angel Ajo from comment #37) > (In reply to Flavio Percoco from comment #32) > > Ok, after further testing and investigation (with the amazing help from > > Mehdi's sharp eye), we've found that the heartbeat was disabled in the > > environment where this bug was found. > > > > By uncommenting out the `heartbeat_timeout_threshold` in `nova.conf` the > > issue seems to have gone away. Is it possible that the config file was > > generated by an older oslo.messaging version? > > > > By enabling heartbeat, the applied patch shouldn't be enabled, although the > > optimizations it introduces would good to have in this release. > > Have we checked the behaviour with neutron?, > > Neutron was getting stuck on the Timeout loop too, shall we enable heartbeat > there too and check? We will need to check all services. I think it's much safer to enable the option in one spot (the right one) vs having to configure it across the board.
(In reply to Fabio Massimo Di Nitto from comment #38) > (In reply to Miguel Angel Ajo from comment #37) > > > > Neutron was getting stuck on the Timeout loop too, shall we enable heartbeat > > there too and check? We do have to! > > We will need to check all services. I think it's much safer to enable the > option in one spot (the right one) vs having to configure it across the > board. Agreed. Although, for OSP7, this will have to be enabled manually. The patch is not going to be backported upstream because it changes the default value of a option, which goes against the stable backport policies.
(In reply to Flavio Percoco from comment #39) > (In reply to Fabio Massimo Di Nitto from comment #38) > > (In reply to Miguel Angel Ajo from comment #37) > > > > > > Neutron was getting stuck on the Timeout loop too, shall we enable heartbeat > > > there too and check? > > We do have to! > > > > > We will need to check all services. I think it's much safer to enable the > > option in one spot (the right one) vs having to configure it across the > > board. > > Agreed. Although, for OSP7, this will have to be enabled manually. The patch > is not going to be backported upstream because it changes the default value > of a option, which goes against the stable backport policies. Ok then we will use this bug to track it for OSP8, we will need a clone for the installer to make sure it is enabled by default on all config files. Is heartbeat_timeout_threshold available in DEFAULT section or some other part of config files? so we can also update the manual HA arch.
(In reply to Fabio Massimo Di Nitto from comment #41) > (In reply to Flavio Percoco from comment #39) > > > > Agreed. Although, for OSP7, this will have to be enabled manually. The patch > > is not going to be backported upstream because it changes the default value > > of a option, which goes against the stable backport policies. > > Ok then we will use this bug to track it for OSP8, we will need a clone for > the installer to make sure it is enabled by default on all config files. Sounds good to me. Before doing so, though, please release Fabio's hammer on this "fix" and see if it survives. > > Is heartbeat_timeout_threshold available in DEFAULT section or some other > part of config files? so we can also update the manual HA arch. It's under the `oslo_messaging_rabbit` section.
(In reply to Flavio Percoco from comment #42) > (In reply to Fabio Massimo Di Nitto from comment #41) > > (In reply to Flavio Percoco from comment #39) > > > > > > Agreed. Although, for OSP7, this will have to be enabled manually. The patch > > > is not going to be backported upstream because it changes the default value > > > of a option, which goes against the stable backport policies. > > > > Ok then we will use this bug to track it for OSP8, we will need a clone for > > the installer to make sure it is enabled by default on all config files. > > Sounds good to me. Before doing so, though, please release Fabio's hammer on > this "fix" and see if it survives. I'll do so next week when I am back from PTO. Did you deploy this fix across all nodes/services on the mrg cluster? Are the packages on the mrg cluster the same as in the OSP channel? just to make sure there are no leftovers of your debugging code around while testing. > > > > > Is heartbeat_timeout_threshold available in DEFAULT section or some other > > part of config files? so we can also update the manual HA arch. > > It's under the `oslo_messaging_rabbit` section. ack.
(In reply to Fabio Massimo Di Nitto from comment #43) > I'll do so next week when I am back from PTO. Did you deploy this fix across > all nodes/services on the mrg cluster? > > Are the packages on the mrg cluster the same as in the OSP channel? just to > make sure there are no leftovers of your debugging code around while testing. > I'll be on PTO next week so you may want to ping sileht (Mehdi) if you need support on the o.m side. The packages installed still have the upstream patch and the config has been set just on the `nova.conf` files. I'll set it on the neutron configs as well and let you downgrade the packages to whatever is on the OSP channel.
(In reply to Flavio Percoco from comment #44) > (In reply to Fabio Massimo Di Nitto from comment #43) > > I'll do so next week when I am back from PTO. Did you deploy this fix across > > all nodes/services on the mrg cluster? > > > > Are the packages on the mrg cluster the same as in the OSP channel? just to > > make sure there are no leftovers of your debugging code around while testing. > > > > I'll be on PTO next week so you may want to ping sileht (Mehdi) if you need > support on the o.m side. Ok. > > The packages installed still have the upstream patch and the config has been > set just on the `nova.conf` files. I'll set it on the neutron configs as > well and let you downgrade the packages to whatever is on the OSP channel. That won't be necessary, we will have to change config for all services that use rabbitmq. neutron and nova is not enough.
(In reply to Flavio Percoco from comment #44) > (In reply to Fabio Massimo Di Nitto from comment #43) > > I'll do so next week when I am back from PTO. Did you deploy this fix across > > all nodes/services on the mrg cluster? > > > > Are the packages on the mrg cluster the same as in the OSP channel? just to > > make sure there are no leftovers of your debugging code around while testing. > > > > I'll be on PTO next week so you may want to ping sileht (Mehdi) if you need > support on the o.m side. > > The packages installed still have the upstream patch and the config has been > set just on the `nova.conf` files. I'll set it on the neutron configs as > well and let you downgrade the packages to whatever is on the OSP channel. Update. I am currently redeploying the environment with the latest puddle and applied: - oslo_messaging_rabbit heartbeat_timeout_threshold 60 to all config files for all services - updated rabbitmq resource agent to include the "kick the node after fencing" fix Once the environment is up and running, I´ll run a few tests and let you know.
I can confirm that enabling oslo_messaging_rabbit heartbeat_timeout_threshold 60 solves the problem without requiring the tcp keepalive workaround as noted here: https://bugzilla.redhat.com/show_bug.cgi?id=1231384#c13 The environment has been deployed _without_ the proposed optimization and with stock packages from the poodle.
epel7 update for python-cachetools - https://admin.fedoraproject.org/updates/python-cachetools-1.0.3-1.el7
Giulio, What do we need to fix here? This is POST, but the patches listed are oslo.messaging, but there is a comment
Mike, as you pointed we only had to port the changes from https://review.openstack.org/203060; thanks for updating the status and the links!
I am unsure what was the root cause here but I am unable to reproduce this issue with the latest builds.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2015:2650
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days