Bug 1440680

Summary: OSP10 -> OSP11 upgrade: compute node upgrade could get stuck with nova-compute trying to reach rabbitmq server on its loopback address
Product: Red Hat OpenStack Reporter: Marius Cornea <mcornea>
Component: openstack-tripleo-heat-templatesAssignee: Marios Andreou <mandreou>
Status: CLOSED ERRATA QA Contact: Marius Cornea <mcornea>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 11.0 (Ocata)CC: dbecker, jcoufal, jschluet, mandreou, mburns, morazi, rbartal, rhel-osp-director-maint, rhos-flags, sathlang
Target Milestone: rcKeywords: Triaged
Target Release: 11.0 (Ocata)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-6.0.0-7.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-05-17 20:19:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Attachments:
Description Flags
openstack-nova-compute
none
nova-compute.log none

Description Marius Cornea 2017-04-10 09:16:22 UTC
Description of problem:
OSP10 -> OSP11 upgrade: the compute node upgrade could get stuck with nova-compute trying to reach rabbitmq server on the loopback address.

Note that this is a sporadic issue, I am not able to always reproduce it. Nevertheless I am asking for a blocker here since this could get the upgrade process stuck without pointing to any error and without timing out.

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-6.0.0-3.el7ost.noarch

How reproducible:
not always

Steps to Reproduce:
1. Deploy OSP10
2. Run upgrade major-upgrade-composable-steps
3. Upgrade compute

Actual results:
Compute upgrade gets stuck, we can see the following as the last line in the compute upgrade log:

Notice: /Stage[main]/Nova::Deps/Anchor[nova::service::begin]: Triggered 'refresh' from 1 events

Expected results:
Compute upgrade completes fine.

Additional info:
Checking /var/log/nova/nova-compute.log we can see:

2017-04-10 09:09:46.784 69560 ERROR oslo.messaging._drivers.impl_rabbit [req-9df0d868-1d3d-4d40-84e5-e4482f89d1bd - - - - -] [75f28dfe-8341-42a9-b23d-e81c9a062dd7] AMQP server on 127.0.0.1:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 32 seconds. Client port: None


[root@compute-0 heat-admin]# grep rabbit /etc/nova/nova.conf
rpc_backend=rabbit
transport_url=rabbit://guest:NsAtPrgwhQAc2WGEtwBsHDU22@controller-0.internalapi.localdomain:5672,guest:NsAtPrgwhQAc2WGEtwBsHDU22@controller-1.internalapi.localdomain:5672,guest:NsAtPrgwhQAc2WGEtwBsHDU22@controller-2.internalapi.localdomain:5672/?ssl=0
[oslo_messaging_rabbit]
rabbit_userid=guest
rabbit_password=NsAtPrgwhQAc2WGEtwBsHDU22
rabbit_use_ssl=False
rabbit_port=5672

Comment 2 Marios Andreou 2017-04-10 10:01:18 UTC
Is this with composable roles Marius, or is it manual upgrade of the compute nodes? If the controller upgrade completed rabbit should be up. I personally haven't seen this one - might be some kind of timing/race issue? (are you running the compute upgrade immediately after?).

Comment 3 Marius Cornea 2017-04-10 10:09:13 UTC
(In reply to marios from comment #2)
> Is this with composable roles Marius, or is it manual upgrade of the compute
> nodes? If the controller upgrade completed rabbit should be up. I personally
> haven't seen this one - might be some kind of timing/race issue? (are you
> running the compute upgrade immediately after?).

This shows up during the manual upgrade of a compute node so the controller upgrade has already completed. The issue looks to me that the nova compute service is trying to reach rabbitmq on the loopback address while it should try to reach it on the controllers IP addresses. This is not always reproducible so it could be a timing issue. I've got the environment available for investigation.

Comment 4 Marios Andreou 2017-04-10 13:21:15 UTC
hey mcornea I spent a while on the overcloud nodes trying to make sense of the sequencing. AFAICS the controller ansible steps happened at ~ Apr  9 18:06:25  where I can see step5 [0] . One thing I noticed on controller0 is [1] "Error: unable to push cib" but the cluster _is_ started fine afaics [2] on all controllers - sanity check mcornea the controller upgrade completed fine before you upgraded the computes?

I'm not sure if the error you reference in comment #0 is the root issue yet, since it also happens on controller 1 except it manages to connect OK a few moments later [3].

As we just discussed, lets try to see if restart on nova-compute will fix the issue so we can have it in docs/release notes/bz doctext etc if necessary. 

[0]
Apr  9 18:06:25 host-192-168-24-17 os-collect-config: [2017-04-09 22:06:25,408] (heat-config) [DEBUG] [2017-04-09 22:05:34,521] (heat-config) [DEBUG] Running ansible        -playbook -i localhost, --module-path /usr/share/ansible-modules --tags common,step5 /var/lib/heat-config/heat-config-ansible/0758e466-dd14-444f-9a8a-40f512fbd023_pl        aybook.yaml --extra-vars @/var/lib/heat-config/heat-config-ansible/0758e466-dd14-444f-9a8a-40f512fbd023_variables.json

[1]
Apr  9 18:11:08 host-192-168-24-17 os-collect-config: (at /etc/puppet/modules/stdlib/lib/puppet/functions/deprecation.rb:25:in `deprecation')#033[0m
Apr  9 18:11:08 host-192-168-24-17 os-collect-config: Error: unable to push cib
Apr  9 18:11:08 host-192-168-24-17 os-collect-config: Call cib_apply_diff failed (-205): Update was older than existing configuration

[2]
Apr  9 17:57:33 host-192-168-24-17 pacemakerd[2885]:  notice: Shutting down Pacemaker
Apr  9 17:57:33 host-192-168-24-17 pacemakerd[2885]:  notice: Stopping crmd
Apr  9 17:59:50 host-192-168-24-17 yum[373888]: Updated: puppet-rabbitmq-5.6.0-3.03b8592git.el7ost.noarch
Apr  9 17:59:48 host-192-168-24-17 cinder-volume: 2017-04-09 21:59:48.396 100509 ERROR oslo.messaging._drivers.impl_rabbit [-] [91f832f5-3b77-42ca-95b8-24036a0e13b4] AMQP server on 172.17.1.17:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds. Client port: None
...
Apr  9 18:03:53 host-192-168-24-17 systemd: Starting Pacemaker High Availability Cluster Manager...

[3]
(control0):
2017-04-09 21:49:56.090 58583 ERROR oslo.messaging._drivers.impl_rabbit [-] [51763e48-4ac2-4f3f-8f3f-fcb9d472a226] AMQP server on 172.17.1.19:5672 is unreachable: (0, 0): (320) CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'. Trying again in 1 seconds. Client port: None
(control1):
2017-04-09 21:49:56.063 58843 ERROR oslo.messaging._drivers.impl_rabbit [-] [693c0db6-2dc4-41a6-ab1b-36a32443cc5f] AMQP server on 172.17.1.19:5672 is unreachable: (0, 0): (320) CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'. Trying again in 1 seconds. Client port: None
...
2017-04-09 21:50:18.854 58843 INFO oslo.messaging._drivers.impl_rabbit [-] [9d802aad-e0cf-463e-8278-19abe1a98d12] Reconnected to AMQP server on 172.17.1.16:5672 via [amqp] clientwith port 35828.

Comment 5 Marius Cornea 2017-04-10 14:01:22 UTC
(In reply to marios from comment #4)
> hey mcornea I spent a while on the overcloud nodes trying to make sense of
> the sequencing. AFAICS the controller ansible steps happened at ~ Apr  9
> 18:06:25  where I can see step5 [0] . One thing I noticed on controller0 is
> [1] "Error: unable to push cib" but the cluster _is_ started fine afaics [2]
> on all controllers - sanity check mcornea the controller upgrade completed
> fine before you upgraded the computes?
> 
> I'm not sure if the error you reference in comment #0 is the root issue yet,
> since it also happens on controller 1 except it manages to connect OK a few
> moments later [3].
> 
> As we just discussed, lets try to see if restart on nova-compute will fix
> the issue so we can have it in docs/release notes/bz doctext etc if
> necessary. 


I've just tried restarting nova-compute on the node and it allowed the upgrade to move forward. So a potential workaround is to SSH to the compute node and run systemctl restart openstack-nova-compute.

Comment 6 Marios Andreou 2017-04-13 13:05:05 UTC
marius next time this happens can we please get the upgrade-non-controller.sh logs logs from the compute node  (there should be a logfile like upgrade-non-controller.sh-overcloud-compute-0 in the directory the upgrade was executed from) as well.

I just checked comment #0 again and it partially answers my question - so this is happening during one of the puppet steps of the compute upgrade (and puppet-nova looks like). We don't explicitly restart nova-compute ourselves as part of the workflow, at least not at this stage (we do earlier) and generally let puppet do the service config/start as happens on deploy.

Having said that, if we need to we can add this restart into the workflow, before the puppet run happens to avoid the operator needing to do that (like here for example https://github.com/openstack/tripleo-heat-templates/blob/25983882c2f7a8e8f8fb83bd967a67d008a556a4/extraconfig/tasks/tripleo_upgrade_node.sh#L50 ).

However I am reluctant since I don't want to inadvertently introduce a regression, trying to fix something we aren't even sure about yet (in terms of trigger/and fix).

Comment 7 Raviv Bar-Tal 2017-04-18 14:29:09 UTC
Created attachment 1272354 [details]
openstack-nova-compute

nova compute fails to connect rabitmq

Comment 8 Raviv Bar-Tal 2017-04-18 14:40:31 UTC
Created attachment 1272357 [details]
nova-compute.log

 openstack-nova-compute from /var/log/compute

Comment 9 Marios Andreou 2017-04-18 14:48:32 UTC
Thanks raviv - as per IRC just now you said that restarting openstack-nova-compute fixes the issue for you too allowing you to re-run the upgrade to completion?  That's at least a good data point for how reliable this 'workaround' is at least is also seemed to work for mcornea.

I will go afk in a while but will cycle back to this tomorrow morning

Comment 10 Sofer Athlan-Guyot 2017-04-18 21:56:10 UTC
Hi,

so I the original error is the service trying to connect to 127.0.0.1 which cannot work.  So I think that we are restarting the service before it has the proper transport_url configured.


2017-04-09 22:41:56.047 69388 ERROR oslo_service.service PlacementNotConfigured: This compute is not configured to talk to the placement service. Configure the [placement] section of nova.conf and restart the service.

This is due to the yum upgrade command:

~/Work/redhat/1440680/sosreport-compute-0.localdomain-20170410093018$ less sos_commands/logs/journalctl_--no-pager_--boot

Apr 09 22:37:25 compute-0.localdomain yum[63583]: Updated: 1:openstack-nova-compute-15.0.2-1.el7ost.noarch

Apr 09 22:40:24 compute-0.localdomain systemd[1]: Stopping OpenStack Nova Compute Server...
Apr 09 22:40:52 compute-0.localdomain systemd[1]: Starting OpenStack Nova Compute Server...
Apr 09 22:40:55 compute-0.localdomain systemd[1]: Started OpenStack Nova Compute Server.
Apr 09 22:40:55 compute-0.localdomain systemd[1]: openstack-nova-compute.service holdoff time over, scheduling restart.
Apr 09 22:40:55 compute-0.localdomain systemd[1]: Starting OpenStack Nova Compute Server...

...

As a matter of fact we do not apply the upgrade_rules defined in https://github.com/openstack/tripleo-heat-templates/blob/master/puppet/services/nova-compute.yaml#L157-L172

One of those commands is a stop of the nova-compute service.

I think we should stop the nova-compute before doing the yum upgrade so that we avoid having concurrently :
 - new version of nova-compute;
 - old configuration/partly modified configuration (by puppet)
 - restart of nova-compute (by systemd since the yum upgrade command)
 - rpc to auto;

I think we can reach a concurrency problem between puppet configuration modification and continuous systemd restart.

We should just let puppet restart the service when the configuration is right. 

If I'm not misleading this should have the added benefit of making sure that no operation happens on this node during the upgrade.

Comment 11 Sofer Athlan-Guyot 2017-04-18 22:05:32 UTC
Adding code to explain what could be a solution.  This is very much to open the discussion and a WIP.

Comment 12 Marios Andreou 2017-04-19 07:27:14 UTC
(In reply to Sofer Athlan-Guyot from comment #10)
> Hi,
> 
> so I the original error is the service trying to connect to 127.0.0.1 which
> cannot work.  So I think that we are restarting the service before it has
> the proper transport_url configured.
> 
> 
> 2017-04-09 22:41:56.047 69388 ERROR oslo_service.service
> PlacementNotConfigured: This compute is not configured to talk to the
> placement service. Configure the [placement] section of nova.conf and
> restart the service.
> 
... SNIP
> 
> I think we can reach a concurrency problem between puppet configuration
> modification and continuous systemd restart.
> 
> We should just let puppet restart the service when the configuration is
> right. 

I wonder if this will cause any issues for your existing vms? I mean between us stopping nova-compute and the puppet run could be as high as 20 mins + depending on what packages are updated etc.

Besides that though, and especially if there are no issues for existing vms, I agree with the let puppet do it since that is how we are dealing with most other services (besides the core which are started at the final ansible steps before puppet). 

One thought... I was wondering if that nova placement issue was to do with https://github.com/openstack/tripleo-heat-templates/blob/299b9f532377a3a0c16ba9cb4fe92c637fc38eeb/puppet/major_upgrade_steps.j2.yaml#L63 ... we are configuring nova placemenent (at least we intend to, with that) and then restarting the nova-compute service too. This is happening on the nova-compute nodes during the upgrade of the controllers (ansible steps) so could it be an ordering issue still (like is that nova compute restart happening at the right time wrt with the controllers?)

Comment 13 Sofer Athlan-Guyot 2017-04-19 11:17:07 UTC
(In reply to marios from comment #12)
> (In reply to Sofer Athlan-Guyot from comment #10)
> > We should just let puppet restart the service when the configuration is
> > right. 
> 
> I wonder if this will cause any issues for your existing vms? I mean between
> us stopping nova-compute and the puppet run could be as high as 20 mins +
> depending on what packages are updated etc.

So having nova-compute down when the node is upgraded shouldn't be a problem as it should be already out of the pool of the available hypervisors, with all the vm evacuated and no new vw created.


> Besides that though, and especially if there are no issues for existing vms,
> I agree with the let puppet do it since that is how we are dealing with most
> other services (besides the core which are started at the final ansible
> steps before puppet). 
> 
> One thought... I was wondering if that nova placement issue was to do with
> https://github.com/openstack/tripleo-heat-templates/blob/
> 299b9f532377a3a0c16ba9cb4fe92c637fc38eeb/puppet/major_upgrade_steps.j2.
> yaml#L63 ... we are configuring nova placemenent (at least we intend to,
> with that) and then restarting the nova-compute service too. This is
> happening on the nova-compute nodes during the upgrade of the controllers
> (ansible steps) so could it be an ordering issue still (like is that nova
> compute restart happening at the right time wrt with the controllers?)

So what we are observing here is:
 - 1. we restart during the controller phase after adding the placement parameter: the restart is useless but not harmful
 - 2. now nova conf has wrong placement parameter, see[1]
 - 3. during the non-controller upgrade script, yum upgrade openstack-nova-compute and then try to restart it and fails because of wrong placement parameter, it goes one restarting it continuously;
 - 4. puppet happens and upgrade parameters in nova.conf one at a time (like sequential crudini) with a purge (recreate all the configuration)
 - 5. systemd restart openstack-nova-compute at the wrong time and get default value for rabbitmq (127.0.0.1) and the openstack-nova-compute is stuck at boot
 - 6. puppet hangs as it cannot restart the service as it's already in the "starting" state.

So I think that making sure we get the placement parameter right should solve the issue.  Having openstack-nova-compute down before yum upgrade shouldn't be necessary.  It's just a matter of having the parameters right.

We could also remove entirely this section (placement parameter configuration during controller upgrade) as it's irrelevant for newton and let puppet do its jobs.





[1] https://review.openstack.org/#/c/457965/

Comment 14 Sofer Athlan-Guyot 2017-04-19 11:36:23 UTC
Adding DFG:Compute to make sure we can:
 - remove the restart of the newton nova compute ;
 - get all the placement parameter right.  Especially we're expecting confirmation on this os_interface value[1]

[1] https://review.openstack.org/#/c/457965/1/puppet/major_upgrade_steps.j2.yaml@54

Comment 15 Marios Andreou 2017-04-19 13:48:32 UTC
(In reply to Sofer Athlan-Guyot from comment #14)
> Adding DFG:Compute to make sure we can:
>  - remove the restart of the newton nova compute ;

I thought we would hold on that ^^^ especially if we set the right config it should proceed OK or this was written before the irc chat :D

>  - get all the placement parameter right.  Especially we're expecting
> confirmation on this os_interface value[1]
> 
> [1]
> https://review.openstack.org/#/c/457965/1/puppet/major_upgrade_steps.j2.
> yaml@54

Comment 16 Marios Andreou 2017-04-20 08:57:59 UTC
(In reply to marios from comment #15)
> (In reply to Sofer Athlan-Guyot from comment #14)
> > Adding DFG:Compute to make sure we can:
> >  - remove the restart of the newton nova compute ;
> 
> I thought we would hold on that ^^^ especially if we set the right config it
> should proceed OK or this was written before the irc chat :D
> 
> >  - get all the placement parameter right.  Especially we're expecting
> > confirmation on this os_interface value[1]
> > 
> > [1]
> > https://review.openstack.org/#/c/457965/1/puppet/major_upgrade_steps.j2.
> > yaml@54

FWIW on my env from last night with 457965 applied I cannot see any "PlacementNotConfigured" in journal/nova-compute logs

Comment 17 Sofer Athlan-Guyot 2017-04-20 11:27:01 UTC
On upstream stable/ocata.

Comment 23 Marius Cornea 2017-04-27 16:19:30 UTC
I wasn't able to reproduce this bug anymore nor find any such messages in nova-compute log after upgrade:

2017-04-09 22:41:56.047 69388 ERROR oslo_service.service PlacementNotConfigured: This compute is not configured to talk to the placement service. Configure the [placement] section of nova.conf and restart the service.

Moving to verified.

Comment 24 errata-xmlrpc 2017-05-17 20:19:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1245