Bug 1447422 - OSP 10 -> 11 upgrade failing when keystone is running in a separate node
Summary: OSP 10 -> 11 upgrade failing when keystone is running in a separate node
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-gnocchi
Version: 11.0 (Ocata)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: z6
: 10.0 (Newton)
Assignee: Julien Danjou
QA Contact: Sasha Smolyak
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-05-02 16:56 UTC by Rodrigo Duarte
Modified: 2017-11-15 13:51 UTC (History)
13 users (show)

Fixed In Version: openstack-gnocchi-3.0.7-1.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-11-15 13:51:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
/var/log/gnocchi/statsd.log file (137.90 KB, text/plain)
2017-05-02 18:18 UTC, Rodrigo Duarte
no flags Details


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 462071 0 None None None 2017-05-03 11:48:39 UTC
Red Hat Product Errata RHBA-2017:3230 0 normal SHIPPED_LIVE Red Hat OpenStack Platform 10 Bug Fix and Enhancement Advisory 2017-11-15 18:39:20 UTC

Description Rodrigo Duarte 2017-05-02 16:56:26 UTC
The OSP 10 -> 11 upgrade fails when keystone is running in a separate node, console outputs can be found at [1].

- Checking overcloud failures:

$ openstack stack failures list overcloud
overcloud.AllNodesDeploySteps.ControllerUpgrade_Step0.2:
  resource_type: OS::Heat::SoftwareDeployment
  physical_resource_id: 3594da32-b7ca-4671-8821-df29642f4296
  status: CREATE_FAILED
  status_reason: |
    Error: resources[2]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2
  deploy_stdout: |
    ...
    TASK [Check if gnocchi_statsd is deployed] *************************************
    changed: [localhost]
    
    TASK [PreUpgrade step0,validation: Check service openstack-gnocchi-statsd is running] ***
    fatal: [localhost]: FAILED! => {"changed": true, "cmd": "/usr/bin/systemctl show 'openstack-gnocchi-statsd' --property ActiveState | grep '\\bactive\\b'", "delta": "0:00:00.007118", "end": "2017-05-02 16:30:53.660109", "failed": true, "rc": 1, "start": "2017-05-02 16:30:53.652991", "stderr": "", "stdout": "", "stdout_lines": [], "warnings": []}
    	to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/90e57c29-f7c1-466f-8f9d-b5bc4febc04c_playbook.retry
    
    PLAY RECAP *********************************************************************
    localhost                  : ok=41   changed=37   unreachable=0    failed=1   
    
    (truncated, view all with --long)
  deploy_stderr: |

- /var/log/gnocchi/statsd.log displays the following: http://paste.openstack.org/show/608617/

Manually restarting "openstack-gnocchi-statsd" works after the failure, might be the case where a ansible step is needed to restart statsd.

[1] https://rhos-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/RHOS/view/RHOS11/job/qe-DFG-security-poc-upgrades-10-11-rhel-7.3-virt-3cont_3keystone_1comp-ipv4-vxlan-ceph-ussl-yes-ossl-no/3/consoleFull

Comment 1 Emilien Macchi 2017-05-02 17:59:07 UTC
I see 2 scenarios :


Scenario 001
============

From what I've seen in your logs, it's failing at this command:
/usr/bin/systemctl show 'openstack-gnocchi-statsd' --property ActiveState | grep '\bactive\b'

Which clearly means Gnocchi Statsd wasn't running well before the upgrade.
Could you confirm that the service was running well before? Could you provide all Gnocchi logs from /var/log/gnocchi?
Even sosreport would be super useful.



Scenario 002
============

It's possible that during the upgrade process:

1) httpd is stopped on an host (where Keystone is running in WSGI)
2) Gnocchi Statsd is started on another host in the same same, and can't reach Keystone endpoints on the other nodes because httpd wasn't started yet.

It could be a race condition in the upgrade process if both task happen in the same upgrade step. Or it could be just an adjustment to make to the steps.

Comment 2 Rodrigo Duarte 2017-05-02 18:18:36 UTC
Created attachment 1275726 [details]
/var/log/gnocchi/statsd.log file

Here are the complete logs for statsd. After the failure, the logs show everything working after the service was manually restarted. Also, the service seemed to be receiving errors while trying to authenticate against keystone before the error in the object-store endpoint.

Comment 3 Emilien Macchi 2017-05-02 18:21:27 UTC
Looking at the logs:

2017-05-02 14:29:21.891 96273 ERROR gnocchi ClientException: Endpoint for object-store not found - have you specified a region?


It's pretty clear that Gnocchi Statsd wasn't working well before the upgrade process.

Comment 4 Emilien Macchi 2017-05-02 19:00:11 UTC
After little investigation, this is not a bug in the upgrade but in OSP10.

When deploy Keystone and Gnocchi Statsd on 2 different nodes, there is
a race condition in the deployment where Gnocchi could be started
before Keystone endpoints are created, within step 5.
See:
https://github.com/openstack/puppet-tripleo/blob/stable/newton/manifests/profile/base/gnocchi/statsd.pp#L31-L35
https://github.com/openstack/puppet-tripleo/blob/stable/newton/manifests/profile/base/keystone.pp#L128

That's why when you restarted Gnocchi Statsd, it worked well afterward.
Note: the bug has been fixed in OSP11, since we now manage Keystone
resources at step 3:
https://github.com/openstack/puppet-tripleo/blob/stable/ocata/manifests/profile/base/keystone.pp#L210

In other words:

1) OSP10 has a bug where Gnocchi Statsd doesn't work when Keystone is
not colocated.
2) OSP11 fails to upgrade when Gnocchi Statsd and Keystone are not
colocated because of 1).

Comment 5 Pradeep Kilambi 2017-05-03 11:49:23 UTC
The upstream fix in gnocchi is under review. Need to backport to 3.0 branch once it merges.

Comment 7 Dustin Schoenbrun 2017-06-07 02:15:42 UTC
For what it's worth, I was able to hit this same exact issue with an upgrade I was attempting on the Manila side of things. This was an Infrared deployment with 3 controllers and 2 compute nodes with the only extra configuration being from setting up the NetApp cDOT Manila Driver during the OSP-10 Overcloud deployment. I'll see if applying the patch referenced in Gerrit fixes the issues that I'm seeing in Gnocchi and allows the upgrade to succeed.

Comment 10 Lon Hohberger 2017-10-10 18:09:40 UTC
According to our records, this should be resolved by openstack-gnocchi-3.0.14-1.el7ost.  This build is available now.

Comment 13 errata-xmlrpc 2017-11-15 13:51:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3230


Note You need to log in before you can comment on or make changes to this bug.