Bug 1374076 - OSP9/10 Ceph osd node upgrade fails.
Summary: OSP9/10 Ceph osd node upgrade fails.
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
Target Milestone: rc
: 10.0 (Newton)
Assignee: Giulio Fidente
QA Contact: Yogev Rabl
Depends On:
Blocks: 1337794
TreeView+ depends on / blocked
Reported: 2016-09-07 21:43 UTC by Sofer Athlan-Guyot
Modified: 2016-12-29 16:55 UTC (History)
6 users (show)

Fixed In Version: openstack-tripleo-heat-templates-5.0.0-0.20160929150845.4cdc4fc.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2016-12-14 15:58:09 UTC

Attachments (Terms of Use)

System ID Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2016:2948 normal SHIPPED_LIVE Red Hat OpenStack Platform 10 enhancement update 2016-12-14 19:55:27 UTC
OpenStack gerrit 370830 None None None 2016-09-15 13:36:16 UTC
Launchpad 1623942 None None None 2016-09-15 13:30:18 UTC

Description Sofer Athlan-Guyot 2016-09-07 21:43:45 UTC
Description of problem:  From an osp9 upgrade to osp10, after controller upgrade, running:

    . stackrc
    upgrade-non-controller.sh --upgrade overcloud-cephstorage-0


Version-Release number of selected component (if applicable): 

How reproducible:

Actual results: The osd cannot restart:

19:01:56 2016-09-07 17:01:55.040934 7f09f4389700  0 -- :/4004684784 >> pipe(0x7f09e40088e0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f09e400d040).fault
19:01:59 2016-09-07 17:01:57.993955 7f09f7401700  0 monclient(hunting): authenticate timed out after 300
19:01:59 2016-09-07 17:01:57.994108 7f09f7401700  0 librados: client.admin authentication error (110) Connection timed out

After verification all the mon are stopped on the 3 controllers node.

Expected results:

Additional info:

On the controller the packages:

ceph-base.x86_64 1:10.2.2-38.el7cp @rhelosp-10.0-ceph-2.0-mon
ceph-common.x86_64 1:10.2.2-38.el7cp @rhelosp-10.0-ceph-2.0-mon
ceph-mon.x86_64 1:10.2.2-38.el7cp @rhelosp-10.0-ceph-2.0-mon
ceph-osd.x86_64 1:10.2.2-38.el7cp @rhelosp-10.0-ceph-2.0-osd
ceph-selinux.x86_64 1:10.2.2-38.el7cp @rhelosp-10.0-ceph-2.0-mon
libcephfs1.x86_64 1:10.2.2-38.el7cp @rhelosp-10.0-ceph-2.0-mon
puppet-ceph.noarch 2.0.0-0.20160823145734.4e36628.1.el7ost @rhelosp-10.0-brew        
python-cephfs.x86_64 1:10.2.2-38.el7cp @rhelosp-10.0-ceph-2.0-mon

The same on the cephstorage node.

Comment 2 Sofer Athlan-Guyot 2016-09-09 15:17:43 UTC

the problem is during the controller-and-block-storage-upgrade, that is when using this template: environments/major-upgrade-pacemaker.yaml

The ceph mon are not properly updated from ~0.9 to 2.0.  This is the state of the mon:

systemctl list-units --all *ceph* 
ceph-mon.target loaded active active ceph target allowing to start/stop all ceph-mon@.service instances at once
ceph-osd.target loaded active active ceph target allowing to start/stop all ceph-osd@.service instances at once
ceph.target     loaded active active ceph target allowing to start/stop all ceph*@.service instances at once

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.

3 loaded units listed.
To show all installed unit files use 'systemctl list-unit-files'.

There is no ceph mon service started.

This is a big upgrade and a lot has changed.  We have to take it into account in the upgrade process.

Comment 5 Omri Hochman 2016-11-18 13:51:01 UTC
Change to DFG:DF-Lifecycle to help with verification

Comment 6 Omri Hochman 2016-11-21 22:09:13 UTC
verified with openstack-tripleo-heat-templates-5.1.0-3.el7ost.noarch 

On controller: 
[root@controller-0 ~]# rpm -qa | grep ceph

Comment 8 errata-xmlrpc 2016-12-14 15:58:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.