Bug 1465776

Summary: OSP8 -> OSP9 -> OSP10 upgrade: major-upgrade-pacemaker-converge.yaml fails restarting openstack-cinder-scheduler: ServiceTooOld: One of the services is in Liberty version. We do not provide backward compatibility with Liberty now, you need to upgrade
Product: Red Hat OpenStack Reporter: Marius Cornea <mcornea>
Component: openstack-tripleo-heat-templatesAssignee: Sofer Athlan-Guyot <sathlang>
Status: CLOSED ERRATA QA Contact: Marius Cornea <mcornea>
Severity: urgent Docs Contact:
Priority: high    
Version: 10.0 (Newton)CC: aludwar, augol, dbecker, geguileo, jjoyce, jschluet, mbultel, mburns, morazi, rhel-osp-director-maint, samccann, sathlang, slinaber
Target Milestone: ---Keywords: Regression, Triaged, ZStream
Target Release: 10.0 (Newton)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-5.3.3-1.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-11-15 13:45:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
cinder-scheduler.log none

Description Marius Cornea 2017-06-28 07:45:03 UTC
Created attachment 1292577 [details]
cinder-scheduler.log

Description of problem:
OSP8 -> OSP9 -> OSP10 upgrade: major-upgrade-pacemaker-converge.yaml fails restarting openstack-cinder-scheduler:  ServiceTooOld: One of the services is in Liberty version. We do not provide backward compatibility with Liberty now, you need to upgrade to Mitaka first.

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-5.2.0-20.el7ost.noarch
openstack-cinder-9.1.4-3.el7ost.noarch
puppet-cinder-9.5.0-1.el7ost.noarch
python-cinder-9.1.4-3.el7ost.noarch
python-cinderclient-1.9.0-6.el7ost.noarch

How reproducible:
1/1

Steps to Reproduce:
1. Deploy OSP8
2. Upgrade to OSP9
3. Upgrade to OSP10

Actual results:
major-upgrade-pacemaker-converge.yaml during OSP9 -> OSP10 upgrade fails with cinder-scheduler service not being able to restart.

Expected results:
major-upgrade-pacemaker-converge.yaml completes fine.

Additional info:
Attaching scheduler.log

Comment 2 Sofer Athlan-Guyot 2017-06-28 11:08:28 UTC
Hi,

so the problem happens during osp8->osp9 upgrade:

 - Jun 27 19:28:46 -> end of controller upgrade to osp9
 - First error in volume:
   - 2017-06-27 19:25:09.142 6574 ERROR cinder.cmd.volume ServiceTooOld: One of the services is in Liberty version. We do not provide backward compatibility with Liberty now, you need to upgrade to Mitaka first.

 - Then error with the scheduler:
   - 2017-06-27 19:28:06.474 4167 CRITICAL cinder [req-3f310722-468e-48e4-99be-6668872890c5 - - - - -] ServiceTooOld: One of the services is in Liberty version. We do not provide backward compatibility with Liberty now, you need to upgrade to Mitaka first.
 - Jun 27 19:55:30 -> start osp9 convergence.
 - Jun 28 08:47:40 -> start upgrade to osp10

But we only catch the problem during restart in the osp10 upgrade.

This issue seems to be an orchestration issue between cinder services.  In the release note of Mitaka[1] we can find this:

  "As cinder-backup was strongly reworked in this release, the recommended upgrade order when executing live (rolling) upgrade is c-api->c-sch->c-vol->c-bak."

we don't use cinder-backup but in the log[3] we can see that c-vol is restarted before c-sch and that must be the root cause.

There is also this launchpad bug[2] that seems to confirm this.

Looking at the code for confirmation.


[1]: https://docs.openstack.org/releasenotes/cinder/mitaka.html#id6
[2]: https://bugs.launchpad.net/devstack/+bug/1612781
[3]: c-vol fails to start at 19:25 and c-sch fails to start at 19:28

Comment 4 Sofer Athlan-Guyot 2017-06-28 15:58:43 UTC
Hi,

so my previous timeline is wrong, just discard it.

So the cause of the error is the second line in the database:

INSERT INTO `services` VALUES 

('2017-06-27 15:44:41','2017-06-27 19:13:24',NULL,0,2,'hostgroup','cinder-scheduler','cinder-scheduler',3040,0,'nova',NULL,NULL,'2.0','1.3','not-capable',0,NULL,NULL),
('2017-06-27 15:44:41','2017-06-27 16:59:01',NULL,0,5,'hostgroup','cinder-scheduler','cinder-scheduler',440,0,'nova',NULL,NULL,NULL,NULL,'not-capable',0,NULL,NULL),
('2017-06-27 15:45:23','2017-06-28 15:11:30',NULL,0,8,'hostgroup@tripleo_ceph','cinder-volume','cinder-volume',1199,0,'nova',NULL,NULL,'3.0','1.11','disabled',0,NULL,NULL);

this entry doesn't have a NULL version, while the others two have 1.3 and 1.11.  This trigger an error when osp10 cinder-volume node (version 8.1.1) and cinder-scheduler (8.1.1) fails with

 "ServiceTooOld: One of the services is in Liberty version. We do not provide backward compatibility with Liberty now, you need to upgrade to Mitaka first."

This second database entry is created before osp8/9 upgrade and is not updated since the first stop that occurred for the osp8/9 upgrade.

On osp10 the services keeps restarting continuously.

We would need help from dfg:storage to get to the bottom of it.

Comment 5 Gorka Eguileor 2017-06-28 16:39:10 UTC
It looks like for some reason the DB had 2 scheduler entries for host hostgroup on OSP8, this meant that when it was upgraded to OSP9 only 1 of then was updated (the one with id=2) but then when OSP10 checks it sees that there is one entry with NULL in the versions (Liberty) and complains.

We cannot have duplicate entries or obsolete entries (from services that will not exist after the upgrade).

Comment 6 Sofer Athlan-Guyot 2017-06-28 17:04:23 UTC
Hi,

so after a Gorka, we came to the conclusion that, as long a we don't do rolling upgrade, we can just delete the entries to avoid duplicate.

I've tested this one liner:


sudo cinder-manage service list | awk '/^cinder/{print $1  " "  $2}' | while read service host; do sudo cinder-manage service remove $service $host; done

and the cinder-scheduler/volume could start.

We need to add that at the right time during upgrade.

Comment 10 Marius Cornea 2017-09-22 16:52:10 UTC
This issue is still reproducible with the latest puddle:

[root@controller-0 heat-admin]# cinder-manage service list 
Option "logdir" from group "DEFAULT" is deprecated. Use option "log-dir" from group "DEFAULT".
Binary           Host                                 Zone             Status     State Updated At           RPC Version  Object Version  Cluster                             
cinder-scheduler hostgroup                            nova             enabled    XXX   2017-09-22 15:21:34  2.0          1.3                                                 
cinder-scheduler hostgroup                            nova             enabled    XXX   2017-09-22 10:41:24  None         None                                                
cinder-volume    hostgroup@tripleo_ceph               nova             enabled    XXX   2017-09-22 15:36:10  3.0          1.11  

[stack@undercloud-0 ~]$ rpm -qa | grep tripleo-heat-templates
openstack-tripleo-heat-templates-compat-2.0.0-58.el7ost.noarch
openstack-tripleo-heat-templates-5.3.0-6.el7ost.noarch

Comment 13 Sofer Athlan-Guyot 2017-09-25 16:22:28 UTC
Hi,

so the cinder-manage service list ... wasn't triggered at the right time.  Galera was already taken down making this command only wasting cycles with:

2017-09-25 16:19:48.680 153294 WARNING oslo_db.sqlalchemy.engines [req-d33c77cd-e238-44aa-b764-55bbe2b69a13 - - - - -] SQL connection failed. -1 attempts left.
2017-09-25 16:19:58.692 153294 WARNING oslo_db.sqlalchemy.engines [req-d33c77cd-e238-44aa-b764-55bbe2b69a13 - - - - -] SQL connection failed. -2 attempts left.
2017-09-25 16:20:08.707 153294 WARNING oslo_db.sqlalchemy.engines [req-d33c77cd-e238-44aa-b764-55bbe2b69a13 - - - - -] SQL connection failed. -3 attempts left.
2017-09-25 16:20:18.722 153294 WARNING oslo_db.sqlalchemy.engines [req-d33c77cd-e238-44aa-b764-55bbe2b69a13 - - - - -] SQL connection failed. -4 attempts left.

... and so on

The new review make sure that the command is triggered at a time where cinder-volume is down and galera is up.

Comment 17 errata-xmlrpc 2017-11-15 13:45:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3231