Bug 1465776 - OSP8 -> OSP9 -> OSP10 upgrade: major-upgrade-pacemaker-converge.yaml fails restarting openstack-cinder-scheduler: ServiceTooOld: One of the services is in Liberty version. We do not provide backward compatibility with Liberty now, you need to upgrade
OSP8 -> OSP9 -> OSP10 upgrade: major-upgrade-pacemaker-converge.yaml fails re...
Status: POST
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates (Show other bugs)
10.0 (Newton)
Unspecified Unspecified
high Severity urgent
: ---
: 10.0 (Newton)
Assigned To: Sofer Athlan-Guyot
Marius Cornea
: Regression, Triaged, ZStream
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-28 03:45 EDT by Marius Cornea
Modified: 2017-10-22 11:45 EDT (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
cinder-scheduler.log (3.69 MB, text/plain)
2017-06-28 03:45 EDT, Marius Cornea
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Launchpad 1701259 None None None 2017-06-29 09:03 EDT
Red Hat Knowledge Base (Solution) 3221581 None None None 2017-10-22 11:37 EDT
OpenStack gerrit 478922 None None None 2017-06-29 09:04 EDT
OpenStack gerrit 507188 None None None 2017-09-25 12:22 EDT

  None (edit)
Description Marius Cornea 2017-06-28 03:45:03 EDT
Created attachment 1292577 [details]
cinder-scheduler.log

Description of problem:
OSP8 -> OSP9 -> OSP10 upgrade: major-upgrade-pacemaker-converge.yaml fails restarting openstack-cinder-scheduler:  ServiceTooOld: One of the services is in Liberty version. We do not provide backward compatibility with Liberty now, you need to upgrade to Mitaka first.

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-5.2.0-20.el7ost.noarch
openstack-cinder-9.1.4-3.el7ost.noarch
puppet-cinder-9.5.0-1.el7ost.noarch
python-cinder-9.1.4-3.el7ost.noarch
python-cinderclient-1.9.0-6.el7ost.noarch

How reproducible:
1/1

Steps to Reproduce:
1. Deploy OSP8
2. Upgrade to OSP9
3. Upgrade to OSP10

Actual results:
major-upgrade-pacemaker-converge.yaml during OSP9 -> OSP10 upgrade fails with cinder-scheduler service not being able to restart.

Expected results:
major-upgrade-pacemaker-converge.yaml completes fine.

Additional info:
Attaching scheduler.log
Comment 2 Sofer Athlan-Guyot 2017-06-28 07:08:28 EDT
Hi,

so the problem happens during osp8->osp9 upgrade:

 - Jun 27 19:28:46 -> end of controller upgrade to osp9
 - First error in volume:
   - 2017-06-27 19:25:09.142 6574 ERROR cinder.cmd.volume ServiceTooOld: One of the services is in Liberty version. We do not provide backward compatibility with Liberty now, you need to upgrade to Mitaka first.

 - Then error with the scheduler:
   - 2017-06-27 19:28:06.474 4167 CRITICAL cinder [req-3f310722-468e-48e4-99be-6668872890c5 - - - - -] ServiceTooOld: One of the services is in Liberty version. We do not provide backward compatibility with Liberty now, you need to upgrade to Mitaka first.
 - Jun 27 19:55:30 -> start osp9 convergence.
 - Jun 28 08:47:40 -> start upgrade to osp10

But we only catch the problem during restart in the osp10 upgrade.

This issue seems to be an orchestration issue between cinder services.  In the release note of Mitaka[1] we can find this:

  "As cinder-backup was strongly reworked in this release, the recommended upgrade order when executing live (rolling) upgrade is c-api->c-sch->c-vol->c-bak."

we don't use cinder-backup but in the log[3] we can see that c-vol is restarted before c-sch and that must be the root cause.

There is also this launchpad bug[2] that seems to confirm this.

Looking at the code for confirmation.


[1]: https://docs.openstack.org/releasenotes/cinder/mitaka.html#id6
[2]: https://bugs.launchpad.net/devstack/+bug/1612781
[3]: c-vol fails to start at 19:25 and c-sch fails to start at 19:28
Comment 4 Sofer Athlan-Guyot 2017-06-28 11:58:43 EDT
Hi,

so my previous timeline is wrong, just discard it.

So the cause of the error is the second line in the database:

INSERT INTO `services` VALUES 

('2017-06-27 15:44:41','2017-06-27 19:13:24',NULL,0,2,'hostgroup','cinder-scheduler','cinder-scheduler',3040,0,'nova',NULL,NULL,'2.0','1.3','not-capable',0,NULL,NULL),
('2017-06-27 15:44:41','2017-06-27 16:59:01',NULL,0,5,'hostgroup','cinder-scheduler','cinder-scheduler',440,0,'nova',NULL,NULL,NULL,NULL,'not-capable',0,NULL,NULL),
('2017-06-27 15:45:23','2017-06-28 15:11:30',NULL,0,8,'hostgroup@tripleo_ceph','cinder-volume','cinder-volume',1199,0,'nova',NULL,NULL,'3.0','1.11','disabled',0,NULL,NULL);

this entry doesn't have a NULL version, while the others two have 1.3 and 1.11.  This trigger an error when osp10 cinder-volume node (version 8.1.1) and cinder-scheduler (8.1.1) fails with

 "ServiceTooOld: One of the services is in Liberty version. We do not provide backward compatibility with Liberty now, you need to upgrade to Mitaka first."

This second database entry is created before osp8/9 upgrade and is not updated since the first stop that occurred for the osp8/9 upgrade.

On osp10 the services keeps restarting continuously.

We would need help from dfg:storage to get to the bottom of it.
Comment 5 Gorka Eguileor 2017-06-28 12:39:10 EDT
It looks like for some reason the DB had 2 scheduler entries for host hostgroup on OSP8, this meant that when it was upgraded to OSP9 only 1 of then was updated (the one with id=2) but then when OSP10 checks it sees that there is one entry with NULL in the versions (Liberty) and complains.

We cannot have duplicate entries or obsolete entries (from services that will not exist after the upgrade).
Comment 6 Sofer Athlan-Guyot 2017-06-28 13:04:23 EDT
Hi,

so after a Gorka, we came to the conclusion that, as long a we don't do rolling upgrade, we can just delete the entries to avoid duplicate.

I've tested this one liner:


sudo cinder-manage service list | awk '/^cinder/{print $1  " "  $2}' | while read service host; do sudo cinder-manage service remove $service $host; done

and the cinder-scheduler/volume could start.

We need to add that at the right time during upgrade.
Comment 10 Marius Cornea 2017-09-22 12:52:10 EDT
This issue is still reproducible with the latest puddle:

[root@controller-0 heat-admin]# cinder-manage service list 
Option "logdir" from group "DEFAULT" is deprecated. Use option "log-dir" from group "DEFAULT".
Binary           Host                                 Zone             Status     State Updated At           RPC Version  Object Version  Cluster                             
cinder-scheduler hostgroup                            nova             enabled    XXX   2017-09-22 15:21:34  2.0          1.3                                                 
cinder-scheduler hostgroup                            nova             enabled    XXX   2017-09-22 10:41:24  None         None                                                
cinder-volume    hostgroup@tripleo_ceph               nova             enabled    XXX   2017-09-22 15:36:10  3.0          1.11  

[stack@undercloud-0 ~]$ rpm -qa | grep tripleo-heat-templates
openstack-tripleo-heat-templates-compat-2.0.0-58.el7ost.noarch
openstack-tripleo-heat-templates-5.3.0-6.el7ost.noarch
Comment 13 Sofer Athlan-Guyot 2017-09-25 12:22:28 EDT
Hi,

so the cinder-manage service list ... wasn't triggered at the right time.  Galera was already taken down making this command only wasting cycles with:

2017-09-25 16:19:48.680 153294 WARNING oslo_db.sqlalchemy.engines [req-d33c77cd-e238-44aa-b764-55bbe2b69a13 - - - - -] SQL connection failed. -1 attempts left.
2017-09-25 16:19:58.692 153294 WARNING oslo_db.sqlalchemy.engines [req-d33c77cd-e238-44aa-b764-55bbe2b69a13 - - - - -] SQL connection failed. -2 attempts left.
2017-09-25 16:20:08.707 153294 WARNING oslo_db.sqlalchemy.engines [req-d33c77cd-e238-44aa-b764-55bbe2b69a13 - - - - -] SQL connection failed. -3 attempts left.
2017-09-25 16:20:18.722 153294 WARNING oslo_db.sqlalchemy.engines [req-d33c77cd-e238-44aa-b764-55bbe2b69a13 - - - - -] SQL connection failed. -4 attempts left.

... and so on

The new review make sure that the command is triggered at a time where cinder-volume is down and galera is up.

Note You need to log in before you can comment on or make changes to this bug.