Bug 1465776
Summary: | OSP8 -> OSP9 -> OSP10 upgrade: major-upgrade-pacemaker-converge.yaml fails restarting openstack-cinder-scheduler: ServiceTooOld: One of the services is in Liberty version. We do not provide backward compatibility with Liberty now, you need to upgrade | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Marius Cornea <mcornea> | ||||
Component: | openstack-tripleo-heat-templates | Assignee: | Sofer Athlan-Guyot <sathlang> | ||||
Status: | CLOSED ERRATA | QA Contact: | Marius Cornea <mcornea> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 10.0 (Newton) | CC: | aludwar, augol, dbecker, geguileo, jjoyce, jschluet, mbultel, mburns, morazi, rhel-osp-director-maint, samccann, sathlang, slinaber | ||||
Target Milestone: | --- | Keywords: | Regression, Triaged, ZStream | ||||
Target Release: | 10.0 (Newton) | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | openstack-tripleo-heat-templates-5.3.3-1.el7ost | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2017-11-15 13:45:13 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Hi, so the problem happens during osp8->osp9 upgrade: - Jun 27 19:28:46 -> end of controller upgrade to osp9 - First error in volume: - 2017-06-27 19:25:09.142 6574 ERROR cinder.cmd.volume ServiceTooOld: One of the services is in Liberty version. We do not provide backward compatibility with Liberty now, you need to upgrade to Mitaka first. - Then error with the scheduler: - 2017-06-27 19:28:06.474 4167 CRITICAL cinder [req-3f310722-468e-48e4-99be-6668872890c5 - - - - -] ServiceTooOld: One of the services is in Liberty version. We do not provide backward compatibility with Liberty now, you need to upgrade to Mitaka first. - Jun 27 19:55:30 -> start osp9 convergence. - Jun 28 08:47:40 -> start upgrade to osp10 But we only catch the problem during restart in the osp10 upgrade. This issue seems to be an orchestration issue between cinder services. In the release note of Mitaka[1] we can find this: "As cinder-backup was strongly reworked in this release, the recommended upgrade order when executing live (rolling) upgrade is c-api->c-sch->c-vol->c-bak." we don't use cinder-backup but in the log[3] we can see that c-vol is restarted before c-sch and that must be the root cause. There is also this launchpad bug[2] that seems to confirm this. Looking at the code for confirmation. [1]: https://docs.openstack.org/releasenotes/cinder/mitaka.html#id6 [2]: https://bugs.launchpad.net/devstack/+bug/1612781 [3]: c-vol fails to start at 19:25 and c-sch fails to start at 19:28 Hi, so my previous timeline is wrong, just discard it. So the cause of the error is the second line in the database: INSERT INTO `services` VALUES ('2017-06-27 15:44:41','2017-06-27 19:13:24',NULL,0,2,'hostgroup','cinder-scheduler','cinder-scheduler',3040,0,'nova',NULL,NULL,'2.0','1.3','not-capable',0,NULL,NULL), ('2017-06-27 15:44:41','2017-06-27 16:59:01',NULL,0,5,'hostgroup','cinder-scheduler','cinder-scheduler',440,0,'nova',NULL,NULL,NULL,NULL,'not-capable',0,NULL,NULL), ('2017-06-27 15:45:23','2017-06-28 15:11:30',NULL,0,8,'hostgroup@tripleo_ceph','cinder-volume','cinder-volume',1199,0,'nova',NULL,NULL,'3.0','1.11','disabled',0,NULL,NULL); this entry doesn't have a NULL version, while the others two have 1.3 and 1.11. This trigger an error when osp10 cinder-volume node (version 8.1.1) and cinder-scheduler (8.1.1) fails with "ServiceTooOld: One of the services is in Liberty version. We do not provide backward compatibility with Liberty now, you need to upgrade to Mitaka first." This second database entry is created before osp8/9 upgrade and is not updated since the first stop that occurred for the osp8/9 upgrade. On osp10 the services keeps restarting continuously. We would need help from dfg:storage to get to the bottom of it. It looks like for some reason the DB had 2 scheduler entries for host hostgroup on OSP8, this meant that when it was upgraded to OSP9 only 1 of then was updated (the one with id=2) but then when OSP10 checks it sees that there is one entry with NULL in the versions (Liberty) and complains. We cannot have duplicate entries or obsolete entries (from services that will not exist after the upgrade). Hi, so after a Gorka, we came to the conclusion that, as long a we don't do rolling upgrade, we can just delete the entries to avoid duplicate. I've tested this one liner: sudo cinder-manage service list | awk '/^cinder/{print $1 " " $2}' | while read service host; do sudo cinder-manage service remove $service $host; done and the cinder-scheduler/volume could start. We need to add that at the right time during upgrade. This issue is still reproducible with the latest puddle: [root@controller-0 heat-admin]# cinder-manage service list Option "logdir" from group "DEFAULT" is deprecated. Use option "log-dir" from group "DEFAULT". Binary Host Zone Status State Updated At RPC Version Object Version Cluster cinder-scheduler hostgroup nova enabled XXX 2017-09-22 15:21:34 2.0 1.3 cinder-scheduler hostgroup nova enabled XXX 2017-09-22 10:41:24 None None cinder-volume hostgroup@tripleo_ceph nova enabled XXX 2017-09-22 15:36:10 3.0 1.11 [stack@undercloud-0 ~]$ rpm -qa | grep tripleo-heat-templates openstack-tripleo-heat-templates-compat-2.0.0-58.el7ost.noarch openstack-tripleo-heat-templates-5.3.0-6.el7ost.noarch Hi, so the cinder-manage service list ... wasn't triggered at the right time. Galera was already taken down making this command only wasting cycles with: 2017-09-25 16:19:48.680 153294 WARNING oslo_db.sqlalchemy.engines [req-d33c77cd-e238-44aa-b764-55bbe2b69a13 - - - - -] SQL connection failed. -1 attempts left. 2017-09-25 16:19:58.692 153294 WARNING oslo_db.sqlalchemy.engines [req-d33c77cd-e238-44aa-b764-55bbe2b69a13 - - - - -] SQL connection failed. -2 attempts left. 2017-09-25 16:20:08.707 153294 WARNING oslo_db.sqlalchemy.engines [req-d33c77cd-e238-44aa-b764-55bbe2b69a13 - - - - -] SQL connection failed. -3 attempts left. 2017-09-25 16:20:18.722 153294 WARNING oslo_db.sqlalchemy.engines [req-d33c77cd-e238-44aa-b764-55bbe2b69a13 - - - - -] SQL connection failed. -4 attempts left. ... and so on The new review make sure that the command is triggered at a time where cinder-volume is down and galera is up. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:3231 |
Created attachment 1292577 [details] cinder-scheduler.log Description of problem: OSP8 -> OSP9 -> OSP10 upgrade: major-upgrade-pacemaker-converge.yaml fails restarting openstack-cinder-scheduler: ServiceTooOld: One of the services is in Liberty version. We do not provide backward compatibility with Liberty now, you need to upgrade to Mitaka first. Version-Release number of selected component (if applicable): openstack-tripleo-heat-templates-5.2.0-20.el7ost.noarch openstack-cinder-9.1.4-3.el7ost.noarch puppet-cinder-9.5.0-1.el7ost.noarch python-cinder-9.1.4-3.el7ost.noarch python-cinderclient-1.9.0-6.el7ost.noarch How reproducible: 1/1 Steps to Reproduce: 1. Deploy OSP8 2. Upgrade to OSP9 3. Upgrade to OSP10 Actual results: major-upgrade-pacemaker-converge.yaml during OSP9 -> OSP10 upgrade fails with cinder-scheduler service not being able to restart. Expected results: major-upgrade-pacemaker-converge.yaml completes fine. Additional info: Attaching scheduler.log