Description of problem: FFU: post upgrade attaching cinder volume to instance fails with: ServiceTooOld: One of cinder-volume services is too old to accept attachment_update request. Required RPC API version is 3.9. Are you running mixed versions of cinder-volumes? Version-Release number of selected component (if applicable): openstack-tripleo-heat-templates-8.0.0-0.20180227121938.e0f59ee.el7ost.noarch How reproducible: 100% Steps to Reproduce: 1. Deploy OSP10 with 3 controllers + 2 computes 2. Run through the FFU process, including pending FFU patches: https://review.openstack.org/#/q/topic:bp/fast-forward-upgrades+(status:open+OR+status:merged) 3. Run tempest after the FFU upgrade process has finished. Actual results: Unable to attach Cinder volume to nova instance. Expected results: Attaching Cinder volumes to instances works fine. Additional info: Cinder services are reported as up: (overcloud) [stack@undercloud-0 ~]$ cinder service-list +------------------+-------------------------+------+---------+-------+----------------------------+-----------------+ | Binary | Host | Zone | Status | State | Updated_at | Disabled Reason | +------------------+-------------------------+------+---------+-------+----------------------------+-----------------+ | cinder-scheduler | hostgroup | nova | enabled | up | 2018-03-11T16:14:21.000000 | - | | cinder-volume | hostgroup@tripleo_iscsi | nova | enabled | up | 2018-03-11T16:14:23.000000 | - | +------------------+-------------------------+------+---------+-------+----------------------------+-----------------+ Containers are up on all 3 controllers: s[heat-admin@controller-0 ~]$ sudo -s [root@controller-0 heat-admin]# pcs status Cluster name: tripleo_cluster Stack: corosync Current DC: controller-1 (version 1.1.18-11.el7-2b07d5c5a9) - partition with quorum Last updated: Sun Mar 11 16:16:57 2018 Last change: Sun Mar 11 07:30:46 2018 by root via cibadmin on controller-0 12 nodes configured 37 resources configured Online: [ controller-0 controller-1 controller-2 ] GuestOnline: [ galera-bundle-0@controller-0 galera-bundle-1@controller-1 galera-bundle-2@controller-2 rabbitmq-bundle-0@controller-0 rabbitmq-bundle-1@controller-1 rabbitmq-bundle-2@controller-2 redis-bundle-0@controller-0 redis-bundle-1@controller-1 redis-bundle-2@controller-2 ] Full list of resources: ip-10.0.0.109 (ocf::heartbeat:IPaddr2): Started controller-0 ip-172.17.1.10 (ocf::heartbeat:IPaddr2): Started controller-1 ip-172.17.4.16 (ocf::heartbeat:IPaddr2): Started controller-2 ip-172.17.3.14 (ocf::heartbeat:IPaddr2): Started controller-0 ip-172.17.1.14 (ocf::heartbeat:IPaddr2): Started controller-1 ip-192.168.24.13 (ocf::heartbeat:IPaddr2): Started controller-2 Docker container set: rabbitmq-bundle [rhos-qe-mirror-tlv.usersys.redhat.com:5000/rhosp13/openstack-rabbitmq:pcmklatest] rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started controller-0 rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started controller-1 rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started controller-2 Docker container set: galera-bundle [rhos-qe-mirror-tlv.usersys.redhat.com:5000/rhosp13/openstack-mariadb:pcmklatest] galera-bundle-0 (ocf::heartbeat:galera): Master controller-0 galera-bundle-1 (ocf::heartbeat:galera): Master controller-1 galera-bundle-2 (ocf::heartbeat:galera): Master controller-2 Docker container set: redis-bundle [rhos-qe-mirror-tlv.usersys.redhat.com:5000/rhosp13/openstack-redis:pcmklatest] redis-bundle-0 (ocf::heartbeat:redis): Master controller-0 redis-bundle-1 (ocf::heartbeat:redis): Slave controller-1 redis-bundle-2 (ocf::heartbeat:redis): Slave controller-2 Docker container set: haproxy-bundle [rhos-qe-mirror-tlv.usersys.redhat.com:5000/rhosp13/openstack-haproxy:pcmklatest] haproxy-bundle-docker-0 (ocf::heartbeat:docker): Started controller-0 haproxy-bundle-docker-1 (ocf::heartbeat:docker): Started controller-1 haproxy-bundle-docker-2 (ocf::heartbeat:docker): Started controller-2 Docker container: openstack-cinder-volume [rhos-qe-mirror-tlv.usersys.redhat.com:5000/rhosp13/openstack-cinder-volume:pcmklatest] openstack-cinder-volume-docker-0 (ocf::heartbeat:docker): Started controller-0 Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled [root@controller-0 ~]# docker ps --format "{{.Names}}: {{.Status}}" | grep cinder openstack-cinder-volume-docker-0: Up 8 hours cinder_api_cron: Up 8 hours cinder_scheduler: Up 8 hours (healthy) cinder_api: Up 8 hours [root@controller-1 ~]# docker ps --format "{{.Names}}: {{.Status}}" | grep cinder cinder_api_cron: Up 8 hours cinder_scheduler: Up 8 hours (healthy) cinder_api: Up 8 hours [root@controller-2 ~]# docker ps --format "{{.Names}}: {{.Status}}" | grep cinder cinder_api_cron: Up 8 hours cinder_scheduler: Up 8 hours (healthy) cinder_api: Up 8 hours
This will be is_bootstrap|bool in cinder FFU tasks.
This is a problem with the start/restart of the services that comes from Cinder's rolling upgrade mechanism. Looking at the upstream documentation for upgrades [1] it looks like its not 100% accurate, and following it will lead to this error. The issue is basically that you need to restart the API and Scheduler services twice, be it during a normal upgrade or a rolling upgrade for the services to get the right RPC and Versioned Objects pinning. So you upgrade, start APIs, then Schedulers, the Volume service, then restart Schedulers and APIs (order not important), and if you have more than 1 volume service you'll have to restart all of them but the last one that was restarted. If we don't want to have to go through all this trouble of restarting the services twice, we can use the `cinder-manage service remove` command to remove all the services that are present in the DB and then we can restart all services in any order. We can even run a SQL command that deletes all entries in the Cinder Services table: "delete from services;" [1] https://docs.openstack.org/cinder/pike/upgrade.html
I have been thinking a bit more about this upgrade issue and I think the removal of services from the table (using SQL or cinder-manage command) is not the best solution because doing this we could unintentionally re-enable a service that was disable during the upgrade, and we would create problems if a volume service is in a failover state. And while the other solution, doing a second restart of the Cinder services, will work as expected (as long as we have purged any service that will no longer be working, and as long as the volume service doesn't have any problem on start) it is bad in terms of user experience and a big headache for the FFU flow. So I have created an upstream Cinder bug and proposed a solution that I believe will be easy to integrate in the FFU flow. The idea of this patch is that in the FFU flow we will just have to pass a new parameter to the db sync command: "cinder-manage db sync --bump-versions" and with that all services will run as expected when we start them after the upgrade and we'll no longer see the ServiceTooOld issue. Will clone this BZ to Cinder to keep track of the backport to OSP13.
https://review.openstack.org/#/q/I5676132be477695838c59a0d59c62e09e335a8f0 Workaround applied and fix tracked in different bug.
Now that we have the "--bump-version" option in db sync (rhbz #1557331) we should replace the workaround to use it instead, as it has the added benefit of not losing the status of the services (disabled, failed over, etc.).
This item has been properly Triaged and planned for the OSP13 release, and is being tagged for tracking. For details, see https://url.corp.redhat.com/1851efd
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2086