Bug 1554122

Summary: FFU: post upgrade attaching cinder volume to instance fails with: ServiceTooOld: One of cinder-volume services is too old to accept attachment_update request. Required RPC API version is 3.9. Are you running mixed versions of cinder-volumes?
Product: Red Hat OpenStack Reporter: Marius Cornea <mcornea>
Component: openstack-tripleo-heat-templatesAssignee: Lukas Bezdicka <lbezdick>
Status: CLOSED ERRATA QA Contact: Marius Cornea <mcornea>
Severity: urgent Docs Contact:
Priority: high    
Version: 13.0 (Queens)CC: dbecker, geguileo, jfrancoa, jschluet, lbezdick, mbracho, mburns, morazi, rhel-osp-director-maint
Target Milestone: betaKeywords: Triaged
Target Release: 13.0 (Queens)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-8.0.2-0.20180416194362.29a5ad5.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1557331 (view as bug list) Environment:
Last Closed: 2018-06-27 13:35:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 1557331    
Bug Blocks:    

Description Marius Cornea 2018-03-11 16:22:16 UTC
Description of problem:
FFU: post upgrade attaching cinder volume to instance fails with: ServiceTooOld: One of cinder-volume services is too old to accept attachment_update request. Required RPC API version is 3.9. Are you running mixed versions of cinder-volumes?

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-8.0.0-0.20180227121938.e0f59ee.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy OSP10 with 3 controllers + 2 computes
2. Run through the FFU process, including pending FFU patches:
https://review.openstack.org/#/q/topic:bp/fast-forward-upgrades+(status:open+OR+status:merged) 
3. Run tempest after the FFU upgrade process has finished.

Actual results:
Unable to attach Cinder volume to nova instance.

Expected results:
Attaching Cinder volumes to instances works fine.

Additional info:
Cinder services are reported as up:

(overcloud) [stack@undercloud-0 ~]$ cinder service-list
+------------------+-------------------------+------+---------+-------+----------------------------+-----------------+
| Binary           | Host                    | Zone | Status  | State | Updated_at                 | Disabled Reason |
+------------------+-------------------------+------+---------+-------+----------------------------+-----------------+
| cinder-scheduler | hostgroup               | nova | enabled | up    | 2018-03-11T16:14:21.000000 | -               |
| cinder-volume    | hostgroup@tripleo_iscsi | nova | enabled | up    | 2018-03-11T16:14:23.000000 | -               |
+------------------+-------------------------+------+---------+-------+----------------------------+-----------------+

Containers are up on all 3 controllers:

s[heat-admin@controller-0 ~]$ sudo -s
[root@controller-0 heat-admin]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller-1 (version 1.1.18-11.el7-2b07d5c5a9) - partition with quorum
Last updated: Sun Mar 11 16:16:57 2018
Last change: Sun Mar 11 07:30:46 2018 by root via cibadmin on controller-0

12 nodes configured
37 resources configured

Online: [ controller-0 controller-1 controller-2 ]
GuestOnline: [ galera-bundle-0@controller-0 galera-bundle-1@controller-1 galera-bundle-2@controller-2 rabbitmq-bundle-0@controller-0 rabbitmq-bundle-1@controller-1 rabbitmq-bundle-2@controller-2 redis-bundle-0@controller-0 redis-bundle-1@controller-1 redis-bundle-2@controller-2 ]

Full list of resources:

 ip-10.0.0.109	(ocf::heartbeat:IPaddr2):	Started controller-0
 ip-172.17.1.10	(ocf::heartbeat:IPaddr2):	Started controller-1
 ip-172.17.4.16	(ocf::heartbeat:IPaddr2):	Started controller-2
 ip-172.17.3.14	(ocf::heartbeat:IPaddr2):	Started controller-0
 ip-172.17.1.14	(ocf::heartbeat:IPaddr2):	Started controller-1
 ip-192.168.24.13	(ocf::heartbeat:IPaddr2):	Started controller-2
 Docker container set: rabbitmq-bundle [rhos-qe-mirror-tlv.usersys.redhat.com:5000/rhosp13/openstack-rabbitmq:pcmklatest]
   rabbitmq-bundle-0	(ocf::heartbeat:rabbitmq-cluster):	Started controller-0
   rabbitmq-bundle-1	(ocf::heartbeat:rabbitmq-cluster):	Started controller-1
   rabbitmq-bundle-2	(ocf::heartbeat:rabbitmq-cluster):	Started controller-2
 Docker container set: galera-bundle [rhos-qe-mirror-tlv.usersys.redhat.com:5000/rhosp13/openstack-mariadb:pcmklatest]
   galera-bundle-0	(ocf::heartbeat:galera):	Master controller-0
   galera-bundle-1	(ocf::heartbeat:galera):	Master controller-1
   galera-bundle-2	(ocf::heartbeat:galera):	Master controller-2
 Docker container set: redis-bundle [rhos-qe-mirror-tlv.usersys.redhat.com:5000/rhosp13/openstack-redis:pcmklatest]
   redis-bundle-0	(ocf::heartbeat:redis):	Master controller-0
   redis-bundle-1	(ocf::heartbeat:redis):	Slave controller-1
   redis-bundle-2	(ocf::heartbeat:redis):	Slave controller-2
 Docker container set: haproxy-bundle [rhos-qe-mirror-tlv.usersys.redhat.com:5000/rhosp13/openstack-haproxy:pcmklatest]
   haproxy-bundle-docker-0	(ocf::heartbeat:docker):	Started controller-0
   haproxy-bundle-docker-1	(ocf::heartbeat:docker):	Started controller-1
   haproxy-bundle-docker-2	(ocf::heartbeat:docker):	Started controller-2
 Docker container: openstack-cinder-volume [rhos-qe-mirror-tlv.usersys.redhat.com:5000/rhosp13/openstack-cinder-volume:pcmklatest]
   openstack-cinder-volume-docker-0	(ocf::heartbeat:docker):	Started controller-0

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled


[root@controller-0 ~]# docker ps --format "{{.Names}}: {{.Status}}" | grep cinder
openstack-cinder-volume-docker-0: Up 8 hours
cinder_api_cron: Up 8 hours
cinder_scheduler: Up 8 hours (healthy)
cinder_api: Up 8 hours


[root@controller-1 ~]# docker ps --format "{{.Names}}: {{.Status}}" | grep cinder
cinder_api_cron: Up 8 hours
cinder_scheduler: Up 8 hours (healthy)
cinder_api: Up 8 hours

[root@controller-2 ~]#  docker ps --format "{{.Names}}: {{.Status}}" | grep cinder
cinder_api_cron: Up 8 hours
cinder_scheduler: Up 8 hours (healthy)
cinder_api: Up 8 hours

Comment 4 Lukas Bezdicka 2018-03-12 13:35:54 UTC
This will be is_bootstrap|bool in cinder FFU tasks.

Comment 5 Gorka Eguileor 2018-03-12 13:47:39 UTC
This is a problem with the start/restart of the services that comes from Cinder's rolling upgrade mechanism.

Looking at the upstream documentation for upgrades [1] it looks like its not 100% accurate, and following it will lead to this error.

The issue is basically that you need to restart the API and Scheduler services twice, be it during a normal upgrade or a rolling upgrade for the services to get the right RPC and Versioned Objects pinning.  So you upgrade, start APIs, then Schedulers, the Volume service, then restart Schedulers and APIs (order not important), and if you have more than 1 volume service you'll have to restart all of them but the last one that was restarted.

If we don't want to have to go through all this trouble of restarting the services twice, we can use the `cinder-manage service remove` command to remove all the services that are present in the DB and then we can restart all services in any order.  We can even run a SQL command that deletes all entries in the Cinder Services table: "delete from services;"


[1] https://docs.openstack.org/cinder/pike/upgrade.html

Comment 6 Gorka Eguileor 2018-03-16 12:39:22 UTC
I have been thinking a bit more about this upgrade issue and I think the removal of services from the table (using SQL or cinder-manage command) is not the best solution because doing this we could unintentionally re-enable a service that was disable during the upgrade, and we would create problems if a volume service is in a failover state.

And while the other solution, doing a second restart of the Cinder services, will work as expected (as long as we have purged any service that will no longer be working, and as long as the volume service doesn't have any problem on start) it is bad in terms of user experience and a big headache for the FFU flow.
So I have created an upstream Cinder bug and proposed a solution that I believe will be easy to integrate in the FFU flow.

The idea of this patch is that in the FFU flow we will just have to pass a new parameter to the db sync command: "cinder-manage db sync --bump-versions" and with that all services will run as expected when we start them after the upgrade and we'll no longer see the ServiceTooOld issue.

Will clone this BZ to Cinder to keep track of the backport to OSP13.

Comment 7 Lukas Bezdicka 2018-04-23 14:31:20 UTC
https://review.openstack.org/#/q/I5676132be477695838c59a0d59c62e09e335a8f0 Workaround applied and fix tracked in different bug.

Comment 10 Gorka Eguileor 2018-04-27 10:13:35 UTC
Now that we have the "--bump-version" option in db sync (rhbz #1557331) we should replace the workaround to use it instead, as it has the added benefit of not losing the status of the services (disabled, failed over, etc.).

Comment 11 Scott Lewis 2018-04-30 14:59:49 UTC
This item has been properly Triaged and planned for the OSP13 release, and is being tagged for tracking. For details, see https://url.corp.redhat.com/1851efd

Comment 13 errata-xmlrpc 2018-06-27 13:35:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086