Bug 1889879 - Cinder A/A can not recover from network failures and service does not go to down
Summary: Cinder A/A can not recover from network failures and service does not go to down
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-cinder
Version: 16.1 (Train)
Hardware: x86_64
OS: Linux
high
urgent
Target Milestone: ---
: ---
Assignee: Gorka Eguileor
QA Contact: Tzach Shefi
Chuck Copello
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-20 19:08 UTC by bkopilov
Modified: 2020-11-06 04:34 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)

Description bkopilov 2020-10-20 19:08:12 UTC
Hi, 
RHOS 16.1 setups, edge with cinder A/A cluster with 3 nodes.

When stopping one of the services tripleo_cinder_volume.service on the cluster, cinder knows to direct the create action to others (active)


In case we drop the traffic from cluster node , cinder service is still up and running (in cinder service-list) and cinder create action stuck forever on "create"


(central) [stack@site-undercloud-0 ~]$ cinder list
+--------------------------------------+-----------+------+------+-------------+----------+-------------+
| ID                                   | Status    | Name | Size | Volume Type | Bootable | Attached to |
+--------------------------------------+-----------+------+------+-------------+----------+-------------+
| 47e796b2-74ab-44fe-9242-5f08104a619b | available | -    | 1    | tripleo     | false    |             |
| a93993a0-ca08-4a00-a348-3bc10f1348cc | creating  | -    | 1    | tripleo     | false    |             |
| c799d1c3-7ce4-46e6-bd4d-cf62ded98a52 | creating  | -    | 1    | tripleo     | false    |             |
+--------------------------------------+-----------+------+------+-------------+----------+-------------+



How to reproduce :
#1 go to one of the cinder cluster node.
#2 block traffic with iptables.

/sbin/iptables -I OUTPUT  1 -p tcp --sport 6800:7300 -j DROP
/sbin/iptables -I OUTPUT  2 -p tcp --dport 6800:7300 -j DROP

#3 Turn the service to stop on the other nodes tripleo_cinder_volume.service

#4 try to create a volume --> Stuck on creating.....

Comment 1 bkopilov 2020-10-20 19:09:06 UTC
Here is the log :
2020-10-20 18:30:18.722 35 DEBUG cinder.volume.flows.manager.create_volume [req-f2c2b32d-05f7-4cb0-946d-06bbda45c2f0 9e7f8616dfd84ead800742850fada80f 50ca03ad7c7c4ce7a254853a28d43a3d - default default] Volume reschedule parameters: True retry: {'num_attempts': 1, 'backends': ['az-dcn2@tripleo_ceph#tripleo_ceph'], 'hosts': ['az-dcn2@tripleo_ceph#tripleo_ceph']} get_flow /usr/lib/python3.6/site-packages/cinder/volume/flows/manager/create_volume.py:1265
2020-10-20 18:30:18.743 35 DEBUG cinder.volume.manager [req-f2c2b32d-05f7-4cb0-946d-06bbda45c2f0 9e7f8616dfd84ead800742850fada80f 50ca03ad7c7c4ce7a254853a28d43a3d - default default] Flow 'volume_create_manager' (ae2454ef-d783-46da-8828-7b108388c807) transitioned into state 'RUNNING' from state 'PENDING' _flow_receiver /usr/lib/python3.6/site-packages/taskflow/listeners/logging.py:145
2020-10-20 18:30:18.747 35 DEBUG cinder.volume.manager [req-f2c2b32d-05f7-4cb0-946d-06bbda45c2f0 9e7f8616dfd84ead800742850fada80f 50ca03ad7c7c4ce7a254853a28d43a3d - default default] Task 'cinder.volume.flows.manager.create_volume.ExtractVolumeRefTask;volume:create' (e61a02d4-8d24-4b55-bcbd-ac88c8b91e37) transitioned into state 'RUNNING' from state 'PENDING' _task_receiver /usr/lib/python3.6/site-packages/taskflow/listeners/logging.py:194
2020-10-20 18:30:22.192 35 DEBUG oslo_db.sqlalchemy.engines [req-f2c2b32d-05f7-4cb0-946d-06bbda45c2f0 9e7f8616dfd84ead800742850fada80f 50ca03ad7c7c4ce7a254853a28d43a3d - default default] MySQL server mode set to STRICT_TRANS_TABLES,STRICT_ALL_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,TRADITIONAL,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION _check_effective_sql_mode /usr/lib/python3.6/site-packages/oslo_db/sqlalchemy/engines.py:307
2020-10-20 18:30:22.420 35 DEBUG cinder.volume.manager [req-f2c2b32d-05f7-4cb0-946d-06bbda45c2f0 9e7f8616dfd84ead800742850fada80f 50ca03ad7c7c4ce7a254853a28d43a3d - default default] Task 'cinder.volume.flows.manager.create_volume.ExtractVolumeRefTask;volume:create' (e61a02d4-8d24-4b55-bcbd-ac88c8b91e37) transitioned into state 'SUCCESS' from state 'RUNNING' with result 'Volume(_name_id=None,admin_metadata={},attach_status='detached',availability_zone='az-dcn2',bootable=False,cluster=<?>,cluster_name='az-dcn2@tripleo_ceph#tripleo_ceph',consistencygroup=<?>,consistencygroup_id=None,created_at=2020-10-20T18:30:17Z,deleted=False,deleted_at=None,display_description=None,display_name=None,ec2_id=None,encryption_key_id=None,glance_metadata=<?>,group=<?>,group_id=None,host='dcn2-computehci2-0@tripleo_ceph#tripleo_ceph',id=a93993a0-ca08-4a00-a348-3bc10f1348cc,launched_at=None,metadata={},migration_status=None,multiattach=False,previous_status=None,project_id='50ca03ad7c7c4ce7a254853a28d43a3d',provider_auth=None,provider_geometry=None,provider_id=None,provider_location=None,replication_driver_data=None,replication_extended_status=None,replication_status=None,scheduled_at=2020-10-20T18:30:18Z,service_uuid=None,shared_targets=True,size=1,snapshot_id=None,snapshots=<?>,source_volid=None,status='creating',terminated_at=None,updated_at=2020-10-20T18:30:18Z,user_id='9e7f8616dfd84ead800742850fada80f',volume_attachment=<?>,volume_type=VolumeType(68c38efd-0d03-4a89-95c1-53f31bd28360),volume_type_id=68c38efd-0d03-4a89-95c1-53f31bd28360)' _task_receiver /usr/lib/python3.6/site-packages/taskflow/listeners/logging.py:183
2020-10-20 18:30:22.423 35 DEBUG cinder.volume.manager [req-f2c2b32d-05f7-4cb0-946d-06bbda45c2f0 9e7f8616dfd84ead800742850fada80f 50ca03ad7c7c4ce7a254853a28d43a3d - default default] Task 'cinder.volume.flows.manager.create_volume.OnFailureRescheduleTask;volume:create' (c0930960-141a-4e5b-a3a2-ac8e8cea7c88) transitioned into state 'RUNNING' from state 'PENDING' _task_receiver /usr/lib/python3.6/site-packages/taskflow/listeners/logging.py:194
2020-10-20 18:30:22.424 35 DEBUG cinder.volume.manager [req-f2c2b32d-05f7-4cb0-946d-06bbda45c2f0 9e7f8616dfd84ead800742850fada80f 50ca03ad7c7c4ce7a254853a28d43a3d - default default] Task 'cinder.volume.flows.manager.create_volume.OnFailureRescheduleTask;volume:create' (c0930960-141a-4e5b-a3a2-ac8e8cea7c88) transitioned into state 'SUCCESS' from state 'RUNNING' with result 'None' _task_receiver /usr/lib/python3.6/site-packages/taskflow/listeners/logging.py:183
2020-10-20 18:30:22.426 35 DEBUG cinder.volume.manager [req-f2c2b32d-05f7-4cb0-946d-06bbda45c2f0 9e7f8616dfd84ead800742850fada80f 50ca03ad7c7c4ce7a254853a28d43a3d - default default] Task 'cinder.volume.flows.manager.create_volume.ExtractVolumeSpecTask;volume:create' (2802a73e-2ecf-4615-a914-68242d6b6579) transitioned into state 'RUNNING' from state 'PENDING' _task_receiver /usr/lib/python3.6/site-packages/taskflow/listeners/logging.py:194
2020-10-20 18:30:22.428 35 DEBUG cinder.volume.manager [req-f2c2b32d-05f7-4cb0-946d-06bbda45c2f0 9e7f8616dfd84ead800742850fada80f 50ca03ad7c7c4ce7a254853a28d43a3d - default default] Task 'cinder.volume.flows.manager.create_volume.ExtractVolumeSpecTask;volume:create' (2802a73e-2ecf-4615-a914-68242d6b6579) transitioned into state 'SUCCESS' from state 'RUNNING' with result '{'status': 'creating', 'type': 'raw', 'volume_id': 'a93993a0-ca08-4a00-a348-3bc10f1348cc', 'volume_name': 'volume-a93993a0-ca08-4a00-a348-3bc10f1348cc', 'volume_size': 1}' _task_receiver /usr/lib/python3.6/site-packages/taskflow/listeners/logging.py:183
2020-10-20 18:30:22.429 35 DEBUG cinder.volume.manager [req-f2c2b32d-05f7-4cb0-946d-06bbda45c2f0 9e7f8616dfd84ead800742850fada80f 50ca03ad7c7c4ce7a254853a28d43a3d - default default] Task 'cinder.volume.flows.manager.create_volume.NotifyVolumeActionTask;volume:create, create.start' (dd8b380a-d936-4665-af5a-09cae784b1c0) transitioned into state 'RUNNING' from state 'PENDING' _task_receiver /usr/lib/python3.6/site-packages/taskflow/listeners/logging.py:194
2020-10-20 18:30:22.970 35 DEBUG cinder.volume.manager [req-f2c2b32d-05f7-4cb0-946d-06bbda45c2f0 9e7f8616dfd84ead800742850fada80f 50ca03ad7c7c4ce7a254853a28d43a3d - default default] Task 'cinder.volume.flows.manager.create_volume.NotifyVolumeActionTask;volume:create, create.start' (dd8b380a-d936-4665-af5a-09cae784b1c0) transitioned into state 'SUCCESS' from state 'RUNNING' with result 'None' _task_receiver /usr/lib/python3.6/site-packages/taskflow/listeners/logging.py:183
2020-10-20 18:30:22.973 35 DEBUG cinder.volume.manager [req-f2c2b32d-05f7-4cb0-946d-06bbda45c2f0 9e7f8616dfd84ead800742850fada80f 50ca03ad7c7c4ce7a254853a28d43a3d - default default] Task 'cinder.volume.flows.manager.create_volume.CreateVolumeFromSpecTask;volume:create' (435cc619-3203-43e5-827f-e7e46470d652) transitioned into state 'RUNNING' from state 'PENDING' _task_receiver /usr/lib/python3.6/site-packages/taskflow/listeners/logging.py:194
2020-10-20 18:30:22.973 35 INFO cinder.volume.flows.manager.create_volume [req-f2c2b32d-05f7-4cb0-946d-06bbda45c2f0 9e7f8616dfd84ead800742850fada80f 50ca03ad7c7c4ce7a254853a28d43a3d - default default] Volume a93993a0-ca08-4a00-a348-3bc10f1348cc: being created as raw with specification: {'status': 'creating', 'volume_name': 'volume-a93993a0-ca08-4a00-a348-3bc10f1348cc', 'volume_size': 1}
2020-10-20 18:30:22.974 35 DEBUG cinder.volume.drivers.rbd [req-f2c2b32d-05f7-4cb0-946d-06bbda45c2f0 9e7f8616dfd84ead800742850fada80f 50ca03ad7c7c4ce7a254853a28d43a3d - default default] creating volume 'volume-a93993a0-ca08-4a00-a348-3bc10f1348cc' create_volume /usr/lib/python3.6/site-packages/cinder/volume/drivers/rbd.py:941
2020-10-20 18:30:22.975 35 DEBUG cinder.volume.drivers.rbd [req-f2c2b32d-05f7-4cb0-946d-06bbda45c2f0 9e7f8616dfd84ead800742850fada80f 50ca03ad7c7c4ce7a254853a28d43a3d - default default] connecting to openstack@dcn2 (conf=/etc/ceph/dcn2.conf, timeout=-1). _do_conn /usr/lib/python3.6/site-packages/cinder/volume/drivers/rbd.py:431

Comment 2 Gorka Eguileor 2020-11-05 09:43:03 UTC
Improvements to the down detection have been discussed in the PTG [1] and it's something we'll be working on.
From the Active-Active perspective it has an additional dimension, besides detecting that the service is down, and that is to not listen to new requests (close the RabbitMQ cluster message queue).

[1]: https://wiki.openstack.org/wiki/CinderWallabyPTGSummary#Improvements_to_service_down


Note You need to log in before you can comment on or make changes to this bug.