This bug has been migrated to another issue tracking site. It has been closed here and may no longer be being monitored.

If you would like to get updates for this issue, or to participate in it, you may do so at Red Hat Issue Tracker .
Bug 1889879 - Cinder A/A can not recover from network failures and service does not go to down
Summary: Cinder A/A can not recover from network failures and service does not go to down
Keywords:
Status: CLOSED MIGRATED
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-cinder
Version: 16.1 (Train)
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Gorka Eguileor
QA Contact: Evelina Shames
RHOS Documentation Team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-20 19:08 UTC by bkopilov
Modified: 2024-01-05 11:12 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-01-05 11:11:04 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker   OSP-2216 0 None None None 2024-01-05 11:11:03 UTC
Red Hat Issue Tracker OSP-31061 0 None None None 2024-01-05 11:12:36 UTC

Description bkopilov 2020-10-20 19:08:12 UTC
Hi, 
RHOS 16.1 setups, edge with cinder A/A cluster with 3 nodes.

When stopping one of the services tripleo_cinder_volume.service on the cluster, cinder knows to direct the create action to others (active)


In case we drop the traffic from cluster node , cinder service is still up and running (in cinder service-list) and cinder create action stuck forever on "create"


(central) [stack@site-undercloud-0 ~]$ cinder list
+--------------------------------------+-----------+------+------+-------------+----------+-------------+
| ID                                   | Status    | Name | Size | Volume Type | Bootable | Attached to |
+--------------------------------------+-----------+------+------+-------------+----------+-------------+
| 47e796b2-74ab-44fe-9242-5f08104a619b | available | -    | 1    | tripleo     | false    |             |
| a93993a0-ca08-4a00-a348-3bc10f1348cc | creating  | -    | 1    | tripleo     | false    |             |
| c799d1c3-7ce4-46e6-bd4d-cf62ded98a52 | creating  | -    | 1    | tripleo     | false    |             |
+--------------------------------------+-----------+------+------+-------------+----------+-------------+



How to reproduce :
#1 go to one of the cinder cluster node.
#2 block traffic with iptables.

/sbin/iptables -I OUTPUT  1 -p tcp --sport 6800:7300 -j DROP
/sbin/iptables -I OUTPUT  2 -p tcp --dport 6800:7300 -j DROP

#3 Turn the service to stop on the other nodes tripleo_cinder_volume.service

#4 try to create a volume --> Stuck on creating.....

Comment 1 bkopilov 2020-10-20 19:09:06 UTC
Here is the log :
2020-10-20 18:30:18.722 35 DEBUG cinder.volume.flows.manager.create_volume [req-f2c2b32d-05f7-4cb0-946d-06bbda45c2f0 9e7f8616dfd84ead800742850fada80f 50ca03ad7c7c4ce7a254853a28d43a3d - default default] Volume reschedule parameters: True retry: {'num_attempts': 1, 'backends': ['az-dcn2@tripleo_ceph#tripleo_ceph'], 'hosts': ['az-dcn2@tripleo_ceph#tripleo_ceph']} get_flow /usr/lib/python3.6/site-packages/cinder/volume/flows/manager/create_volume.py:1265
2020-10-20 18:30:18.743 35 DEBUG cinder.volume.manager [req-f2c2b32d-05f7-4cb0-946d-06bbda45c2f0 9e7f8616dfd84ead800742850fada80f 50ca03ad7c7c4ce7a254853a28d43a3d - default default] Flow 'volume_create_manager' (ae2454ef-d783-46da-8828-7b108388c807) transitioned into state 'RUNNING' from state 'PENDING' _flow_receiver /usr/lib/python3.6/site-packages/taskflow/listeners/logging.py:145
2020-10-20 18:30:18.747 35 DEBUG cinder.volume.manager [req-f2c2b32d-05f7-4cb0-946d-06bbda45c2f0 9e7f8616dfd84ead800742850fada80f 50ca03ad7c7c4ce7a254853a28d43a3d - default default] Task 'cinder.volume.flows.manager.create_volume.ExtractVolumeRefTask;volume:create' (e61a02d4-8d24-4b55-bcbd-ac88c8b91e37) transitioned into state 'RUNNING' from state 'PENDING' _task_receiver /usr/lib/python3.6/site-packages/taskflow/listeners/logging.py:194
2020-10-20 18:30:22.192 35 DEBUG oslo_db.sqlalchemy.engines [req-f2c2b32d-05f7-4cb0-946d-06bbda45c2f0 9e7f8616dfd84ead800742850fada80f 50ca03ad7c7c4ce7a254853a28d43a3d - default default] MySQL server mode set to STRICT_TRANS_TABLES,STRICT_ALL_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,TRADITIONAL,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION _check_effective_sql_mode /usr/lib/python3.6/site-packages/oslo_db/sqlalchemy/engines.py:307
2020-10-20 18:30:22.420 35 DEBUG cinder.volume.manager [req-f2c2b32d-05f7-4cb0-946d-06bbda45c2f0 9e7f8616dfd84ead800742850fada80f 50ca03ad7c7c4ce7a254853a28d43a3d - default default] Task 'cinder.volume.flows.manager.create_volume.ExtractVolumeRefTask;volume:create' (e61a02d4-8d24-4b55-bcbd-ac88c8b91e37) transitioned into state 'SUCCESS' from state 'RUNNING' with result 'Volume(_name_id=None,admin_metadata={},attach_status='detached',availability_zone='az-dcn2',bootable=False,cluster=<?>,cluster_name='az-dcn2@tripleo_ceph#tripleo_ceph',consistencygroup=<?>,consistencygroup_id=None,created_at=2020-10-20T18:30:17Z,deleted=False,deleted_at=None,display_description=None,display_name=None,ec2_id=None,encryption_key_id=None,glance_metadata=<?>,group=<?>,group_id=None,host='dcn2-computehci2-0@tripleo_ceph#tripleo_ceph',id=a93993a0-ca08-4a00-a348-3bc10f1348cc,launched_at=None,metadata={},migration_status=None,multiattach=False,previous_status=None,project_id='50ca03ad7c7c4ce7a254853a28d43a3d',provider_auth=None,provider_geometry=None,provider_id=None,provider_location=None,replication_driver_data=None,replication_extended_status=None,replication_status=None,scheduled_at=2020-10-20T18:30:18Z,service_uuid=None,shared_targets=True,size=1,snapshot_id=None,snapshots=<?>,source_volid=None,status='creating',terminated_at=None,updated_at=2020-10-20T18:30:18Z,user_id='9e7f8616dfd84ead800742850fada80f',volume_attachment=<?>,volume_type=VolumeType(68c38efd-0d03-4a89-95c1-53f31bd28360),volume_type_id=68c38efd-0d03-4a89-95c1-53f31bd28360)' _task_receiver /usr/lib/python3.6/site-packages/taskflow/listeners/logging.py:183
2020-10-20 18:30:22.423 35 DEBUG cinder.volume.manager [req-f2c2b32d-05f7-4cb0-946d-06bbda45c2f0 9e7f8616dfd84ead800742850fada80f 50ca03ad7c7c4ce7a254853a28d43a3d - default default] Task 'cinder.volume.flows.manager.create_volume.OnFailureRescheduleTask;volume:create' (c0930960-141a-4e5b-a3a2-ac8e8cea7c88) transitioned into state 'RUNNING' from state 'PENDING' _task_receiver /usr/lib/python3.6/site-packages/taskflow/listeners/logging.py:194
2020-10-20 18:30:22.424 35 DEBUG cinder.volume.manager [req-f2c2b32d-05f7-4cb0-946d-06bbda45c2f0 9e7f8616dfd84ead800742850fada80f 50ca03ad7c7c4ce7a254853a28d43a3d - default default] Task 'cinder.volume.flows.manager.create_volume.OnFailureRescheduleTask;volume:create' (c0930960-141a-4e5b-a3a2-ac8e8cea7c88) transitioned into state 'SUCCESS' from state 'RUNNING' with result 'None' _task_receiver /usr/lib/python3.6/site-packages/taskflow/listeners/logging.py:183
2020-10-20 18:30:22.426 35 DEBUG cinder.volume.manager [req-f2c2b32d-05f7-4cb0-946d-06bbda45c2f0 9e7f8616dfd84ead800742850fada80f 50ca03ad7c7c4ce7a254853a28d43a3d - default default] Task 'cinder.volume.flows.manager.create_volume.ExtractVolumeSpecTask;volume:create' (2802a73e-2ecf-4615-a914-68242d6b6579) transitioned into state 'RUNNING' from state 'PENDING' _task_receiver /usr/lib/python3.6/site-packages/taskflow/listeners/logging.py:194
2020-10-20 18:30:22.428 35 DEBUG cinder.volume.manager [req-f2c2b32d-05f7-4cb0-946d-06bbda45c2f0 9e7f8616dfd84ead800742850fada80f 50ca03ad7c7c4ce7a254853a28d43a3d - default default] Task 'cinder.volume.flows.manager.create_volume.ExtractVolumeSpecTask;volume:create' (2802a73e-2ecf-4615-a914-68242d6b6579) transitioned into state 'SUCCESS' from state 'RUNNING' with result '{'status': 'creating', 'type': 'raw', 'volume_id': 'a93993a0-ca08-4a00-a348-3bc10f1348cc', 'volume_name': 'volume-a93993a0-ca08-4a00-a348-3bc10f1348cc', 'volume_size': 1}' _task_receiver /usr/lib/python3.6/site-packages/taskflow/listeners/logging.py:183
2020-10-20 18:30:22.429 35 DEBUG cinder.volume.manager [req-f2c2b32d-05f7-4cb0-946d-06bbda45c2f0 9e7f8616dfd84ead800742850fada80f 50ca03ad7c7c4ce7a254853a28d43a3d - default default] Task 'cinder.volume.flows.manager.create_volume.NotifyVolumeActionTask;volume:create, create.start' (dd8b380a-d936-4665-af5a-09cae784b1c0) transitioned into state 'RUNNING' from state 'PENDING' _task_receiver /usr/lib/python3.6/site-packages/taskflow/listeners/logging.py:194
2020-10-20 18:30:22.970 35 DEBUG cinder.volume.manager [req-f2c2b32d-05f7-4cb0-946d-06bbda45c2f0 9e7f8616dfd84ead800742850fada80f 50ca03ad7c7c4ce7a254853a28d43a3d - default default] Task 'cinder.volume.flows.manager.create_volume.NotifyVolumeActionTask;volume:create, create.start' (dd8b380a-d936-4665-af5a-09cae784b1c0) transitioned into state 'SUCCESS' from state 'RUNNING' with result 'None' _task_receiver /usr/lib/python3.6/site-packages/taskflow/listeners/logging.py:183
2020-10-20 18:30:22.973 35 DEBUG cinder.volume.manager [req-f2c2b32d-05f7-4cb0-946d-06bbda45c2f0 9e7f8616dfd84ead800742850fada80f 50ca03ad7c7c4ce7a254853a28d43a3d - default default] Task 'cinder.volume.flows.manager.create_volume.CreateVolumeFromSpecTask;volume:create' (435cc619-3203-43e5-827f-e7e46470d652) transitioned into state 'RUNNING' from state 'PENDING' _task_receiver /usr/lib/python3.6/site-packages/taskflow/listeners/logging.py:194
2020-10-20 18:30:22.973 35 INFO cinder.volume.flows.manager.create_volume [req-f2c2b32d-05f7-4cb0-946d-06bbda45c2f0 9e7f8616dfd84ead800742850fada80f 50ca03ad7c7c4ce7a254853a28d43a3d - default default] Volume a93993a0-ca08-4a00-a348-3bc10f1348cc: being created as raw with specification: {'status': 'creating', 'volume_name': 'volume-a93993a0-ca08-4a00-a348-3bc10f1348cc', 'volume_size': 1}
2020-10-20 18:30:22.974 35 DEBUG cinder.volume.drivers.rbd [req-f2c2b32d-05f7-4cb0-946d-06bbda45c2f0 9e7f8616dfd84ead800742850fada80f 50ca03ad7c7c4ce7a254853a28d43a3d - default default] creating volume 'volume-a93993a0-ca08-4a00-a348-3bc10f1348cc' create_volume /usr/lib/python3.6/site-packages/cinder/volume/drivers/rbd.py:941
2020-10-20 18:30:22.975 35 DEBUG cinder.volume.drivers.rbd [req-f2c2b32d-05f7-4cb0-946d-06bbda45c2f0 9e7f8616dfd84ead800742850fada80f 50ca03ad7c7c4ce7a254853a28d43a3d - default default] connecting to openstack@dcn2 (conf=/etc/ceph/dcn2.conf, timeout=-1). _do_conn /usr/lib/python3.6/site-packages/cinder/volume/drivers/rbd.py:431

Comment 2 Gorka Eguileor 2020-11-05 09:43:03 UTC
Improvements to the down detection have been discussed in the PTG [1] and it's something we'll be working on.
From the Active-Active perspective it has an additional dimension, besides detecting that the service is down, and that is to not listen to new requests (close the RabbitMQ cluster message queue).

[1]: https://wiki.openstack.org/wiki/CinderWallabyPTGSummary#Improvements_to_service_down

Comment 4 Yaniv Kaul 2022-05-11 12:53:11 UTC
This BZ has urgent severity and high priority, but did not see much activity for ~1.5 years. Can you provide latest status?

Comment 5 Gorka Eguileor 2022-06-23 10:05:32 UTC
Unfortunately the issue is not currently being actively worked on and will need to be added to the planning.


Note You need to log in before you can comment on or make changes to this bug.