Description of problem:
When the CephFS Native Driver cannot connect to the backing CephFS Cluster, it will report an error to the Manila Share log saying that it cannot connect. It will then immediately attempt to reconnect again to the CephFS cluster where it will most likely fail again. There is seemingly no limit to the amount of retries on connecting to the CephFS cluster which will cause the Manila Share log to grow exceptionally quickly.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Set up OSP-10 using Packstack, ensuring that Manila is installed.
2. Configure the CephFS Native driver but ensure that the driver cannot connect to the backing CephFS cluster.
3. Observe that the driver cannot connect to the backing CephFS cluster and that the Manila Share log is flooded with error messages.
The driver appears to attempt to reconnect continuously and will flood the Manila Share log with error messages.
The driver should only retry a certain number of times before giving up or should space out the retries over a longer period of time.
We should investigate whether this is a CephFS driver-specific issue or whether any manila backend that fails to connect to external storage will do the same thing. And if the latter, is this a problem also in cinder?
Targeting 10z, but if this is very problematic then consider bringing it back.
upstream bug: https://bugs.launchpad.net/manila/+bug/1640169
Created attachment 1222498 [details]
head of /var/log/manila/share.log after native cephfs driver deployed w/o actual cephfs backend
I used OSPd to deploy the native cephfs backend for manila via '-e /usr/share/openstack-tripleo-heat-templates/environments/manila-cephfsnative-config.yaml' using the latest rhos10 puddle, core_puddle=2016-11-19.4.
Results are in https://bugzilla.redhat.com/attachment.cgi?id=1222498, where one can readily see that the manila share log shows that the manila share service
correctly determines that it cannot interact with the backend. Instead of
retrying in a quick loop as reported in this BZ and in https://bugs.launchpad.net/manila/+bug/1640169 the share service instead declares:
2016-11-21 22:39:54.682 113290 ERROR oslo_service.periodic_task DriverNotInitialized: Share driver 'CephFSNativeDriver' not initialized.
This message is seen again on periodic task updates that require interaction
with the driver:
2016-11-21 22:40:54.682 113290 ERROR oslo_service.periodic_task DriverNotInitialized: Share driver 'CephFSNativeDriver' not initialized.
In other words, the current log shows behavior consistent with other backends,
and not the tight infinite loop of retries to connect to the CephFS cluster
as reported in this bug.
Thanks for having a look at this, Tom! Looks good to me. Marking the bug as VERIFIED.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.