Bug 1380482

Summary: The CephFS Native Manila Driver will Flood the Share Log with Errors when it Cannot Connect to Backing CephFS Cluster
Product: Red Hat OpenStack Reporter: Dustin Schoenbrun <dschoenb>
Component: openstack-manilaAssignee: Jan Provaznik <jprovazn>
Status: CLOSED ERRATA QA Contact: Dustin Schoenbrun <dschoenb>
Severity: unspecified Docs Contact: Don Domingo <ddomingo>
Priority: unspecified    
Version: 10.0 (Newton)CC: dschoenb, jjoyce, jprovazn, jschluet, mlopes, pgrist, tbarron
Target Milestone: rcKeywords: Triaged
Target Release: 10.0 (Newton)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-manila-3.0.0-5.el7ost Doc Type: Bug Fix
Doc Text:
Prior to this update, the Manila Ceph FS driver did not check if it could connect to the Ceph server. Consequently, if the connection to the Ceph server did not work, `manila-share` service kept crashing or respawning without any timeout. With this update, there is now a check to confirm that the Ceph connection works when initializing the Manila Ceph FS driver. As a result, the Ceph driver checks the Ceph connection on driver init, and if it fails the driver is not initialized and no further steps are performed.
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-12-14 16:06:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
head of /var/log/manila/share.log after native cephfs driver deployed w/o actual cephfs backend none

Description Dustin Schoenbrun 2016-09-29 17:49:59 UTC
Description of problem:
When the CephFS Native Driver cannot connect to the backing CephFS Cluster, it will report an error to the Manila Share log saying that it cannot connect. It will then immediately attempt to reconnect again to the CephFS cluster where it will most likely fail again. There is seemingly no limit to the amount of retries on connecting to the CephFS cluster which will cause the Manila Share log to grow exceptionally quickly. 

Version-Release number of selected component (if applicable):
openstack-manila-3.0.0-0.20160903135125.7a16eb6.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Set up OSP-10 using Packstack, ensuring that Manila is installed.
2. Configure the CephFS Native driver but ensure that the driver cannot connect to the backing CephFS cluster.
3. Observe that the driver cannot connect to the backing CephFS cluster and that the Manila Share log is flooded with error messages. 

Actual results:
The driver appears to attempt to reconnect continuously and will flood the Manila Share log with error messages.

Expected results:
The driver should only retry a certain number of times before giving up or should space out the retries over a longer period of time.

Comment 1 Tom Barron 2016-09-30 14:31:47 UTC
We should investigate whether this is a CephFS driver-specific issue or whether any manila backend that fails to connect to external storage will do the same thing.  And if the latter, is this a problem also in cinder?

Comment 2 Paul Grist 2016-10-14 18:05:29 UTC
Targeting 10z, but if this is very problematic then consider bringing it back.

Comment 3 Jan Provaznik 2016-11-08 13:58:15 UTC
upstream bug: https://bugs.launchpad.net/manila/+bug/1640169

Comment 5 Tom Barron 2016-11-21 23:20:21 UTC
Created attachment 1222498 [details]
head of /var/log/manila/share.log after native cephfs driver deployed w/o actual cephfs backend

Comment 6 Tom Barron 2016-11-21 23:27:33 UTC
I used OSPd to deploy the native cephfs backend for manila via '-e /usr/share/openstack-tripleo-heat-templates/environments/manila-cephfsnative-config.yaml' using the latest rhos10 puddle, core_puddle=2016-11-19.4.

Results are in https://bugzilla.redhat.com/attachment.cgi?id=1222498, where one can readily see that the manila share log shows that the manila share service
correctly determines that it cannot interact with the backend.  Instead of
retrying in a quick loop as reported in this BZ and in https://bugs.launchpad.net/manila/+bug/1640169 the share service instead declares:

2016-11-21 22:39:54.682 113290 ERROR oslo_service.periodic_task DriverNotInitialized: Share driver 'CephFSNativeDriver' not initialized.

This message is seen again on periodic task updates that require interaction
with the driver:

2016-11-21 22:40:54.682 113290 ERROR oslo_service.periodic_task DriverNotInitialized: Share driver 'CephFSNativeDriver' not initialized.

In other words, the current log shows behavior consistent with other backends,
and not the tight infinite loop of retries to connect to the CephFS cluster
as reported in this bug.

Comment 8 Dustin Schoenbrun 2016-11-22 22:22:58 UTC
Thanks for having a look at this, Tom! Looks good to me. Marking the bug as VERIFIED.

Comment 10 errata-xmlrpc 2016-12-14 16:06:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2948.html