Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 764043 (GLUSTER-2311)

Summary:	Enhancement (cloudfs): allow itable to be virtualized
Product:	[Community] GlusterFS	Reporter:	Jeff Darcy <jdarcy>
Component:	core	Assignee:	Anand Avati <aavati>
Status:	CLOSED WONTFIX	QA Contact:
Severity:	low	Docs Contact:
Priority:	medium
Version:	mainline	CC:	chrisw, gluster-bugs
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
URL:	http://git.fedorahosted.org/git/?p=CloudFS.git;a=blob;f=patches/root_inode.patch;h=b78ef2e019be631685d1c5157032d9b8189f621c;hb=HEAD
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:		Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Anand Avati 2011-01-21 15:28:38 UTC

Even though the patch by itself is fine, it is not complete. There are many places where the inode table pointer is picked up by dereferencing subvol->itable. Apart from fixing this, we need to prevent this from happening even accidentally by some other developer writing code down the line.

I do have other thoughts on how to handle this problem. I need to understand if your translator needs to 'kick in' and do something on every fop (other than redirecting fops to the appropriate subvolume). If redirecting is all what cloud does for fops, then we can figure out an interface where it can ask protocol server to 'reset' the bound_xl to the appropriate xlator. itable would then still get picked up as bound_xl->itable. This way, the topology of the graph on the server side continue to be as it is today in the mainline tree and the cloud translator can be an extra subvolume to protocol/server to which the clients set themselves as the remote-subvolume. After doing its auth, using a new interface (yet to be figured out) your xlator can tell protocol server to reset the bound_xl pointer to the right sibling.

Would such a scheme come in your way for the feature set you have in mind? Do you have a doc which lists the goals that the cloud xlator is expected to achieve?

(In reply to comment #0)
> For the multi-tenancy feature of cloudfs, it is necessary to redirect requests
> to specific subvolumes based on tenant identity.  Setting up multiple
> protocol/server instances within one process is not only cumbersome but
> incompatible with the volfile-fetching and port-mapping mechanisms which assume
> a 1:1:1 identity between processes, ports, and volfiles.  For the most part
> this can be worked around by putting the tenant-specific subvolumes under a
> single translator which binds to protocol/server to handle both subvolume
> dispatch and other functions (e.g. UID mapping) based on the authenticated
> identity associated with a connection.  The place where this fails is in
> server-resolve, which attempts lookups before any regular translator ever sees
> the request, using the inode table associated with its single bound_xl.
> 
> The patch at the attached URL attempts to address this issue by giving the
> bound_xl a chance to "virtualize" the inode table for each tenant before
> lookups are done, allowing each tenant to have a truly separate namespace. 
> Every translator other than the cloudfs multi-tenant translator will
> transparently default to a function which preserves existing behavior, so there
> should be no disruption to existing translators, but this functionality is
> critical for cloudfs and impossible to implement without even more invasive
> changes to other subsystems.
> 
> There are some secondary issues to do with the special xattrs used by DHT (and
> possibly AFR) but the patch as-is handles the essentials.  More refinements can
> be added later.

Comment 1 Jeff Darcy 2011-01-21 16:35:09 UTC

(In reply to comment #1)
> Even though the patch by itself is fine, it is not complete. There are many
> places where the inode table pointer is picked up by dereferencing
> subvol->itable.

I looked for such references, but I'm not sure this concern applies to many.  For a start, many of those references are in client code which is unaffected by any of this.  A few are in debug or deprecated code; I checked those anyway, and I think they're OK.  The protocol/server code all uses state->itable, which is now being picked up from the right place.  The only area that I think is affected is NFS, which uses constructs like cs->vol->itable in its path-resolution code.  This could be changed to be more like protocol/server, e.g. by adding nfs3_call_state_t.itable (much like server_state_t.itable) and invoking the get_itbl interface to populate it, pretty easily.  I'm not sure how much that would buy us, though, since there are other issues (e.g. authentication) that also need to be dealt with before the multi-tenancy stuff will work through NFS.

> I need to understand if
> your translator needs to 'kick in' and do something on every fop (other than
> redirecting fops to the appropriate subvolume). If redirecting is all what
> cloud does for fops, then we can figure out an interface where it can ask
> protocol server to 'reset' the bound_xl to the appropriate xlator.

I actually tried to implement things this way, but the changes became even more invasive than the patch I've given.  For it to work, the authentication and "re-binding" have to be among the very first things that get done when the connection is established, even before the very first lookup which uses itable.  That would have required fundamental changes to the server connection-setup code, which we could still do if you think that's the right way, but it seemed like a patch based on that approach would be even more troubling than this one.

> Would such a scheme come in your way for the feature set you have in mind? Do
> you have a doc which lists the goals that the cloud xlator is expected to
> achieve?

The main descriptions are at Fedora (https://fedoraproject.org/wiki/Features/CloudFS) and my own project site (http://cloudfs.org/?p=11).  In a nutshell, the multi-tenant requirement addresses the case where I'm a user in BarnesAndNobleCloud, and BANC has decided to offer "filesystem as a service" to differentiate themselves from their main competitor.  That would let me connect my VMs to their big shared filesystem, in such a way that I see my own "universe" in terms of both namespace and UID/GID space (since I can make up not just paths but also UIDs/GIDs within my VMs).  They get one filesystem to manage; tenants get the appearance of one filesystem each.

For this to work, something on the provider side has to map per-tenant paths and IDs into their global equivalents.  That means knowing which "universe" each request belongs to.  That in turn depends on strong authentication - of the system/connection, not of the individual UID/user - which does not currently exist and might well be implemented as part of the same translator.  That's the "cloud" translator.  It can already do weak authentication and (with this patch or the one you've proposed) namespace mapping; in the near future it will do stronger authentication and UID/GID mapping as well.  You can view the code for the translator itself at http://git.fedorahosted.org/git/?p=CloudFS.git if you're interested.

Comment 2 Jeff Darcy 2011-01-21 17:03:36 UTC

For the multi-tenancy feature of cloudfs, it is necessary to redirect requests to specific subvolumes based on tenant identity. Setting up multiple protocol/server instances within one process is not only cumbersome but incompatible with the volfile-fetching and port-mapping mechanisms which assume a 1:1:1 identity between processes, ports, and volfiles. For the most part this can be worked around by putting the tenant-specific subvolumes under a single translator which binds to protocol/server to handle both subvolume dispatch and other functions (e.g. UID mapping) based on the authenticated identity associated with a connection. The place where this fails is in server-resolve, which attempts lookups before any regular translator ever sees the request, using the inode table associated with its single bound_xl.

The patch at the attached URL attempts to address this issue by giving the bound_xl a chance to "virtualize" the inode table for each tenant before lookups are done, allowing each tenant to have a truly separate namespace. Every translator other than the cloudfs multi-tenant translator will transparently default to a function which preserves existing behavior, so there should be no disruption to existing translators, but this functionality is critical for cloudfs and impossible to implement without even more invasive changes to other subsystems.

There are some secondary issues to do with the special xattrs used by DHT (and possibly AFR) but the patch as-is handles the essentials. More refinements can be added later.

Comment 3 Jeff Darcy 2011-01-26 12:44:56 UTC

I gave Anand Avati's alternate approach another try, and was able to make things work so long as the re-assignment of bound_xl is done early - not necessarily first, but before anything other than an initial root-inode lookup.  Thus, this patch is no longer needed, but for posterity I'll note two related issues.

(1) It would still be nice to have an interface to reassign bound_xl.  The way the cloud translator does it now is a bit of a layering violation.

(2) This still doesn't address the issues with NFS, which IMO needs something like bound_xl instead of using cs->vol->itable (or similar).  As noted, though, this is hardly the only issue that needs to be addressed between NFS and CloudFS.

Despite these remaining issues, though, many thanks to Anand Avati for pointing the way to a solution that's much cleaner (and more robust) overall.

Comment 4 Jeff Darcy 2011-06-14 11:09:44 UTC

CloudFS doesn't need this any more.  Withdrawn.