Having just worked on some of the transport code, this scares me. I won't comment on what the RDMA code is doing since I haven't looked it, but right now the socket transport relies on a process-wide polling thread. With patches I've proposed, there will be one polling thread per connection. If we issue a synchronous request when we're already in one of those threads then the results are likely to be disastrous, so we should outright forbid that. If we issue a synchronous request from some other worker thread, then we'll need to deal with the fact that the reply will come in on one of the polling threads, so that thread will have to signal the one that's waiting. Worse, that polling thread - in both transport-specific and generic RPC code - will have to forego freeing resources until the synchronous worker thread is done with them. We can't block the polling thread while we wait for yet another signal and context switch, so we'd need to refactor the cleanup code to let the synchronous-request code call it from within the worker thread. I've already hit concurrency/reentrancy bugs in the generic RPC code, so I'm not at all sure it will hold up well in this even more challenging thread environment. One thing I've thought about, looking at things like DHT lookups and AFR transactions (or for that matter the locked read-modify-write cycles required by CloudFS's "crypt" translator) is that there's a lot of repetitive code to allocate/free this->private and frame->local, look stuff up in them, deal with call counts and inode locking, etc. One filesystem I worked on in a previous life dealt with this issue using a domain-specific language extension to convert a linear specification of request handling into a series of callbacks with local variables moved to a shared structure (i.e. much like the translator API as it exists now) and then execute those callbacks within a separate thread/polling framework (i.e. much like the transport/RPC code as it exists now). Developing this language extension required a significant one-time investment, but it paid off pretty well over time. Maybe it's worth considering something like that as a way to simplify both the new rebalance code and the old DHT/AFR/crypt code that has to deal with similar issues.
(In reply to comment #1) > Having just worked on some of the transport code, this scares me. I won't > comment on what the RDMA code is doing since I haven't looked it, but right now > the socket transport relies on a process-wide polling thread. With patches > I've proposed, there will be one polling thread per connection. If we issue a > synchronous request when we're already in one of those threads then the results > are likely to be disastrous, so we should outright forbid that. If we issue a > synchronous request from some other worker thread, then we'll need to deal with > the fact that the reply will come in on one of the polling threads, so that > thread will have to signal the one that's waiting. Worse, that polling thread > - in both transport-specific and generic RPC code - will have to forego freeing > resources until the synchronous worker thread is done with them. We can't > block the polling thread while we wait for yet another signal and context > switch, so we'd need to refactor the cleanup code to let the > synchronous-request code call it from within the worker thread. I've already > hit concurrency/reentrancy bugs in the generic RPC code, so I'm not at all sure > it will hold up well in this even more challenging thread environment. The syncop framework in Gluster is written very much in consideration with the kind of starvation and deadlocks that can happen if the regular polling threads were to perform the synchronous operations. Everything mentioned above has been addressed in the design of the syncop framework. Please look into - https://github.com/gluster/glusterfs/commit/fac3ff8bfb3958a3bdc34dc9bff7cb281597e40f especially the 'IMPORTANT NOTES' comment and let us know if you think something has been missed. syncops are already being used in gluster during replace brick operation for implementing a recursive crawler within the filesystem. It is a perfectly good framework for implementing rebalance as well. > One thing I've thought about, looking at things like DHT lookups and AFR > transactions (or for that matter the locked read-modify-write cycles required > by CloudFS's "crypt" translator) is that there's a lot of repetitive code to > allocate/free this->private and frame->local, look stuff up in them, deal with > call counts and inode locking, etc. One filesystem I worked on in a previous > life dealt with this issue using a domain-specific language extension to > convert a linear specification of request handling into a series of callbacks > with local variables moved to a shared structure (i.e. much like the translator > API as it exists now) and then execute those callbacks within a separate > thread/polling framework (i.e. much like the transport/RPC code as it exists > now). Developing this language extension required a significant one-time > investment, but it paid off pretty well over time. Maybe it's worth > considering something like that as a way to simplify both the new rebalance > code and the old DHT/AFR/crypt code that has to deal with similar issues. You might considering syncops for your crypt code as well. There is a very strong reason why we are not using syncops in DHT/AFR which is that synchronous operations take away parallelism. Even though we explicitly use call counts and stuff for merging callbacks, it works best from a performance point of view. Avati
Thanks, Avati. I had a look, and it does seem to address the problems I mentioned for Amar's use case. It's quite elegant, in fact. As we discussed in IRC, though, it might not be the right thing for the crypt code where we're right in the normal read/write path and already tend to become CPU-bound on 10GbE.
so that we should be able to do file operations (like open/read/write/create) inside glusterfs translators. very much needed for proposed rebalance enhancement.
PATCH: http://patches.gluster.com/patch/7631 in master (libglusterfs: added syncop_* functions)
More granular enhancements may be required, but existing syncop set after the patch went in are enough to handle current rebalance.
PATCH: http://patches.gluster.com/patch/7651 in master (libglusterfs/syncop: add more functions)
Since its an internal enhancement, we just need to check, the required syncops are included or not in the mainline. Its verified.
CHANGE: http://review.gluster.com/3667 (libglusterfs: syncop for flush ()) merged in master by Anand Avati (avati)