Bug 763973 (GLUSTER-2241)

Summary: GlusterFs Stat Actions Degrade During I/O
Product: [Community] GlusterFS Reporter: Idan Shinberg <idan>
Component: coreAssignee: tcp
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: 3.1.1CC: aavati, amarts, gluster-bugs, saurabh, vijay, vijaykumar
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: release-3.1.3 Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Idan Shinberg 2010-12-20 13:25:20 UTC
While performing Different Write I/O Bebchmarks using iozone , accessing The Glusterfs Mount-Point via stat actions ( cd , ls , tab completion  and so on) freezes for short periods . This may be acceptable for user experience , but not for HPC /RealTime Platforms.This was not seen on 3.0.X versions.

Steps To reproduce : -
--------------------
- Create Heavy IO Loads the Gluster Volume through the Gluster mount-Point
- Access The Mount Point via ls , cd ,etc...

Tnks

Idan Shinberg
Superfish

Comment 1 Anand Avati 2011-01-25 17:37:37 UTC
Pavan,
 I already had a fix for this in my private repository and have submitted it -

http://patches.gluster.com/patch/6051/
http://patches.gluster.com/patch/6052/
http://patches.gluster.com/patch/6053/

However, for a more comprehensive fix the same classification and prioritization should happen in socket queues as well (rpc/rpc-transport/socket/src/socket.c)

Avati

Comment 2 tcp 2011-01-27 08:03:46 UTC
I'm thinking this flips the problem around, and makes the slow IO really slow.
I mean, the test case could be something like:

1. untar the linux kernel onto the test volume.
2. Once that is complete, run a find on the top level directory.
3. In parallel to 2 above, run "touch newfile"

I'll refer to the current implementation as the baseline. With the new implementation, I think touch will return only after a very long time in comparison with the baseline. I don't have statistics, though.

A suggestion:
We could have one dedicated thread for each of the priority queues. And, also have the thread pool that continues to work as in your implementation. This way, we avoid some starvation.
A revisit of locks can also be made. Currently, the conf lock is to be obtained by every thread immaterial of the queue it is serving. Though I see that it is only for the duration of dequeing a call stub, it might be worthwhile to see the effects of splitting the lock.

Comments?

Comment 3 Anand Avati 2011-01-27 15:09:07 UTC
(In reply to comment #2)
> I'm thinking this flips the problem around, and makes the slow IO really slow.
> I mean, the test case could be something like:
> 
> 1. untar the linux kernel onto the test volume.
> 2. Once that is complete, run a find on the top level directory.
> 3. In parallel to 2 above, run "touch newfile"
> 
> I'll refer to the current implementation as the baseline. With the new
> implementation, I think touch will return only after a very long time in
> comparison with the baseline. I don't have statistics, though.


With the test case you described I doubt if you will see any difference at all made by the patch. The effectiveness of io-threads is only when there are multiple system calls operating in the graph _in parallel_. A find is a single threaded job and by nature of its operation only one system call will be in the graph at any time. Different scheduling policies in io-threads will have literally negligible effect.

The problem being attempted to solve here is to make things more responsive (for a user experience) while there is heavy I/O. When multiple applications writing heavy, because of the existence of write-behind, there will be multiple write fops in the io-threads queue. If you scale up with window-size configuration of write-behind or have multiple instances of writer threads (parallel dd?) then the io-threads queue will have outstanding requests. This is the only time when changes in scheduling policy can have any difference at all. The goal of the patch is to make user perceivable operations ("triggered via cd, ls, tab completion and so on") - which means system calls like lookup/opendir/readdir get a higher priority over bulk operations.

One important note - the slow/normal/fast classification in the patch is NOT related to how long the fop takes to complete in storage/posix. However it is about the expectancy from a user's point of view of which of these fops he LIKES to get completed soon or late. Probably renaming them to iot_schedule_{soon,late,normal} might make it more clear.



> A suggestion:
> We could have one dedicated thread for each of the priority queues. And, also
> have the thread pool that continues to work as in your implementation. This
> way, we avoid some starvation.
> A revisit of locks can also be made. Currently, the conf lock is to be obtained
> by every thread immaterial of the queue it is serving. Though I see that it is
> only for the duration of dequeing a call stub, it might be worthwhile to see
> the effects of splitting the lock.

You can give it a shot, but splitting locks in this case only makes things overly complex forcing you to forego the benefits of granular locking. Shehjar did attempt this once but soon realized that you can never avoid having a central lock (when working with a thread pool) to be guaranteed to be race free, even while more granular locks exist. So the current scheme has just one central lock (which can anyways not be avoided) but keeps the critical section as bare minimum as possible.

Avati

Comment 4 tcp 2011-01-28 03:16:40 UTC
(In reply to comment #3)

> 
> With the test case you described I doubt if you will see any difference at all
> made by the patch. The effectiveness of io-threads is only when there are
> multiple system calls operating in the graph _in parallel_. A find is a single
> threaded job and by nature of its operation only one system call will be in the
> graph at any time. Different scheduling policies in io-threads will have
> literally negligible effect.

I brought in a wrong test case. My bad.

> 
> The problem being attempted to solve here is to make things more responsive
> (for a user experience) while there is heavy I/O. When multiple applications
> writing heavy, because of the existence of write-behind, there will be multiple
> write fops in the io-threads queue. If you scale up with window-size
> configuration of write-behind or have multiple instances of writer threads
> (parallel dd?) then the io-threads queue will have outstanding requests. This

My point is that there can be a similar "parallel stat", in which case the higher priority queue will be full. Or is such a case unnecessary to handle?

> is the only time when changes in scheduling policy can have any difference at
> all. The goal of the patch is to make user perceivable operations ("triggered
> via cd, ls, tab completion and so on") - which means system calls like
> lookup/opendir/readdir get a higher priority over bulk operations.

Right. And that translates to which requests we are handling first. With the new patch, we direct all threads in the pool to serve the higher priority queue *until it is empty*. Theoretically, there can be a shortage of threads to serve the other two queues. It all boils down to whether that is acceptable or not *and of course*, whether the complexity involved in fixing it is worth it.

> One important note - the slow/normal/fast classification in the patch is NOT
> related to how long the fop takes to complete in storage/posix. However it is
> about the expectancy from a user's point of view of which of these fops he
> LIKES to get completed soon or late. Probably renaming them to
> iot_schedule_{soon,late,normal} might make it more clear.

Not necessary. The current nomenclature is quite clear (at least to me). I did not mean time either. But that said, the time aspect does not come into picture only as long as the scheduling policy does not affect the users expectation. Understood that it is an indication of the likeness of the user, but even when the user's understanding is that create is a slow (lower priority) operation, there is (I think) a certain expectation on *how slow* that could be.

Pavan

Comment 5 Anand Avati 2011-02-22 07:11:07 UTC
PATCH: http://patches.gluster.com/patch/6051 in master (io-threads: whitespace cleanup)

Comment 6 Anand Avati 2011-02-22 07:11:13 UTC
PATCH: http://patches.gluster.com/patch/6052 in master (io-threads: implement bulk and priority queues)

Comment 7 Anand Avati 2011-02-22 07:11:19 UTC
PATCH: http://patches.gluster.com/patch/6053 in master (io-threads: use slow/normal/fast classification of fops)

Comment 8 Saurabh 2011-03-14 09:35:48 UTC
tried to execute iozone -a along with ls and also used dd for creating 1GB files with 10 similar processed going on and ls on the other terminal. The ls command finished in few seconds itself, on both 3.1.2 and 3.1.3 with slight differences in microseconds.

Comment 9 Vijaykumar 2011-08-25 06:36:51 UTC
I couldn't reproducing the bug using only one client. So tried a different setup.
I created a distributed-replicate with two bricks in one machine and other two in another machine and mount over nfs on another machine on four mount points. I ran iozone, dbench and dd on different mount points and accessing files using ls on the different mount points. 
####################################################################################################################################################################### With glusterfs 3.1.1 results i got ######################################################################################
real	0m15.741s
user	0m0.000s
sys	0m0.000s
root@ubuntu10-template:/mnt/vol3# time ls
clients  file  file1  file2  file3  iozone.tmp

real	0m8.327s
user	0m0.000s
sys	0m0.010s
root@ubuntu10-template:/mnt/vol3# time ls
clients  file  file1  file2  file3  iozone.tmp

real	0m14.702s
user	0m0.000s
sys	0m0.000s
root@ubuntu10-template:/mnt/vol3# time ls
clients  file  file1  file2  file3  iozone.tmp

real	0m8.335s
user	0m0.000s
sys	0m0.000s
root@ubuntu10-template:/mnt/vol3# time ls
clients  file  file1  file2  file3  iozone.tmp

real	0m8.091s
user	0m0.000s
sys	0m0.000s
^[[Aroot@ubuntu10-template:/mnt/vol3# time ls
clients  file  file1  file2  file3  iozone.tmp

real	0m8.438s
user	0m0.000s
sys	0m0.000s
root@ubuntu10-template:/mnt/vol3# time ls
clients  file  file1  file2  file3  iozone.tmp

real	0m14.588s
user	0m0.000s
sys	0m0.010s
root@ubuntu10-template:/mnt/vol3# time ls
clients  file  file1  file2  file3

################################################################################################################################################################
##### but with glusterfs 3.1.3 i got results as ################################################################################################################################################################
real	0m4.762s
user	0m0.000s
sys	0m0.000s
root@ubuntu10-template:/mnt/vol3# time ls
clients  file1  file2

real	0m3.746s
user	0m0.000s
sys	0m0.000s
root@ubuntu10-template:/mnt/vol3# time ls
clients  file1  file2  iozone.tmp

real	0m2.607s
user	0m0.010s
sys	0m0.000s
root@ubuntu10-template:/mnt/vol3# time ls
clients  file1  file2

real	0m3.978s
user	0m0.000s
sys	0m0.000s
root@ubuntu10-template:/mnt/vol3# time ls
clients  file1  file2

real	0m11.542s
user	0m0.000s
sys	0m0.000s
root@ubuntu10-template:/mnt/vol3# time ls
clients  file1  file2

real	0m2.062s
user	0m0.000s
sys	0m0.000s
root@ubuntu10-template:/mnt/vol3# time ls
clients  file1  file2

real	0m2.932s
user	0m0.000s
sys	0m0.000s
root@ubuntu10-template:/mnt/vol3# time ls
clients  file1  file2

real	0m4.794s
user	0m0.000s
sys	0m0.000s
root@ubuntu10-template:/mnt/vol3# time ls
clients  file1  file2  iozone.tmp

real	0m6.994s
user	0m0.000s
sys	0m0.000s

by observing the results , i don't think there is a significant difference. Let me know if i am in right path or i have to do any other tests.

Comment 10 Vijaykumar 2011-08-25 08:31:39 UTC
For this particular scenario we can observe improvement in the performance.