| Summary: | GlusterFs Stat Actions Degrade During I/O | ||
|---|---|---|---|
| Product: | [Community] GlusterFS | Reporter: | Idan Shinberg <idan> |
| Component: | core | Assignee: | tcp |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 3.1.1 | CC: | aavati, amarts, gluster-bugs, saurabh, vijay, vijaykumar |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | --- | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | release-3.1.3 | Category: | --- |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Idan Shinberg
2010-12-20 13:25:20 UTC
Pavan, I already had a fix for this in my private repository and have submitted it - http://patches.gluster.com/patch/6051/ http://patches.gluster.com/patch/6052/ http://patches.gluster.com/patch/6053/ However, for a more comprehensive fix the same classification and prioritization should happen in socket queues as well (rpc/rpc-transport/socket/src/socket.c) Avati I'm thinking this flips the problem around, and makes the slow IO really slow. I mean, the test case could be something like: 1. untar the linux kernel onto the test volume. 2. Once that is complete, run a find on the top level directory. 3. In parallel to 2 above, run "touch newfile" I'll refer to the current implementation as the baseline. With the new implementation, I think touch will return only after a very long time in comparison with the baseline. I don't have statistics, though. A suggestion: We could have one dedicated thread for each of the priority queues. And, also have the thread pool that continues to work as in your implementation. This way, we avoid some starvation. A revisit of locks can also be made. Currently, the conf lock is to be obtained by every thread immaterial of the queue it is serving. Though I see that it is only for the duration of dequeing a call stub, it might be worthwhile to see the effects of splitting the lock. Comments? (In reply to comment #2) > I'm thinking this flips the problem around, and makes the slow IO really slow. > I mean, the test case could be something like: > > 1. untar the linux kernel onto the test volume. > 2. Once that is complete, run a find on the top level directory. > 3. In parallel to 2 above, run "touch newfile" > > I'll refer to the current implementation as the baseline. With the new > implementation, I think touch will return only after a very long time in > comparison with the baseline. I don't have statistics, though. With the test case you described I doubt if you will see any difference at all made by the patch. The effectiveness of io-threads is only when there are multiple system calls operating in the graph _in parallel_. A find is a single threaded job and by nature of its operation only one system call will be in the graph at any time. Different scheduling policies in io-threads will have literally negligible effect. The problem being attempted to solve here is to make things more responsive (for a user experience) while there is heavy I/O. When multiple applications writing heavy, because of the existence of write-behind, there will be multiple write fops in the io-threads queue. If you scale up with window-size configuration of write-behind or have multiple instances of writer threads (parallel dd?) then the io-threads queue will have outstanding requests. This is the only time when changes in scheduling policy can have any difference at all. The goal of the patch is to make user perceivable operations ("triggered via cd, ls, tab completion and so on") - which means system calls like lookup/opendir/readdir get a higher priority over bulk operations. One important note - the slow/normal/fast classification in the patch is NOT related to how long the fop takes to complete in storage/posix. However it is about the expectancy from a user's point of view of which of these fops he LIKES to get completed soon or late. Probably renaming them to iot_schedule_{soon,late,normal} might make it more clear. > A suggestion: > We could have one dedicated thread for each of the priority queues. And, also > have the thread pool that continues to work as in your implementation. This > way, we avoid some starvation. > A revisit of locks can also be made. Currently, the conf lock is to be obtained > by every thread immaterial of the queue it is serving. Though I see that it is > only for the duration of dequeing a call stub, it might be worthwhile to see > the effects of splitting the lock. You can give it a shot, but splitting locks in this case only makes things overly complex forcing you to forego the benefits of granular locking. Shehjar did attempt this once but soon realized that you can never avoid having a central lock (when working with a thread pool) to be guaranteed to be race free, even while more granular locks exist. So the current scheme has just one central lock (which can anyways not be avoided) but keeps the critical section as bare minimum as possible. Avati (In reply to comment #3) > > With the test case you described I doubt if you will see any difference at all > made by the patch. The effectiveness of io-threads is only when there are > multiple system calls operating in the graph _in parallel_. A find is a single > threaded job and by nature of its operation only one system call will be in the > graph at any time. Different scheduling policies in io-threads will have > literally negligible effect. I brought in a wrong test case. My bad. > > The problem being attempted to solve here is to make things more responsive > (for a user experience) while there is heavy I/O. When multiple applications > writing heavy, because of the existence of write-behind, there will be multiple > write fops in the io-threads queue. If you scale up with window-size > configuration of write-behind or have multiple instances of writer threads > (parallel dd?) then the io-threads queue will have outstanding requests. This My point is that there can be a similar "parallel stat", in which case the higher priority queue will be full. Or is such a case unnecessary to handle? > is the only time when changes in scheduling policy can have any difference at > all. The goal of the patch is to make user perceivable operations ("triggered > via cd, ls, tab completion and so on") - which means system calls like > lookup/opendir/readdir get a higher priority over bulk operations. Right. And that translates to which requests we are handling first. With the new patch, we direct all threads in the pool to serve the higher priority queue *until it is empty*. Theoretically, there can be a shortage of threads to serve the other two queues. It all boils down to whether that is acceptable or not *and of course*, whether the complexity involved in fixing it is worth it. > One important note - the slow/normal/fast classification in the patch is NOT > related to how long the fop takes to complete in storage/posix. However it is > about the expectancy from a user's point of view of which of these fops he > LIKES to get completed soon or late. Probably renaming them to > iot_schedule_{soon,late,normal} might make it more clear. Not necessary. The current nomenclature is quite clear (at least to me). I did not mean time either. But that said, the time aspect does not come into picture only as long as the scheduling policy does not affect the users expectation. Understood that it is an indication of the likeness of the user, but even when the user's understanding is that create is a slow (lower priority) operation, there is (I think) a certain expectation on *how slow* that could be. Pavan PATCH: http://patches.gluster.com/patch/6051 in master (io-threads: whitespace cleanup) PATCH: http://patches.gluster.com/patch/6052 in master (io-threads: implement bulk and priority queues) PATCH: http://patches.gluster.com/patch/6053 in master (io-threads: use slow/normal/fast classification of fops) tried to execute iozone -a along with ls and also used dd for creating 1GB files with 10 similar processed going on and ls on the other terminal. The ls command finished in few seconds itself, on both 3.1.2 and 3.1.3 with slight differences in microseconds. I couldn't reproducing the bug using only one client. So tried a different setup. I created a distributed-replicate with two bricks in one machine and other two in another machine and mount over nfs on another machine on four mount points. I ran iozone, dbench and dd on different mount points and accessing files using ls on the different mount points. ####################################################################################################################################################################### With glusterfs 3.1.1 results i got ###################################################################################### real 0m15.741s user 0m0.000s sys 0m0.000s root@ubuntu10-template:/mnt/vol3# time ls clients file file1 file2 file3 iozone.tmp real 0m8.327s user 0m0.000s sys 0m0.010s root@ubuntu10-template:/mnt/vol3# time ls clients file file1 file2 file3 iozone.tmp real 0m14.702s user 0m0.000s sys 0m0.000s root@ubuntu10-template:/mnt/vol3# time ls clients file file1 file2 file3 iozone.tmp real 0m8.335s user 0m0.000s sys 0m0.000s root@ubuntu10-template:/mnt/vol3# time ls clients file file1 file2 file3 iozone.tmp real 0m8.091s user 0m0.000s sys 0m0.000s ^[[Aroot@ubuntu10-template:/mnt/vol3# time ls clients file file1 file2 file3 iozone.tmp real 0m8.438s user 0m0.000s sys 0m0.000s root@ubuntu10-template:/mnt/vol3# time ls clients file file1 file2 file3 iozone.tmp real 0m14.588s user 0m0.000s sys 0m0.010s root@ubuntu10-template:/mnt/vol3# time ls clients file file1 file2 file3 ################################################################################################################################################################ ##### but with glusterfs 3.1.3 i got results as ################################################################################################################################################################ real 0m4.762s user 0m0.000s sys 0m0.000s root@ubuntu10-template:/mnt/vol3# time ls clients file1 file2 real 0m3.746s user 0m0.000s sys 0m0.000s root@ubuntu10-template:/mnt/vol3# time ls clients file1 file2 iozone.tmp real 0m2.607s user 0m0.010s sys 0m0.000s root@ubuntu10-template:/mnt/vol3# time ls clients file1 file2 real 0m3.978s user 0m0.000s sys 0m0.000s root@ubuntu10-template:/mnt/vol3# time ls clients file1 file2 real 0m11.542s user 0m0.000s sys 0m0.000s root@ubuntu10-template:/mnt/vol3# time ls clients file1 file2 real 0m2.062s user 0m0.000s sys 0m0.000s root@ubuntu10-template:/mnt/vol3# time ls clients file1 file2 real 0m2.932s user 0m0.000s sys 0m0.000s root@ubuntu10-template:/mnt/vol3# time ls clients file1 file2 real 0m4.794s user 0m0.000s sys 0m0.000s root@ubuntu10-template:/mnt/vol3# time ls clients file1 file2 iozone.tmp real 0m6.994s user 0m0.000s sys 0m0.000s by observing the results , i don't think there is a significant difference. Let me know if i am in right path or i have to do any other tests. For this particular scenario we can observe improvement in the performance. |