Bug 1349953
Summary: | thread CPU saturation limiting throughput on write workloads | |||
---|---|---|---|---|
Product: | [Community] GlusterFS | Reporter: | Manoj Pillai <mpillai> | |
Component: | fuse | Assignee: | Pranith Kumar K <pkarampu> | |
Status: | CLOSED NEXTRELEASE | QA Contact: | ||
Severity: | unspecified | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 3.8 | CC: | bugs, dblack, iesool, jahernan, jbyers, pkarampu, rtalur | |
Target Milestone: | --- | Keywords: | Performance | |
Target Release: | --- | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1361678 (view as bug list) | Environment: | ||
Last Closed: | 2017-08-31 06:54:50 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1361678 |
Description
Manoj Pillai
2016-06-24 15:59:04 UTC
Simpler reproduction with a single client: Created the 12x(4+2) volume and fuse mounted on the client. Run iozone single client write test with 8 threads/files: iozone -i 0 -w -+n -c -C -e -s 10g -r 64k -t 8 -F /mnt/glustervol/f{1..8}.ioz Output: throughput for 8 initial writers = 179084.75 kB/sec Output from: top -bH -d 10: top - 13:55:21 up 191 days, 12:36, 6 users, load average: 0.92, 0.32, 0.55 Threads: 292 total, 3 running, 289 sleeping, 0 stopped, 0 zombie %Cpu(s): 10.3 us, 4.9 sy, 0.0 ni, 84.7 id, 0.0 wa, 0.0 hi, 0.2 si, 0.0 st KiB Mem : 65728904 total, 54453352 free, 861572 used, 10413980 buff/cache KiB Swap: 32964604 total, 32963420 free, 1184 used. 64082548 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 8839 root 20 0 892360 239612 3572 R 99.8 0.4 0:53.34 glusterfs 8836 root 20 0 892360 239612 3572 S 19.6 0.4 0:10.37 glusterfs 8835 root 20 0 892360 239612 3572 S 19.0 0.4 0:10.18 glusterfs 8834 root 20 0 892360 239612 3572 S 18.8 0.4 0:10.17 glusterfs 8833 root 20 0 892360 239612 3572 R 18.6 0.4 0:10.49 glusterfs 8853 root 20 0 47432 18796 168 S 2.2 0.0 0:01.25 iozone 8855 root 20 0 47432 18800 168 S 2.2 0.0 0:00.99 iozone [...] Thanks for this bug Manoj, I think, we need to test these cases with "gluster volume set <volname> client.io-threads on" so that fuse-thread won't do the encoding in EC volume. (In reply to Manoj Pillai from comment #1) > Simpler reproduction with a single client: > > Created the 12x(4+2) volume and fuse mounted on the client. > > Run iozone single client write test with 8 threads/files: > iozone -i 0 -w -+n -c -C -e -s 10g -r 64k -t 8 -F /mnt/glustervol/f{1..8}.ioz > > Output: > throughput for 8 initial writers = 179084.75 kB/sec > Same test with additional option: performance.client-io-threads on: throughput for 8 initial writers = 773070.43 kB/sec That's more like it! I'll look at some more results, including from replica-2 runs, before closing this bug. Results for distributed iozone test, with 6 clients. Command line: iozone -+m ${IOZONE_CONF} -i 0 -w -+n -c -C -e -s 10g -r 64k -t 48 Volume same as in comment #0. 6 servers, 12x(4+2). Volume options set: cluster.lookup-optimize: on server.event-threads: 4 client.event-threads: 4 iozone output: Children see throughput for 48 initial writers = 1092648.42 kB/sec Now, adding option: performance.client-io-threads on: iozone output: Children see throughput for 48 initial writers = 3746335.84 kB/sec So, enabling client-io-threads gave us ~3.5x bump in throughput here. top output (on a client) without client-io-threads is given in comment #0. Here's what it looks like after. <body> top - 12:07:08 up 11:37, 5 users, load average: 1.66, 0.40, 0.17 Threads: 314 total, 7 running, 307 sleeping, 0 stopped, 0 zombie %Cpu(s): 27.1 us, 15.9 sy, 0.0 ni, 54.5 id, 0.0 wa, 0.0 hi, 2.4 si, 0.0 st KiB Mem : 65728904 total, 45210228 free, 973552 used, 19545124 buff/cache KiB Swap: 32964604 total, 32964604 free, 0 used. 64288252 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 22074 root 20 0 1168900 239896 3708 R 88.0 0.4 0:21.89 glusterfs 22078 root 20 0 1168900 239896 3708 R 88.0 0.4 0:21.68 glusterfs 22076 root 20 0 1168900 239896 3708 S 87.9 0.4 0:21.62 glusterfs 22077 root 20 0 1168900 239896 3708 R 87.9 0.4 0:21.68 glusterfs 22081 root 20 0 1168900 239896 3708 R 75.5 0.4 0:17.93 glusterfs 22075 root 20 0 1168900 239896 3708 S 19.1 0.4 0:04.56 glusterfs 22273 root 20 0 1168900 239896 3708 R 18.9 0.4 0:04.48 glusterfs 22272 root 20 0 1168900 239896 3708 R 18.8 0.4 0:04.56 glusterfs 22274 root 20 0 1168900 239896 3708 S 18.7 0.4 0:04.48 glusterfs 22199 root 20 0 53752 19484 816 S 6.2 0.0 0:01.63 iozone 22186 root 20 0 53752 18536 820 S 6.1 0.0 0:01.52 iozone [...] </body> The single hot thread is gone. Overall CPU utilization has gone up quite a bit (idle is down to ~54%), and that may be a concern for some deployments. Network stats (below is from a client) for the run with client-io-threads enabled show that the 10gE link is now being pushed to near its limit, which is what we want: 12:07:07 PM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s 12:07:17 PM p3p1 76644.80 150502.30 15037.46 1197410.19 0.00 0.00 0.00 12:07:17 PM p3p2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 12:07:17 PM lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00 12:07:17 PM em3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 12:07:17 PM em1 6.80 0.00 0.45 0.00 0.00 0.00 3.00 12:07:17 PM em2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 12:07:17 PM em4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 (In reply to Manoj Pillai from comment #5) > Results for distributed iozone test, with 6 clients. Command line: > iozone -+m ${IOZONE_CONF} -i 0 -w -+n -c -C -e -s 10g -r 64k -t 48 > > Volume same as in comment #0. > 6 servers, 12x(4+2). > Volume options set: > cluster.lookup-optimize: on > server.event-threads: 4 > client.event-threads: 4 > > iozone output: > Children see throughput for 48 initial writers = 1092648.42 kB/sec > > Now, adding option: performance.client-io-threads on: > iozone output: > Children see throughput for 48 initial writers = 3746335.84 kB/sec > > So, enabling client-io-threads gave us ~3.5x bump in throughput here. > > top output (on a client) without client-io-threads is given in comment #0. > Here's what it looks like after. > > <body> > top - 12:07:08 up 11:37, 5 users, load average: 1.66, 0.40, 0.17 > Threads: 314 total, 7 running, 307 sleeping, 0 stopped, 0 zombie > %Cpu(s): 27.1 us, 15.9 sy, 0.0 ni, 54.5 id, 0.0 wa, 0.0 hi, 2.4 si, 0.0 > st > KiB Mem : 65728904 total, 45210228 free, 973552 used, 19545124 buff/cache > KiB Swap: 32964604 total, 32964604 free, 0 used. 64288252 avail Mem > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 22074 root 20 0 1168900 239896 3708 R 88.0 0.4 0:21.89 glusterfs > 22078 root 20 0 1168900 239896 3708 R 88.0 0.4 0:21.68 glusterfs > 22076 root 20 0 1168900 239896 3708 S 87.9 0.4 0:21.62 glusterfs > 22077 root 20 0 1168900 239896 3708 R 87.9 0.4 0:21.68 glusterfs > 22081 root 20 0 1168900 239896 3708 R 75.5 0.4 0:17.93 glusterfs > 22075 root 20 0 1168900 239896 3708 S 19.1 0.4 0:04.56 glusterfs > 22273 root 20 0 1168900 239896 3708 R 18.9 0.4 0:04.48 glusterfs > 22272 root 20 0 1168900 239896 3708 R 18.8 0.4 0:04.56 glusterfs > 22274 root 20 0 1168900 239896 3708 S 18.7 0.4 0:04.48 glusterfs > 22199 root 20 0 53752 19484 816 S 6.2 0.0 0:01.63 iozone > 22186 root 20 0 53752 18536 820 S 6.1 0.0 0:01.52 iozone > [...] > </body> > > The single hot thread is gone. Overall CPU utilization has gone up quite a > bit (idle is down to ~54%), and that may be a concern for some deployments. CPU utilization will go up because more threads are doing encoding now. Do you suggest for us to have a way to throttle that? I also added Xavi CCed Xavi to the bug to get his inputs. > > Network stats (below is from a client) for the run with client-io-threads > enabled show that the 10gE link is now being pushed to near its limit, which > is what we want: > 12:07:07 PM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s > txcmp/s rxmcst/s > 12:07:17 PM p3p1 76644.80 150502.30 15037.46 1197410.19 0.00 > 0.00 0.00 > 12:07:17 PM p3p2 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 > 12:07:17 PM lo 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 > 12:07:17 PM em3 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 > 12:07:17 PM em1 6.80 0.00 0.45 0.00 0.00 > 0.00 3.00 > 12:07:17 PM em2 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 > 12:07:17 PM em4 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 (In reply to Pranith Kumar K from comment #6) > > > > The single hot thread is gone. Overall CPU utilization has gone up quite a > > bit (idle is down to ~54%), and that may be a concern for some deployments. > > CPU utilization will go up because more threads are doing encoding now. Do > you suggest for us to have a way to throttle that? I also added Xavi CCed > Xavi to the bug to get his inputs. > Right, CPU utilization is expected to go up as you scale to higher throughput. I'm just saying that client-side CPU utilization can be a concern depending on how CPU-hungry the applications are -- client-side CPU is primarily meant for them. Do we need an ability to cap CPU utilization from within gluster (rather than using something like cgroups)? Take a look at a run with client event-threads set to 2, instead of 4 like in the runs above: iozone output: Children see throughput for 48 initial writers = 2778952.81 kB/sec [lower throughput compared to 3.7 GB/s with client event-threads=4] top output: <body> top - 00:26:56 up 1 day, 23:57, 5 users, load average: 1.38, 0.45, 0.38 Threads: 312 total, 3 running, 309 sleeping, 0 stopped, 0 zombie %Cpu(s): 15.8 us, 9.8 sy, 0.0 ni, 73.6 id, 0.0 wa, 0.0 hi, 0.8 si, 0.0 st KiB Mem : 65728904 total, 40787784 free, 985512 used, 23955608 buff/cache KiB Swap: 32964604 total, 32964604 free, 0 used. 64218968 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 6003 root 20 0 954872 240892 3708 R 99.9 0.4 0:47.30 glusterfs 6005 root 20 0 954872 240892 3708 R 99.9 0.4 0:47.21 glusterfs 6008 root 20 0 954872 240892 3708 S 48.5 0.4 0:22.06 glusterfs 6199 root 20 0 954872 240892 3708 S 13.5 0.4 0:06.77 glusterfs 6200 root 20 0 954872 240892 3708 S 13.5 0.4 0:06.41 glusterfs 6004 root 20 0 954872 240892 3708 S 13.4 0.4 0:07.30 glusterfs 6126 root 20 0 53752 19488 816 S 5.8 0.0 0:02.61 iozone [...] </body> [lower overall CPU utilization as well, compared to event-threads=4] The 2 top threads by CPU utilization (99.9% each) seem to be the epoll threads, based on pstack output. The third is the fuse thread. In comment #0, IIRC, the epoll threads were not doing most of the encoding. With client-io-threads on, the epoll threads seem to be doing most of the encoding and using most of the CPU. Is that true and expected? And by varying the number of client event-threads, I seem to be able to limit CPU utilization at the expense of throughput. (In reply to Manoj Pillai from comment #7) > (In reply to Pranith Kumar K from comment #6) > > > > > > > The single hot thread is gone. Overall CPU utilization has gone up quite a > > > bit (idle is down to ~54%), and that may be a concern for some deployments. > > > > CPU utilization will go up because more threads are doing encoding now. Do > > you suggest for us to have a way to throttle that? I also added Xavi CCed > > Xavi to the bug to get his inputs. > > > > Right, CPU utilization is expected to go up as you scale to higher > throughput. I'm just saying that client-side CPU utilization can be a > concern depending on how CPU-hungry the applications are -- client-side CPU > is primarily meant for them. > > Do we need an ability to cap CPU utilization from within gluster (rather > than using something like cgroups)? Take a look at a run with client > event-threads set to 2, instead of 4 like in the runs above: > > iozone output: > Children see throughput for 48 initial writers = 2778952.81 kB/sec > [lower throughput compared to 3.7 GB/s with client event-threads=4] > > top output: > <body> > top - 00:26:56 up 1 day, 23:57, 5 users, load average: 1.38, 0.45, 0.38 > Threads: 312 total, 3 running, 309 sleeping, 0 stopped, 0 zombie > %Cpu(s): 15.8 us, 9.8 sy, 0.0 ni, 73.6 id, 0.0 wa, 0.0 hi, 0.8 si, 0.0 > st > KiB Mem : 65728904 total, 40787784 free, 985512 used, 23955608 buff/cache > KiB Swap: 32964604 total, 32964604 free, 0 used. 64218968 avail Mem > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 6003 root 20 0 954872 240892 3708 R 99.9 0.4 0:47.30 glusterfs > 6005 root 20 0 954872 240892 3708 R 99.9 0.4 0:47.21 glusterfs > 6008 root 20 0 954872 240892 3708 S 48.5 0.4 0:22.06 glusterfs > 6199 root 20 0 954872 240892 3708 S 13.5 0.4 0:06.77 glusterfs > 6200 root 20 0 954872 240892 3708 S 13.5 0.4 0:06.41 glusterfs > 6004 root 20 0 954872 240892 3708 S 13.4 0.4 0:07.30 glusterfs > 6126 root 20 0 53752 19488 816 S 5.8 0.0 0:02.61 iozone > [...] > </body> > [lower overall CPU utilization as well, compared to event-threads=4] > > The 2 top threads by CPU utilization (99.9% each) seem to be the epoll > threads, based on pstack output. The third is the fuse thread. > > In comment #0, IIRC, the epoll threads were not doing most of the encoding. > With client-io-threads on, the epoll threads seem to be doing most of the > encoding and using most of the CPU. Is that true and expected? And by > varying the number of client event-threads, I seem to be able to limit CPU > utilization at the expense of throughput. There is a way to find what thread is doing what. I believe Raghavendra Talur showed that to me once. Let me CC him. It is not clear why epoll-thread would do most of the encoding if all we are doing is writes. I would guess for them to be io-threads. First write on the file which has to acquire lock will be encoded by epoll-thread after which the encoding burden should shift to io-threads. I'm not sure what is exactly happening here. Enabling client-io-threads should not move the encoding load to the epoll threads for most of the requests. However the decoding load should be done mainly by the epoll threads. In your test you are doing writes, right ? in this case the epoll threads should only be handling the answers, with no encoding/decoding. 2.78 GB/s in records of 64k means a total of 43.421 requests per second. A 4+2 EC volume multiplies this by 6, for a total of 260.526 requests of 16k per second. On average, each client is receiving 43.421 answers per second (assuming a balanced load). Could it be that the communications layer could not handle so many answers ? Having 4 epoll threads means each thread handles 10.855 requests/s, while 2 threads need to handle 21.711 requests/s. Not sure if this is too many or not. Performing a pstack on the PID should show enough backtrace for us to determine what is the thread causing CPU overload. Can we have the pstack output please? Back to single client runs for simplicity. iozone command: ./iozone -i 0 -w -+n -c -C -e -s 10g -r 64k -t 8 -F /mnt/glustervol/f{1..8}.ioz gluster vol info ouput: [...] Options Reconfigured: cluster.lookup-optimize: on server.event-threads: 4 client.event-threads: 4 performance.client-io-threads: on transport.address-family: inet performance.readdir-ahead: on top output: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 12160 root 20 0 1301004 249764 3868 R 90.4 0.4 1:57.51 glusterfs 12162 root 20 0 1301004 249764 3868 R 90.4 0.4 1:58.38 glusterfs 12158 root 20 0 1301004 249764 3868 R 90.2 0.4 1:58.66 glusterfs 12161 root 20 0 1301004 249764 3868 R 90.2 0.4 1:58.13 glusterfs 12165 root 20 0 1301004 249764 3868 S 72.9 0.4 1:36.63 glusterfs 12178 root 20 0 1301004 249764 3868 S 15.3 0.4 0:19.57 glusterfs 12159 root 20 0 1301004 249764 3868 S 14.3 0.4 0:19.14 glusterfs 12177 root 20 0 1301004 249764 3868 S 14.3 0.4 0:19.39 glusterfs 12175 root 20 0 1301004 249764 3868 S 14.1 0.4 0:19.01 glusterfs 12179 root 20 0 1301004 249764 3868 S 14.1 0.4 0:19.43 glusterfs 12246 root 20 0 47432 18796 168 S 6.8 0.0 0:01.55 iozone pstack 12160: Thread 1 (process 12160): #0 0x00007f675ac9ed89 in gf8_muladd_04 () from /usr/lib64/glusterfs/3.8.0/xlator/cluster/disperse.so #1 0x00007f675acb4d77 in ec_method_encode () from /usr/lib64/glusterfs/3.8.0/xlator/cluster/disperse.so #2 0x00007f675ac9954d in ec_wind_writev () from /usr/lib64/glusterfs/3.8.0/xlator/cluster/disperse.so #3 0x00007f675ac7dc78 in ec_dispatch_mask () from /usr/lib64/glusterfs/3.8.0/xlator/cluster/disperse.so #4 0x00007f675ac9bbde in ec_manager_writev () from /usr/lib64/glusterfs/3.8.0/xlator/cluster/disperse.so #5 0x00007f675ac7d50b in __ec_manager () from /usr/lib64/glusterfs/3.8.0/xlator/cluster/disperse.so #6 0x00007f675ac7d6e8 in ec_resume () from /usr/lib64/glusterfs/3.8.0/xlator/cluster/disperse.so #7 0x00007f675ac7f6df in ec_lock_resume_shared () from /usr/lib64/glusterfs/3.8.0/xlator/cluster/disperse.so #8 0x00007f675ac80b83 in ec_lock_reuse () from /usr/lib64/glusterfs/3.8.0/xlator/cluster/disperse.so #9 0x00007f675ac9bd08 in ec_manager_writev () from /usr/lib64/glusterfs/3.8.0/xlator/cluster/disperse.so #10 0x00007f675ac7d50b in __ec_manager () from /usr/lib64/glusterfs/3.8.0/xlator/cluster/disperse.so #11 0x00007f675ac7d6e8 in ec_resume () from /usr/lib64/glusterfs/3.8.0/xlator/cluster/disperse.so #12 0x00007f675ac7d80f in ec_complete () from /usr/lib64/glusterfs/3.8.0/xlator/cluster/disperse.so #13 0x00007f675ac9917c in ec_inode_write_cbk () from /usr/lib64/glusterfs/3.8.0/xlator/cluster/disperse.so #14 0x00007f675aeef442 in client3_3_writev_cbk () from /usr/lib64/glusterfs/3.8.0/xlator/protocol/client.so #15 0x00007f6768b26b30 in rpc_clnt_handle_reply () from /lib64/libgfrpc.so.0 #16 0x00007f6768b26def in rpc_clnt_notify () from /lib64/libgfrpc.so.0 #17 0x00007f6768b22923 in rpc_transport_notify () from /lib64/libgfrpc.so.0 #18 0x00007f675d3ec9d4 in socket_event_poll_in () from /usr/lib64/glusterfs/3.8.0/rpc-transport/socket.so #19 0x00007f675d3ef614 in socket_event_handler () from /usr/lib64/glusterfs/3.8.0/rpc-transport/socket.so #20 0x00007f6768db4590 in event_dispatch_epoll_worker () from /lib64/libglusterfs.so.0 #21 0x00007f6767bbddf5 in start_thread () from /lib64/libpthread.so.0 #22 0x00007f67675041ad in clone () from /lib64/libc.so.6 Thanks for this stack trace Manoj. I think I understand what is happening now. What happens is, while a write is in progress if more writes come, they will be put in a queue which will be picked up as soon as the previous write completes. So because epoll-thread resumes the write, it is doing encoding. So yes, you are correct, in this case it is happening in epoll-thread. Do you think this is bad. Is it becoming a bottle-neck? Pranith (In reply to Pranith Kumar K from comment #12) > So yes, you are correct, in this case it is happening in epoll-thread. Do > you think this is bad. Is it becoming a bottle-neck? > Nothing bad I think, just need to make sure that client.event-threads is also tuned to a good number ( >=4 ), in addition to having client-io-threads on. We are seeing that both are important for good EC write performance. Thanks Manoj for the testing. I was wrong. As Pranith says, writes need to be serialized in EC causing the queued writes to be resumed inside the epoll threads. In this case I'm not sure why enabling client-io-threads improves performance so much. Maybe it allows fuse to process requests faster, causing the saturation of epoll-threads later... In any case I think it would be interesting to move the execution of all these operations to other threads. Maybe using a thread pool or synctasks. Another parameter that I tweak that gives a small extra boost in my tests is rpc.outstanding-rpc-limit. Default is 64. Using 128, 256 or 512 gives some more performance (probably at the cost of more CPU usage) (In reply to Xavier Hernandez from comment #14) > Thanks Manoj for the testing. I was wrong. As Pranith says, writes need to > be serialized in EC causing the queued writes to be resumed inside the epoll > threads. > > In this case I'm not sure why enabling client-io-threads improves > performance so much. Maybe it allows fuse to process requests faster, > causing the saturation of epoll-threads later... > > In any case I think it would be interesting to move the execution of all > these operations to other threads. Maybe using a thread pool or synctasks. > > Another parameter that I tweak that gives a small extra boost in my tests is > rpc.outstanding-rpc-limit. Default is 64. Using 128, 256 or 512 gives some > more performance (probably at the cost of more CPU usage) Xavi, In the latest releases(Even in 3.7.x) the outstanding-rpc-limit is infinity i.e. value '0' for servers. It is enabled only for NFS servers. These are the patches: http://review.gluster.org/10267 http://review.gluster.org/10268 So I would say don't enable anything? Pranith I didn't know about this change... Maybe I did the tests with 3.6.x. I'm not sure. In any case I agree that it will be useless with latest versions. Results with 3x2 volume (6 servers) with bricks on 12-disk RAID-6: Benchmark: distributed iozone (write test) on 6 clients: iozone -+m ${IOZONE_CONF} -i 0 -w -+n -c -C -e -s 10g -r 64k -t 48 With standard tuning options: cluster.lookup-optimize: on server.event-threads: 4 client.event-threads: 4 iozone output: throughput for 48 initial writers = 2052985.17 kB/sec iozone output after adding option performance.client-io-threads on: throughput for 48 initial writers = 3180628.89 kB/sec So we see a big bump in the case of replica-2 as well with client-io-threads. So we really want this option on irrespective of volume type. Without client-io-threads, we see the following in top output on the clients: <quote> top - 02:49:26 up 21:08, 0 users, load average: 0.24, 0.47, 0.47 Threads: 296 total, 4 running, 292 sleeping, 0 stopped, 0 zombie %Cpu(s): 7.5 us, 9.2 sy, 0.0 ni, 82.6 id, 0.0 wa, 0.0 hi, 0.8 si, 0.0 st KiB Mem : 65728904 total, 59729480 free, 758744 used, 5240680 buff/cache KiB Swap: 32964604 total, 32964604 free, 0 used. 64684248 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5156 root 20 0 736788 62396 3512 R 99.7 0.1 0:10.38 glusterfs 5152 root 20 0 736788 62396 3512 R 20.3 0.1 0:02.10 glusterfs 5150 root 20 0 736788 62396 3512 S 20.2 0.1 0:02.16 glusterfs 5153 root 20 0 736788 62396 3512 S 19.9 0.1 0:02.08 glusterfs 5151 root 20 0 736788 62396 3512 S 19.4 0.1 0:02.03 glusterfs 5198 root 20 0 53752 19484 816 S 4.4 0.0 0:00.45 iozone 5184 root 20 0 53752 19488 816 S 4.3 0.0 0:00.44 iozone [...] </quote> pstat shows that the hot thread is the fuse thread: <quote> [...] 37 0x00007eff9f18c3c0 in meta_writev () from /usr/lib64/glusterfs/3.8.0/xlator/meta.so #38 0x00007effa90aea55 in fuse_write_resume () from /usr/lib64/glusterfs/3.8.0/xlator/mount/fuse.so #39 0x00007effa90a55d5 in fuse_resolve_all () from /usr/lib64/glusterfs/3.8.0/xlator/mount/fuse.so #40 0x00007effa90a52e8 in fuse_resolve () from /usr/lib64/glusterfs/3.8.0/xlator/mount/fuse.so #41 0x00007effa90a561e in fuse_resolve_all () from /usr/lib64/glusterfs/3.8.0/xlator/mount/fuse.so #42 0x00007effa90a4903 in fuse_resolve_continue () from /usr/lib64/glusterfs/3.8.0/xlator/mount/fuse.so #43 0x00007effa90a5266 in fuse_resolve () from /usr/lib64/glusterfs/3.8.0/xlator/mount/fuse.so #44 0x00007effa90a55fe in fuse_resolve_all () from /usr/lib64/glusterfs/3.8.0/xlator/mount/fuse.so #45 0x00007effa90a5640 in fuse_resolve_and_resume () from /usr/lib64/glusterfs/3.8.0/xlator/mount/fuse.so #46 0x00007effa90bb7be in fuse_thread_proc () from /usr/lib64/glusterfs/3.8.0/xlator/mount/fuse.so #47 0x00007effb1659df5 in start_thread () from /lib64/libpthread.so.0 #48 0x00007effb0fa01ad in clone () from /lib64/libc.so.6 </quote> All 3.8.x bugs are now reported against version 3.8 (without .x). For more information, see http://www.gluster.org/pipermail/gluster-devel/2016-September/050859.html I think final dependency https://review.gluster.org/15036 for this issue is merged upstream. We can close this now. |