Bug 829913
Summary: | single glusterfs (FUSE client) thread is bottleneck for UFO multi-thread PUT workload | ||||||
---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Ben England <bengland> | ||||
Component: | gluster-swift | Assignee: | Luis Pabón <lpabon> | ||||
Status: | CLOSED ERRATA | QA Contact: | pushpesh sharma <psharma> | ||||
Severity: | low | Docs Contact: | |||||
Priority: | low | ||||||
Version: | 2.0 | CC: | aavati, bbandari, gluster-bugs, madam, perfbz, pportant, rhs-bugs, sdharane, vagarwal, vbellur | ||||
Target Milestone: | --- | ||||||
Target Release: | RHGS 2.1.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | FutureFeature | ||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2013-09-23 22:32:17 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Ben England
2012-06-07 19:40:15 UTC
Both container-server and object-server are issuing filesystem calls, this may increase load on glusterfs. during DELETEs, when it's the hot thread as well, here's what it is doing in a few-second interval. One problem is that the open, fstat, mmap, munmap and close are reading /proc/pid/status to get group information which could be cached, a known problem (Brian Foster already put in bz on this). [root@gprfs014 ~]# strace -p 10221 -c Process 10221 attached - interrupt to quit ^CProcess 10221 detached % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 44.26 0.009011 1 7130 read 22.93 0.004667 0 11791 writev 15.03 0.003059 0 6965 munmap 5.31 0.001081 0 6964 open 3.49 0.000711 0 8714 readv 2.86 0.000582 0 1469 194 futex 2.84 0.000578 0 6964 mmap 1.69 0.000345 0 6965 close 1.59 0.000323 0 6964 fstat ------ ----------- ----------- --------- --------- ---------------- 100.00 0.020357 63926 194 total I think that mostly it should just be doing reads and writes on /dev/fuse and sockets connected to glusterfsd processes, not much else. It should also only be doing 2 reads to receive an RPC message and 1 write to transmit. Here's the cycle that repeats forever in a HTTP PUT strace of glusterfs hot thread: read(11, "P\254[!{\344\202kk\\\20\243?'&\235", 16) = 16 open("/proc/32452/status", O_RDONLY) = 12 fstat(12, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f733e834000 read(12, "Name:\tswift-object-se\nState:\tS ("..., 1024) = 903 close(12) = 0 munmap(0x7f733e834000, 4096) = 0 writev(9, [{"\200\0\1@", 4}, {"\0\36Q\n\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\33\0\5\363\227\0\0\0\34"..., 68}, {"\0\0\0\0\ 0\0\0\0\0\0\0\0\0\0\0\0\4r/\356\223eFQ\207H1\375\21\261\266\375"..., 252}], 3) = 324 readv(7, [{",\0\0\0\1\0\0\0\332\230%\0\0\0\0\0@AJ9s\177\0\0\0\0\0\0\0\0\0\0"..., 80}, {"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\ 0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072}], 2) = 44 Note we can get rid entirely of calls 2-6 (5 system calls) by just caching group information from /proc/pid status. After that it doesn't look so bad. If I include elapsed times it looks like this: 20:07:59.608918 readv(7, [{"@\0\0\0\31\0\0\0T\262@\0\0\0\0\0d\224n9s\177\0\0\0\0\0\0\0\0\0\0"..., 80}, {"ent-Lengthq\2U\00710 48576q\3U\4ETagq\4U"..., 131072}], 2) = 64 <0.000012> 20:07:59.608994 open("/proc/32456/status", O_RDONLY) = 12 <0.000015> 20:07:59.609041 fstat(12, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 <0.000009> 20:07:59.609097 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f733e834000 <0.000011> 20:07:59.609139 read(12, "Name:\tswift-object-se\nState:\tS ("..., 1024) = 903 <0.000041> 20:07:59.609214 close(12) = 0 <0.000011> 20:07:59.609252 munmap(0x7f733e834000, 4096) = 0 <0.000014> 20:07:59.609314 writev(9, [{"\200\0\0`", 4}, {"\0002dO\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\17\0\5\363\227\0\0\0\34"..., 6 8}, {"\0\210p\21\301lD\310\243\201\276>*\305^\214\0\0\0\0\0\0\0h\0\0\0\0", 28}], 3) = 100 <0.000017> 20:07:59.609395 writev(7, [{"\20\0\0\0\0\0\0\0T\262@\0\0\0\0\0", 16}], 1) = 16 <0.000013> 20:07:59.610109 readv(7, [{"E\0\0\0\26\0\0\0V\262@\0\0\0\0\0\0\372(9s\177\0\0\0\0\0\0\0\0\0\0"..., 80}, {"ent-Lengthq\2U\0071 048576q\3U\4ETagq\4U"..., 131072}], 2) = 69 <0.000012> 20:07:59.610223 open("/proc/32446/status", O_RDONLY) = 12 <0.000019> 20:07:59.610291 fstat(12, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 <0.000009> 20:07:59.610347 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f733e834000 <0.000013> 20:07:59.610398 read(12, "Name:\tswift-container\nState:\tS ("..., 1024) = 901 <0.000041> 20:07:59.610478 close(12) = 0 <0.000011> 20:07:59.610524 munmap(0x7f733e834000, 4096) = 0 <0.000013> 20:07:59.610590 writev(9, [{"\200\0\0t", 4}, {"\0002dQ\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\22\0\5\363\227\0\0\0\34"..., 6 8}, {"z2{ViTG^\240\302\177\2202\22\341\210\0\0\0\1\0\0\0\24user.swi"..., 48}], 3) = 120 <0.000019> PRIVATE, 2, NULL) = 0 <0.000020> 20:07:59.582046 futex(0x14d1a50, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000010> 20:07:59.582086 futex(0x14c1bf0, FUTEX_WAKE_PRIVATE, 1) = 1 <0.000011> You can ignore the futexes since these appear to be triggered by the act of stracing with elapsed times. But it appears that at least 50% of elapsed time in system calls is spent in unnecessary activity. Can you observe the effects of turning on io-threads in the client - # gluster volume set <name> performance.client-io-threads on Avati As Avati mentioned, we must try the testing with client-io-threads on. If the performance improves then it means that network transactions made by GlusterFS with GlusterFSd was the bottleneck which the io-threads will solve the problem else we can imply that fuse is the bottleneck. Created attachment 590497 [details]
strace of hot threads now that performance.client-io-threads is enabled
Now I think I know why UFO is slow. Now that I enabled performance.client-io-threads, behavior is different. When I run
./parallel_curl.py PUT 128 256 1024 gprfs013 gprfs013-10ge
I see 2 hot glusterfs threads instead of 1. The 2nd thread just does system calls to read /proc/pid/status, and this has already been discussed elsewhere.
The second thread does mostly readv and writev, see attachment. It should be idle but it is working very hard because the I/O sizes are all small, average size was 115 bytes for the 1000 message sample. Does this mean UFO is not aggregating data transfers to filesystem? Can we get it to do bigger reads and writes? We know Gluster doesn't do well at tiny reads and writes.
Also, this thread is doing the same readv sequence every time:
readv(27, [{"\200\0\0$", 4}], 1) = 4
readv(27, [{"\0$\235r\0\0\0\1", 8}], 1) = 8
readv(27, [{"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", N}], 1) = N
If we could just collapse the 4- and 8-byte reads together, this would improve efficiency, but I think the big problem is that the average transfer size from UFO to native FUSE client is too small.
Correction to my previous statement -- there might not be small I/O sizes coming from UFO server. When I looked at object-server threads they were doing large enough writes (64-KB), but there were a ton of other system calls happening as well, this is what is threw me off. Looks like I'll be spending some quality time with strace ;-) Ben, What is the latest observation related to this issue? Based on my current understanding of the code, there are still a lot of inefficiencies in the UFO code in terms of the system calls made. This is due in part to both the OpenStack Swift code and the UFO glue code that we wrote. It would be worth our time to make sure we can account for all the system calls being made by our UFO stack outside of GlusterFS before trying to nail down inefficiencies in the GlusterFS layer. RHS 2.0 UFO Bugs are being set to low priority. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1262.html |