829913 – single glusterfs (FUSE client) thread is bottleneck for UFO multi-thread PUT workload

Bug 829913 - single glusterfs (FUSE client) thread is bottleneck for UFO multi-thread PUT workload

Summary: single glusterfs (FUSE client) thread is bottleneck for UFO multi-thread PUT ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	gluster-swift
Sub Component:
Version:	2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	RHGS 2.1.0
Assignee:	Luis Pabón
QA Contact:	pushpesh sharma
Docs Contact:
URL:
Whiteboard:	FutureFeature
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-06-07 19:40 UTC by Ben England
Modified:	2016-11-08 22:24 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-09-23 22:32:17 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
strace of hot threads now that performance.client-io-threads is enabled (57.18 KB, application/x-gzip) 2012-06-08 19:37 UTC, Ben England	no flags	Details
View All

Description Ben England 2012-06-07 19:40:15 UTC

Description of problem:

UFO PUT throughput for 1 MB is 1/4 XFS throughput, even when there is no network in use and aggregate CPU utilization is 50% on average, and there is no replication.

Version-Release number of selected component (if applicable):

RHS 2.0 RC1 (20120530)

How reproducible:

Just configure a single-brick with 12-drive 256-KB-stripe RAID6 LUN, install RC1, and run parallel curl threads within this one server. There is no network interaction.

Steps to Reproduce:
1. configure UFO on a single server with local 1-brick volume, using 12 swift proxy-server and object-server worker threads. Need 2 Westmere sockets, 15 GB mem, and megaraid.
2. run workload below on it, consisting of 128 threads, each thread writes 256 1-MB objects to its container.
3. with "top" utility, press "H" to see per-thread CPU consumption.

Actual results:

Throughput = 125 files/sec = 125 MB/s, hottest thread is in glusterfs. well above 90% consistently. So we have a CPU bottleneck. The storage is not a bottleneck, XFS can deliver 525 files/sec, 4x throughput for same workload using files instead of objects. Writing same files to gluster mountpoint delivers 400 MB/s, about 80% of XFS throughput. So there must be something that UFO is doing differently that causes glusterfs thread to work harder (xattrs?).

Expected results:

We want to see an I/O bottleneck. We are well below 10-GbE network speeds.

Additional info:

VERBOSE=1 ./parallel_curl.py PUT 128 256 1024 localhost localhost

benchmark available at
http://perf1.lab.bos.redhat.com/bengland/laptop/matte/ufo/parallel_curl.py

doing PUT 128 threads 256 objects/thread 1024 KB/object
clients: ['localhost']
servers: ['localhost']
authentication tokens done after 0.00 sec
upload files ready after 0.10 sec
container put finished after 42.67 sec
started to create curl object threads after 42.67 sec
curl threads started after 43.71 sec
curl threads stopped after 304.81 sec
thread startup time = 0.40%
elapsed time = 262.15 sec
throughput = 125.00 objs/sec
transfer rate = 125.00 MB/s

Will add more detail on what glusterfs thread is doing later.

Comment 2 Ben England 2012-06-07 20:23:02 UTC

Both container-server and object-server are issuing filesystem calls, this may increase load on glusterfs.

during DELETEs, when it's the hot thread as well, here's what it is doing in a few-second interval.  One problem is that the open, fstat, mmap, munmap and close are reading /proc/pid/status to get group information which could be cached, a known problem (Brian Foster already put in bz on this).

[root@gprfs014 ~]# strace -p 10221 -c 
Process 10221 attached - interrupt to quit
^CProcess 10221 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 44.26    0.009011           1      7130           read
 22.93    0.004667           0     11791           writev
 15.03    0.003059           0      6965           munmap
  5.31    0.001081           0      6964           open
  3.49    0.000711           0      8714           readv
  2.86    0.000582           0      1469       194 futex
  2.84    0.000578           0      6964           mmap
  1.69    0.000345           0      6965           close
  1.59    0.000323           0      6964           fstat
------ ----------- ----------- --------- --------- ----------------
100.00    0.020357                 63926       194 total

I think that mostly it should just be doing reads and writes on /dev/fuse and sockets connected to glusterfsd processes, not much else.  It should also only be doing 2 reads to receive an RPC message and 1 write to transmit.

Here's the cycle that repeats forever in a HTTP PUT strace of glusterfs hot thread:

read(11, "P\254[!{\344\202kk\\\20\243?'&\235", 16) = 16
open("/proc/32452/status", O_RDONLY)    = 12
fstat(12, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f733e834000
read(12, "Name:\tswift-object-se\nState:\tS ("..., 1024) = 903
close(12)                               = 0
munmap(0x7f733e834000, 4096)            = 0
writev(9, [{"\200\0\1@", 4}, {"\0\36Q\n\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\33\0\5\363\227\0\0\0\34"..., 68}, {"\0\0\0\0\
0\0\0\0\0\0\0\0\0\0\0\0\4r/\356\223eFQ\207H1\375\21\261\266\375"..., 252}], 3) = 324
readv(7, [{",\0\0\0\1\0\0\0\332\230%\0\0\0\0\0@AJ9s\177\0\0\0\0\0\0\0\0\0\0"..., 80}, {"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\
0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072}], 2) = 44

Note we can get rid entirely of calls 2-6 (5 system calls) by just caching group information from /proc/pid status.  After that it doesn't look so bad.  If I include elapsed times it looks like this:

20:07:59.608918 readv(7, [{"@\0\0\0\31\0\0\0T\262@\0\0\0\0\0d\224n9s\177\0\0\0\0\0\0\0\0\0\0"..., 80}, {"ent-Lengthq\2U\00710
48576q\3U\4ETagq\4U"..., 131072}], 2) = 64 <0.000012>
20:07:59.608994 open("/proc/32456/status", O_RDONLY) = 12 <0.000015>
20:07:59.609041 fstat(12, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 <0.000009>
20:07:59.609097 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f733e834000 <0.000011>
20:07:59.609139 read(12, "Name:\tswift-object-se\nState:\tS ("..., 1024) = 903 <0.000041>
20:07:59.609214 close(12)               = 0 <0.000011>
20:07:59.609252 munmap(0x7f733e834000, 4096) = 0 <0.000014>
20:07:59.609314 writev(9, [{"\200\0\0`", 4}, {"\0002dO\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\17\0\5\363\227\0\0\0\34"..., 6
8}, {"\0\210p\21\301lD\310\243\201\276>*\305^\214\0\0\0\0\0\0\0h\0\0\0\0", 28}], 3) = 100 <0.000017>
20:07:59.609395 writev(7, [{"\20\0\0\0\0\0\0\0T\262@\0\0\0\0\0", 16}], 1) = 16 <0.000013>

20:07:59.610109 readv(7, [{"E\0\0\0\26\0\0\0V\262@\0\0\0\0\0\0\372(9s\177\0\0\0\0\0\0\0\0\0\0"..., 80}, {"ent-Lengthq\2U\0071
048576q\3U\4ETagq\4U"..., 131072}], 2) = 69 <0.000012>
20:07:59.610223 open("/proc/32446/status", O_RDONLY) = 12 <0.000019>
20:07:59.610291 fstat(12, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 <0.000009>
20:07:59.610347 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f733e834000 <0.000013>
20:07:59.610398 read(12, "Name:\tswift-container\nState:\tS ("..., 1024) = 901 <0.000041>
20:07:59.610478 close(12)               = 0 <0.000011>
20:07:59.610524 munmap(0x7f733e834000, 4096) = 0 <0.000013>
20:07:59.610590 writev(9, [{"\200\0\0t", 4}, {"\0002dQ\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\22\0\5\363\227\0\0\0\34"..., 6
8}, {"z2{ViTG^\240\302\177\2202\22\341\210\0\0\0\1\0\0\0\24user.swi"..., 48}], 3) = 120 <0.000019>
PRIVATE, 2, NULL) = 0 <0.000020>
20:07:59.582046 futex(0x14d1a50, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000010>
20:07:59.582086 futex(0x14c1bf0, FUTEX_WAKE_PRIVATE, 1) = 1 <0.000011>

You can ignore the futexes since these appear to be triggered by the act of stracing with elapsed times.  But it appears that at least 50% of elapsed time in system calls is spent in unnecessary activity.

Comment 3 Anand Avati 2012-06-07 20:46:03 UTC

Can you observe the effects of turning on io-threads in the client -

# gluster volume set <name> performance.client-io-threads on

Avati

Comment 4 Junaid 2012-06-08 07:13:04 UTC

As Avati mentioned, we must try the testing with client-io-threads on. If the performance improves then it means that network transactions made by GlusterFS with GlusterFSd was the bottleneck which the io-threads will solve the problem else we can imply that fuse is the bottleneck.

Comment 5 Ben England 2012-06-08 19:37:03 UTC

Created attachment 590497 [details]
strace of hot threads now that performance.client-io-threads is enabled

Now I think I know why UFO is slow.  Now that I enabled performance.client-io-threads, behavior is different.  When I run 

./parallel_curl.py PUT 128 256 1024 gprfs013 gprfs013-10ge

I see 2 hot glusterfs threads instead of 1.  The 2nd thread just does system calls to read /proc/pid/status, and this has already been discussed elsewhere.

The second thread does mostly readv and writev, see attachment.  It should be idle but it is working very hard because the I/O sizes are all small, average size was 115 bytes for the 1000 message sample.    Does this mean UFO is not aggregating data transfers to filesystem?  Can we get it to do bigger reads and writes?  We know Gluster doesn't do well at tiny reads and writes.

Also, this thread is doing the same readv sequence every time:

readv(27, [{"\200\0\0$", 4}], 1)        = 4
readv(27, [{"\0$\235r\0\0\0\1", 8}], 1) = 8
readv(27, [{"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", N}], 1) = N

If we could just collapse the 4- and 8-byte reads together, this would improve efficiency, but I think the big problem is that the average transfer size from UFO to native FUSE client is too small.

Comment 6 Ben England 2012-06-08 20:12:22 UTC

Correction to my previous statement -- there might not be small I/O sizes coming from UFO server.  When I looked at object-server threads they were doing large enough writes (64-KB), but there were a ton of other system calls happening as well, this is what is threw me off.  Looks like I'll be spending some quality time with strace ;-)

Comment 7 Junaid 2012-09-12 05:07:06 UTC

Ben,

What is the latest observation related to this issue?

Comment 8 Peter Portante 2012-11-19 19:56:31 UTC

Based on my current understanding of the code, there are still a lot of inefficiencies in the UFO code in terms of the system calls made. This is due in part to both the OpenStack Swift code and the UFO glue code that we wrote. It would be worth our time to make sure we can account for all the system calls being made by our UFO stack outside of GlusterFS before trying to nail down inefficiencies in the GlusterFS layer.

Comment 9 Luis Pabón 2013-07-17 01:01:32 UTC

RHS 2.0 UFO Bugs are being set to low priority.

Comment 13 Scott Haines 2013-09-23 22:32:17 UTC

Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html

Note You need to log in before you can comment on or make changes to this bug.