Created attachment 706885 [details] Full annotation of the entire sequence to service one PUT operation It appears that we need to consider the system calls being made by the posix translator. Here is an edited/annotated strace output for the sequence of system calls the posix translator makes to create one file in response to an open() system call: <unfinis> [object-server] 184140 open("/m/g/A/g/.name.random", O_WRONLY|O_CREAT|O_EXCL|O_CLOEXEC, 0777 0.000161 [glusterfs-000] 184191 <... readv resumed> [{"\\\0\0"..., 80}, {"8e5ba"..., 131072}], 2) = 92 0.000011 [glusterfs-000] 184239 read(8, "g\\k~\323@U}\210\306~\2003<\270_", 16) = 16 #+ # posix_create() # MAKE_ENTRY_HANDLE() - posix-handle.h # posix_istat(), line 112 of MAKE_ENTRY_HANDLE() 0.000048 [glusterfs-000] 184280 lstat("/b/s/p/g/.name.random", 0x7f7183ffe8b0) = -1 ENOENT # MAKE_HANDLE_PATH() 0.000010 [glusterfs-000] 184357 lstat("/b/s/p/g/.name.random", 0x7f7183ffe900) = -1 ENOENT 0.000009 [glusterfs-000] 184394 lstat("/b/s/p/g/.name.random", 0x7f7183ffe770) = -1 ENOENT # MAKE_HANDLE_PATH() 0.000008 [glusterfs-000] 184429 lstat("/b/s/p/g", {st_mode=S_IFDIR|0755, st_size=48, ...}) = 0 # ... not sure of next three ... 0.000009 [glusterfs-000] 184469 writev(7, [{"\20\0\0"..., 16}], 1) = 16 0.000034 [glusterfs-000] 184503 readv(7, [{"l\0\0"..., 80}, {"d37c"..., 131072}], 2) = 108 0.000009 [glusterfs-000] 184572 read(8, "6\355\365\212lD\263b!\264h\335Fx\353f", 16) = 16 # ... not sure, but probably part of MAKE_ENTRY_HANDLE() somehow 0.000009 [glusterfs-000] 184606 lstat("/b/s/p/g/.name.random", 0x7f7183ffe820) = -1 ENOENT # Line 1691 of posix_create() in posix.c - posix_pstat() 0.000007 [glusterfs-000] 184641 lstat("/b/s/p/g", {st_mode=S_IFDIR|0755, st_size=48, ...}) = 0 # Line 1711 of posix_create() in posix.c - posix_pstat() 0.000008 [glusterfs-000] 184679 lstat("/b/s/p/g/.name.random", 0x7f7183ffe820) = -1 ENOENT # Line 1719 of posix_create() in posix.c 0.000117 [glusterfs-000] 184713 open("/b/s/p/g/.name.random", O_WRONLY|O_CREAT|O_EXCL|O_LARGEFILE|O_CLOEXEC, 0100755) = 10 #+ # posix_gfid_set() - called on Line 1733 of posix_create() in posix.c # Line 488 of posix_gfid_set() in posix-helpers.c 0.000009 [glusterfs-000] 184856 lstat("/b/s/p/g/.name.random", {st_mode=S_IFREG|0755, st_size=0, ...}) = 0 # Line 491 of posix_gfid_set() in posix-helpers.c 0.000009 [glusterfs-000] 184901 lgetxattr("/b/s/p/g/.name.random", "trusted.gfid", 0x7f7183ffe900, 16) = -1 ENODATA # Line 505 of posix_gfid_set() in posix-helpers.c 0.000013 [glusterfs-000] 184940 lsetxattr("/b/s/p/g/.name.random", "trusted.gfid", "\x1c\x0b"..., 16, XATTR_CREATE) = 0 #+ # posix_handle_hard() # Line 214 of posix_handle_path() in posix-handle.c from macro MAKE_HANDLE_PATH, # first invocation with length 0 0.046522 [glusterfs-000] 184989 lstat("/b/s/p/.glusterfs/1c/0b/1c0b1d78-5368-4967-9fd2-fbe76b1592a6", 0x7f7183ffe680) = -1 ENOENT # Line 214 of posix_handle_path() in posix-handle.c from macro MAKE_HANDLE_PATH, # first invocation with length previously generated 0.000061 [glusterfs-000] 231594 lstat("/b/s/p/.glusterfs/1c/0b/1c0b1d78-5368-4967-9fd2-fbe76b1592a6", 0x7f7183ffe620) = -1 ENOENT # Line 441 of posix_handle_hard() in posix-handle.c 0.000099 [glusterfs-000] 231722 lstat("/b/s/p/.glusterfs/1c/0b/1c0b1d78-5368-4967-9fd2-fbe76b1592a6", 0x7f7183ffe790) = -1 ENOENT # Line 449 of posix_handle_hard() in posix-handle.c, call to posix_handle_mkdir_hashes() 0.000041 [glusterfs-000] 231937 mkdir("/b/s/p/.glusterfs/1c", 0700) = -1 EEXIST 0.000016 [glusterfs-000] 232021 mkdir("/b/s/p/.glusterfs/1c/0b", 0700) = -1 EEXIST # Line 457 of posix_handle_hard() in posix-handle.c 0.013147 [glusterfs-000] 232072 link("/b/s/p/g/.name.random", "/b/s/p/.glusterfs/1c/0b/1c0b1d78-5368-4967-9fd2-fbe76b1592a6") = 0 # Line 465 of posix_handle_hard() in posix-handle.c 0.000037 [glusterfs-000] 245286 lstat("/b/s/p/.glusterfs/1c/0b/1c0b1d78-5368-4967-9fd2-fbe76b1592a6", {st_mode=S_IFREG|0755, st_size=0, ...}) = 0 # Line 474 of posix_handle_hard() in posix-handle.c 0.000063 [glusterfs-000] 245434 lstat("/b/s/p/.glusterfs/1c/0b/1c0b1d78-5368-4967-9fd2-fbe76b1592a6", {st_mode=S_IFREG|0755, st_size=0, ...}) = 0 # posix_handle_hard() #- # posix_gfid_set() #- # Line 1740 of posix_create() in posix.c 0.000072 [glusterfs-000] 245608 chown("/b/s/p/g/.name.random", 0, 0) = 0 # Line 1749 of posix_create() in posix.c, posix_acl_xattr_set() 0.000063 [glusterfs-000] 245753 lstat("/b/s/p/g/.name.random", {st_mode=S_IFREG|0755, st_size=0, ...}) = 0 # Line 1764 of posix_create() in posix.c, posix_fdstat() 0.000011 [glusterfs-000] 245888 fstat(10, {st_mode=S_IFREG|0755, st_size=0, ...}) = 0 0.000009 [glusterfs-000] 245940 fgetxattr(10, "trusted.gfid", "\x1c\x0b"..., 16) = 16 # Line 1772 of posix_create() in posix.c, posix_pstat() 0.000009 [glusterfs-000] 245969 lstat("/b/s/p/g", {st_mode=S_IFDIR|0755, st_size=106, ...}) = 0 # posix_create() #- 0.000011 [glusterfs-000] 246014 writev(7, [{"\240\0"..., 16}, {"|\302"..., 128}, {"\34\34"..., 16}], 3) = 160 <unfinis> [glusterfs-000] 246062 readv(7, 0.061993 [object-server] 246168 <... open resumed> ) = 15 The first column is the time strace measured for the system call to take. The second column is the name of the process traced. Here the UFO object server is show invoking the open() system call on a FUSE mount point, and the glusterfs process handling it. The third column is the absolute timestamp in microseconds within a given second. The indented comments are an attempt at matching which routine in the posix translator is responsible for the system call (not entirely accurate, but a start). As you can see, there are a number of system calls the are repeated for the same value, which could be eliminated entirely. Even though they are still "short", when running millions of these for file creations removing them will save a lot of time. Additionally, one will note the fact that the creation of the gfid file in the .glusterfs directory always performs 3 stats, 2 mkdirs for directories that already exist, the link, and then another stat. This is probably the most significant piece to consider, given that a simple link() system call could be made first, optimistically assuming the hash directories are already there, and then creating them if they are not. We did a similar thing in the UFO code to avoid making unnecessary system calls at that level. See also gist: https://gist.github.com/portante/5029518
updating to mainline, as it's (still) applicable to every release
because of the large number of bugs filed against mainline version\ is ambiguous and about to be removed as a choice. If you believe this is still a bug, please change the status back to NEW and choose the appropriate, applicable version for it.