Bug 1194446

Summary: GFS2: mkfs.gfs2 scalability issue on large devices
Product: Red Hat Enterprise Linux 7 Reporter: Nate Straz <nstraz>
Component: gfs2-utilsAssignee: Andrew Price <anprice>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 7.1CC: cluster-maint, gfs2-maint
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: gfs2-utils-3.1.8-1.el7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-11-19 03:53:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1184482    
Bug Blocks: 1111393, 1497636    
Attachments:
Description Flags
Patch submitted upstream none

Description Nate Straz 2015-02-19 20:08:22 UTC
Description of problem:

A mkfs.gfs2 on a 250TB device took 7 hours last night and used a lot of CPU time.

# /usr/bin/time mkfs -t gfs2 -p lock_nolock -O /dev/XL/gfs2
24723.05user 59.37system 7:08:46elapsed 96%CPU (0avgtext+0avgdata 417080maxresident)k

I collected perf data on file systems from 100GB to 100TB and found that lgfs2_rgrps_append quickly dominates all other functions.

==> perf-100G.txt <==
# Overhead    Command                        Symbol
# ........  .........  ............................
     3.15%  mkfs.gfs2  [.] gfs2_disk_hash
     0.30%  mkfs.gfs2  [.] gfs2_meta_header_out_bh

==> perf-1T.txt <==
# Overhead    Command                        Symbol
# ........  .........  ............................
    17.48%  mkfs.gfs2  [.] lgfs2_rgrps_append
     1.29%  mkfs.gfs2  [.] gfs2_disk_hash
     0.12%  mkfs.gfs2  [.] lgfs2_rgrp_write

==> perf-10T.txt <==
# Overhead    Command                        Symbol
# ........  .........  ............................

    84.46%  mkfs.gfs2  [.] lgfs2_rgrps_append
     0.03%  mkfs.gfs2  [.] lgfs2_rgrp_bitbuf_alloc

==> perf-100T.txt <==
# Overhead    Command                        Symbol
# ........  .........  ............................
    98.85%  mkfs.gfs2  [.] lgfs2_rgrps_append
     0.00%  mkfs.gfs2  [.] __errno_location@plt

perf annotate:

Sorted summary for file /usr/sbin/mkfs.gfs2
----------------------------------------------

   97.15 /usr/src/debug/gfs2-utils-3.1.7/gfs2/libgfs2/../../gfs2/include/osi_tree.h:320
    2.81 /usr/src/debug/gfs2-utils-3.1.7/gfs2/libgfs2/../../gfs2/include/osi_tree.h:321


313 static inline struct osi_node *osi_last(struct osi_root *root)
...
320         while (n->osi_right)
321                 n = n->osi_right;
322         return n;

It looks like the tree being used isn't balanced and probably has turned into a list which is traversed every time an RG is added.  That ends up being a list 400k entries deep on a 100TB file system w/ 256MB RGs.


Version-Release number of selected component (if applicable):
gfs2-utils-3.1.7-6.el7.x86_64

How reproducible:
Easily

Steps to Reproduce:
1. perf record mkfs.gfs2 -O -p lock_nolock -j 1 /dev/foo
2. perf report --stdio -d mkfs.gfs2 | head -n 30

Actual results:



Expected results:


Additional info:

Comment 1 Andrew Price 2015-02-20 00:36:47 UTC
Created attachment 993801 [details]
Patch submitted upstream

With this patch I'm seeing much better performance with a 250T volume and far lower CPU usage:

Before: 13034.77user 41.25system 3:47:21elapsed 95%CPU (0avgtext+0avgdata 416248maxresident)k
2840inputs+41337136outputs (0major+449613minor)pagefaults 0swaps

After: 7.07user 32.58system 29:16.12elapsed 2%CPU (0avgtext+0avgdata 416308maxresident)k
3368inputs+41337136outputs (1major+105705minor)pagefaults 0swaps

Comment 3 Andrew Price 2015-02-24 16:15:00 UTC
Patch is now upstream and will land in RHEL7 with the gfs2-utils rebase.

Comment 6 Nate Straz 2015-08-25 11:26:38 UTC
BEFORE with gfs2-utils-3.1.7-6.el7.x86_64

[root@dash-02 ~]# /usr/bin/time blkdiscard /dev/fsck/large
0.00user 66.44system 5:10.35elapsed 21%CPU (0avgtext+0avgdata 632maxresident)k
40inputs+0outputs (1major+198minor)pagefaults 0swaps
[root@dash-02 ~]# /usr/bin/time mkfs.gfs2 -O -p lock_nolock -j 1 -K /dev/fsck/large
/dev/fsck/large is a symbolic link to /dev/dm-16
This will destroy any data on /dev/dm-16
Device:                    /dev/fsck/large
Block size:                4096
Device size:               256000.00 GB (67108866048 blocks)
Filesystem size:           255999.98 GB (67108860932 blocks)
Journals:                  1
Resource groups:           1023251
Locking protocol:          "lock_nolock"
Lock table:                ""
UUID:                      ec8d8197-816a-fafc-330f-1d2e3543dfb3
20854.75user 21.14system 5:48:04elapsed 99%CPU (0avgtext+0avgdata 416772maxresident)k
2960inputs+41387248outputs (0major+576524minor)pagefaults 0swaps
[root@dash-02 ~]# /usr/bin/time mkfs.gfs2 -O -p lock_nolock -j 1 -K /dev/fsck/large
It appears to contain an existing filesystem (gfs2)
/dev/fsck/large is a symbolic link to /dev/dm-16
This will destroy any data on /dev/dm-16
Device:                    /dev/fsck/large
Block size:                4096
Device size:               256000.00 GB (67108866048 blocks)
Filesystem size:           255999.98 GB (67108860932 blocks)
Journals:                  1
Resource groups:           1023251
Locking protocol:          "lock_nolock"
Lock table:                ""
UUID:                      8e15a54a-dc63-acb5-c92b-1a16059f7218
21262.27user 20.98system 5:54:51elapsed 99%CPU (0avgtext+0avgdata 416812maxresident)k
2952inputs+41387248outputs (0major+633204minor)pagefaults 0swaps


AFTER with gfs2-utils-3.1.8-4.el7.x86_64

[root@dash-03 ~]# blkdiscard /dev/fsck/large
[root@dash-03 ~]# /usr/bin/time mkfs.gfs2 -O -p lock_nolock -j 1 -K /dev/fsck/large
/dev/fsck/large is a symbolic link to /dev/dm-16
This will destroy any data on /dev/dm-16
Device:                    /dev/fsck/large
Block size:                4096
Device size:               256000.00 GB (67108866048 blocks)
Filesystem size:           255999.98 GB (67108860932 blocks)
Journals:                  1
Resource groups:           1023251
Locking protocol:          "lock_nolock"
Lock table:                ""
UUID:                      feb57a73-338d-70bf-eacc-212d200340b1
2.39user 24.40system 3:07.09elapsed 14%CPU (0avgtext+0avgdata 416800maxresident)k
2960inputs+41387248outputs (0major+170243minor)pagefaults 0swaps
[root@dash-03 ~]# /usr/bin/time mkfs.gfs2 -O -p lock_nolock -j 1 -K /dev/fsck/large
It appears to contain an existing filesystem (gfs2)
/dev/fsck/large is a symbolic link to /dev/dm-16
This will destroy any data on /dev/dm-16
Device:                    /dev/fsck/large
Block size:                4096
Device size:               256000.00 GB (67108866048 blocks)
Filesystem size:           255999.98 GB (67108860932 blocks)
Journals:                  1
Resource groups:           1023251
Locking protocol:          "lock_nolock"
Lock table:                ""
UUID:                      9d17c0bc-57b7-dcca-fe5a-667ae8e98ced
2.31user 24.08system 1:13.30elapsed 36%CPU (0avgtext+0avgdata 416816maxresident)k
2952inputs+41387248outputs (0major+193953minor)pagefaults 0swaps
[root@dash-03 ~]# /usr/bin/time mkfs.gfs2 -O -p lock_nolock -j 1 -K /dev/fsck/large
It appears to contain an existing filesystem (gfs2)
/dev/fsck/large is a symbolic link to /dev/dm-16
This will destroy any data on /dev/dm-16
Device:                    /dev/fsck/large
Block size:                4096
Device size:               256000.00 GB (67108866048 blocks)
Filesystem size:           255999.98 GB (67108860932 blocks)
Journals:                  1
Resource groups:           1023251
Locking protocol:          "lock_nolock"
Lock table:                ""
UUID:                      9f8bb1f9-7a75-13ab-8df9-dadfefa59650
2.47user 21.97system 1:06.49elapsed 36%CPU (0avgtext+0avgdata 416808maxresident)k
2952inputs+41387248outputs (0major+204423minor)pagefaults 0swaps

Comment 8 errata-xmlrpc 2015-11-19 03:53:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-2178.html