1440269 – mkfs.gfs2 is slow on ppc64le

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1440269 - mkfs.gfs2 is slow on ppc64le

Summary: mkfs.gfs2 is slow on ppc64le

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	gfs2-utils
Sub Component:
Version:	7.4
Hardware:	ppc64le
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Andrew Price
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-04-07 17:36 UTC by Nate Straz
Modified:	2017-08-01 21:57 UTC (History)
CC List:	3 users (show)
Fixed In Version:	gfs2-utils-3.1.10-3.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-08-01 21:57:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
iowatcher graph comparison (358.04 KB, image/png) 2017-04-18 14:56 UTC, Andrew Price	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:2226	0	normal	SHIPPED_LIVE	gfs2-utils bug fix and enhancement update	2017-08-01 18:43:08 UTC

Description Nate Straz 2017-04-07 17:36:11 UTC

Description of problem:

When I try to mkfs.gfs2 a 300TB device, it takes over an hour to run.

[root@gfs-p8-02-lp01 ~]# /usr/bin/time mkfs.gfs2 -O -p lock_nolock -j 1 -r 2048 -K -q /dev/fsck/perf
/dev/fsck/perf is a symbolic link to /dev/dm-9
This will destroy any data on /dev/dm-9
2.37user 15.53system 1:59:25elapsed 0%CPU (0avgtext+0avgdata 163264maxresident)k
58144896inputs+58141696outputs (1major+4779minor)pagefaults 0swaps


Version-Release number of selected component (if applicable):
gfs2-utils-3.1.10-2.el7.ppc64le

How reproducible:
Easily

Steps to Reproduce:
1. /usr/bin/time mkfs.gfs2 -O -p lock_nolock -j 1 -r 2048 -K -q /dev/fsck/perf
2.
3.

Actual results:


Expected results:


Additional info:
This was on dev's ppc test bed.  I haven't re-run this on our QE system yet.  For comparison, gfs2-utils-3.1.9-3.el7 took 3:21.49 to mkfs.gfs2 256TB.  I'm going to re-run on QE as soon as possible.

Comment 2 Nate Straz 2017-04-08 12:50:31 UTC

I ran this on the QE test systems and it completed in 90 seconds.

[root@dash-02 ~]# /usr/bin/time mkfs.gfs2 -O -K -p lock_nolock -j 1 -r 2048 /dev/dash/fsck
Warning: device is not properly aligned. This may harm performance.
/dev/dash/fsck is a symbolic link to /dev/dm-15
This will destroy any data on /dev/dm-15
Adding journals: Done
Building resource groups: Done
Creating quota file: Done
Writing superblock and syncing: Done
Device:                    /dev/dash/fsck
Block size:                4096
Device size:               307199.97 GB (80530630656 blocks)
Filesystem size:           307198.75 GB (80530309220 blocks)
Journals:                  1
Resource groups:           153605
Locking protocol:          "lock_nolock"
Lock table:                ""
UUID:                      40829345-5fda-4dda-af04-70f63475a2df
1.20user 18.84system 1:30.52elapsed 22%CPU (0avgtext+0avgdata 164424maxresident)k
4304inputs+40845320outputs (1major+73500minor)pagefaults 0swaps
[root@dash-02 ~]# rpm -q gfs2-utils
gfs2-utils-3.1.10-2.el7.x86_64
[root@dash-02 ~]# mkfs.gfs2 -V
mkfs.gfs2 master (built Mar 29 2017 09:51:30)
Copyright (C) Red Hat, Inc.  2004-2010  All rights reserved.

It may be because of the amount of memory on the system.

[root@dash-02 sts-rhel7.4]# free -h
              total        used        free      shared  buff/cache   available
Mem:           125G        979M        103G        9.2M         20G        123G
Swap:          4.0G          0B        4.0G

That 20G of buffers is all from mkfs.gfs2.  The ppc system I ran on in the original description only has 10G RAM.

Comment 3 Andrew Price 2017-04-10 11:18:08 UTC

The gfs-p8-01-lp* nodes have 32G memory each. I think it would be better to do the 300TB testing on those.

That said, it seems odd that kernel buffers would make much of a difference since the same amount of i/o needs to be completed. I'm wondering if the difference is related to the larger page size on these systems, but I can't think why it would add a whole hour to the run time.

Comment 4 Robert Peterson 2017-04-10 12:19:37 UTC

Nate: Andy's right. I failed to notice that the gfs-p8-01
series of lpars has 32GB of ram but the gfs-p8-02 series
has only 10GB. I'll build you a new 5-node cluster to use
instead, and move the big storage over there once it's up.

Comment 5 Andrew Price 2017-04-11 09:59:25 UTC

Comparison against a sparse file on the root partition (5TB fs):

[root@gfs-p8-01-lp08 ~/gfs2-utils]# truncate -s 5T testvol
[root@gfs-p8-01-lp08 ~/gfs2-utils]# /usr/bin/time -v gfs2/mkfs/mkfs.gfs2 -Oqp lock_nolock -o align=0 testvol 1342177280
This will destroy any data on testvol
	Command being timed: "gfs2/mkfs/mkfs.gfs2 -Oqp lock_nolock -o align=0 testvol 1342177280"
	User time (seconds): 0.10
	System time (seconds): 1.44
	Percent of CPU this job got: 34%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:04.49  <--------
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 12992
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 258
	Voluntary context switches: 280
	Involuntary context switches: 5
	Swaps: 0
	File system inputs: 0
	File system outputs: 3545344
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 65536
	Exit status: 0
[root@gfs-p8-01-lp08 ~/gfs2-utils]# /usr/bin/time -v gfs2/mkfs/mkfs.gfs2 -Oqp lock_nolock -o align=0 /dev/fsck/perf 1342177280
It appears to contain an existing filesystem (gfs2)
/dev/fsck/perf is a symbolic link to /dev/dm-15
This will destroy any data on /dev/dm-15
	Command being timed: "gfs2/mkfs/mkfs.gfs2 -Oqp lock_nolock -o align=0 /dev/fsck/perf 1342177280"
	User time (seconds): 0.18
	System time (seconds): 0.81
	Percent of CPU this job got: 1%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:55.48   <--------
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 7552
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 260
	Voluntary context switches: 27992
	Involuntary context switches: 212
	Swaps: 0
	File system inputs: 3546496
	File system outputs: 3545600
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 65536
	Exit status: 0

Maybe this is a storage config problem or something like that?

Comment 6 Andrew Price 2017-04-11 10:22:19 UTC

I've just noticed this:

	File system inputs: 0
	File system outputs: 3545344
vs
	File system inputs: 3546496
	File system outputs: 3545600

Perhaps that's something to do with it.

Comment 7 Andrew Price 2017-04-11 10:48:48 UTC

Comparing against an iscsi-backed lv on x86_64:

[root@gfs-p8-01-lp08 ~/gfs2-utils]# /usr/bin/time gfs2/mkfs/mkfs.gfs2 -Oqp lock_nolock -o align=0 -K /dev/fsck/perf 244058112
It appears to contain an existing filesystem (gfs2)
/dev/fsck/perf is a symbolic link to /dev/dm-15
This will destroy any data on /dev/dm-15
0.03user 0.24system 0:30.05elapsed 0%CPU (0avgtext+0avgdata 4416maxresident)k
775936inputs+775040outputs (0major+155minor)pagefaults 0swaps

[root@curie-01 gfs2-utils]# /usr/bin/time gfs2/mkfs/mkfs.gfs2 -Oqp lock_nolock -o align=0 -K /dev/vg_test/lv_test 244058112
It appears to contain an existing filesystem (gfs2)
/dev/vg_test/lv_test is a symbolic link to /dev/dm-3
This will destroy any data on /dev/dm-3
0.02user 0.36system 0:02.16elapsed 18%CPU (0avgtext+0avgdata 2608maxresident)k
4112inputs+414632outputs (0major+965minor)pagefaults 0swaps

The "inputs" are non-zero this time but much lower, and there's a lot less io overall than on ppc64le.

Comment 8 Andrew Price 2017-04-13 15:53:53 UTC

I've managed to track this down to read-modify-write issues caused by a combination of two factors:

1. There's a bug in the resource group creation code which causes resource groups to be misaligned after the initial ones that contain journals. This is a one-line fix and it could potentially increase performance for gfs2 itself as well as mkfs.gfs2 in some cases.

2. The page size on these machines is 64K which is also the "minimum io size" limit in this case. The resource groups are being written out in single-block I/Os so with a 4K block size, when we issue a write the kernel also wants to read in the remaining 60K before we can modify the pages. The solution here is to issue writes in multiples of the minimum io size.

I'm testing some patches at the moment and the speed improvement is looking promising. My most recent 300T test looks like:

1.54user 11.68system 2:52.32elapsed 7%CPU (0avgtext+0avgdata 163584maxresident)k
296872inputs+115986176outputs (2major+1059794minor)pagefaults 0swaps

Comment 9 Andrew Price 2017-04-13 17:02:06 UTC

I have sent three patches upstream that provide the aforementioned performance improvement. They include a new test case for the resource group alignment bug.

Comment 12 Andrew Price 2017-04-18 14:56:07 UTC

Created attachment 1272360 [details]
iowatcher graph comparison

This bz makes for a nice iowatcher graph side-by-side comparison so I'm attaching it for posterity.

Comment 13 Nate Straz 2017-05-11 18:08:51 UTC

[root@gfs-p8-01-lp06 ~]# rpm -q gfs2-utils
gfs2-utils-3.1.10-3.el7.ppc64le

=== mkfs.gfs2 300T ===
/dev/fsck/perf is a symbolic link to /dev/dm-15
This will destroy any data on /dev/dm-15
Adding journals: Done
Building resource groups: Done
Creating quota file: Done
Writing superblock and syncing: Done
Device:                    /dev/fsck/perf
Block size:                4096
Device size:               307200.00 GB (80530636800 blocks)
Filesystem size:           307198.75 GB (80530309325 blocks)
Journals:                  1
Resource groups:           153605
Locking protocol:          "lock_nolock"
Lock table:                ""
UUID:                      6039e7d4-d4d5-4923-a131-9cc0323e6059
1.28user 13.49system 2:44.06elapsed 9%CPU (0avgtext+0avgdata 166720maxresident)k
294160inputs+118262272outputs (1major+1080567minor)pagefaults 0swaps

Comment 14 errata-xmlrpc 2017-08-01 21:57:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2226

Note You need to log in before you can comment on or make changes to this bug.