Bug 1440269
| Summary: | mkfs.gfs2 is slow on ppc64le | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Nate Straz <nstraz> | ||||
| Component: | gfs2-utils | Assignee: | Andrew Price <anprice> | ||||
| Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | ||||
| Severity: | unspecified | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 7.4 | CC: | cluster-maint, gfs2-maint, rpeterso | ||||
| Target Milestone: | rc | ||||||
| Target Release: | --- | ||||||
| Hardware: | ppc64le | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | gfs2-utils-3.1.10-3.el7 | Doc Type: | If docs needed, set a value | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2017-08-01 21:57:28 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Nate Straz
2017-04-07 17:36:11 UTC
I ran this on the QE test systems and it completed in 90 seconds.
[root@dash-02 ~]# /usr/bin/time mkfs.gfs2 -O -K -p lock_nolock -j 1 -r 2048 /dev/dash/fsck
Warning: device is not properly aligned. This may harm performance.
/dev/dash/fsck is a symbolic link to /dev/dm-15
This will destroy any data on /dev/dm-15
Adding journals: Done
Building resource groups: Done
Creating quota file: Done
Writing superblock and syncing: Done
Device: /dev/dash/fsck
Block size: 4096
Device size: 307199.97 GB (80530630656 blocks)
Filesystem size: 307198.75 GB (80530309220 blocks)
Journals: 1
Resource groups: 153605
Locking protocol: "lock_nolock"
Lock table: ""
UUID: 40829345-5fda-4dda-af04-70f63475a2df
1.20user 18.84system 1:30.52elapsed 22%CPU (0avgtext+0avgdata 164424maxresident)k
4304inputs+40845320outputs (1major+73500minor)pagefaults 0swaps
[root@dash-02 ~]# rpm -q gfs2-utils
gfs2-utils-3.1.10-2.el7.x86_64
[root@dash-02 ~]# mkfs.gfs2 -V
mkfs.gfs2 master (built Mar 29 2017 09:51:30)
Copyright (C) Red Hat, Inc. 2004-2010 All rights reserved.
It may be because of the amount of memory on the system.
[root@dash-02 sts-rhel7.4]# free -h
total used free shared buff/cache available
Mem: 125G 979M 103G 9.2M 20G 123G
Swap: 4.0G 0B 4.0G
That 20G of buffers is all from mkfs.gfs2. The ppc system I ran on in the original description only has 10G RAM.
The gfs-p8-01-lp* nodes have 32G memory each. I think it would be better to do the 300TB testing on those. That said, it seems odd that kernel buffers would make much of a difference since the same amount of i/o needs to be completed. I'm wondering if the difference is related to the larger page size on these systems, but I can't think why it would add a whole hour to the run time. Nate: Andy's right. I failed to notice that the gfs-p8-01 series of lpars has 32GB of ram but the gfs-p8-02 series has only 10GB. I'll build you a new 5-node cluster to use instead, and move the big storage over there once it's up. Comparison against a sparse file on the root partition (5TB fs): [root@gfs-p8-01-lp08 ~/gfs2-utils]# truncate -s 5T testvol [root@gfs-p8-01-lp08 ~/gfs2-utils]# /usr/bin/time -v gfs2/mkfs/mkfs.gfs2 -Oqp lock_nolock -o align=0 testvol 1342177280 This will destroy any data on testvol Command being timed: "gfs2/mkfs/mkfs.gfs2 -Oqp lock_nolock -o align=0 testvol 1342177280" User time (seconds): 0.10 System time (seconds): 1.44 Percent of CPU this job got: 34% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:04.49 <-------- Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 12992 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 258 Voluntary context switches: 280 Involuntary context switches: 5 Swaps: 0 File system inputs: 0 File system outputs: 3545344 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 65536 Exit status: 0 [root@gfs-p8-01-lp08 ~/gfs2-utils]# /usr/bin/time -v gfs2/mkfs/mkfs.gfs2 -Oqp lock_nolock -o align=0 /dev/fsck/perf 1342177280 It appears to contain an existing filesystem (gfs2) /dev/fsck/perf is a symbolic link to /dev/dm-15 This will destroy any data on /dev/dm-15 Command being timed: "gfs2/mkfs/mkfs.gfs2 -Oqp lock_nolock -o align=0 /dev/fsck/perf 1342177280" User time (seconds): 0.18 System time (seconds): 0.81 Percent of CPU this job got: 1% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:55.48 <-------- Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 7552 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 260 Voluntary context switches: 27992 Involuntary context switches: 212 Swaps: 0 File system inputs: 3546496 File system outputs: 3545600 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 65536 Exit status: 0 Maybe this is a storage config problem or something like that? I've just noticed this: File system inputs: 0 File system outputs: 3545344 vs File system inputs: 3546496 File system outputs: 3545600 Perhaps that's something to do with it. Comparing against an iscsi-backed lv on x86_64: [root@gfs-p8-01-lp08 ~/gfs2-utils]# /usr/bin/time gfs2/mkfs/mkfs.gfs2 -Oqp lock_nolock -o align=0 -K /dev/fsck/perf 244058112 It appears to contain an existing filesystem (gfs2) /dev/fsck/perf is a symbolic link to /dev/dm-15 This will destroy any data on /dev/dm-15 0.03user 0.24system 0:30.05elapsed 0%CPU (0avgtext+0avgdata 4416maxresident)k 775936inputs+775040outputs (0major+155minor)pagefaults 0swaps [root@curie-01 gfs2-utils]# /usr/bin/time gfs2/mkfs/mkfs.gfs2 -Oqp lock_nolock -o align=0 -K /dev/vg_test/lv_test 244058112 It appears to contain an existing filesystem (gfs2) /dev/vg_test/lv_test is a symbolic link to /dev/dm-3 This will destroy any data on /dev/dm-3 0.02user 0.36system 0:02.16elapsed 18%CPU (0avgtext+0avgdata 2608maxresident)k 4112inputs+414632outputs (0major+965minor)pagefaults 0swaps The "inputs" are non-zero this time but much lower, and there's a lot less io overall than on ppc64le. I've managed to track this down to read-modify-write issues caused by a combination of two factors: 1. There's a bug in the resource group creation code which causes resource groups to be misaligned after the initial ones that contain journals. This is a one-line fix and it could potentially increase performance for gfs2 itself as well as mkfs.gfs2 in some cases. 2. The page size on these machines is 64K which is also the "minimum io size" limit in this case. The resource groups are being written out in single-block I/Os so with a 4K block size, when we issue a write the kernel also wants to read in the remaining 60K before we can modify the pages. The solution here is to issue writes in multiples of the minimum io size. I'm testing some patches at the moment and the speed improvement is looking promising. My most recent 300T test looks like: 1.54user 11.68system 2:52.32elapsed 7%CPU (0avgtext+0avgdata 163584maxresident)k 296872inputs+115986176outputs (2major+1059794minor)pagefaults 0swaps I have sent three patches upstream that provide the aforementioned performance improvement. They include a new test case for the resource group alignment bug. Created attachment 1272360 [details]
iowatcher graph comparison
This bz makes for a nice iowatcher graph side-by-side comparison so I'm attaching it for posterity.
[root@gfs-p8-01-lp06 ~]# rpm -q gfs2-utils gfs2-utils-3.1.10-3.el7.ppc64le === mkfs.gfs2 300T === /dev/fsck/perf is a symbolic link to /dev/dm-15 This will destroy any data on /dev/dm-15 Adding journals: Done Building resource groups: Done Creating quota file: Done Writing superblock and syncing: Done Device: /dev/fsck/perf Block size: 4096 Device size: 307200.00 GB (80530636800 blocks) Filesystem size: 307198.75 GB (80530309325 blocks) Journals: 1 Resource groups: 153605 Locking protocol: "lock_nolock" Lock table: "" UUID: 6039e7d4-d4d5-4923-a131-9cc0323e6059 1.28user 13.49system 2:44.06elapsed 9%CPU (0avgtext+0avgdata 166720maxresident)k 294160inputs+118262272outputs (1major+1080567minor)pagefaults 0swaps Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2226 |