1837640 – gfs2_jadd doesn't clean up if it runs out of space

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1837640 - gfs2_jadd doesn't clean up if it runs out of space

Summary: gfs2_jadd doesn't clean up if it runs out of space

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	gfs2-utils
Sub Component:
Version:	7.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	7.9
Assignee:	Abhijith Das
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:	1833141
Blocks:	1834456
TreeView+	depends on / blocked

Reported:	2020-05-19 18:04 UTC by Abhijith Das
Modified:	2020-09-29 20:33 UTC (History)
CC List:	8 users (show)
Fixed In Version:	gfs2-utils-3.1.10-11.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1833141
Environment:
Last Closed:	2020-09-29 20:33:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:4008	0	None	None	None	2020-09-29 20:33:33 UTC

Description Abhijith Das 2020-05-19 18:04:55 UTC

+++ This bug was initially created as a clone of Bug #1833141 +++

Description of problem:

If we run out of space on the file system while gfs2_jadd is trying to create new journals, it exits, leaving files behind and gfs2meta mounted.


Version-Release number of selected component (if applicable):
gfs2-utils-3.2.0-7.el8.x86_64

How reproducible:
easily

Steps to Reproduce:
1. gfs2_jadd too many large journals on a small file system
2.
3.

Actual results:
[root@host-028 ~]# lvcreate -L 1g -n brawl0 brawl
WARNING: gfs2 signature detected on /dev/brawl/brawl0 at offset 65536. Wipe it? [y/n]: y
  Wiping gfs2 signature on /dev/brawl/brawl0.
  Logical volume "brawl0" created.
[root@host-028 ~]# mkfs.gfs2 -p lock_nolock -j 1 -J 128 -O /dev/brawl/brawl0
/dev/brawl/brawl0 is a symbolic link to /dev/dm-2
This will destroy any data on /dev/dm-2
Discarding device contents (may take a while on large devices): Done
Adding journals: Done
Building resource groups: Done
Creating quota file: Done
Writing superblock and syncing: Done
Device:                    /dev/brawl/brawl0
Block size:                4096
Device size:               1.00 GB (262144 blocks)
Filesystem size:           1.00 GB (262142 blocks)
Journals:                  1
Journal size:              128MB
Resource groups:           5
Locking protocol:          "lock_nolock"
Lock table:                ""
UUID:                      914a1553-6647-4df2-8721-3ffc30f56d20
[root@host-028 ~]# mount /dev/brawl/brawl0 /mnt/brawl
[root@host-028 ~]# df /mnt/brawl
Filesystem               1K-blocks   Used Available Use% Mounted on
/dev/mapper/brawl-brawl0   1048400 132404    915996  13% /mnt/brawl
[root@host-028 ~]# gfs2_jadd -j 10 -J 128 /mnt/brawl
add_j: No space left on device
[root@host-028 ~]# df
Filesystem                      1K-blocks    Used Available Use% Mounted on
devtmpfs                           965440       0    965440   0% /dev
tmpfs                              982048   51696    930352   6% /dev/shm
tmpfs                              982048   16892    965156   2% /run
tmpfs                              982048       0    982048   0% /sys/fs/cgroup
/dev/mapper/rhel_host--028-root   6486016 4849696   1636320  75% /
/dev/vda1                         1038336  320660    717676  31% /boot
tmpfs                              196352       0    196352   0% /run/user/0
/dev/mapper/brawl-brawl0          1048400 1048272       128 100% /mnt/brawl
[root@host-028 ~]# mount
...
/dev/mapper/brawl-brawl0 on /mnt/brawl type gfs2 (rw,relatime,seclabel,localflocks)
/mnt/brawl on /tmp/.gfs2meta.eSuC5g type gfs2 (rw,relatime,seclabel,meta,localflocks)
[root@host-028 ~]# ls /tmp/.gfs2meta.eSuC5g/ -l
total 120516
-rw-------. 1 root root         8 May  7 16:17 inum
drwx------. 2 root root      3864 May  7 16:19 jindex
-rw-------. 1 root root 123129856 May  7 16:19 new_inode
drwx------. 2 root root      3864 May  7 16:19 per_node
-rw-------. 1 root root       176 May  7 16:17 quota
-rw-------. 1 root root       480 May  7 16:17 rindex
-rw-------. 1 root root        24 May  7 16:17 statfs
[root@host-028 ~]# ls /tmp/.gfs2meta.eSuC5g/jindex -l
total 919376
-rw-------. 1 root root 134217728 May  7 16:17 journal0
-rw-------. 1 root root 134217728 May  7 16:18 journal1
-rw-------. 1 root root 134217728 May  7 16:18 journal2
-rw-------. 1 root root 134217728 May  7 16:19 journal3
-rw-------. 1 root root 134217728 May  7 16:19 journal4
-rw-------. 1 root root 134217728 May  7 16:19 journal5
-rw-------. 1 root root 134217728 May  7 16:19 journal6


Expected results:
gfs2_jadd should umount gfs2meta and clean up any journals it was not able to completely create (or not make them to begin with)

Additional info:

--- Additional comment from Nate Straz on 2020-05-07 21:32:06 UTC ---

fsck.gfs2 can not fix these file systems either.

[root@host-028 ~]# fsck.gfs2 -y /dev/brawl/brawl0
Initializing fsck
Validating resource group index.
Level 1 resource group check: Checking if all rgrp and rindex values are good.
(level 1 passed)
File system journal "journal7" is missing or corrupt: pass1 will try to recreate it.

Journal recovery complete.
Starting pass1
Invalid or missing journal7 system inode (is 'free', should be 'inode').
Rebuilding system file "journal7"
get_file_buf
[root@host-028 ~]# echo $?
1

--- Additional comment from Abhijith Das on 2020-05-11 03:20:17 UTC ---

If gfs2_jadd runs out of disk space while adding journals, it does
not exit gracefully. It partially does its job and bails out when
it hits -ENOSPC. This leaves the metafs mounted and most likely a
corrupted filesystem that even fsck.gfs2 can't fix.

This patch adds a pre-check that ensures that the journals requested
will fit in the available space before proceeding. Note that this is
not foolproof because gfs2_jadd operates on a mounted filesystem.
While it is required that the filesystem be idle (and mounted on only
one node) while gfs2_jadd is being run, there is nothing stopping a
user from having some I/O process competing with gfs2_jadd for disk
blocks and consequently crashing it.

This patch also does some cleanup of data structures when gfs2_jadd
exits due to errors.

--- Additional comment from Abhijith Das on 2020-05-11 03:33:04 UTC ---

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=28472231

Here's a scratch build with the above patch.

--- Additional comment from Andrew Price on 2020-05-11 11:17:35 UTC ---

(In reply to Abhijith Das from comment #2)
> Created attachment 1687127 [details]
> gfs2_jadd-out-of-space-issues-and-other-fixes

> Note that this is
> not foolproof because gfs2_jadd operates on a mounted filesystem.

I think it would be better to add error handling to the write() and close() calls, as currently write() errors cause an exit() without cleanup, and the close()es aren't checked at all. It probably needs fsync()s too, to make sure all error cases get flagged up. At the least it should unmount the metafs before bailing out in all cases.

Nate, could you open a separate bz for the fsck.gfs2 failure?

--- Additional comment from Abhijith Das on 2020-05-11 16:30:51 UTC ---



--- Additional comment from Abhijith Das on 2020-05-11 16:32:42 UTC ---

Andy, when you get a chance, could you go over this (and the previous) patch and let me know if it looks ok to you?

--- Additional comment from Andrew Price on 2020-05-11 17:23:49 UTC ---

(In reply to Abhijith Das from comment #6)
> Created attachment 1687380 [details]
> First bash at error handling fixes

Just some minor style/maintainability nits - using "close" as a label can make it difficult to search for close() calls later, and returning errno where the caller isn't expecting an errno value can be confusing... returning -1 will probably be safer in those cases. Other than that it looks like a good improvement, thanks Abhi.

--- Additional comment from Abhijith Das on 2020-05-11 23:07:49 UTC ---

This version has Andy's suggested fixes. Here's a build with this patch and the previous bugfix patch: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=28499297

--- Additional comment from Andrew Price on 2020-05-12 09:31:54 UTC ---

I've pushed the patches upstream (commits deb620675, bb22cafab) with a slight tweak to the second patch to clear up these warnings:

main_jadd.c:518:12: warning: unused variable ‘blk_addr’ [-Wunused-variable]
  518 |   uint64_t blk_addr = 0;
      |            ^~~~~~~~
main_jadd.c:618:19: warning: too many arguments for format [-Wformat-extra-args]
  618 |   fprintf(stderr, "%s: not a mounted gfs2 file system\n",
      |                   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Let me know when you're happy for them to be added to the RHEL package (move the bug to POST) and please make any further fixes on top of the master branch. Thanks!

Comment 2 Abhijith Das 2020-05-19 18:15:46 UTC

When gfs2_jadd is run on an fs that's low on disk space, it could fail and leave the filesystem in an inconsistent state after having only partially done its job. The meta filesystem may remain mounted as well. The resulting inconsistent fs will not mount and fsck.gfs2 would be unable to fix this filesystem.

gfs2_jadd is used rarely and the odds of someone running it with an almost-full fs is even lower. We even recommend that the fs be backed up before attempting gfs2_jadd. However, the failure renders the fs unusable, so the impact is severe.

Requesting blocker+ for this bug. We have a fix for this upstream and in RHEL8

Comment 4 Steve Whitehouse 2020-05-21 10:09:47 UTC

You shouldn't need blocker to get this in yet. Unless this is important enough that we need to make it a blocker anyway?

Comment 6 Abhijith Das 2020-05-27 15:23:05 UTC

Cancelling needinfo as this was not deemed a blocker.

Comment 11 Nate Straz 2020-07-09 19:02:40 UTC

BEFORE - gfs2-utils-3.1.10-10.el7.x86_64

SCENARIO - [jadd_no_space]
Creating 1G LV jadded on host-027
WARNING: gfs2 signature detected on /dev/fsck/jadded at offset 65536. Wipe it? [y/n]: [n]
  Aborted wiping of gfs2.
  1 existing signature left on the device.
Creating file system on /dev/fsck/jadded with options '-p lock_nolock -j 1 -J 128' on host-027
It appears to contain an existing filesystem (gfs2)
/dev/fsck/jadded is a symbolic link to /dev/dm-2
This will destroy any data on /dev/dm-2
Discarding device contents (may take a while on large devices): Done
Adding journals: Done
Building resource groups: Done
Creating quota file: Done
Writing superblock and syncing: Done
Device:                    /dev/fsck/jadded
Block size:                4096
Device size:               1.00 GB (262144 blocks)
Filesystem size:           1.00 GB (262142 blocks)
Journals:                  1
Journal size:              128MB
Resource groups:           5
Locking protocol:          "lock_nolock"
Lock table:                ""
UUID:                      e6aeee2d-5721-4fd1-acff-e76c319167ce
Mounting gfs2 /dev/fsck/jadded on host-027 with opts ''
Filling some space
Try to add more journals than there is space
add_j: No space left on device
Unexpected gfs2meta mounted after gfs2_jadd
[root@host-025 sts-rhel7.9]# echo $?
1


AFTER - gfs2-utils-3.1.10-11.el7.x86_64

SCENARIO - [jadd_no_space]
Creating 1G LV jadded on host-027
WARNING: gfs2 signature detected on /dev/fsck/jadded at offset 65536. Wipe it? [y/n]: [n]
  Aborted wiping of gfs2.
  1 existing signature left on the device.
Creating file system on /dev/fsck/jadded with options '-p lock_nolock -j 1 -J 128' on host-027
It appears to contain an existing filesystem (gfs2)
/dev/fsck/jadded is a symbolic link to /dev/dm-2
This will destroy any data on /dev/dm-2
Discarding device contents (may take a while on large devices): Done
Adding journals: Done
Building resource groups: Done
Creating quota file: Done
Writing superblock and syncing: Done
Device:                    /dev/fsck/jadded
Block size:                4096
Device size:               1.00 GB (262144 blocks)
Filesystem size:           1.00 GB (262142 blocks)
Journals:                  1
Journal size:              128MB
Resource groups:           5
Locking protocol:          "lock_nolock"
Lock table:                ""
UUID:                      8daf9acf-5cac-4d75-b5cb-1f5cbd1b470b
Mounting gfs2 /dev/fsck/jadded on host-027 with opts ''
Filling some space
Try to add more journals than there is space
Failed to add journals: No space left on device

Insufficient space on the device to add 5 128MB journals (1MB QC size)

Required space  :     165465 blks (33093 blks per journal)
Available space :     100745 blks

Good, no gfs2meta mounts found
Unmounting /mnt/fsck on host-027
Removing LV jadded on host-027


[root@host-025 sts-rhel7.9]# echo $?
0

Comment 13 errata-xmlrpc 2020-09-29 20:33:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (gfs2-utils bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4008

Note You need to log in before you can comment on or make changes to this bug.