Description of problem: After creating and removing a large file on a gfs2 filesystem using statfs_slow, the filesystem withdraws. This happened on a fresh filesystem for both myself and a customer where nothing has been written other than this one file. I could not reproduce the issue when mounting with just one node, nor when using a smaller (700 Mb) file. I was able to reproduce the issue on the first try (using procedure below), but unfortunately I can't seem to anymore. Version-Release number of selected component (if applicable): kernel-2.6.18-128.1.10.el5, gfs2-utils-0.1.53-1.el5_3.3 How reproducible: Sometimes Steps to Reproduce: # mkfs.gfs2 -p lock_dlm -t jrummy5:gfs2-a -j 2 /dev/clust/gsf2-a # mount /dev/clust/gfs2-a /mnt/gfs2-a <-- on both nodes # gfs2_tool settune /mnt/gfs2-a/ statfs_slow 1 <-- on both nodes # cd /mnt/gfs2-a/ # dd if=/dev/zero of=./bigfile bs=1M count=3800 # df -k | grep lv1 # rm bigfile # df -k | grep lv1 Actual results: I/O errors received and withdraw seen in logs: Jun 10 16:09:18 jrummy5-1 kernel: GFS2: fsid=jrummy5:gfs2-a.0: fatal: filesystem consistency error Jun 10 16:09:18 jrummy5-1 kernel: GFS2: fsid=jrummy5:gfs2-a.0: RG = 65388 Jun 10 16:09:18 jrummy5-1 kernel: GFS2: fsid=jrummy5:gfs2-a.0: function = gfs2_rgrp_verify, file = fs/gfs2/rgrp.c, line = 274 Jun 10 16:09:18 jrummy5-1 kernel: GFS2: fsid=jrummy5:gfs2-a.0: about to withdraw this file system Jun 10 16:09:18 jrummy5-1 kernel: GFS2: fsid=jrummy5:gfs2-a.0: telling LM to withdraw Jun 10 16:09:19 jrummy5-1 kernel: GFS2: fsid=jrummy5:gfs2-a.0: withdrawn Jun 10 16:09:19 jrummy5-1 kernel: Jun 10 16:09:19 jrummy5-1 kernel: Call Trace: Jun 10 16:09:19 jrummy5-1 kernel: [<ffffffff88508526>] :gfs2:gfs2_lm_withdraw+0xc1/0xd0 Jun 10 16:09:19 jrummy5-1 kernel: [<ffffffff800255df>] find_or_create_page+0x1e/0x75 Jun 10 16:09:19 jrummy5-1 kernel: [<ffffffff80063097>] thread_return+0x62/0xfe Jun 10 16:09:19 jrummy5-1 kernel: [<ffffffff8851b6cf>] :gfs2:gfs2_consist_rgrpd_i+0x34/0x39 Jun 10 16:09:19 jrummy5-1 kernel: [<ffffffff88517977>] :gfs2:gfs2_rgrp_verify+0x176/0x230 Jun 10 16:09:19 jrummy5-1 kernel: [<ffffffff88519742>] :gfs2:gfs2_statfs_slow+0xc6/0x19b Jun 10 16:09:19 jrummy5-1 kernel: [<ffffffff88517ff9>] :gfs2:gfs2_rindex_hold+0x32/0x153 Jun 10 16:09:19 jrummy5-1 kernel: [<ffffffff885139b0>] :gfs2:gfs2_statfs+0x46/0xa0 Jun 10 16:09:19 jrummy5-1 kernel: [<ffffffff800db2c3>] vfs_statfs+0x63/0x7f Jun 10 16:09:19 jrummy5-1 kernel: [<ffffffff800db422>] vfs_statfs_native+0x13/0x34 Jun 10 16:09:19 jrummy5-1 kernel: [<ffffffff800db4ea>] sys_statfs+0x3f/0x79 Jun 10 16:09:19 jrummy5-1 kernel: [<ffffffff80066bcd>] do_page_fault+0x4fe/0x830 Jun 10 16:09:19 jrummy5-1 kernel: [<ffffffff8000de64>] do_mmap_pgoff+0x66c/0x7d7 Jun 10 16:09:19 jrummy5-1 kernel: [<ffffffff800b46ab>] audit_syscall_entry+0x16e/0x1a1 Jun 10 16:09:19 jrummy5-1 kernel: [<ffffffff8005d229>] tracesys+0x71/0xe0 Jun 10 16:09:19 jrummy5-1 kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Jun 10 16:09:19 jrummy5-1 kernel: Jun 10 16:09:19 jrummy5-1 kernel: GFS2: fsid=jrummy5:gfs2-a.0: used data mismatch: 65326 != 65326 Expected results: File is removed, space is reclaimed. Additional info: I'll keep trying to reproduce again and see if there is some repeatable steps that can be taken
Ben, does this look familiar from your recent work on statfs?
Sounds more like file system corruption to me, that is detected when the slow statfs traverses the rgrp linked list.
I finally have recreated this myself on a 3 node x86 cluster. However I needed to do multiple dd's and removes and dfs on different nodes to hit it.
Created attachment 350981 [details] Change to statfs_slow code to fix withdraws This change in the statfs_slow code fixes the issue for me. Since both linked and unlinked inodes are counted by rgd->rd_dinodes, It makes no sense to count them with the used data blocks (first check that I changed), it makes sense to count them with the linked inodes (second check), and it makes no sense to care if there are more unlinked inodes than linked ones.
patch posted on 7/10/09: 3:38 PM; set dev ack.
in kernel-2.6.18-158.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html