Description of problem: Two issues reported: 1. GFS stuck in gfs_releasepage() with gulm_Cb_Handler using >24% of a CPU: Jun 9 15:17:38 Kaukasian kernel: GFS: fsid=Automatoi:home.4: stuck in gfs_releasepage()... Jun 9 15:17:38 Kaukasian kernel: GFS: fsid=Automatoi:home.4: blkno = 7161047, bh->b_count = 2 Jun 9 15:17:38 Kaukasian kernel: GFS: fsid=Automatoi:home.4: bh->b_journal_head = !NULL Jun 9 15:17:38 Kaukasian kernel: GFS: fsid=Automatoi:home.4: gl = (4, 7155705) Jun 9 15:17:38 Kaukasian kernel: GFS: fsid=Automatoi:home.4: bd_new_le.le_trans = NULL Jun 9 15:17:38 Kaukasian kernel: GFS: fsid=Automatoi:home.4: bd_incore_le.le_trans = NULL Jun 9 15:17:38 Kaukasian kernel: GFS: fsid=Automatoi:home.4: bd_frozen = NULL Jun 9 15:17:38 Kaukasian kernel: GFS: fsid=Automatoi:home.4: bd_pinned = 0 Jun 9 15:17:38 Kaukasian kernel: GFS: fsid=Automatoi:home.4: bd_ail_tr = NULL Jun 9 15:17:38 Kaukasian kernel: GFS: fsid=Automatoi:home.4: ip = 7155705/7155705 Jun 9 15:17:38 Kaukasian kernel: GFS: fsid=Automatoi:home.4: ip->i_count = 1, ip->i_vnode = !NULL Jun 9 15:17:38 Kaukasian kernel: GFS: fsid=Automatoi:home.4: ip->i_arch.i_cache[0] = NULL Jun 9 15:17:38 Kaukasian kernel: GFS: fsid=Automatoi:home.4: ip->i_arch.i_cache[1] = NULL Jun 9 15:17:38 Kaukasian kernel: GFS: fsid=Automatoi:home.4: ip->i_arch.i_cache[2] = NULL Jun 9 15:17:38 Kaukasian kernel: GFS: fsid=Automatoi:home.4: ip->i_arch.i_cache[3] = NULL Jun 9 15:17:38 Kaukasian login(pam_unix)[4341]: session opened for user pkgcath7 by (uid=0) Jun 9 15:17:38 Kaukasian kernel: GFS: fsid=Automatoi:home.4: ip->i_arch.i_cache[4] = NULL Jun 9 15:17:39 Kaukasian kernel: GFS: fsid=Automatoi:home.4: ip->i_arch.i_cache[5] = NULL Jun 9 15:17:39 Kaukasian kernel: GFS: fsid=Automatoi:home.4: ip->i_arch.i_cache[6] = NULL Jun 9 15:17:39 Kaukasian kernel: GFS: fsid=Automatoi:home.4: ip->i_arch.i_cache[7] = NULL Jun 9 15:17:39 Kaukasian kernel: GFS: fsid=Automatoi:home.4: ip->i_arch.i_cache[8] = NULL Jun 9 15:17:39 Kaukasian kernel: GFS: fsid=Automatoi:home.4: ip->i_arch.i_cache[9] = NULL 2. After the node is off-line, gfs_fsck complained "Extended attributes indirect block out of range...removing", followed by fixing the bitmaps on block numbers 16470-16501 inclusively: 06/12/05 06:01 adingman@Prometheus:~$ sudo gfs_fsck /dev/pool/automatoi_data Initializing fsck Starting pass1 Pass1 complete Starting pass1b Pass1b complete Starting pass1c Extended attributes indirect block out of range...removing Pass1c complete Starting pass2 Pass2 complete Starting pass3 Pass3 complete Starting pass4 Pass4 complete Starting pass5 ondisk and fsck bitmaps differ at block 16470 Fix bitmap for block 16470? (y/n) y Succeeded. ondisk and fsck bitmaps differ at block 16471 Fix bitmap for block 16471? (y/n) y Succeeded. ondisk and fsck bitmaps differ at block 16472 Fix bitmap for block 16472? (y/n) y Succeeded. ondisk and fsck bitmaps differ at block 16473 Fix bitmap for block 16473? (y/n) y Succeeded. ondisk and fsck bitmaps differ at block 16474 Fix bitmap for block 16474? (y/n) y Succeeded. ondisk and fsck bitmaps differ at block 16475 Fix bitmap for block 16475? (y/n) y Succeeded. ondisk and fsck bitmaps differ at block 16476 Fix bitmap for block 16476? (y/n) y Succeeded. ondisk and fsck bitmaps differ at block 16477 Fix bitmap for block 16477? (y/n) y Succeeded. ondisk and fsck bitmaps differ at block 16478 Fix bitmap for block 16478? (y/n) y Succeeded. ondisk and fsck bitmaps differ at block 16479 Fix bitmap for block 16479? (y/n) y Succeeded. ondisk and fsck bitmaps differ at block 16480 Fix bitmap for block 16480? (y/n) y Succeeded. ondisk and fsck bitmaps differ at block 16481 Fix bitmap for block 16481? (y/n) y Succeeded. ondisk and fsck bitmaps differ at block 16482 Fix bitmap for block 16482? (y/n) y Succeeded. Version-Release number of selected component (if applicable): * 2.4.21-32.0.1.ELsmp * GFS-6.0.2.20-2-i686 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Look like the extended attributes metadata got consistently corrupted since the very same problem has been occurring on different partitions (filesystems) and different disks. Each time, the gfs_fsck is trying to fix a fixed set of block numbers (16470-16501). Note that the filesystem is mounted with acl on: # GFS and network volumes #/dev/pool/automatoi_conf /automatoi gfs defaults,acl 0 0 #/dev/pool/automatoi_home /home gfs defaults,acl 0 0 #/dev/pool/automatoi_data /data gfs defaults,acl 0 0 #/dev/pool/automatoi_slashShared /shared gfs defaults,acl 0 0 #/dev/pool/automatoi_sanBackup /mnt/sanBackup gfs defaults,acl,noatime 0 0
Created attachment 115444 [details] log 4-2
Does the fsck make it so the filesystem works again? Is samba the process using the ACLs?
This bug is on RHEL3, not RHEL4.
I believe the fsck *does* fix the issue. However, after the filesystem is back online again (or even moved to another disk), this stuck-fence-fsck cycle will repeat again and, each time and everytime, the fsck would try to fix the very same block numbers. Look to me something to do with extended attributes. Havn't heard back from John yet - he is on site today.
Cloned bug to bug #160525 for the gfs portion of this - turns out it's a red herring - the fsck was mishandling extended attributes, so it is a bug, but there's nothing wrong with the actual fs metadata. See bug #160525 for more information.
what type of load is being run to trigger this?
now looking at the logs for Talos, at 19:19:38 Kaukasian was instantly expired, with no missing heartbeats. Was it manually expired?
(GPS on behalf of the client) Yes, it was manually expired. Client reports that in the cases where fs "corruption" is suspected, the GFS client which is attempting to access the files located on the GFS filesystem, begins to build up a large number of IO blocked processes which can only be cleared by rebooting the node. When the system is in this state, any current or new process which attempts to access the GFS fs in question will block on IO. Is it your guess that there is actually no filesystem corruption occuring, but that the gfs_fsck command is erroneously reporting that items require fixing? If this is the case, then why does it no longer report the same errors after "fixing," given the argument that nothing was done in the first place? We're basically looking for a next step here. Currently, client has pretty much had it with their application's (jBase) interactions with GFS, and we are currently considering migrating to ext3 on pool. In this case, we will have a test cluster with the current configuration of jBase on GFS, which we can then utilize to attempt replication of the problem. Currently, however, we have been unable to replicate the problem in production. After disabling ACLs, the failures have decreased from every other day to once a week or so.
No, the fsck is converting free metadata to free data - it's just shifting internal types around. So it is changing things on the filesystem, but it's just adjusting types that GFS would eventually get to anyway. I have updated the fsck to print a different message in this case - see bug #160835 for more details.
How nicely does JBase play with others? Does it need lots of memory? You're running gulm embedded, and that requires quite a bit of memory depending on the situation - it looks like there's 4G of memory on the systems. Have you tried this with gulm running external?
Got hotfix-- thanks. As for jBase, it can be hard to tell how much memory it actually uses. We're currently looking into a way to determine memory usage. It appears that the main jBase daemons are merely lock arbitraters, and the main "heavy lifting" of the program is done by the user daemons that are started when a user logs in. In peak times, there are 300 users logged onto the system. In terms of memory usage, 2.5 GiB free on a loaded system. Got about 2GiB cached right now, with around 30MiB of buffer. All in all, the memory situation looks rosy. I'd be more inclined to think that we're bumping into a network problem of some type. We have not tried running stand-alone locking, although it is certainly an option of something to try. As we ensure that the master is *not* actually located on the jBase system, this shouldn't really be a huge issue, but it's certainly something that we can, and will, try.
Also something to consider is cpu time that is consumed by jBase versus providing opportunities for GuLM to run. Separating the two on different nodes might provide some insight to that as well.
Could some of the latest latency changes be the solution for this bugzilla? Have the jBase issues been resolved?