Description of problem:
While working on bug #606468 on a ia64 box I discovered that
my latest and greatest fsck.gfs2 produced multiple unaligned
access errors during execution. I backtracked it to to a
regression as described in:
The regression is here:
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. fsck.gfs2 /dev/device
fsck.gfs2(10238): unaligned access to 0x600000000006930b, ip=0x400000000004c730
None of these messages should be received
I have a patch to fix the problem
Requesting ack flags to get this into 6.0.
Created attachment 426982 [details]
Proposed patch for STABLE3
To fix the problem, I simply ported Steve Whitehouse's kernel
version of the latest gfs2_bitfit function back to user space.
I tested it on system a1 and it works properly.
Why is fixing a bug that only shows on a platform we don't ship a blocker?
Because the patch to fix the problem affects all platforms
and the patch that introduced the problem affected all platforms
and is a regression?
OK. It just didn't seem like a particularly relevant regression, if it only affects ia64.
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release. Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release. This request is not yet committed for
I pushed the patch to the master branch of the gfs2-utils git
tree and the STABLE3 and RHEL6 branches of the cluster git tree
for inclusion into 6.0. Changing status to POST until it gets
built. This was tested on system a1 for ia64 and on roth-08
Ran into an issue where fsck.gfs2 got stuck in an infinite loop reading the same block.
<bob> refried: This is a bug with gfs2's bitfit algorithm. For 608154 I ported it from kernel space to user space. The problem is that the algorithm doesn't return blocks as ascending order.
<bob> I'm calling the algorithm to get "Get the next bitmap block higher than 0x2016" and it comes back with 0x2015.
<bob> I believe I noticed the non-sequential issue in the kernel code months ago and swhiteho_ohnl and I discussed it.
<bob> So I guess the proper thing to do is to mark 608154 as FAILS_QA because that's what I'm going to have to rework
Created attachment 434885 [details]
Okay, mystery solved. I enhanced a private copy of gfs2_edit
so that it would tell me the block allocations as I walked the
bitmaps. That enabled me to figure out what the proper values
should be, and that, in turn, enabled me to figure out the problem.
The problem, as it turns out, is a "thinko" in this patch
that only affects i386. I was using sizeof(unsigned long) when
I should have been using sizeof(unsigned long long). The value
is the same in x86_64 but different in 32-bit machines like the
one that failed. That bad size caused a miscalculation of the
shift point, which threw everything off. The infinite loop was
repeatedly returning the same block due to this bad shift point.
All in all that's good news because it means the original
algorithm in the kernel is sound and doesn't need changing.
This addendum patch should take care of the problem for user space.
I'll push it out shortly.
I tested the new patch on system morph-01. The patch was pushed
to the master branch of the gfs2-utils git repository and
the STABLE3 and RHEL6 branches of the cluster git repository.
Changing status to POST until this gets built into a new
Is it not better to use fixed size types here? The kernel one always uses chunks of u64 even on 32 bit arches. Its possible that its a bit slower (due to the smaller register set) but I doubt it makes a great deal of difference overall.
Well, actually that was my first thought, to use declarations
such as "uint64_t" but I was leery of doing that because the
previous algorithm did that and got into trouble with the
unaligned access messages on ia64.
I can get through gfs_fsck_stress on i686 again.
Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.