Description of Problem: We have been experiencing some problems trying to use kernel modules with kernels that are compiled with different versions of gcc. On our kernel build machine (where we compile our kernel modules) we have gcc 2.91.66 (I believe the preferred kernel compiler, according to Documentation/Changes); RedHat 7.1 ships with gcc 2.96. Now, the problem is that RedHat also apparently compiles (at least its newer) kernels with the 2.96 gcc. Unfortunately, there appears to be a structure misalignment problem in gcc 2.96. One particular instance of this problem that we are running into is in the raid1.o module in the 2.4.3 kernel. The structure alignment problem is causing our gcc 2.91.66-compiled raid1 module to malfunction. (raid1.o compiled from the same source on gcc 2.96 works fine.) We've traced the problem down to the following assembly code generated by the 2.96 and 2.91.66 gcc's respectively: (assembly code for parameter setup and call to __alloc_pages (within raid1_grow_buffers)) 2.96: movl $contig_page_data_Rsmp_cef82582+3800, %eax call __alloc_pages_Rsmp_decacc2f 2.91.66: movl $contig_page_data_Rsmp_cef82582+3884,%eax call __alloc_pages_Rsmp_decacc2f gcc 2.91.66 is padding out the zone_t structure by 28 bytes. With an array of 3 of those before our field in question that equals 84 bytes offset in the above assembler code. The 28 byte padding is because gcc 2.91.66 is trying to 32 byte align this structure. The reason for this is that the first submember of zone_t is explicitly defined as 32 byte aligned (per_cpu_t). So, gcc 2.91.66 is (properly) aligning the per_cpu_t structure on a 32 byte boundary as specified by the __attribute__((aligned(32))) directive in that structure's definition: (gdb) p &((pg_data_t *)0)->node_zones[1].cpu_pages[0] $22 = (per_cpu_t *) 0x4e0 (gdb) p &((pg_data_t *)0)->node_zones[1].cpu_pages[1] $23 = (per_cpu_t *) 0x500 (gdb) p 0x500 % 32 $24 = 0 (gdb) p 0x4e0 % 32 $25 = 0 gcc 2.96 is not properly aligning this structure: (gdb) p &((pg_data_t *)0)->node_zones[1].cpu_pages[0] $32 = (per_cpu_t *) 0x4c4 (gdb) p &((pg_data_t *)0)->node_zones[1].cpu_pages[1] $33 = (per_cpu_t *) 0x4e4 (gdb) p 0x4c4 % 32 $34 = 4 (gdb) p 0x4e4 % 32 $35 = 4 So, in order for our raid1 modules to work properly with a kernel compiled by gcc 2.96, we must also use (the broken) 2.96 to compile our module. Version-Release number of selected component (if applicable): # rpm -q gcc gcc-2.96-81 How Reproducible: compile raid1.o kernel module with gcc 2.91.66 and attempt to run it on RH 2.4.3-12 kernel (compiled with gcc 2.96) module fails to work properly - data is not resynchronized when software array is created as it is supposed to (root cause is failure in call to __alloc_pages kernel function due to structure misalignment problems) Steps to Reproduce: 1. 2. 3. Actual Results: Expected Results: Additional Information:
Can you please attach the exact preprocessed source which shows this? I've tried typedef struct x { int a; int b; } __attribute__((aligned(32))) X; typedef struct y { X x; int c; } Y; Y y[3]; which models about what I can see in 2.4.7's mmzone.h and y has the same size and alignment both with all 2.96-RH's I've tried and egcs 1.1.2.
Created attachment 29274 [details] pre-processed source file, which demonstrates misalignment problem
Simplified testcase typedef struct x { int a; int b; } __attribute__((aligned(32))) X; typedef struct y { X x[32]; int c; } Y; Y y[3]; int main(void) { if (sizeof (y) != 3168) abort (); exit (0); } (the X in array is important, changing it to typedef struct y { X x; X y[31]; int c; } Y; fixes it). This testcase works on gcc < 2.96-RH or on 2.96-RH+ (incl. 3.0, 3.0.1, 3.1) on non-IA-32 architectures (tested alpha, IA-64, sparc). Fails on IA-32 with 2.96-RH, 3.0, 3.0.1, 3.1. Apparently some config/i386 alignment issue, will debug this tomorrow.
s/tomorrow/today/. Analysis with a patch is at http://gcc.gnu.org/ml/gcc-patches/2001-09/msg00072.html but I'm not sure I want to apply this, since then would mean binary incompatibility between modules created with gcc < 2.96-98 and modules created with >= 2.96-98, which is a worse thing than binary incompatibility with egcs 1.1.2.
gcc 3.2 should have resolved all these issues