Bug 762588 (GLUSTER-856)

Summary: GCC Compilation error in 4 Node Distributed Native NFS Configuration for infiniband transport
Product: [Community] GlusterFS Reporter: Sampath <sampath.kumar>
Component: nfsAssignee: Shehjar Tikoo <shehjart>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: medium Docs Contact:
Priority: urgent    
Version: nfs-betaCC: gluster-bugs, lakshmipathi
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: RTP Mount Type: nfs
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:

Description Sampath 2010-04-26 02:51:27 EDT
Copied the NFS Server Log file to

  dev.gluster.com
  /share/tickets/856/ folder.
Comment 1 Sampath 2010-04-26 05:46:08 EDT
Getting Following error when make;-

make[5]: Leaving directory `/mnt/nfs-share/build/x86_64-unknown-linux-gnu/32/boehm-gc'
make[4]: Leaving directory `/mnt/nfs-share/build/x86_64-unknown-linux-gnu/32/boehm-gc'
make[3]: Leaving directory `/mnt/nfs-share/build/x86_64-unknown-linux-gnu/boehm-gc'
make[2]: Leaving directory `/mnt/nfs-share/build/x86_64-unknown-linux-gnu/boehm-gc'
make[1]: Leaving directory `/mnt/nfs-share/build/x86_64-unknown-linux-gnu/boehm-gc'
make[1]: Entering directory `/mnt/nfs-share/build/x86_64-unknown-linux-gnu/libjava'
deps.mk:1: libltdl/ltdl.d: No such file or directory

tr
make[1]: *** No rule to make target `libltdl/ltdl.d'.  Stop.
make[1]: Leaving directory `/mnt/nfs-share/build/x86_64-unknown-linux-gnu/libjava'
make: *** [all-target-libjava] Error 2
Comment 2 Sampath 2010-04-26 06:13:07 EDT
strace output of make command is copied to /share/tickets/856/ folder of dev.gluster.com
Comment 3 Shehjar Tikoo 2010-04-26 21:20:52 EDT
Sampath, do you get the same error when running with strace?
Comment 4 Sampath 2010-04-27 03:52:39 EDT
Getting following gcc make error

make[5]: *** [ltdl.lo] Error 1
make[5]: Leaving directory `/mnt/nfs-share/build/x86_64-unknown-linux-gnu/32/libjava/libltdl'
make[4]: *** [all] Error 2
make[4]: Leaving directory `/mnt/nfs-share/build/x86_64-unknown-linux-gnu/32/libjava/libltdl'
make[3]: *** [all-recursive] Error 1
make[3]: Leaving directory `/mnt/nfs-share/build/x86_64-unknown-linux-gnu/32/libjava'
make[2]: *** [multi-do] Error 1
make[2]: Leaving directory `/mnt/nfs-share/build/x86_64-unknown-linux-gnu/libjava'
make[1]: *** [all-multi] Error 2
make[1]: Leaving directory `/mnt/nfs-share/build/x86_64-unknown-linux-gnu/libjava'
make: *** [all-target-libjava] Error 2

strace output is copied to /share/tickets/856 folder of dev.gluster.com
Comment 5 Shehjar Tikoo 2010-04-28 00:54:47 EDT
gcc build has been verified to run over ethernet. The problem is only over IB,  and most probably related to the NFS perf degradation over IB.
Comment 6 Shehjar Tikoo 2010-05-06 22:38:37 EDT
From Chida:

Compile on localdisk compiles in 32 minutes. Compile on 4 node distribute setup results in error.

Run 1:
checking for strerror... yes
checking for unistd.h... (cached) yes
updating cache ./config.cache
creating ./config.status
creating Makefile
make[1]: Entering directory `/mnt/distribute/src/build/zlib'
make[1]: *** No rule to make target `adler32.*', needed by `libz.a'.  Stop.
make[1]: Leaving directory `/mnt/distribute/src/build/zlib'
make: *** [all-zlib] Error 2

real    0m42.237s
user    0m14.662s
sys     0m9.793s
[root@client01 build]# 

Run 2:
checking if libtool supports shared libraries... yes
checking whether to build shared libraries... yes
checking whether to build static libraries... yes
creating libtool
updating cache ./config.cache
configure: loading cache ./config.cache
checking how to run the C++ preprocessor... (cached)  /mnt/synopsys/build/gcc/xgcc -shared-libgcc -B/mnt/synopsys/build/gcc/ -nostdinc++ -L/
configure: error: C++ preprocessor " /mnt/synopsys/build/gcc/xgcc -shared-libgcc -B/mnt/synopsys/build/gcc/ -nostdinc++ -L/mnt/synopsys/buil
See `config.log' for more details.
make: *** [configure-target-libstdc++-v3] Error 1

real    6m13.037s
user    4m57.355s
sys     0m26.728s


Run 3:
checking for time... yes
checking for ftime... /mnt/distribute/src/gcc-3.4.6/libjava/configure: line 6295: test: conftest.*: binary operator expected
no
checking for memmove... yes
checking for memcpy... /mnt/distribute/src/gcc-3.4.6/libjava/configure: line 6412: test: conftest.*: binary operator expected
no
configure: error: memcpy is required
make: *** [configure-target-libjava] Error 1

real    40m39.667s
user    18m19.311s
sys     9m16.740s
[root@client01 buils]#
Comment 7 Shehjar Tikoo 2010-05-07 02:21:00 EDT
The first time I am able to see an error number related message:

[root@domU-12-31-39-0E-B1-16 gcc-3.4.6]# make >/dev/null
configure: WARNING:
*** Makeinfo is missing or too old.
*** Info documentation will not be built.
<b>nm: conftest.o: Input/output error
nm: conftest.o: Input/output error
</b>
./configure: line 10583: test: too many arguments
checking for the document directory.
Links are now set up to build a native compiler for i686-pc-linux-gnu.
In file included from ./../include/xregex.h:26,
                 from regex.c:195:
./../include/xregex2.h:548: warning: ISO C90 does not support ‘static’ or type qualifiers in parameter array declarators
In file included from regex.c:649:
regex.c: In function ‘byte_compile_range’:
regex.c:4548: warning: signed and unsigned type in conditional expression
regex.c:4558: warning: signed and unsigned type in conditional expression
regex.c:4558: warning: signed and unsigned type in conditional expression
regex.c: In function ‘xregcomp’:
regex.c:8043: warning: signed and unsigned type in conditional expression
regex.c: In function ‘xregerror’:
regex.c:8178: warning: unused parameter ‘preg’
concat.c: In function ‘concat_length’:
concat.c:112: warning: traditional C rejects ISO C style function definitions
concat.c: In function ‘concat_copy’:
concat.c:127: warning: traditional C rejects ISO C style function definitions
concat.c: In function ‘concat_copy2’:
concat.c:146: warning: traditional C rejects ISO C style function definitions
concat.c: In function ‘concat’:
concat.c:157: warning: traditional C rejects ISO C style function definitions
concat.c: In function ‘reconcat’:
concat.c:194: warning: traditional C rejects ISO C style function definitions
<b>make[1]: *** Makefile: Input/output error.  Stop.</b>
make: *** [all-gcc] Error 2
Comment 8 Shehjar Tikoo 2010-05-07 02:28:34 EDT
NFS client syslog gives the following messages:
call_verify: XDR representation not a multiple of 4 bytes: 0x756
call_verify: XDR representation not a multiple of 4 bytes: 0x756
call_verify: XDR representation not a multiple of 4 bytes: 0x756
call_verify: XDR representation not a multiple of 4 bytes: 0x756
call_verify: XDR representation not a multiple of 4 bytes: 0x949
call_verify: XDR representation not a multiple of 4 bytes: 0x949
call_verify: XDR representation not a multiple of 4 bytes: 0x949
call_verify: XDR representation not a multiple of 4 bytes: 0x949
call_verify: XDR representation not a multiple of 4 bytes: 0x949
call_verify: XDR representation not a multiple of 4 bytes: 0x949
call_verify: XDR representation not a multiple of 4 bytes: 0x75a
call_verify: XDR representation not a multiple of 4 bytes: 0x75a
call_verify: XDR representation not a multiple of 4 bytes: 0x75a
call_verify: XDR representation not a multiple of 4 bytes: 0x75a
call_verify: XDR representation not a multiple of 4 bytes: 0x989
call_verify: XDR representation not a multiple of 4 bytes: 0x989
call_verify: XDR representation not a multiple of 4 bytes: 0x989
call_verify: XDR representation not a multiple of 4 bytes: 0x8b5
call_verify: XDR representation not a multiple of 4 bytes: 0x8b5
call_verify: XDR representation not a multiple of 4 bytes: 0x8b5
call_verify: XDR representation not a multiple of 4 bytes: 0x8f9
call_verify: XDR representation not a multiple of 4 bytes: 0x8f9
call_verify: XDR representation not a multiple of 4 bytes: 0x8f9
call_verify: XDR representation not a multiple of 4 bytes: 0x8f9
call_verify: XDR representation not a multiple of 4 bytes: 0x997
call_verify: XDR representation not a multiple of 4 bytes: 0x997
call_verify: XDR representation not a multiple of 4 bytes: 0x997


From the linux kernel file net/sunrpc/clnt.c:
static __be32 *
call_verify(struct rpc_task *task)
{
        struct kvec *iov = &task->tk_rqstp->rq_rcv_buf.head[0];
        int len = task->tk_rqstp->rq_rcv_buf.len >> 2;
        __be32  *p = iov->iov_base;
        u32 n;
        int error = -EACCES;

        if ((task->tk_rqstp->rq_rcv_buf.len & 3) != 0) {
                /* RFC-1014 says that the representation of XDR data must be a
                 * multiple of four bytes
                 * - if it isn't pointer subtraction in the NFS client may give
                 *   undefined results
                 */
                dprintk("RPC: %5u %s: XDR representation not a multiple of"
                       " 4 bytes: 0x%x\n", task->tk_pid, __FUNCTION__,
                       task->tk_rqstp->rq_rcv_buf.len);
                goto out_eio;
        }

This could be the reason for the EIO through a syscall
Comment 9 Shehjar Tikoo 2010-05-11 01:42:09 EDT
Bugs that were identified during gcc building with releases before rc5 have been fixed in rc5. The gcc build is completing on FC8 AWS instance consistently over a 4 node distribute with and without all performance translators.

On the other hand, I've also tested gcc building on centos AWS instances since the  US cluster is running Centos. This build does fail consistently even over a ethernet link, unlike what I had earlier reported here.

The error messages reported during the Centos failure point to some lines in an Ada source file which contains blank lines. I've verified that the contents of this file are the same on the FC8 instance.

I believe the failure is occurring due to an old ada compiler on the centos distros, i.e. the gnat compiler. To test this theory, I deleted the blank lines from that particular file and re-ran the build. This time, it built that file correctly and failed on a different file, again reporting problems with blank lines in the Ada source.

This is corroborated by the difference in gnat versions on:
FC8 - GNAT 4.1.2 20070925 (Red Hat 4.1.2-33)
Centos - GNAT 4.1.1 20070105 (Red Hat 4.1.1-52)

Add to this the fact that the build on Centos machine fails not just on NFS mount point but also on the local file system.

Sac/Chida

If there is nothing more to add, feel free to close this bug.
Comment 10 Shehjar Tikoo 2010-05-11 01:53:15 EDT
I see others also facing similar problems.

See
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=22413

and 

http://www.linuxquestions.org/questions/linux-server-73/multiple-gcc-versions-691302/

These look like setup/environment problems than filesystem related.
Please re-open if doubts remain.
Comment 11 Sampath 2010-05-12 05:56:19 EDT
Getting Following gcc compilation error on local Disk
======================================================

s-traent.ads:61:01: (style) blank lines not allowed at end of file
a-exexda.adb:346:01: (style) multiple blank lines
a-exextr.adb:216:01: (style) multiple blank lines
make[1]: *** [ada/a-except.o] Error 1
make[1]: Leaving directory `/tmp/build/gcc'
make: *** [all-gcc] Error 2

real    2m20.460s
user    1m22.799s
sys     0m51.179s

[root@client7 build]# rpm -qa |grep gcc
libgcc-4.1.2-46.el5_4.2
gcc-c++-4.1.2-46.el5_4.2
gcc-4.1.2-46.el5_4.2
gcc-gnat-4.1.2-46.el5_4.2
libgcc-4.1.2-46.el5_4.2


If gcc-gnat is not installed, gcc compilation is successful on local disk.
Comment 12 Shehjar Tikoo 2010-05-12 22:45:45 EDT
Hi Sampath
the previous test is the local disk of which distro? Centos? on AWS or the US cluster?

For the record, the localdisk is successful because in the absence of GNAT compiler the configure scripts must be disabling the building of Ada sources.
Comment 13 Shehjar Tikoo 2010-05-26 04:36:37 EDT
Regression Test Info:
RT is required to check against IO error/EIO received by applications. See comments 7 and 8.

Analysis:
The problem is caused due to nfsx returning unaligned read replies. In RPC every message length needs to be aligned to 4 byte boundary. For reads requests which ask for data lengths that are  not aligned to 4 byte boundary, NFSx needs to still send replies with enough padding bytes to align the final RPC message length to 4 bytes. NFSx did not do that and hence the EIO by the NFS client to the application.

Patch for this bug was submitted as part of bz 902.

Test case:
1. Use storage/posix as an NFS export. 

2. Create a large file of say 1G.

3. In your test tool, create a collection of (offset, len) pairs. We'll be performing a read op for each one of these tuples. Either offset or the len should be unaligned to a 4 byte boundary. Each len must be less than 64k, otherwise, the NFS client will just start sending requests larger than 64k as properly aligned 64k read requests.

4. Do a read from the file for each one of these offset,len pairs. The file must be re-opened for each read and that the mount point must be re-mounted to avoid cached data being returned from NFS client cache. We want to force the server to return a reply.

5. Before starting the test, run the following command to enable logging from the kernel NFS client module.
           $ echo "65535" > /proc/sys/sunrpc/nfs_debug
Comment 14 Shehjar Tikoo 2010-07-14 04:51:16 EDT
Patches submitted in comment 3 and 4 for bug 762634 are the fix for bug 762588. To reproduce 902, use the commit before all of the patches below.

To reproduce 856, use the commit before patches in comment 3 and 4.
Comment 15 Lakshmipathi G 2010-07-14 07:50:27 EDT
Fixed.Verified with nfs-beta-rc9.
Comment 16 Lakshmipathi G 2010-07-14 22:39:19 EDT
Regression test  - http://test.gluster.com/show_bug.cgi?id=79