Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 747496

Summary:

cthon test hang on nfs

Product:

Red Hat Enterprise Linux 6

Reporter:

Jian Li <jiali>

Component:

kernel

Assignee:

J. Bruce Fields <bfields>

Status:

CLOSED WONTFIX

QA Contact:

Red Hat Kernel QE team <kernel-qe>

Severity:

unspecified

Docs Contact:

Priority:

medium

Version:

6.2

CC:

bfields, dhowells, dpal, jburke, jiali, nmurray, pbunyan, rwheeler, sprabhu, steved, swhiteho, torel, yanwang

Target Milestone:

Keywords:

Reopened

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2017-12-06 12:38:47 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
call trace	none
rpcdebug -m rpc	none
rpcdebug -m nfs	none
rpcdebug -m nfsd	none
print gss_unwrap errors	none
print more gss_unwrap errors	none

Description Jian Li 2011-10-20 03:37:48 UTC

Description of problem:
Test is designed to check nfs exportfs's option(such as: sync,no_wdelay,sec=*) and mount.nfs's option(a|sync,tcp|udp,r|wsize=*) would cooperate well. cthon will run in all combination of these options.

When server uses such options:  (path=/mnt/exportdir)

*****
$path *(rw,sync,no_wdelay,insecure,fsid=0x111111,no_root_squash)  
*****

And client run such test:

*****
iosizes="1024 4096 65536 1048576"
synctypes="sync async"
for size in $iosize ; do
  for syn in $synctypes ; do 
    option='nfsvers=4,udp,rsize=$size,wsize=$size,$syn
    **** run cthon04's "special" test with $option, 
    **** this task always hangs on 'bigfile' test.
  done
done

When the test hangs, the mountpoint could not be accessed (ls /mnt/$server would also hang), and this test will block system's reboot|shutdown.

Version-Release number of selected component (if applicable):
2.6.32-209.el6.x86_64  (rhel6.2-20111019...)

How reproducible:
test case /tests/kernel/filesystems/nfs/cthon-export-mount
 
Actual results:

Expected results:
Cthon should pass.

Additional info:
1. With wireshark, such message are got just before hang (from host and client):
 19.491408  10.16.65.45 -> 10.16.45.34  NFS [RPC retransmission of #268798]V4 COMP Call <EMPTY> PUTFH;WRITE WRITE;GETATTR GETATTR
*******
 19.491673  10.16.45.34 -> 10.16.65.45  NFS V4 COMP Reply (Call In 268795) <EMPTY> PUTFH;WRITE WRITE;GETATTR GETATTR
 19.491703  10.16.45.34 -> 10.16.65.45  NFS V4 COMP Reply (Call In 268798) <EMPTY> PUTFH;WRITE WRITE;GETATTR GETATTR
 19.491728  10.16.45.34 -> 10.16.65.45  NFS [RPC duplicate of #268806]V4 COMP Reply (Call In 268798) <EMPTY> PUTFH;WRITE WRITE;GETATTR GETATTR

2. Call trace from (echo w > /proc/sysrq-trigger) is attached.

Comment 1 Jian Li 2011-10-20 03:39:44 UTC

Created attachment 529178 [details]
call trace

Comment 3 Steve Dickson 2011-10-21 13:25:32 UTC

(In reply to comment #0)
> Description of problem:
> Test is designed to check nfs exportfs's option(such as: sync,no_wdelay,sec=*)
> and mount.nfs's option(a|sync,tcp|udp,r|wsize=*) would cooperate well. cthon
> will run in all combination of these options.
> 
> When server uses such options:  (path=/mnt/exportdir)
> 
> *****
> $path *(rw,sync,no_wdelay,insecure,fsid=0x111111,no_root_squash)  
> *****
> 
> And client run such test:
> 
> *****
> iosizes="1024 4096 65536 1048576"
> synctypes="sync async"
> for size in $iosize ; do
>   for syn in $synctypes ; do 
>     option='nfsvers=4,udp,rsize=$size,wsize=$size,$syn
This is not valid. It is mandated  in the spec that 
v4 has to use TCP... Please Change this to tcp and rerun
the tests.

Comment 4 RHEL Program Management 2011-10-25 05:48:31 UTC

Since RHEL 6.2 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 5 Jian Li 2011-11-02 08:23:19 UTC

Ok, thanks for tips.

now I meet problem again, when using sec=krb5p & s390x, details:
[root@ibm-z10-03 basic]# uname -a
Linux ibm-z10-03.rhts.eng.bos.redhat.com 2.6.32-214.el6.s390x #1 SMP Tue Oct 25 20:00:08 EDT 2011 s390x s390x s390x GNU/Linux

SERVER:
/mnt/exportdir *(sec=krb5p,rw,sync,no_wdelay,secure,fsid=0x111111,no_root_squash)

CLIENT:
#cat /proc/mount | grep nfs
ibm-z10-04.rhts.eng.bos.redhat.com:/mnt/exportdir /mnt/ibm-z10-04.rhts.eng.bos.redhat.com nfs rw,sync,relatime,vers=2,rsize=1024,wsize=1024,namlen=255,hard,proto=udp,timeo=11,retrans=3,sec=krb5p,mountaddr=10.16.66.195,mountvers=1,mountport=38260,mountproto=udp,local_lock=none,addr=10.16.66.195 0 0

when running cthon's basic test(test5), always get EIO. strace result:

[root@ibm-z10-03 basic]# export NFSTESTDIR="/mnt/ibm-z10-04.rhts.eng.bos.redhat.com/test"
[root@ibm-z10-03 basic]# ./test5 -t
./test5: read and write
/mnt/ibm-z10-04.rhts.eng.bos.redhat.com/test, bigfile
	./test5: (/mnt/ibm-z10-04.rhts.eng.bos.redhat.com/test) 'bigfile' write failed : Input/output error
[root@ibm-z10-03 basic]# strace ./test5 -t
execve("./test5", ["./test5", "-t"], [/* 36 vars */]) = 11
brk(0)                                  = 0xac62a000
** snip **
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
--- SIGCHLD (Child exited) @ 0 (0) ---
mkdir("/mnt/ibm-z10-04.rhts.eng.bos.redhat.com/test", 0777) = 0
chdir("/mnt/ibm-z10-04.rhts.eng.bos.redhat.com/test") = 0
getcwd("/mnt/ibm-z10-04.rhts.eng.bos.redhat.com/test", 1024) = 45
write(1, "/mnt/ibm-z10-04.rhts.eng.bos.red"..., 54/mnt/ibm-z10-04.rhts.eng.bos.redhat.com/test, bigfile
) = 54
open("bigfile", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
stat("bigfile", {st_mode=S_IFREG|0666, st_size=0, ...}) = 0
write(3, "\0\0\0\0\0\0\0\1\0\0\0\2\0\0\0\3\0\0\0\4\0\0\0\5\0\0\0\6\0\0\0\7"..., 8192) = 8192
write(3, "\0\0\0\0\0\0\0\1\0\0\0\2\0\0\0\3\0\0\0\4\0\0\0\5\0\0\0\6\0\0\0\7"..., 8192) = 8192
** snip **
write(3, "\0\0\0\0\0\0\0\1\0\0\0\2\0\0\0\3\0\0\0\4\0\0\0\5\0\0\0\6\0\0\0\7"..., 8192) = 8192
write(3, "\0\0\0\0\0\0\0\1\0\0\0\2\0\0\0\3\0\0\0\4\0\0\0\5\0\0\0\6\0\0\0\7"..., 8192) = 8192
write(3, "\0\0\0\0\0\0\0\1\0\0\0\2\0\0\0\3\0\0\0\4\0\0\0\5\0\0\0\6\0\0\0\7"..., 8192) = 8192
write(3, "\0\0\0\0\0\0\0\1\0\0\0\2\0\0\0\3\0\0\0\4\0\0\0\5\0\0\0\6\0\0\0\7"..., 8192) = -1 EIO (Input/output error)
getcwd("/mnt/ibm-z10-04.rhts.eng.bos.redhat.com/test", 4096) = 45
write(2, "\t./test5: (/mnt/ibm-z10-04.rhts."..., 57	./test5: (/mnt/ibm-z10-04.rhts.eng.bos.redhat.com/test) ) = 57
write(2, "'bigfile' write failed", 22'bigfile' write failed)  = 22
write(2, " : Input/output error\n", 22 : Input/output error
) = 22
exit_group(1)                           = ?

Comment 6 Jian Li 2011-11-02 09:02:17 UTC

And EIO appears when nfsv2/3/4 * tcp/udp are used.

Comment 7 Jian Li 2011-11-02 11:19:35 UTC

There is EIO again when I running cthon in such situation(krb5i,upd,nfsv3,x86):

SERVER:
/mnt/exportdir *(sec=krb5i,rw,sync,no_wdelay,secure,fsid=0x111111,no_root_squash)
CLIENT:
#mount | grep nfs
hp-z220-03.lab.bos.redhat.com:/mnt/exportdir on /mnt/hp-z220-03.lab.bos.redhat.com type nfs (rw,sync,sec=krb5i,resvport,nfsvers=3,udp,rsize=4096,wsize=4096,addr=10.16.42.225)

TEST: bigfile
#strace ./bigfile test

[root@hp-xw4550-01 hp-z220-03.lab.bos.redhat.com]# strace ./bigfile test
execve("./bigfile", ["./bigfile", "test"], [/* 37 vars */]) = 0
brk(0)                                  = 0x7c5000
** snip **
munmap(0x7fe6b7af3000, 45194)           = 0
open("test", O_RDWR|O_CREAT|O_TRUNC, 0666) = 3
brk(0)                                  = 0x7c5000
brk(0x7e8000)                           = 0x7e8000
write(3, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 8192) = 8192
** snip **
write(3, "jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj"..., 8192) = 8192
write(3, "kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk"..., 8192) = -1 EIO (Input/output error)
write(2, "Error: ", 7Error: )                  = 7
write(2, "write to test failed: Input/outp"..., 41write to test failed: Input/output error
) = 41
exit_group(1)                           = ?

When I using tcp, there is no EIO.

Comment 8 J. Bruce Fields 2011-11-02 13:44:31 UTC

If this is only reproduceable over udp, then I think we should close the bug as WONTFIX.  There are known bugs with rpcsec_gss over udp.  Perhaps we should find some way to fail (or at least warn about) that combination at mount time.

Comment 9 Steve Dickson 2011-11-02 14:26:03 UTC

(In reply to comment #8)
> If this is only reproduceable over udp, then I think we should close the bug as
> WONTFIX.  There are known bugs with rpcsec_gss over udp.  Perhaps we should
> find some way to fail (or at least warn about) that combination at mount time.

I agree...

Comment 10 Jian Li 2011-11-03 08:02:22 UTC

Hi, problem also remain when using tcp, but this error couldn't be reproduced 100%(90% maybe)

#strace ./bigfile testfile
** snip **
read(3, "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"..., 8192) = 8192
lseek(3, 31326208, SEEK_SET)            = 31326208
read(3, "cccccccccccccccccccccccccccccccc"..., 8192) = 8192
lseek(3, 31334400, SEEK_SET)            = 31334400
read(3, "dddddddddddddddddddddddddddddddd"..., 8192) = 4096
write(2, "short read (4096) to testfile\n", 30short read (4096) to testfile
) = 30
exit_group(1)                           = ?
[root@ibm-z10-03 test]# uname -a
Linux ibm-z10-03.rhts.eng.bos.redhat.com 2.6.32-214.el6.s390x #1 SMP Tue Oct 25 20:00:08 EDT 2011 s390x s390x s390x GNU/Linux
[root@ibm-z10-03 test]# cat /proc/mounts | grep nfs
sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0
ibm-z10-04.rhts.eng.bos.redhat.com:/mnt/exportdir/ /mnt/ibm-z10-04.rhts.eng.bos.redhat.com nfs4 rw,relatime,vers=4,rsize=4096,wsize=4096,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=krb5p,clientaddr=10.16.66.194,minorversion=0,local_lock=none,addr=10.16.66.195 0 0

SERVER:
[root@ibm-z10-04 ~]# cat /etc/exports 
/mnt/exportdir *(sec=krb5p,rw,sync,no_wdelay,secure,fsid=0x111111,no_root_squash)
and confused message appear in server:
svc: 10.16.66.194, port=685: failed to decode args  (10.16.66.194: client's ip)
svc: 10.16.66.194, port=685: failed to decode args


When testing, such errors were met:
#strace ./bigfile testfile 
** snip **
lseek(3, 31440896, SEEK_SET)            = 31440896
read(3, "qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq"..., 8192) = 8192
lseek(3, 31449088, SEEK_SET)            = 31449088
read(3, "rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr"..., 8192) = 8192
ftruncate(3, 0)                         = 0
ftruncate(3, 31457280)                  = 0
mmap(NULL, 31457280, PROT_READ|PROT_WRITE, MAP_SHARED, 3, 0) = 0x3fffb62f000
msync(0x3fffb62f000, 31457280, MS_SYNC|MS_INVALIDATE) = -1 EIO (Input/output error)
write(2, "Error: ", 7Error: )                  = 7
write(2, "can't msync testfile: Input/outp"..., 41can't msync testfile: Input/output error
) = 41
exit_group(1)                           = ?

#strace ./bigfile testfile 
** snip **
read(3, "pppppppppppppppppppppppppppppppp"..., 8192) = 8192
lseek(3, 31440896, SEEK_SET)            = 31440896
read(3, "qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq"..., 8192) = 8192
lseek(3, 31449088, SEEK_SET)            = 31449088
read(3, "rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr"..., 8192) = 8192
ftruncate(3, 0)                         = 0
ftruncate(3, 31457280)                  = 0
mmap(NULL, 31457280, PROT_READ|PROT_WRITE, MAP_SHARED, 3, 0) = 0x3fffb683000
--- SIGBUS (Bus error) @ 0 (0) ---
+++ killed by SIGBUS (core dumped) +++
Bus error (core dumped)



(In reply to comment #9)
> (In reply to comment #8)
> > If this is only reproduceable over udp, then I think we should close the bug as
> > WONTFIX.  There are known bugs with rpcsec_gss over udp.  Perhaps we should
> > find some way to fail (or at least warn about) that combination at mount time.
> 
> I agree...

Comment 11 J. Bruce Fields 2011-11-03 13:33:40 UTC

"problem also remain when using tcp"

OK, it's likely a real krb5p bug in that case.

Next might be to see if we can locate more precisely where those errors are originating.  Wireshark won't help thanks to the encryption.

If you turn on rpc debugging on client and server, can you still reproduce the problem, and what do the last lines of output before the failure look like?

Comment 12 Jian Li 2011-11-06 04:12:45 UTC

Hi, I reproduce the case in one machine (both as SERVER and CLIENT), and fetch debug info with rpcdebug command.

rpcdebug -m rpc -s all; 
for ((;;)) ; do
  dmesg -c > /dev/null
  strace ./bigfile testfile
  if [ $? -ne 0 ] ; then
    dmesg -c > ~/rpcdebug_rpc.txt
    break;
  fi
done

rpcdebug -m rpc -c all; 
rpcdebug -m nfs -s all; 
for ((;;)) ; do
**

Comment 13 Jian Li 2011-11-06 04:13:32 UTC

Created attachment 531920 [details]
rpcdebug -m rpc

Comment 14 Jian Li 2011-11-06 04:13:54 UTC

Created attachment 531921 [details]
rpcdebug -m nfs

Comment 15 Jian Li 2011-11-06 04:14:13 UTC

Created attachment 531922 [details]
rpcdebug -m nfsd

Comment 16 Jian Li 2011-11-06 04:17:10 UTC

And pls notice such message:
> svc: 10.16.66.194, port=685: failed to decode args  (10.16.66.194: client's ip)
> svc: 10.16.66.194, port=685: failed to decode args

Comment 17 J. Bruce Fields 2011-11-09 17:44:45 UTC

"And pls notice such message"

I wonder why that message doesn't show up in any of those dmesg's?  Neither does "server saw garbage", which the client should be dprintk'ing in the rpc code if it gets a garbage_args error back from the server.

Comment 18 J. Bruce Fields 2011-11-09 17:55:57 UTC

Created attachment 532625 [details]
print gss_unwrap errors

Here's an attempt to get a little more information about gss_unwrap failures; no need to turn on rpc debugging this time, just run until you get the "failed to decode args" error and see if there's also one of these new errors before it.

Comment 19 J. Bruce Fields 2011-11-09 17:57:23 UTC

By the way, do you have a previous kernel version where you know this error did *not* happen?

Comment 20 Jian Li 2011-11-10 01:09:21 UTC

> I wonder why that message doesn't show up in any of those dmesg's?  Neither
I do many times of test, these messages may be printed sometime.

(In reply to comment #19)
> By the way, do you have a previous kernel version where you know this error did
> *not* happen?
I run the test on a old kernel(forget the exact NVR), the bug also exist.

Testing the patch.

Comment 21 Jian Li 2011-11-14 08:59:33 UTC

Hi Bruce:
I test your patch three times.

1st test:
[root@ibm-z10-03 test]# dmesg -c > /dev/null
[root@ibm-z10-03 test]# ./bigfile testfile
[root@ibm-z10-03 test]# dmesg
gss_unwrap returned d0000
svc: 10.16.66.194, port=779: failed to decode args
gss_unwrap returned d0000
svc: 10.16.66.194, port=779: failed to decode args
gss_unwrap returned d0000
** repeat **
gss_unwrap returned d0000

2nd test:
[root@ibm-z10-03 test]# dmesg -c > /dev/null
[root@ibm-z10-03 test]# ./bigfile testfile
Bus error (core dumped)
[root@ibm-z10-03 test]# dmesg
gss_unwrap returned d0000
svc: 10.16.66.194, port=779: failed to decode args
gss_unwrap returned d0000
svc: 10.16.66.194, port=779: failed to decode args
gss_unwrap returned d0000
svc: 10.16.66.194, port=779: failed to decode args
gss_unwrap returned d0000
svc: 10.16.66.194, port=779: failed to decode args
gss_unwrap returned d0000
** repeat **
gss_unwrap returned d0000

3rd test:
[root@ibm-z10-03 test]# dmesg -c > /dev/null
[root@ibm-z10-03 test]# ./bigfile testfile
short read (4096) to testfile
[root@ibm-z10-03 test]# dmesg -c
gss_unwrap returned d0000
svc: 10.16.66.194, port=779: failed to decode args
gss_unwrap returned d0000
svc: 10.16.66.194, port=779: failed to decode args
gss_unwrap returned d0000
svc: 10.16.66.194, port=779: failed to decode args
gss_unwrap returned d0000
svc: 10.16.66.194, port=779: failed to decode args
gss_unwrap returned d0000
svc: 10.16.66.194, port=779: failed to decode args
gss_unwrap returned d0000
svc: 10.16.66.194, port=779: failed to decode args
gss_unwrap returned d0000
svc: 10.16.66.194, port=779: failed to decode args
gss_unwrap returned d0000
svc: 10.16.66.194, port=779: failed to decode args
gss_unwrap returned d0000
** repeat **
gss_unwrap returned d0000
gss_unwrap returned d0000

Comment 22 J. Bruce Fields 2011-11-14 16:55:19 UTC

Looking at include/linux/sunrpc/gss_err.h, 0xd0000 is GSS_S_FAILURE.

So either crypto_alloc_blkcipher() or make_checksum() (which also does some allocation) failed.

So maybe this is an allocation failure.  I can make another patch to confirm that (might not get to it today).

We really shouldn't have allocations in that code at all; I thought we'd gotten rid of them....  So I'll need to work on fixing that regardless.

Comment 23 J. Bruce Fields 2011-11-15 21:47:38 UTC

Created attachment 533851 [details]
print more gss_unwrap errors

Sorry, those aren't the only two places where it could fail....

Here are some more printk's to try.

Comment 24 Jian Li 2011-11-16 02:29:14 UTC

Ok, testing
(In reply to comment #23)
> Created attachment 533851 [details]

Comment 25 Jian Li 2011-11-16 05:18:39 UTC

[root@ibm-z10-03 test]# ./bigfile testfile 
short read (4096) to testfile
[root@ibm-z10-03 test]# dmesg -c
gss_unwrap_kerberos_v2: error 393216 from decrypt_v2
gss_unwrap returned d0000
svc: 10.16.66.194, port=753: failed to decode args
gss_unwrap_kerberos_v2: error 393216 from decrypt_v2
gss_unwrap returned d0000
svc: 10.16.66.194, port=753: failed to decode args
gss_unwrap_kerberos_v2: error 393216 from decrypt_v2
gss_unwrap returned d0000
svc: 10.16.66.194, port=753: failed to decode args
gss_unwrap_kerberos_v2: error 393216 from decrypt_v2
gss_unwrap returned d0000
svc: 10.16.66.194, port=753: failed to decode args
gss_unwrap_kerberos_v2: error 393216 from decrypt_v2
gss_unwrap returned d0000
** repeat **


[root@ibm-z10-03 test]# ./bigfile testfile 
Error: read from testfile failed: Input/output error
[root@ibm-z10-03 test]# dmesg -c
gss_unwrap_kerberos_v2: error 393216 from decrypt_v2
gss_unwrap returned d0000
svc: 10.16.66.194, port=753: failed to decode args
gss_unwrap_kerberos_v2: error 393216 from decrypt_v2
gss_unwrap returned d0000
svc: 10.16.66.194, port=753: failed to decode args
gss_unwrap_kerberos_v2: error 393216 from decrypt_v2
gss_unwrap returned d0000
svc: 10.16.66.194, port=753: failed to decode args
gss_unwrap_kerberos_v2: error 393216 from decrypt_v2
gss_unwrap returned d0000
svc: 10.16.66.194, port=753: failed to decode args
gss_unwrap_kerberos_v2: error 393216 from decrypt_v2
gss_unwrap returned d0000
** repeat **
(In reply to comment #23)
> Created attachment 533851 [details]
> print more gss_unwrap errors
> 
> Sorry, those aren't the only two places where it could fail....
> 
> Here are some more printk's to try.

Comment 26 J. Bruce Fields 2011-11-22 23:01:25 UTC

OK, so I was wrong, allocation failures aren't the problem.  It's gss_krb5_aes_decrypt returning GSS_S_BAD_SIG, suggesting the memcmp() at the end of gss_krb5_aes_decrypt is failing.  That could be a bug in the encoding or decoding, or perhaps a network that's corruption data somehow.

Aie.  What to try next?

Well, it did attempt to decrypt this, so perhaps looking at the result would tell us something.  So, we could add more printk()'s, to:
  1. Dump the decrypted data.  We can take a look at it and see if it seems like
     a sensible xdr-encoded rpc request, or like random data, or if perhaps it
     starts well and then turns into random data at the end.
  2. Dump the two MAC's that we're memcmp'ing (pkt_hmac and our_hmac).  Probably
     they'll both look random and unrelated, but maybe it's worth looking at
     just in case.

I wonder whether it would also be worth running a tcpdump, both on the client and on the server, and then comparing the two data streams to see if we can catch the network flipping a bit or something.  Seems unlikely.

Comment 27 J. Bruce Fields 2011-11-22 23:11:12 UTC

The client also knows when we hit this failure since it gets an error back.  So on the client side we should be able to dump the data that it fed into the encryption routines, and on the server dump what we decrypted, and see if the results always differ in some predictable way.

Comment 28 J. Bruce Fields 2011-11-23 15:34:31 UTC

Also, I know you said you'd tried one older kernel version, but it might be worth experimenting a little more with kernel versions, to see if there are any where this works (in which case we can bisect to find the regression (if it's an older kernel) or bugfix (if it's a newer one).

It would be worth trying the latest upstream kernel, at least.

Comment 32 Steve Dickson 2012-01-06 15:22:04 UTC

*** Bug 749655 has been marked as a duplicate of this bug. ***

Comment 35 yanfu,wang 2012-05-28 06:57:05 UTC

Pasted my test result on 2.6.32-274.el6 of RHEL6.3, cthon 'Special' test failed on krb5p/krb5i mounts with UDP. Pls check bug 822189 comment #26 - comment #29.

Comment 36 Jian Li 2012-08-10 01:36:51 UTC

cthon fail on rhel5.9, cause machine panic.

failed command: 
cthon04 nfs -s:special nfsvers=2_udp_krb5p intel-s3e37-01.rhts.eng.rdu.redhat.com /mnt/testarea -onfsvers=2,udp,sec=krb5p

dmesg:
kernel BUG at arch/i386/mm/highmem.c:64! 
invalid opcode: 0000 [#1] 
SMP  
last sysfs file: /devices/pci0000:00/0000:00:19.0/irq 
Modules linked in: md5 testmgr_cipher testmgr aead crypto_blkcipher crypto_algapi des rpcsec_gss_krb5 auth_rpcgss nfs nfs_acl autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc cpufreq_ondemand acpi_cpufreq mperf be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i libcxgbi cxgb3 8021q libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi ac lp snd_hda_intel snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_hwdep snd sr_mod cdrom e1000e tpm_tis parport_serial tpm tpm_bios parport_pc parport sg pcspkr soundcore i2c_i801 i2c_core dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ahci libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd 
CPU:    3 
EIP:    0060:[<c041dc92>]    Not tainted VLI 
EFLAGS: 00010287   (2.6.18-333.el5 #1)  
EIP is at kunmap_atomic+0x38/0x66 
eax: fff63000   ebx: fff62000   ecx: 00000058   edx: 0009c000 
esi: 00000003   edi: f648bb65   ebp: 00000000   esp: f648bb3c 
ds: 007b   es: 007b   ss: 0068 
Process rewind (pid: 21220, ti=f648b000 task=dffac000 task.ti=f648b000) 
Stack: f648bbcc c04db8d2 f648bbcc f648bbb4 f648bb60 f648bb94 c04dbcaf 00000000  
       c06a2300 01000000 c045f100 00000044 00000000 00020220 00000000 c16d3200  
       ffffffff f648bc34 f648bb68 f6990000 00000010 f648bbb4 00000010 c04dbe6e  
Call Trace: 
 [<c04db8d2>] scatterwalk_copychunks+0x49/0x7c 
 [<c04dbcaf>] crypt_slow+0x45/0x85 
 [<c045f100>] __alloc_pages+0x10/0x2cf 
 [<c04dbe6e>] crypt+0x17f/0x1f1 
 [<c04dc06d>] crypt_iv_unaligned+0x9b/0xa4 
 [<c04db9be>] cbc_process_encrypt+0x0/0x85 
 [<c04dc0f1>] cbc_encrypt_iv+0x3b/0x42 
 [<f90c4c36>] des_encrypt+0x0/0x20a [des] 
 [<c04db9be>] cbc_process_encrypt+0x0/0x85 
 [<f90bb980>] encryptor+0xfe/0x165 [auth_rpcgss] 
 [<f90bb73f>] process_xdr_buf+0x133/0x144 [auth_rpcgss] 
 [<f90bbb8a>] gss_encrypt_xdr_buf+0x87/0x92 [auth_rpcgss] 
 [<f90bb882>] encryptor+0x0/0x165 [auth_rpcgss] 
 [<f9046d88>] gss_wrap_kerberos+0x2df/0x30b [rpcsec_gss_krb5] 
 [<f90b97f1>] gss_wrap+0xd/0x10 [auth_rpcgss] 
 [<f90b8f94>] gss_wrap_req+0x29d/0x36d [auth_rpcgss] 
 [<f90b8cf7>] gss_wrap_req+0x0/0x36d [auth_rpcgss] 
 [<f90f9f1b>] nfs_xdr_writeargs+0x0/0x78 [nfs] 
 [<f906f90e>] rpcauth_wrap_req+0x4b/0x60 [sunrpc] 
 [<f90f9f1b>] nfs_xdr_writeargs+0x0/0x78 [nfs] 
 [<f906a35b>] call_transmit+0x171/0x1ca [sunrpc] 
 [<f906f05c>] __rpc_execute+0x80/0x26a [sunrpc] 
 [<c042ee54>] sigprocmask+0xb0/0xce 
 [<f90fdf44>] nfs_execute_write+0x35/0x49 [nfs] 
 [<f90ff309>] nfs_flush_one+0xb8/0x10d [nfs] 
 [<f90ff251>] nfs_flush_one+0x0/0x10d [nfs] 
 [<f90fdbca>] nfs_flush_list+0x6f/0x132 [nfs] 
 [<f90fdcd2>] nfs_flush_inode+0x45/0x53 [nfs] 
 [<f90ff571>] nfs_writepages+0x4c/0x88 [nfs] 
 [<c045fb01>] do_writepages+0x20/0x32 
 [<c045b5eb>] __filemap_fdatawrite_range+0x66/0x72 
 [<c045b7e5>] filemap_fdatawrite+0x12/0x16 
 [<f90f5da1>] nfs_file_flush+0x66/0x78 [nfs] 
 [<c0476860>] filp_close+0x2f/0x54 
 [<c0477ad8>] sys_close+0x71/0xa0 
 [<c0404f4b>] syscall_call+0x7/0xb

Comment 37 J. Bruce Fields 2012-08-10 15:29:21 UTC

We expect krb5/udp to fail sometimes, but we don't expect it to trigger a BUG like this.  Is this reproduceable?

It looks like we passed something inconsistent to kunmap_atomic().

Possibly there's some bug in the crypto code, or perhaps we're setting up the something incorrectly in gss_wrap_req_priv.

Comment 38 Jian Li 2012-08-13 01:11:28 UTC

(In reply to comment #37)
> We expect krb5/udp to fail sometimes, but we don't expect it to trigger a
> BUG like this.  Is this reproduceable?
2 automatic jobs all panic machine.  
> 
> It looks like we passed something inconsistent to kunmap_atomic().
> 
> Possibly there's some bug in the crypto code, or perhaps we're setting up
> the something incorrectly in gss_wrap_req_priv.
Ok, I will test again, try to fetch a vmcore!

Comment 40 Tore H. Larsen 2012-10-31 05:48:23 UTC

cc

Comment 43 J. Bruce Fields 2015-12-21 22:16:47 UTC

There's an odd mixture of bugs reported here, I'm not convinced they're all the same (e.g., comment 36 is on rhel5, and for a crash not an EIO?).

For the reports on s390, it would be worth checking whether this is a dup of 1003528, in which case this should be fixed as of 2.6.32-491.el6 or so according to that bug.

Comment 44 J. Bruce Fields 2016-01-05 15:56:27 UTC

(In reply to J. Bruce Fields from comment #43)
> For the reports on s390, it would be worth checking whether this is a dup of
> 1003528, in which case this should be fixed as of 2.6.32-491.el6 or so
> according to that bug.

Jian Li, could you check this?  I'm thinking for example about the problem you reported in comment 10.

Comment 45 Jan Kurik 2017-12-06 12:38:47 UTC

Red Hat Enterprise Linux 6 is in the Production 3 Phase. During the Production 3 Phase, Critical impact Security Advisories (RHSAs) and selected Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available.

The official life cycle policy can be reviewed here:

http://redhat.com/rhel/lifecycle

This issue does not meet the inclusion criteria for the Production 3 Phase and will be marked as CLOSED/WONTFIX. If this remains a critical requirement, please contact Red Hat Customer Support to request a re-evaluation of the issue, citing a clear business justification. Note that a strong business justification will be required for re-evaluation. Red Hat Customer Support can be contacted via the Red Hat Customer Portal at the following URL:

https://access.redhat.com/

Comment 46 Red Hat Bugzilla 2023-09-14 01:25:55 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days