Bug 207737

Summary: An ISS scan killed NFS servers
Product: Red Hat Enterprise Linux 4 Reporter: Shing-Shong Shei <shei>
Component: kernelAssignee: Jeff Layton <jlayton>
Status: CLOSED DUPLICATE QA Contact: Ben Levenson <benl>
Severity: high Docs Contact:
Priority: medium    
Version: 4.4CC: buckh, e4glez, jwhelland, laurent.deniel, shl1, sputhenp, steved, tjp
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-03-17 12:24:59 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Procedures to reproduce the crash
none
database to reproduce the crash (See the attached procedure)
none
Test case none

Description Shing-Shong Shei 2006-09-22 19:46:28 UTC
Description of problem: Today we requested our
IT Security Office to perfoem security ISS scan
to some of our servers running RHEL 4, Update 4
and it killed several machines with the error
messages attached at the end of this message.

Version-Release number of selected component (if applicable):


How reproducible: Not sure.


Steps to Reproduce:
1.
2.
3.
  
Actual results:
Machines crashed

Expected results:
Do not crash. :-)

Additional info:

Sep 22 13:38:17 newman kernel: lockd: cannot monitor 129.79.246.27
Sep 22 13:38:17 newman kernel: lockd: failed to monitor 129.79.246.27
Sep 22 13:38:56 newman kernel: lockd: server 129.79.247.1 OK
Sep 22 13:40:28 newman kernel: ------------[ cut here ]------------
Sep 22 13:40:28 newman kernel: kernel BUG at fs/locks.c:1798!
Sep 22 13:40:28 newman kernel: invalid operand: 0000 [#1]
Sep 22 13:40:28 newman kernel: SMP
Sep 22 13:40:28 newman kernel: Modules linked in: ipmi_devintf ipmi_si
ipmi_msghandler mptctl mptbase nfs nfsd exportfs lockd nfs_acl md5 ipv6
parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc iptable_filter ip_tables
dm_mirror dm_mod button battery ac uhci_hcd ehci_hcd hw_random e1000 floppy sg
ext3 jbd megaraid_mbox megaraid_mm sd_mod scsi_mod
Sep 22 13:40:28 newman kernel: CPU:    0
Sep 22 13:40:28 newman kernel: EIP:    0060:[<c016e904>]    Not tainted VLI
Sep 22 13:40:28 newman kernel: EFLAGS: 00010246   (2.6.9-42.0.2.ELsmp)
Sep 22 13:40:28 newman kernel: EIP is at locks_remove_flock+0xa1/0xe1
Sep 22 13:40:28 newman kernel: eax: f6434eac   ebx: e74af73c   ecx: 00000000  
edx: 00000001
Sep 22 13:40:28 newman kernel: esi: 00000000   edi: e74af694   ebp: f4c6b480  
esp: cfaf8f2c
Sep 22 13:40:28 newman kernel: ds: 007b   es: 007b   ss: 0068
Sep 22 13:40:28 newman kernel: Process bogofilter (pid: 29111,
threadinfo=cfaf8000 task=f5d10eb0)
Sep 22 13:40:28 newman kernel: Stack: f4c6b480 f920543a cfaf8f44 f9205e2a
f9277fb7 c016e85c 00000000 00000000
Sep 22 13:40:28 newman kernel:        00000000 0b9cc0ed 00000000 f6179380
000071b7 45141e72 00000000 45141e72
Sep 22 13:40:28 newman kernel:        00000000 f4c6b480 00000201 00000000
00000000 00000246 00000000 f4c6b480
Sep 22 13:40:28 newman kernel: Call Trace:
Sep 22 13:40:28 newman kernel:  [<f920543a>] nlm_put_lockowner+0x11/0x49 [lockd]
Sep 22 13:40:28 newman kernel:  [<f9205e2a>]
nlmclnt_locks_release_private+0xb/0x14 [lockd]
Sep 22 13:40:28 newman kernel:  [<f9277fb7>] nfs_lock+0x0/0xc7 [nfs]
Sep 22 13:40:28 newman kernel:  [<c016e85c>] locks_remove_posix+0x130/0x137
Sep 22 13:40:28 newman kernel:  [<c015bbc2>] __fput+0x41/0x100
Sep 22 13:40:28 newman kernel:  [<c015a7f5>] filp_close+0x59/0x5f
Sep 22 13:40:28 newman kernel:  [<c02d47bf>] syscall_call+0x7/0xb
Sep 22 13:40:28 newman kernel: Code: 38 39 68 2c 75 2d 0f b6 50 30 f6 c2 02 74
09 89 d8 e8 9f df ff ff eb 1d f6 c2 20 74 0e ba 02 00 00
00 89 d8 e8 ba ec ff ff eb 0a <0f> 0b 06 07 65 98 2e c0 89 c3 8b 03 eb c4 b8 00
f0 ff ff 21 e0
Sep 22 13:40:28 newman kernel:  <0>Fatal exception: panic in 5 seconds

Comment 1 Steve Dickson 2006-09-23 00:25:41 UTC
So what is exactly is a security ISS scan and where can we get one... 

Comment 2 Shing-Shong Shei 2006-09-25 12:30:40 UTC
Steve,

ISS is a commercial security scan package our IT Security Office
uses to scan all machines on campus at Indiana University.  I do
not know how you can get one but I can try to find out.

Thanks,
Bruce

Comment 3 Steve Dickson 2006-09-25 15:38:24 UTC
Please do... since it would be good to know if we have the
same problem with our RHEL5 produce line....

tia.... 

Comment 4 Shing-Shong Shei 2006-09-25 15:52:04 UTC
I have asked our ITSO staff for more information.  The alternative
is to give me the beta version of RHEL5 so that I can install it
here and ask them to scan the machine as they do now.  --Bruce


Comment 5 Shing-Shong Shei 2006-09-25 17:45:52 UTC
Here is more information at the time when this happened:

1) the machine that crashed, newman, is our mail server
   and it NFS mounted user's home directory using autofs.
   (This is so that sendmail can access users' homedirectory
   for, say, .forward file.)
2) The machine as users' home directory server, frog, had its
   locked/statd died due to the ISS scan mentioned above.  (we
   had to do 'service nfs stop; sevice nfslock restart; service
   nfs start' later after we found out what was going on.  At
   this point, newman already crashed.

My guess is that probably newman could not locked users'
.forward due to 2).

Bruce


Comment 6 Steven Lee 2006-09-28 13:21:18 UTC
FYI, we just encountered exactly the same crash on one of our (CS department at
Cornell University) main compute servers.  At the time of the crash, it probably
had a couple of dozens of users logged on.  The load was not particularly high
(< 4 on a dual Xeon Dell PowerEdge 2650), according to our Bigbro monitor.  We
use amd to auto-mount user home directories so at the time probably about half a
dozen NFS shares were mounted from four file servers.  We are running RHEL 4 U4
kernel 2.6.9-42.0.2.ELsmp.

If you need any more info, please let me know.  We've been running RHEL 4 on
this particular server for more than a year.  I believe this is the first time
we encountered this crash.

---
Sep 27 16:04:34 lion kernel: ------------[ cut here ]------------
Sep 27 16:04:34 lion kernel: kernel BUG at fs/locks.c:1798!
Sep 27 16:04:34 lion kernel: invalid operand: 0000 [#1]
Sep 27 16:04:34 lion kernel: SMP
Sep 27 16:04:34 lion kernel: Modules linked in: loop nfsd exportfs md5 ipv6
parport_pc lp parport autofs4 i2c_dev i2c_core nfs lockd nfs_acl sunrpc joydev
button battery ac ohci_hcd tg3 floppy sg dm_snapshot dm_zero dm_mirror ext3 jbd
dm_mod qla6312 qla2xxx scsi_transport_fc aic7xxx sd_mod scsi_mod
Sep 27 16:04:34 lion kernel: CPU:    0
Sep 27 16:04:34 lion kernel: EIP:    0060:[<c016e904>]    Not tainted VLI
Sep 27 16:04:34 lion kernel: EFLAGS: 00010246   (2.6.9-42.0.2.ELsmp)
Sep 27 16:04:34 lion kernel: EIP is at locks_remove_flock+0xa1/0xe1
Sep 27 16:04:34 lion kernel: eax: f515424c   ebx: d84694a4   ecx: 00000000  
edx: 00000001
Sep 27 16:04:34 lion kernel: esi: 00000000   edi: d84693fc   ebp: f498b180  
esp: e6d94f2c
Sep 27 16:04:34 lion kernel: ds: 007b   es: 007b   ss: 0068
Sep 27 16:04:34 lion kernel: Process lt-sqlite3 (pid: 1350, threadinfo=e6d94000
task=d117b130)
Sep 27 16:04:34 lion kernel: Stack: f498b180 f8c3843a e6d94f44 f8c38e2a f8cb1fb7
c016e85c 00000000 00000000
Sep 27 16:04:34 lion kernel:        00000000 00ebc056 00000000 f2545a80 00000546
451ad939 00000000 451ad939
Sep 27 16:04:34 lion kernel:        00000000 f498b180 00000201 00000000 00000000
00000246 00000000 f498b180
Sep 27 16:04:34 lion kernel: Call Trace:
Sep 27 16:04:34 lion kernel:  [<f8c3843a>] nlm_put_lockowner+0x11/0x49 [lockd]
Sep 27 16:04:34 lion kernel:  [<f8c38e2a>]
nlmclnt_locks_release_private+0xb/0x14 [lockd]
Sep 27 16:04:34 lion kernel:  [<f8cb1fb7>] nfs_lock+0x0/0xc7 [nfs]
Sep 27 16:04:34 lion kernel:  [<c016e85c>] locks_remove_posix+0x130/0x137
Sep 27 16:04:34 lion kernel:  [<c015bbc2>] __fput+0x41/0x100
Sep 27 16:04:34 lion kernel:  [<c015a7f5>] filp_close+0x59/0x5f
Sep 27 16:04:34 lion kernel:  [<c02d47bf>] syscall_call+0x7/0xb
Sep 27 16:04:34 lion kernel: Code: 38 39 68 2c 75 2d 0f b6 50 30 f6 c2 02 74 09
89 d8 e8 9f df ff ff eb 1d f6 c2 20 74 0e ba 02 00 00 00 89 d8 e8 ba ec ff ff eb
0a <0f> 0b 06 07 65 98 2e c0 89 c3 8b 03 eb c4 b8 00 f0 ff ff 21 e0
Sep 27 16:04:34 lion kernel:  <0>Fatal exception: panic in 5 seconds
---

Comment 7 Steven Lee 2006-09-28 13:32:46 UTC
Sorry, I forgot to mention: this compute server, in addition of mounting NFS
shares, also serves a NFS share (used a short-term storage space) to other
compute servers.  

Comment 8 Steven Lee 2006-09-28 19:07:08 UTC
I just got another report for a user who has been able to consisitently crash
the servers :-)  Hopefully this info can help you guys debug:

Here is an interesting point: Basically, he accessed the database file on 2
different NFS servers (web8 and panda, where his home directory resides).  He
was able to crash multiple NFS clients (all of which are running RHEL 4 Update
4) by accessing the file on web8, but not the same file in his home directory on
panda.  web8 is running RHEL 4 Update 4 whereas panda is still running RHEL 4
Update 3.

It seems to me something is bad with nfs/lockd in nfs-utils-1.0.6-70 in Update
4.....


---
But I don't know why.



I just crashed cfs03 again just now. Here's exactly what I did.



login (via ssh)



cd misc/rsrch/mpqa/fa04 # one of my research directories



sqlite3 ~/misc/web8/cs474/hmm.db



# this a soft-link to /cucs/web/w8/ebreck/cs474/hmm.db

# I run the command-line interface to SQLite, a serverless SQL engine,

# on a database stored on web8 - it's a database of student submissions

# for a homework assignment; I'm TAing for Prof Cardie's CS474 class.



# now within sqlite



> .tables



# this command just lists the tables in the database

# it runs for a long time, so I hit Ctrl-C, and sqlite

# responds with



Error: database is locked



# now I attempt to exit sqlite



.exit



# sqlite is still hung somehow, so I hit Ctrl-C again

# and down goes the machine.


I think it must have something to do with trying to write to the filesystem on
web8, because running

sqlite3 on the identical file copied to my home directory works fine.

Comment 9 Shing-Shong Shei 2006-10-19 13:04:17 UTC
So is anybody looking this problem now?  It's pretty
severe to us -- we want to ask our ITSO (IT Security
Office) to perform a security scan on all of our servers
regularly but we don't want our servers to quit because
of the security scans.  Currently we have to disable the
security scans because of this bug.  Please escalate this
call if you can.  Much appreciated.

Thanks,
Bruce Shei

Comment 10 Peter Staubach 2006-10-19 14:46:13 UTC
Yes, I am looking at this issue.  I am looking at it in conjunction
with another bugzilla, 211092.  This other bugzilla shows some
similarities in the stack traces, although the scenario to recreate
the situation is very different.

It would help to have a testcase which can reproduce the problem.
Perhaps it would be possible to simulate the ISS scan by watching
the network traffic which causes the problem and then writing a
program which generates the same sort of traffic?

Comment 11 Shing-Shong Shei 2006-10-19 14:55:52 UTC
Thanks, Peter, for the quick response.  I will try to contact our
ITSO staff to see how much they can help.  Unfortunately, it's
kind of out of our control.  But looks like Steven at Cornell
has described a reliable way to reproduce it?  At this is what
he indicated in his posting.  Thanks and have a great day.
--Bruce

Comment 12 Peter Staubach 2006-10-19 15:08:22 UTC
Any information that you can provide would be appreciated.

Steven, what is the sqlite3 command, and would it be possible to get
a copy of hmm.db or some other database which can be used to reproduce
the hang?

Comment 13 Steven Lee 2006-10-19 15:58:53 UTC
I'll contact the user to see if he is willing to release his code for testing
purposes.

Comment 14 Steven Lee 2006-10-23 17:24:04 UTC
Created attachment 139148 [details]
Procedures to reproduce the crash

Comment 15 Steven Lee 2006-10-23 17:25:11 UTC
Created attachment 139149 [details]
database to reproduce the crash (See the attached procedure)

Comment 16 Steven Lee 2006-10-23 17:30:09 UTC
OK guys, I attached the procedure as well as the database that caused the crash
for us.  sqlite3 is from http://www.sqlite.org

Please note, this database is not exactly the same one as the one that caused
the crash.  But according to the user, should be similar enough.  We'd rather
not crash our compute server to prove it though....

I hope this helps your debugging.

Comment 17 Peter Staubach 2006-10-23 18:00:12 UTC
Thanx!  I'll take a peek using this stuff and see what I can find.

Comment 18 Laurent Deniel 2006-12-04 17:36:47 UTC
Any news on this ? I have the same problem (RHEL4U4) :

Dec  4 18:22:19 storm kernel:  [<e0b3c43a>] nlm_put_lockowner+0x11/0x49 [lockd]
Dec  4 18:22:19 storm kernel:  [<e0b3ce2a>]
nlmclnt_locks_release_private+0xb/0x14 [lockd]
Dec  4 18:22:19 storm kernel:  [<e0bb5fb7>] nfs_lock+0x0/0xc7 [nfs]
Dec  4 18:22:19 storm kernel:  [<c016e85c>] locks_remove_posix+0x130/0x137
Dec  4 18:22:19 storm kernel:  [<c015bbc2>] __fput+0x41/0x100
Dec  4 18:22:19 storm kernel:  [<c015a7f5>] filp_close+0x59/0x5f
Dec  4 18:22:19 storm kernel:  [<c02d4703>] syscall_call+0x7/0xb

Test case is simply running a gcov instrumented binary on NFS (tcp, hard) 
so NFS locks to write gcov data at program exit ...


Comment 19 Peter Staubach 2006-12-04 18:04:22 UTC
No news yet.

If you distill down the gcov behavior to a simple testcase, it would
be appreciated.

Comment 20 Buck Huppmann 2007-03-02 12:45:01 UTC
me, too. crash with EIP at exactly the same instruction

have opened service req. 1119828 (going on 3 months old)

in my circumstance, i believe the crash occurred when the client was holding
(or maybe trying to get rid of) locks on a server that was down (or was maybe
in the process of coming back up, i'm not sure)

Comment 21 Laurent Deniel 2007-04-10 11:26:13 UTC
Refering to http://lkml.org/lkml/2005/12/21/334

Test case :

#include <unistd.h>
#include <fcntl.h>
int main()
{
   int          fd;
   struct flock lck;
   fd = open("file_on_nfs", O_RDWR | O_CREAT, 0644);
   memset(&lck, 0, sizeof(lck));
   lck.l_type = F_WRLCK;
   fcntl(fd, F_SETLK, &lck);
   fchmod(fd, 02644);
   close(fd);
}

Comment 22 Laurent Deniel 2007-04-10 11:33:59 UTC
Yeah I saw that Peter did respond on lkml, I don't care
to have a quick & dirty patch but I do care of kernel crash !

Thx.

Comment 23 Buck Huppmann 2007-04-11 13:17:16 UTC
@laurent.deniel:

when i run the reproducer in your comment#21, i get a different backtrace:

kernel BUG at fs/locks.c:1798!
invalid operand: 0000 [#1]

EFLAGS: 00010246   (2.6.9-42.0.10.ELsmp) 
EIP is at locks_remove_flock+0xa1/0xe1

Call Trace:
 [<f8d31fb7>] nfs_lock+0x0/0xc7 [nfs]
 [<c016e787>] locks_remove_posix+0x8f/0x137
 [<c015bb0a>] __fput+0x41/0x100
 [<c015a73d>] filp_close+0x59/0x5f
 [<c02d4903>] syscall_call+0x7/0xb

i.e., no NLM client stuff. i think comment#21 is germane to bz#218777

for the record, speaking of NLM, the only way i could get the reproducer to
crash the system was to restart rpc.statd (by condrestart-ing nfslock) first.
after a fresh boot, the reproducer wouldn't work; i'd only get

lockd: cannot monitor 172.31.206.130
lockd: failed to monitor 172.31.206.130

, where that IP address is the address of the local machine, and the file
would be created and chown-ed without causing a crash. i guess i should be
thankful that statd usually doesn't work

Comment 24 Laurent Deniel 2007-04-11 17:45:59 UTC
Right the problem in #21 is more related to bz#218777.




Comment 25 Laurent Deniel 2007-04-16 08:29:49 UTC
Created attachment 152672 [details]
Test case

README includued

Comment 26 Laurent Deniel 2007-05-30 19:18:23 UTC
Any news since the last test case ? 
Shall I open a service request to speed thing up ?

Comment 27 Peter Staubach 2007-05-30 19:29:05 UTC
Sorry, I haven't had a chance to look at this yet.  I will try to take
a peek at it and see what I can find.

Comment 28 Peter Staubach 2007-05-30 19:34:25 UTC
Has this test program been tried with the proposed patch included in bz211092?

Comment 29 Laurent Deniel 2007-05-30 19:57:37 UTC
Sorry but I don't have access to 211092 ...

Comment 30 Jason Helland 2008-05-27 17:36:18 UTC
Just had this same problem and RHEL4U5.  Any updates?  

Comment 31 Peter Staubach 2008-05-27 17:44:34 UTC
Actually, I knew which version of RHEL that it was from the version
field, so that change wasn't really needed.

But no, no updates yet.  I have been working on a data corruption issue
and a diffferent system crash in the Sun RPC code.

Comment 33 Jeff Layton 2010-03-17 12:24:59 UTC
The patch referenced in the above discussion went into RHEL4.5. Comment #30 mentions seeing this issue in a RHEL4.5 kernel. I suspect however that that is a different problem that simply manifested itself in the same way. We've had a number of different fixes go in for problems that look similar to this one but are different.

I'm going to close this bug as a duplicate of the one that added the patch under discussion above. If anyone is able to reproduce this on more recent kernels, please try to reproduce it on the kernels here:

    http://people.redhat.com/vgoyal/rhel4/RPMS.kernel/

...and reopen this bug if you are able to do so.

*** This bug has been marked as a duplicate of bug 218777 ***