Bug 207737
Summary: | An ISS scan killed NFS servers | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Shing-Shong Shei <shei> | ||||||||
Component: | kernel | Assignee: | Jeff Layton <jlayton> | ||||||||
Status: | CLOSED DUPLICATE | QA Contact: | Ben Levenson <benl> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | medium | ||||||||||
Version: | 4.4 | CC: | buckh, e4glez, jwhelland, laurent.deniel, shl1, sputhenp, steved, tjp | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | i686 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2010-03-17 12:24:59 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Shing-Shong Shei
2006-09-22 19:46:28 UTC
So what is exactly is a security ISS scan and where can we get one... Steve, ISS is a commercial security scan package our IT Security Office uses to scan all machines on campus at Indiana University. I do not know how you can get one but I can try to find out. Thanks, Bruce Please do... since it would be good to know if we have the same problem with our RHEL5 produce line.... tia.... I have asked our ITSO staff for more information. The alternative is to give me the beta version of RHEL5 so that I can install it here and ask them to scan the machine as they do now. --Bruce Here is more information at the time when this happened: 1) the machine that crashed, newman, is our mail server and it NFS mounted user's home directory using autofs. (This is so that sendmail can access users' homedirectory for, say, .forward file.) 2) The machine as users' home directory server, frog, had its locked/statd died due to the ISS scan mentioned above. (we had to do 'service nfs stop; sevice nfslock restart; service nfs start' later after we found out what was going on. At this point, newman already crashed. My guess is that probably newman could not locked users' .forward due to 2). Bruce FYI, we just encountered exactly the same crash on one of our (CS department at Cornell University) main compute servers. At the time of the crash, it probably had a couple of dozens of users logged on. The load was not particularly high (< 4 on a dual Xeon Dell PowerEdge 2650), according to our Bigbro monitor. We use amd to auto-mount user home directories so at the time probably about half a dozen NFS shares were mounted from four file servers. We are running RHEL 4 U4 kernel 2.6.9-42.0.2.ELsmp. If you need any more info, please let me know. We've been running RHEL 4 on this particular server for more than a year. I believe this is the first time we encountered this crash. --- Sep 27 16:04:34 lion kernel: ------------[ cut here ]------------ Sep 27 16:04:34 lion kernel: kernel BUG at fs/locks.c:1798! Sep 27 16:04:34 lion kernel: invalid operand: 0000 [#1] Sep 27 16:04:34 lion kernel: SMP Sep 27 16:04:34 lion kernel: Modules linked in: loop nfsd exportfs md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core nfs lockd nfs_acl sunrpc joydev button battery ac ohci_hcd tg3 floppy sg dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla6312 qla2xxx scsi_transport_fc aic7xxx sd_mod scsi_mod Sep 27 16:04:34 lion kernel: CPU: 0 Sep 27 16:04:34 lion kernel: EIP: 0060:[<c016e904>] Not tainted VLI Sep 27 16:04:34 lion kernel: EFLAGS: 00010246 (2.6.9-42.0.2.ELsmp) Sep 27 16:04:34 lion kernel: EIP is at locks_remove_flock+0xa1/0xe1 Sep 27 16:04:34 lion kernel: eax: f515424c ebx: d84694a4 ecx: 00000000 edx: 00000001 Sep 27 16:04:34 lion kernel: esi: 00000000 edi: d84693fc ebp: f498b180 esp: e6d94f2c Sep 27 16:04:34 lion kernel: ds: 007b es: 007b ss: 0068 Sep 27 16:04:34 lion kernel: Process lt-sqlite3 (pid: 1350, threadinfo=e6d94000 task=d117b130) Sep 27 16:04:34 lion kernel: Stack: f498b180 f8c3843a e6d94f44 f8c38e2a f8cb1fb7 c016e85c 00000000 00000000 Sep 27 16:04:34 lion kernel: 00000000 00ebc056 00000000 f2545a80 00000546 451ad939 00000000 451ad939 Sep 27 16:04:34 lion kernel: 00000000 f498b180 00000201 00000000 00000000 00000246 00000000 f498b180 Sep 27 16:04:34 lion kernel: Call Trace: Sep 27 16:04:34 lion kernel: [<f8c3843a>] nlm_put_lockowner+0x11/0x49 [lockd] Sep 27 16:04:34 lion kernel: [<f8c38e2a>] nlmclnt_locks_release_private+0xb/0x14 [lockd] Sep 27 16:04:34 lion kernel: [<f8cb1fb7>] nfs_lock+0x0/0xc7 [nfs] Sep 27 16:04:34 lion kernel: [<c016e85c>] locks_remove_posix+0x130/0x137 Sep 27 16:04:34 lion kernel: [<c015bbc2>] __fput+0x41/0x100 Sep 27 16:04:34 lion kernel: [<c015a7f5>] filp_close+0x59/0x5f Sep 27 16:04:34 lion kernel: [<c02d47bf>] syscall_call+0x7/0xb Sep 27 16:04:34 lion kernel: Code: 38 39 68 2c 75 2d 0f b6 50 30 f6 c2 02 74 09 89 d8 e8 9f df ff ff eb 1d f6 c2 20 74 0e ba 02 00 00 00 89 d8 e8 ba ec ff ff eb 0a <0f> 0b 06 07 65 98 2e c0 89 c3 8b 03 eb c4 b8 00 f0 ff ff 21 e0 Sep 27 16:04:34 lion kernel: <0>Fatal exception: panic in 5 seconds --- Sorry, I forgot to mention: this compute server, in addition of mounting NFS shares, also serves a NFS share (used a short-term storage space) to other compute servers. I just got another report for a user who has been able to consisitently crash
the servers :-) Hopefully this info can help you guys debug:
Here is an interesting point: Basically, he accessed the database file on 2
different NFS servers (web8 and panda, where his home directory resides). He
was able to crash multiple NFS clients (all of which are running RHEL 4 Update
4) by accessing the file on web8, but not the same file in his home directory on
panda. web8 is running RHEL 4 Update 4 whereas panda is still running RHEL 4
Update 3.
It seems to me something is bad with nfs/lockd in nfs-utils-1.0.6-70 in Update
4.....
---
But I don't know why.
I just crashed cfs03 again just now. Here's exactly what I did.
login (via ssh)
cd misc/rsrch/mpqa/fa04 # one of my research directories
sqlite3 ~/misc/web8/cs474/hmm.db
# this a soft-link to /cucs/web/w8/ebreck/cs474/hmm.db
# I run the command-line interface to SQLite, a serverless SQL engine,
# on a database stored on web8 - it's a database of student submissions
# for a homework assignment; I'm TAing for Prof Cardie's CS474 class.
# now within sqlite
> .tables
# this command just lists the tables in the database
# it runs for a long time, so I hit Ctrl-C, and sqlite
# responds with
Error: database is locked
# now I attempt to exit sqlite
.exit
# sqlite is still hung somehow, so I hit Ctrl-C again
# and down goes the machine.
I think it must have something to do with trying to write to the filesystem on
web8, because running
sqlite3 on the identical file copied to my home directory works fine.
So is anybody looking this problem now? It's pretty severe to us -- we want to ask our ITSO (IT Security Office) to perform a security scan on all of our servers regularly but we don't want our servers to quit because of the security scans. Currently we have to disable the security scans because of this bug. Please escalate this call if you can. Much appreciated. Thanks, Bruce Shei Yes, I am looking at this issue. I am looking at it in conjunction with another bugzilla, 211092. This other bugzilla shows some similarities in the stack traces, although the scenario to recreate the situation is very different. It would help to have a testcase which can reproduce the problem. Perhaps it would be possible to simulate the ISS scan by watching the network traffic which causes the problem and then writing a program which generates the same sort of traffic? Thanks, Peter, for the quick response. I will try to contact our ITSO staff to see how much they can help. Unfortunately, it's kind of out of our control. But looks like Steven at Cornell has described a reliable way to reproduce it? At this is what he indicated in his posting. Thanks and have a great day. --Bruce Any information that you can provide would be appreciated. Steven, what is the sqlite3 command, and would it be possible to get a copy of hmm.db or some other database which can be used to reproduce the hang? I'll contact the user to see if he is willing to release his code for testing purposes. Created attachment 139148 [details]
Procedures to reproduce the crash
Created attachment 139149 [details]
database to reproduce the crash (See the attached procedure)
OK guys, I attached the procedure as well as the database that caused the crash for us. sqlite3 is from http://www.sqlite.org Please note, this database is not exactly the same one as the one that caused the crash. But according to the user, should be similar enough. We'd rather not crash our compute server to prove it though.... I hope this helps your debugging. Thanx! I'll take a peek using this stuff and see what I can find. Any news on this ? I have the same problem (RHEL4U4) : Dec 4 18:22:19 storm kernel: [<e0b3c43a>] nlm_put_lockowner+0x11/0x49 [lockd] Dec 4 18:22:19 storm kernel: [<e0b3ce2a>] nlmclnt_locks_release_private+0xb/0x14 [lockd] Dec 4 18:22:19 storm kernel: [<e0bb5fb7>] nfs_lock+0x0/0xc7 [nfs] Dec 4 18:22:19 storm kernel: [<c016e85c>] locks_remove_posix+0x130/0x137 Dec 4 18:22:19 storm kernel: [<c015bbc2>] __fput+0x41/0x100 Dec 4 18:22:19 storm kernel: [<c015a7f5>] filp_close+0x59/0x5f Dec 4 18:22:19 storm kernel: [<c02d4703>] syscall_call+0x7/0xb Test case is simply running a gcov instrumented binary on NFS (tcp, hard) so NFS locks to write gcov data at program exit ... No news yet. If you distill down the gcov behavior to a simple testcase, it would be appreciated. me, too. crash with EIP at exactly the same instruction have opened service req. 1119828 (going on 3 months old) in my circumstance, i believe the crash occurred when the client was holding (or maybe trying to get rid of) locks on a server that was down (or was maybe in the process of coming back up, i'm not sure) Refering to http://lkml.org/lkml/2005/12/21/334 Test case : #include <unistd.h> #include <fcntl.h> int main() { int fd; struct flock lck; fd = open("file_on_nfs", O_RDWR | O_CREAT, 0644); memset(&lck, 0, sizeof(lck)); lck.l_type = F_WRLCK; fcntl(fd, F_SETLK, &lck); fchmod(fd, 02644); close(fd); } Yeah I saw that Peter did respond on lkml, I don't care to have a quick & dirty patch but I do care of kernel crash ! Thx. @laurent.deniel: when i run the reproducer in your comment#21, i get a different backtrace: kernel BUG at fs/locks.c:1798! invalid operand: 0000 [#1] EFLAGS: 00010246 (2.6.9-42.0.10.ELsmp) EIP is at locks_remove_flock+0xa1/0xe1 Call Trace: [<f8d31fb7>] nfs_lock+0x0/0xc7 [nfs] [<c016e787>] locks_remove_posix+0x8f/0x137 [<c015bb0a>] __fput+0x41/0x100 [<c015a73d>] filp_close+0x59/0x5f [<c02d4903>] syscall_call+0x7/0xb i.e., no NLM client stuff. i think comment#21 is germane to bz#218777 for the record, speaking of NLM, the only way i could get the reproducer to crash the system was to restart rpc.statd (by condrestart-ing nfslock) first. after a fresh boot, the reproducer wouldn't work; i'd only get lockd: cannot monitor 172.31.206.130 lockd: failed to monitor 172.31.206.130 , where that IP address is the address of the local machine, and the file would be created and chown-ed without causing a crash. i guess i should be thankful that statd usually doesn't work Right the problem in #21 is more related to bz#218777. Created attachment 152672 [details]
Test case
README includued
Any news since the last test case ? Shall I open a service request to speed thing up ? Sorry, I haven't had a chance to look at this yet. I will try to take a peek at it and see what I can find. Has this test program been tried with the proposed patch included in bz211092? Sorry but I don't have access to 211092 ... Just had this same problem and RHEL4U5. Any updates? Actually, I knew which version of RHEL that it was from the version field, so that change wasn't really needed. But no, no updates yet. I have been working on a data corruption issue and a diffferent system crash in the Sun RPC code. The patch referenced in the above discussion went into RHEL4.5. Comment #30 mentions seeing this issue in a RHEL4.5 kernel. I suspect however that that is a different problem that simply manifested itself in the same way. We've had a number of different fixes go in for problems that look similar to this one but are different. I'm going to close this bug as a duplicate of the one that added the patch under discussion above. If anyone is able to reproduce this on more recent kernels, please try to reproduce it on the kernels here: http://people.redhat.com/vgoyal/rhel4/RPMS.kernel/ ...and reopen this bug if you are able to do so. *** This bug has been marked as a duplicate of bug 218777 *** |