Description of problem: Today we requested our IT Security Office to perfoem security ISS scan to some of our servers running RHEL 4, Update 4 and it killed several machines with the error messages attached at the end of this message. Version-Release number of selected component (if applicable): How reproducible: Not sure. Steps to Reproduce: 1. 2. 3. Actual results: Machines crashed Expected results: Do not crash. :-) Additional info: Sep 22 13:38:17 newman kernel: lockd: cannot monitor 129.79.246.27 Sep 22 13:38:17 newman kernel: lockd: failed to monitor 129.79.246.27 Sep 22 13:38:56 newman kernel: lockd: server 129.79.247.1 OK Sep 22 13:40:28 newman kernel: ------------[ cut here ]------------ Sep 22 13:40:28 newman kernel: kernel BUG at fs/locks.c:1798! Sep 22 13:40:28 newman kernel: invalid operand: 0000 [#1] Sep 22 13:40:28 newman kernel: SMP Sep 22 13:40:28 newman kernel: Modules linked in: ipmi_devintf ipmi_si ipmi_msghandler mptctl mptbase nfs nfsd exportfs lockd nfs_acl md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc iptable_filter ip_tables dm_mirror dm_mod button battery ac uhci_hcd ehci_hcd hw_random e1000 floppy sg ext3 jbd megaraid_mbox megaraid_mm sd_mod scsi_mod Sep 22 13:40:28 newman kernel: CPU: 0 Sep 22 13:40:28 newman kernel: EIP: 0060:[<c016e904>] Not tainted VLI Sep 22 13:40:28 newman kernel: EFLAGS: 00010246 (2.6.9-42.0.2.ELsmp) Sep 22 13:40:28 newman kernel: EIP is at locks_remove_flock+0xa1/0xe1 Sep 22 13:40:28 newman kernel: eax: f6434eac ebx: e74af73c ecx: 00000000 edx: 00000001 Sep 22 13:40:28 newman kernel: esi: 00000000 edi: e74af694 ebp: f4c6b480 esp: cfaf8f2c Sep 22 13:40:28 newman kernel: ds: 007b es: 007b ss: 0068 Sep 22 13:40:28 newman kernel: Process bogofilter (pid: 29111, threadinfo=cfaf8000 task=f5d10eb0) Sep 22 13:40:28 newman kernel: Stack: f4c6b480 f920543a cfaf8f44 f9205e2a f9277fb7 c016e85c 00000000 00000000 Sep 22 13:40:28 newman kernel: 00000000 0b9cc0ed 00000000 f6179380 000071b7 45141e72 00000000 45141e72 Sep 22 13:40:28 newman kernel: 00000000 f4c6b480 00000201 00000000 00000000 00000246 00000000 f4c6b480 Sep 22 13:40:28 newman kernel: Call Trace: Sep 22 13:40:28 newman kernel: [<f920543a>] nlm_put_lockowner+0x11/0x49 [lockd] Sep 22 13:40:28 newman kernel: [<f9205e2a>] nlmclnt_locks_release_private+0xb/0x14 [lockd] Sep 22 13:40:28 newman kernel: [<f9277fb7>] nfs_lock+0x0/0xc7 [nfs] Sep 22 13:40:28 newman kernel: [<c016e85c>] locks_remove_posix+0x130/0x137 Sep 22 13:40:28 newman kernel: [<c015bbc2>] __fput+0x41/0x100 Sep 22 13:40:28 newman kernel: [<c015a7f5>] filp_close+0x59/0x5f Sep 22 13:40:28 newman kernel: [<c02d47bf>] syscall_call+0x7/0xb Sep 22 13:40:28 newman kernel: Code: 38 39 68 2c 75 2d 0f b6 50 30 f6 c2 02 74 09 89 d8 e8 9f df ff ff eb 1d f6 c2 20 74 0e ba 02 00 00 00 89 d8 e8 ba ec ff ff eb 0a <0f> 0b 06 07 65 98 2e c0 89 c3 8b 03 eb c4 b8 00 f0 ff ff 21 e0 Sep 22 13:40:28 newman kernel: <0>Fatal exception: panic in 5 seconds
So what is exactly is a security ISS scan and where can we get one...
Steve, ISS is a commercial security scan package our IT Security Office uses to scan all machines on campus at Indiana University. I do not know how you can get one but I can try to find out. Thanks, Bruce
Please do... since it would be good to know if we have the same problem with our RHEL5 produce line.... tia....
I have asked our ITSO staff for more information. The alternative is to give me the beta version of RHEL5 so that I can install it here and ask them to scan the machine as they do now. --Bruce
Here is more information at the time when this happened: 1) the machine that crashed, newman, is our mail server and it NFS mounted user's home directory using autofs. (This is so that sendmail can access users' homedirectory for, say, .forward file.) 2) The machine as users' home directory server, frog, had its locked/statd died due to the ISS scan mentioned above. (we had to do 'service nfs stop; sevice nfslock restart; service nfs start' later after we found out what was going on. At this point, newman already crashed. My guess is that probably newman could not locked users' .forward due to 2). Bruce
FYI, we just encountered exactly the same crash on one of our (CS department at Cornell University) main compute servers. At the time of the crash, it probably had a couple of dozens of users logged on. The load was not particularly high (< 4 on a dual Xeon Dell PowerEdge 2650), according to our Bigbro monitor. We use amd to auto-mount user home directories so at the time probably about half a dozen NFS shares were mounted from four file servers. We are running RHEL 4 U4 kernel 2.6.9-42.0.2.ELsmp. If you need any more info, please let me know. We've been running RHEL 4 on this particular server for more than a year. I believe this is the first time we encountered this crash. --- Sep 27 16:04:34 lion kernel: ------------[ cut here ]------------ Sep 27 16:04:34 lion kernel: kernel BUG at fs/locks.c:1798! Sep 27 16:04:34 lion kernel: invalid operand: 0000 [#1] Sep 27 16:04:34 lion kernel: SMP Sep 27 16:04:34 lion kernel: Modules linked in: loop nfsd exportfs md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core nfs lockd nfs_acl sunrpc joydev button battery ac ohci_hcd tg3 floppy sg dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla6312 qla2xxx scsi_transport_fc aic7xxx sd_mod scsi_mod Sep 27 16:04:34 lion kernel: CPU: 0 Sep 27 16:04:34 lion kernel: EIP: 0060:[<c016e904>] Not tainted VLI Sep 27 16:04:34 lion kernel: EFLAGS: 00010246 (2.6.9-42.0.2.ELsmp) Sep 27 16:04:34 lion kernel: EIP is at locks_remove_flock+0xa1/0xe1 Sep 27 16:04:34 lion kernel: eax: f515424c ebx: d84694a4 ecx: 00000000 edx: 00000001 Sep 27 16:04:34 lion kernel: esi: 00000000 edi: d84693fc ebp: f498b180 esp: e6d94f2c Sep 27 16:04:34 lion kernel: ds: 007b es: 007b ss: 0068 Sep 27 16:04:34 lion kernel: Process lt-sqlite3 (pid: 1350, threadinfo=e6d94000 task=d117b130) Sep 27 16:04:34 lion kernel: Stack: f498b180 f8c3843a e6d94f44 f8c38e2a f8cb1fb7 c016e85c 00000000 00000000 Sep 27 16:04:34 lion kernel: 00000000 00ebc056 00000000 f2545a80 00000546 451ad939 00000000 451ad939 Sep 27 16:04:34 lion kernel: 00000000 f498b180 00000201 00000000 00000000 00000246 00000000 f498b180 Sep 27 16:04:34 lion kernel: Call Trace: Sep 27 16:04:34 lion kernel: [<f8c3843a>] nlm_put_lockowner+0x11/0x49 [lockd] Sep 27 16:04:34 lion kernel: [<f8c38e2a>] nlmclnt_locks_release_private+0xb/0x14 [lockd] Sep 27 16:04:34 lion kernel: [<f8cb1fb7>] nfs_lock+0x0/0xc7 [nfs] Sep 27 16:04:34 lion kernel: [<c016e85c>] locks_remove_posix+0x130/0x137 Sep 27 16:04:34 lion kernel: [<c015bbc2>] __fput+0x41/0x100 Sep 27 16:04:34 lion kernel: [<c015a7f5>] filp_close+0x59/0x5f Sep 27 16:04:34 lion kernel: [<c02d47bf>] syscall_call+0x7/0xb Sep 27 16:04:34 lion kernel: Code: 38 39 68 2c 75 2d 0f b6 50 30 f6 c2 02 74 09 89 d8 e8 9f df ff ff eb 1d f6 c2 20 74 0e ba 02 00 00 00 89 d8 e8 ba ec ff ff eb 0a <0f> 0b 06 07 65 98 2e c0 89 c3 8b 03 eb c4 b8 00 f0 ff ff 21 e0 Sep 27 16:04:34 lion kernel: <0>Fatal exception: panic in 5 seconds ---
Sorry, I forgot to mention: this compute server, in addition of mounting NFS shares, also serves a NFS share (used a short-term storage space) to other compute servers.
I just got another report for a user who has been able to consisitently crash the servers :-) Hopefully this info can help you guys debug: Here is an interesting point: Basically, he accessed the database file on 2 different NFS servers (web8 and panda, where his home directory resides). He was able to crash multiple NFS clients (all of which are running RHEL 4 Update 4) by accessing the file on web8, but not the same file in his home directory on panda. web8 is running RHEL 4 Update 4 whereas panda is still running RHEL 4 Update 3. It seems to me something is bad with nfs/lockd in nfs-utils-1.0.6-70 in Update 4..... --- But I don't know why. I just crashed cfs03 again just now. Here's exactly what I did. login (via ssh) cd misc/rsrch/mpqa/fa04 # one of my research directories sqlite3 ~/misc/web8/cs474/hmm.db # this a soft-link to /cucs/web/w8/ebreck/cs474/hmm.db # I run the command-line interface to SQLite, a serverless SQL engine, # on a database stored on web8 - it's a database of student submissions # for a homework assignment; I'm TAing for Prof Cardie's CS474 class. # now within sqlite > .tables # this command just lists the tables in the database # it runs for a long time, so I hit Ctrl-C, and sqlite # responds with Error: database is locked # now I attempt to exit sqlite .exit # sqlite is still hung somehow, so I hit Ctrl-C again # and down goes the machine. I think it must have something to do with trying to write to the filesystem on web8, because running sqlite3 on the identical file copied to my home directory works fine.
So is anybody looking this problem now? It's pretty severe to us -- we want to ask our ITSO (IT Security Office) to perform a security scan on all of our servers regularly but we don't want our servers to quit because of the security scans. Currently we have to disable the security scans because of this bug. Please escalate this call if you can. Much appreciated. Thanks, Bruce Shei
Yes, I am looking at this issue. I am looking at it in conjunction with another bugzilla, 211092. This other bugzilla shows some similarities in the stack traces, although the scenario to recreate the situation is very different. It would help to have a testcase which can reproduce the problem. Perhaps it would be possible to simulate the ISS scan by watching the network traffic which causes the problem and then writing a program which generates the same sort of traffic?
Thanks, Peter, for the quick response. I will try to contact our ITSO staff to see how much they can help. Unfortunately, it's kind of out of our control. But looks like Steven at Cornell has described a reliable way to reproduce it? At this is what he indicated in his posting. Thanks and have a great day. --Bruce
Any information that you can provide would be appreciated. Steven, what is the sqlite3 command, and would it be possible to get a copy of hmm.db or some other database which can be used to reproduce the hang?
I'll contact the user to see if he is willing to release his code for testing purposes.
Created attachment 139148 [details] Procedures to reproduce the crash
Created attachment 139149 [details] database to reproduce the crash (See the attached procedure)
OK guys, I attached the procedure as well as the database that caused the crash for us. sqlite3 is from http://www.sqlite.org Please note, this database is not exactly the same one as the one that caused the crash. But according to the user, should be similar enough. We'd rather not crash our compute server to prove it though.... I hope this helps your debugging.
Thanx! I'll take a peek using this stuff and see what I can find.
Any news on this ? I have the same problem (RHEL4U4) : Dec 4 18:22:19 storm kernel: [<e0b3c43a>] nlm_put_lockowner+0x11/0x49 [lockd] Dec 4 18:22:19 storm kernel: [<e0b3ce2a>] nlmclnt_locks_release_private+0xb/0x14 [lockd] Dec 4 18:22:19 storm kernel: [<e0bb5fb7>] nfs_lock+0x0/0xc7 [nfs] Dec 4 18:22:19 storm kernel: [<c016e85c>] locks_remove_posix+0x130/0x137 Dec 4 18:22:19 storm kernel: [<c015bbc2>] __fput+0x41/0x100 Dec 4 18:22:19 storm kernel: [<c015a7f5>] filp_close+0x59/0x5f Dec 4 18:22:19 storm kernel: [<c02d4703>] syscall_call+0x7/0xb Test case is simply running a gcov instrumented binary on NFS (tcp, hard) so NFS locks to write gcov data at program exit ...
No news yet. If you distill down the gcov behavior to a simple testcase, it would be appreciated.
me, too. crash with EIP at exactly the same instruction have opened service req. 1119828 (going on 3 months old) in my circumstance, i believe the crash occurred when the client was holding (or maybe trying to get rid of) locks on a server that was down (or was maybe in the process of coming back up, i'm not sure)
Refering to http://lkml.org/lkml/2005/12/21/334 Test case : #include <unistd.h> #include <fcntl.h> int main() { int fd; struct flock lck; fd = open("file_on_nfs", O_RDWR | O_CREAT, 0644); memset(&lck, 0, sizeof(lck)); lck.l_type = F_WRLCK; fcntl(fd, F_SETLK, &lck); fchmod(fd, 02644); close(fd); }
Yeah I saw that Peter did respond on lkml, I don't care to have a quick & dirty patch but I do care of kernel crash ! Thx.
@laurent.deniel: when i run the reproducer in your comment#21, i get a different backtrace: kernel BUG at fs/locks.c:1798! invalid operand: 0000 [#1] EFLAGS: 00010246 (2.6.9-42.0.10.ELsmp) EIP is at locks_remove_flock+0xa1/0xe1 Call Trace: [<f8d31fb7>] nfs_lock+0x0/0xc7 [nfs] [<c016e787>] locks_remove_posix+0x8f/0x137 [<c015bb0a>] __fput+0x41/0x100 [<c015a73d>] filp_close+0x59/0x5f [<c02d4903>] syscall_call+0x7/0xb i.e., no NLM client stuff. i think comment#21 is germane to bz#218777 for the record, speaking of NLM, the only way i could get the reproducer to crash the system was to restart rpc.statd (by condrestart-ing nfslock) first. after a fresh boot, the reproducer wouldn't work; i'd only get lockd: cannot monitor 172.31.206.130 lockd: failed to monitor 172.31.206.130 , where that IP address is the address of the local machine, and the file would be created and chown-ed without causing a crash. i guess i should be thankful that statd usually doesn't work
Right the problem in #21 is more related to bz#218777.
Created attachment 152672 [details] Test case README includued
Any news since the last test case ? Shall I open a service request to speed thing up ?
Sorry, I haven't had a chance to look at this yet. I will try to take a peek at it and see what I can find.
Has this test program been tried with the proposed patch included in bz211092?
Sorry but I don't have access to 211092 ...
Just had this same problem and RHEL4U5. Any updates?
Actually, I knew which version of RHEL that it was from the version field, so that change wasn't really needed. But no, no updates yet. I have been working on a data corruption issue and a diffferent system crash in the Sun RPC code.
The patch referenced in the above discussion went into RHEL4.5. Comment #30 mentions seeing this issue in a RHEL4.5 kernel. I suspect however that that is a different problem that simply manifested itself in the same way. We've had a number of different fixes go in for problems that look similar to this one but are different. I'm going to close this bug as a duplicate of the one that added the patch under discussion above. If anyone is able to reproduce this on more recent kernels, please try to reproduce it on the kernels here: http://people.redhat.com/vgoyal/rhel4/RPMS.kernel/ ...and reopen this bug if you are able to do so. *** This bug has been marked as a duplicate of bug 218777 ***