From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3a) Gecko/20021110 Phoenix/0.4 Description of problem: When running on a loaded system, a SCP of a 3g file results in the process taking up 99.9% in system time over all processors. the system responds VERY slowly, but we can not kill that process. for some reason it only happens at the very end of the SCP. it also only happens when the machine is getting loaded (not to hard) Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. load the machine. we are using something which uses mmap heavily 2. run the scp once or twice 3. grab a coffee. Actual Results: See attached kernel dump. it looks like something in ext3 is caught in a spin lock Expected Results: not to have the machine hang ;-) (but you knew that) Additional info: Tested on. 1. HW Raid (with/without) 2. BIGmem kernel & SMP kernel 2.4.18-10 with 6G installed mem 3. 2 x Pentium IV Xeon procsseors (with hyperthreading enabled) 4. it has occured on 3 seperate machines so far I can provide any information you guys need, but I can't provide remote access.
Created attachment 84564 [details] kernel sysreq output (gzipped)
please try to reproduce this without binary only kernel modules loaded
[root@cn-sfo1-pma-sneezy root]# lsmod Module Size Used by Not tainted tg3 45568 1 usb-ohci 21472 0 (unused) usbcore 75904 1 [usb-ohci] ext3 70944 2 jbd 53792 2 [ext3] aic7xxx 128672 2 sd_mod 12992 4 scsi_mod 113284 2 [aic7xxx sd_mod] what is the binary module?
your trace showed the cpqevt bin only module. Can you positively confirm that it still happens without that binary only module?
sorry.. the dump was from a different machine. I can confirm that it happens without the binary modules. I'm waiting on getting a console on that box (without the binary modules) so I can get the sysreq output. if you know of a method of doing it via a normal ssh I'll get the sysreq output that way.
can you boot with the kernel parameters nmi_watchdog=1 profile=1 next time you boot and see if readprofile -m /boot/System.map | sort -n | tail -20 shows any functions that have a extreme high number ? (you can reset the counters with "readprofile -r"; if you can do that just before starting rsync the profile best reflects the problem case)
If it's happening just at the end of the scp, that's a sign that the expensive operation is the truncate() where we're deleting the old file and installing the new one. That fits with the trace showing scp stuck in truncate_list_pages(). Does the box eventually unstick itself? How long does the "locked" state last for?
[root@cn-sfo1-pma-sneezy root]# readprofile -r [root@cn-sfo1-pma-sneezy root]# df -k Filesystem 1k-blocks Used Available Use% Mounted on /dev/sdb1 17631216 16315344 420252 98% / /dev/sda1 101107 18976 76910 20% /boot none 1934464 0 1934464 0% /dev/shm [root@cn-sfo1-pma-sneezy root]# readprofile -m /boot/System.map-2.4.18-10smp | sort -n | tail -20 481 do_generic_file_read 0.4062 484 create_bounce 1.4578 587 __generic_copy_to_user 9.1719 732 schedule 0.7379 820 fget 12.8125 912 tcp_sendmsg 0.1952 1227 .text.lock.sched 2.0450 1240 system_call 22.1429 1425 .text.lock.tty_io 4.3712 1469 statm_pgd_range 3.2790 1513 file_read_actor 5.9102 1566 do_anonymous_page 5.1513 4643 .text.lock.fcntl 36.8492 7845 .text.lock.namei 6.3780 12581 .text.lock.inode 22.3464 12681 .text.lock.read_write 105.6750 15236 .text.lock.filemap 21.5502 28045 truncate_list_pages 46.1266 35631 default_idle 556.7344 140587 total 0.1138 [root@cn-sfo1-pma-sneezy root]# we had a box running for ~2 days with it stuck. BTW. .this is not a rsync.. just a normal scp (this is the output from a top if it helps) the searchD process is the custom code we wrote. 1:42pm up 35 min, 3 users, load average: 9.13, 7.81, 5.41 129 processes: 120 sleeping, 9 running, 0 zombie, 0 stopped CPU0 states: 0.0% user, 99.36% system, 0.0% nice, 0.14% idle CPU1 states: 0.0% user, 99.24% system, 0.1% nice, 0.25% idle CPU2 states: 0.0% user, 99.43% system, 0.0% nice, 0.7% idle CPU3 states: 0.0% user, 99.38% system, 0.4% nice, 0.8% idle Mem: 3868928K av, 3668596K used, 200332K free, 0K shrd, 17700K buff Swap: 1807000K av, 0K used, 1807000K free 3198684K cached PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND 1336 ianh 25 0 804 804 640 R 99.2 0.0 6:11 scp 1206 root 25 0 1020 1020 784 R 63.4 0.0 4:37 top 1154 searchd 22 5 26940 26M 2076 R N 16.9 0.6 1:05 searchd 1185 searchd 20 5 296M 296M 2160 R N 14.5 7.8 6:39 searchd 1190 searchd 23 5 296M 296M 2160 R N 11.4 7.8 6:54 searchd 1184 searchd 21 5 296M 296M 2160 R N 11.0 7.8 6:26 searchd 1155 searchd 20 5 26940 26M 2076 S N 9.3 0.6 0:38 searchd 1153 searchd 21 5 26940 26M 2076 S N 8.8 0.6 0:51 searchd 1187 searchd 22 5 296M 296M 2160 S N 8.6 7.8 6:47 searchd 1186 searchd 21 5 296M 296M 2160 S N 8.2 7.8 6:48 searchd 1191 searchd 20 5 296M 296M 2160 S N 8.0 7.8 6:43 searchd 1188 searchd 20 5 296M 296M 2160 S N 7.7 7.8 6:25 searchd
confirmed that box doesn't seem to 'unstick' itself. I left it running overnight and it is still @ 99% utilization.
BTW.. this is also happening with ext2. not sure if this was known to you guys. do you have any recommendations on what we can do to be able to SCP files?
I wrote up a problem (sys_ftruncate taking hours), which Stephen Tweedie said looks similar to this bug (77669). When I did a search, I thought my problem was more related to bug 71157, so I've added my observations to that bug. But I wanted to tie it in here also - they may all be related. Thanks!
FWIW I managed to grab redhat's AS2.1 kernel (the one on the ftp site) and installed that on the machine... and I can't reproduce the problem with that kernel.
I notice this references a 2.4.18-10 kernel. Have you applied the latest erratta fixes? The most recent kernel for the 7.x platform is 2.4.18-18.7.x. While this type of issue is not specifically listed in the changelog, this kernel does contain many changes, some of which are also in the AS 2.1 kernels.
hi ksonney. it looks like it the file in question has changed (mm/filemap.c) in the latest rev. I'll go get this kernel installed, and see if it still fails.
Possible cause found, patch available in bug #71157.
*** This bug has been marked as a duplicate of 71157 ***
Hi. I finally tried the latest kernel 2.4.18-18.7.x the problem is when I try to recreate the problem, it freezes the machine, before the scp is even attempted. for some reason the serial i/f is dead on the box, so I can't see what went wrong, I'll try again on monday, but it looks like this kernel is actually worse :(
This bug was closed as a duplicate of #71157. Please file updates there so that we can keep all of the information in one place. A non-bootable kernel is definitely a problem, but a different one --- please file a separate bug if you are able to confirm that it is a kernel fault. *** This bug has been marked as a duplicate of 71157 ***