From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3a) Gecko/20021110
Description of problem:
When running on a loaded system,
a SCP of a 3g file results in the process taking up 99.9% in system time over
the system responds VERY slowly, but we can not kill that process.
for some reason it only happens at the very end of the SCP.
it also only happens when the machine is getting loaded (not to hard)
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. load the machine. we are using something which uses mmap heavily
2. run the scp once or twice
3. grab a coffee.
Actual Results: See attached kernel dump.
it looks like something in ext3 is caught in a spin lock
Expected Results: not to have the machine hang ;-) (but you knew that)
1. HW Raid (with/without)
2. BIGmem kernel & SMP kernel 2.4.18-10 with 6G installed mem
3. 2 x Pentium IV Xeon procsseors (with hyperthreading enabled)
4. it has occured on 3 seperate machines so far
I can provide any information you guys need, but I can't provide remote access.
Created attachment 84564 [details]
kernel sysreq output (gzipped)
please try to reproduce this without binary only kernel modules loaded
[root@cn-sfo1-pma-sneezy root]# lsmod
Module Size Used by Not tainted
tg3 45568 1
usb-ohci 21472 0 (unused)
usbcore 75904 1 [usb-ohci]
ext3 70944 2
jbd 53792 2 [ext3]
aic7xxx 128672 2
sd_mod 12992 4
scsi_mod 113284 2 [aic7xxx sd_mod]
what is the binary module?
your trace showed the cpqevt bin only module.
Can you positively confirm that it still happens without that binary only module?
sorry.. the dump was from a different machine.
I can confirm that it happens without the binary modules.
I'm waiting on getting a console on that box (without the binary modules) so I
can get the sysreq output. if you know of a method of doing it via a normal ssh
I'll get the sysreq output that way.
can you boot with the kernel parameters
next time you boot and see if
readprofile -m /boot/System.map | sort -n | tail -20
shows any functions that have a extreme high number ?
(you can reset the counters with "readprofile -r"; if you can do that just
before starting rsync the profile best reflects the problem case)
If it's happening just at the end of the scp, that's a sign that the expensive
operation is the truncate() where we're deleting the old file and installing the
new one. That fits with the trace showing scp stuck in truncate_list_pages().
Does the box eventually unstick itself? How long does the "locked" state last for?
[root@cn-sfo1-pma-sneezy root]# readprofile -r
[root@cn-sfo1-pma-sneezy root]# df -k
Filesystem 1k-blocks Used Available Use% Mounted on
/dev/sdb1 17631216 16315344 420252 98% /
/dev/sda1 101107 18976 76910 20% /boot
none 1934464 0 1934464 0% /dev/shm
[root@cn-sfo1-pma-sneezy root]# readprofile -m /boot/System.map-2.4.18-10smp |
sort -n | tail -20
481 do_generic_file_read 0.4062
484 create_bounce 1.4578
587 __generic_copy_to_user 9.1719
732 schedule 0.7379
820 fget 12.8125
912 tcp_sendmsg 0.1952
1227 .text.lock.sched 2.0450
1240 system_call 22.1429
1425 .text.lock.tty_io 4.3712
1469 statm_pgd_range 3.2790
1513 file_read_actor 5.9102
1566 do_anonymous_page 5.1513
4643 .text.lock.fcntl 36.8492
7845 .text.lock.namei 6.3780
12581 .text.lock.inode 22.3464
12681 .text.lock.read_write 105.6750
15236 .text.lock.filemap 21.5502
28045 truncate_list_pages 46.1266
35631 default_idle 556.7344
140587 total 0.1138
we had a box running for ~2 days with it stuck.
BTW. .this is not a rsync.. just a normal scp
(this is the output from a top if it helps)
the searchD process is the custom code we wrote.
1:42pm up 35 min, 3 users, load average: 9.13, 7.81, 5.41
129 processes: 120 sleeping, 9 running, 0 zombie, 0 stopped
CPU0 states: 0.0% user, 99.36% system, 0.0% nice, 0.14% idle
CPU1 states: 0.0% user, 99.24% system, 0.1% nice, 0.25% idle
CPU2 states: 0.0% user, 99.43% system, 0.0% nice, 0.7% idle
CPU3 states: 0.0% user, 99.38% system, 0.4% nice, 0.8% idle
Mem: 3868928K av, 3668596K used, 200332K free, 0K shrd, 17700K buff
Swap: 1807000K av, 0K used, 1807000K free 3198684K cached
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND
1336 ianh 25 0 804 804 640 R 99.2 0.0 6:11 scp
1206 root 25 0 1020 1020 784 R 63.4 0.0 4:37 top
1154 searchd 22 5 26940 26M 2076 R N 16.9 0.6 1:05 searchd
1185 searchd 20 5 296M 296M 2160 R N 14.5 7.8 6:39 searchd
1190 searchd 23 5 296M 296M 2160 R N 11.4 7.8 6:54 searchd
1184 searchd 21 5 296M 296M 2160 R N 11.0 7.8 6:26 searchd
1155 searchd 20 5 26940 26M 2076 S N 9.3 0.6 0:38 searchd
1153 searchd 21 5 26940 26M 2076 S N 8.8 0.6 0:51 searchd
1187 searchd 22 5 296M 296M 2160 S N 8.6 7.8 6:47 searchd
1186 searchd 21 5 296M 296M 2160 S N 8.2 7.8 6:48 searchd
1191 searchd 20 5 296M 296M 2160 S N 8.0 7.8 6:43 searchd
1188 searchd 20 5 296M 296M 2160 S N 7.7 7.8 6:25 searchd
confirmed that box doesn't seem to 'unstick' itself. I left it running overnight
and it is still @ 99% utilization.
BTW.. this is also happening with ext2.
not sure if this was known to you guys.
do you have any recommendations on what we can do to be able to SCP files?
I wrote up a problem (sys_ftruncate taking hours), which Stephen Tweedie said
looks similar to this bug (77669). When I did a search, I thought my problem was
more related to bug 71157, so I've added my observations to that bug.
But I wanted to tie it in here also - they may all be related. Thanks!
FWIW I managed to grab redhat's AS2.1 kernel (the one on the ftp site) and
installed that on the machine...
and I can't reproduce the problem with that kernel.
I notice this references a 2.4.18-10 kernel. Have you applied the latest erratta
fixes? The most recent kernel for the 7.x platform is 2.4.18-18.7.x.
While this type of issue is not specifically listed in the changelog, this
kernel does contain many changes, some of which are also in the AS 2.1 kernels.
it looks like it the file in question has changed (mm/filemap.c) in the latest rev.
I'll go get this kernel installed, and see if it still fails.
Possible cause found, patch available in bug #71157.
*** This bug has been marked as a duplicate of 71157 ***
I finally tried the latest kernel 2.4.18-18.7.x
the problem is when I try to recreate the problem, it freezes the machine,
before the scp is even attempted.
for some reason the serial i/f is dead on the box, so I can't see what went
wrong, I'll try again on monday, but it looks like this kernel is actually worse :(
This bug was closed as a duplicate of #71157. Please file updates there so that
we can keep all of the information in one place.
A non-bootable kernel is definitely a problem, but a different one --- please
file a separate bug if you are able to confirm that it is a kernel fault.
*** This bug has been marked as a duplicate of 71157 ***