Bug 77669 - SCP takes up 99% of cpu when copying file.
SCP takes up 99% of cpu when copying file.
Status: CLOSED DUPLICATE of bug 71157
Product: Red Hat Linux
Classification: Retired
Component: kernel (Show other bugs)
7.3
i686 Linux
medium Severity high
: ---
: ---
Assigned To: Arjan van de Ven
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2002-11-11 14:57 EST by Ian Holsman
Modified: 2007-04-18 12:48 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2002-11-29 18:28:50 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
kernel sysreq output (gzipped) (5.38 KB, text/plain)
2002-11-11 14:58 EST, Ian Holsman
no flags Details

  None (edit)
Description Ian Holsman 2002-11-11 14:57:23 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3a) Gecko/20021110
Phoenix/0.4

Description of problem:
When running on a loaded system, 
a SCP of a 3g file results in the process taking up 99.9% in system time over
all processors. 
the system responds VERY slowly, but we can not kill that process.
for some reason it only happens at the very end of the SCP.

it also only happens when the machine is getting loaded (not to hard) 

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. load the machine. we are using something which uses mmap heavily
2. run the scp once or twice
3. grab a coffee.
	

Actual Results:  See attached kernel dump.
it looks like something in ext3 is caught in a spin lock

Expected Results:  not to have the machine hang ;-) (but you knew that)

Additional info:

Tested on.
1. HW Raid (with/without)
2. BIGmem kernel & SMP kernel 2.4.18-10  with 6G installed mem
3. 2 x Pentium IV Xeon procsseors (with hyperthreading enabled)
4. it has occured on 3 seperate machines so far

I can provide any information you guys need, but I can't provide remote access.
Comment 1 Ian Holsman 2002-11-11 14:58:10 EST
Created attachment 84564 [details]
kernel sysreq output (gzipped)
Comment 2 Arjan van de Ven 2002-11-11 15:02:28 EST
please try to reproduce this without binary only kernel modules loaded
Comment 3 Ian Holsman 2002-11-11 15:33:22 EST
 [root@cn-sfo1-pma-sneezy root]# lsmod
Module                  Size  Used by    Not tainted
tg3                    45568   1
usb-ohci               21472   0  (unused)
usbcore                75904   1  [usb-ohci]
ext3                   70944   2
jbd                    53792   2  [ext3]
aic7xxx               128672   2
sd_mod                 12992   4
scsi_mod              113284   2  [aic7xxx sd_mod]

what is the binary module?
Comment 4 Arjan van de Ven 2002-11-11 15:36:56 EST
your trace showed the cpqevt bin only module.
Can you positively confirm that it still happens without that binary only module?
Comment 5 Ian Holsman 2002-11-11 15:40:44 EST
sorry.. the dump was from a different machine. 

I can confirm that it happens without the binary modules.
I'm waiting on getting a console on that box (without the binary modules) so I
can get the sysreq output. if you know of a method of doing it via a normal ssh
I'll get the sysreq output that way.
Comment 6 Arjan van de Ven 2002-11-11 15:51:45 EST
can you boot with the kernel parameters
nmi_watchdog=1 profile=1
next time you boot and see if
readprofile -m /boot/System.map | sort -n | tail -20
shows any functions that have a extreme high number ?
(you can reset the counters with "readprofile -r"; if you can do that just
before starting rsync the profile best reflects the problem case)
Comment 7 Stephen Tweedie 2002-11-11 16:22:15 EST
If it's happening just at the end of the scp, that's a sign that the expensive
operation is the truncate() where we're deleting the old file and installing the
new one.  That fits with the trace showing scp stuck in truncate_list_pages().

Does the box eventually unstick itself?  How long does the "locked" state last for?
Comment 8 Ian Holsman 2002-11-11 16:44:14 EST
[root@cn-sfo1-pma-sneezy root]# readprofile -r
[root@cn-sfo1-pma-sneezy root]# df -k
Filesystem           1k-blocks      Used Available Use% Mounted on
/dev/sdb1             17631216  16315344    420252  98% /
/dev/sda1               101107     18976     76910  20% /boot
none                   1934464         0   1934464   0% /dev/shm
[root@cn-sfo1-pma-sneezy root]# readprofile -m /boot/System.map-2.4.18-10smp |
sort -n | tail -20
   481 do_generic_file_read                       0.4062
   484 create_bounce                              1.4578
   587 __generic_copy_to_user                     9.1719
   732 schedule                                   0.7379
   820 fget                                      12.8125
   912 tcp_sendmsg                                0.1952
  1227 .text.lock.sched                           2.0450
  1240 system_call                               22.1429
  1425 .text.lock.tty_io                          4.3712
  1469 statm_pgd_range                            3.2790
  1513 file_read_actor                            5.9102
  1566 do_anonymous_page                          5.1513
  4643 .text.lock.fcntl                          36.8492
  7845 .text.lock.namei                           6.3780
 12581 .text.lock.inode                          22.3464
 12681 .text.lock.read_write                    105.6750
 15236 .text.lock.filemap                        21.5502
 28045 truncate_list_pages                       46.1266
 35631 default_idle                             556.7344
140587 total                                      0.1138
[root@cn-sfo1-pma-sneezy root]#


we had a box running for ~2 days with it stuck.
BTW. .this is not a rsync.. just a normal scp

(this is the output from a top if it helps)
the searchD process is the custom code we wrote.
  1:42pm  up 35 min,  3 users,  load average: 9.13, 7.81, 5.41
129 processes: 120 sleeping, 9 running, 0 zombie, 0 stopped
CPU0 states:  0.0% user, 99.36% system,  0.0% nice,  0.14% idle
CPU1 states:  0.0% user, 99.24% system,  0.1% nice,  0.25% idle
CPU2 states:  0.0% user, 99.43% system,  0.0% nice,  0.7% idle
CPU3 states:  0.0% user, 99.38% system,  0.4% nice,  0.8% idle
Mem:  3868928K av, 3668596K used,  200332K free,       0K shrd,   17700K buff
Swap: 1807000K av,       0K used, 1807000K free                 3198684K cached

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
 1336 ianh      25   0   804  804   640 R    99.2  0.0   6:11 scp
 1206 root      25   0  1020 1020   784 R    63.4  0.0   4:37 top
 1154 searchd   22   5 26940  26M  2076 R N  16.9  0.6   1:05 searchd
 1185 searchd   20   5  296M 296M  2160 R N  14.5  7.8   6:39 searchd
 1190 searchd   23   5  296M 296M  2160 R N  11.4  7.8   6:54 searchd
 1184 searchd   21   5  296M 296M  2160 R N  11.0  7.8   6:26 searchd
 1155 searchd   20   5 26940  26M  2076 S N   9.3  0.6   0:38 searchd
 1153 searchd   21   5 26940  26M  2076 S N   8.8  0.6   0:51 searchd
 1187 searchd   22   5  296M 296M  2160 S N   8.6  7.8   6:47 searchd
 1186 searchd   21   5  296M 296M  2160 S N   8.2  7.8   6:48 searchd
 1191 searchd   20   5  296M 296M  2160 S N   8.0  7.8   6:43 searchd
 1188 searchd   20   5  296M 296M  2160 S N   7.7  7.8   6:25 searchd

Comment 9 Ian Holsman 2002-11-12 14:44:31 EST
confirmed that box doesn't seem to 'unstick' itself. I left it running overnight
and it is still @ 99% utilization.
Comment 10 Ian Holsman 2002-11-12 18:40:15 EST
BTW.. this is also happening with ext2.
not sure if this was known to you guys.
do you have any recommendations on what we can do to be able to SCP files?
Comment 11 Neal McBurnett 2002-11-15 15:40:54 EST
I wrote up a problem (sys_ftruncate taking hours), which Stephen Tweedie said
looks similar to this bug (77669).  When I did a search, I thought my problem was
more related to bug 71157, so I've added my observations to that bug.

But I wanted to tie it in here also - they may all be related.  Thanks!
Comment 12 Ian Holsman 2002-11-20 19:49:00 EST
FWIW I managed to grab redhat's AS2.1 kernel (the one on the ftp site) and
installed that on the machine...
and I can't reproduce the problem with that kernel.
Comment 13 Kevin Sonney 2002-11-25 18:12:31 EST
I notice this references a 2.4.18-10 kernel. Have you applied the latest erratta
fixes? The most recent kernel for the 7.x platform is 2.4.18-18.7.x. 

While this type of issue is not specifically listed in the changelog, this
kernel does contain many changes, some of which are also in the AS 2.1 kernels.
Comment 14 Ian Holsman 2002-11-26 11:42:04 EST
hi ksonney.

it looks like it the file in question has changed (mm/filemap.c) in the latest rev.
I'll go get this kernel installed, and see if it still fails.
Comment 15 Stephen Tweedie 2002-11-27 12:55:39 EST
Possible cause found, patch available in bug #71157.
Comment 16 Stephen Tweedie 2002-11-27 12:56:09 EST

*** This bug has been marked as a duplicate of 71157 ***
Comment 17 Ian Holsman 2002-11-29 18:28:43 EST
Hi.
I finally tried the latest kernel  2.4.18-18.7.x 
the problem is when I try to recreate the problem, it freezes the machine,
before the scp is even attempted.

for some reason the serial i/f is dead on the box, so I can't see what went
wrong, I'll try again on monday, but it looks like this kernel is actually worse :(
Comment 18 Stephen Tweedie 2002-12-02 15:21:32 EST
This bug was closed as a duplicate of #71157.  Please file updates there so that
we can keep all of the information in one place.  

A non-bootable kernel is definitely a problem, but a different one --- please
file a separate bug if you are able to confirm that it is a kernel fault.

*** This bug has been marked as a duplicate of 71157 ***

Note You need to log in before you can comment on or make changes to this bug.