Bug 71157

Summary:

gimp process is unkillable

Product:

[Retired] Red Hat Linux

Reporter:

Jeremy Sanders <jss>

Component:

kernel

Assignee:

Arjan van de Ven <arjanv>

Status:

CLOSED ERRATA

QA Contact:

Brian Brock <bbrock>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

7.3

CC:

chris, neal, sct

Target Milestone:

---

Target Release:

---

Hardware:

athlon

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2003-01-11 00:13:40 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Fix for truncate hang	none

Description Jeremy Sanders 2002-08-09 13:24:58 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1b) Gecko/20020722

Description of problem:
I can (almost) reproducably get a gimp process into an unkillable state with
kernel 2.4.18-5 on an athlon. The gimp version is gimp-1.2.3-4. When I select an
area for trimming on a large tiff image the process is stuck in an "R" state:

000   914 20587 18770  25   0 45768 43228 -     R    pts/4      5:18 gimp
mccarthy.tiff
000   914 20588 20587  16   0  6492 3008 schedu S    pts/4      0:00
/usr/lib/gimp/1.2/plug-ins/script-fu -gimp 9 8 -run 0

Using a normal kill and a kill -9 won't get rid of the process. I've also tried
virtually every signal, but none of them have a response. The only way I've been
able to remove the thing is go to /proc/pid and echo stuff from /dev/urandom in
the mem space of the process.

Pressing ctrl+scroll lock gives the following trace on the process:

gimp          R current      0 20587  18770 20588               (NOTLB)
Call Trace: [<c0129886>] truncate_list_pages [kernel] 0x1e6 
[<c01298db>] truncate_inode_pages [kernel] 0x3b 
[<c012742b>] vmtruncate [kernel] 0x9b 
[<c014dd06>] inode_setattr [kernel] 0x26 
[<e0820f16>] ext3_setattr [ext3] 0x1c6 
[<c0127ce5>] vm_enough_memory [kernel] 0x35 
[<c014de8e>] notify_change [kernel] 0x5e 
[<c0138926>] do_truncate [kernel] 0x46 
[<c0116542>] __wake_up [kernel] 0x42 
[<e331c2a6>] es1371_interrupt [es1371] 0x66 
[<c0138c39>] sys_ftruncate [kernel] 0x129 
[<c0108913>] system_call [kernel] 0x33 

script-fu     S D6E00000     0 20588  20587                     (NOTLB)
Call Trace: [<c0120b14>] schedule_timeout [kernel] 0x14 
[<c0147dc6>] do_select [kernel] 0x206 
[<c0148169>] sys_select [kernel] 0x339 
[<c0108913>] system_call [kernel] 0x33 



Version-Release number of selected component (if applicable):


How reproducible:
Sometimes

Steps to Reproduce:
1. Open up this particular tiff image (can supply this, but copyrighted (but
downloadable) material)
2. Select a figure with the trim tool
3. Click trim


Actual Results:  Process hangs. No kill works.

Expected Results:  Process shouldn't die, but should be killable if it does.

Additional info:

Comment 1 Jeremy Sanders 2002-08-09 13:27:19 UTC

Here are also the contents of /proc/pid/status:

Name:	gimp
State:	R (running)
Tgid:	20587
Pid:	20587
PPid:	18770
TracerPid:	0
Uid:	914	914	914	914
Gid:	15	15	15	15
FDSize:	32
Groups:	15 0 
VmSize:	   45768 kB
VmLck:	       0 kB
VmRSS:	   43228 kB
VmData:	   38832 kB
VmStk:	      88 kB
VmExe:	    1648 kB
VmLib:	    4220 kB
SigPnd:	0000000000004100
SigBlk:	0000000000000000
SigIgn:	8000000000001000
SigCgt:	00000000000144e7
CapInh:	0000000000000000
CapPrm:	0000000000000000
CapEff:	0000000000000000

Comment 2 Christopher Wong 2002-08-22 20:13:32 UTC

For what it's worth, I have seen the same behavior after running transcode (from freshrpms.net) 
for some hours to convert a few GB of DV to Divx. It completes, but stays hung. Transcode 
sees and acknowledges the kill signal when I hit Ctrl-C, but does not terminate.

Comment 3 Neal McBurnett 2002-11-15 20:37:19 UTC

I've noted something very similar on the ext3 list:
 https://listman.redhat.com/pipermail/ext3-users/2002-August/003923.html

But if I wait long enough (once, 17 hours....) the kernel does
return from the sys_ftruncate call and the process does resume
working just fine.

I'm running Redhat 7.3, kernel 2.4.18-3.

Stephen Tweedie noted a similar complaint with kernel 2.4.18-10
reported as bugzilla bug 77669, in which high system load increased
the reproducability.  I don't know if system load was an issue for my
cases, and I haven't experimented with that.  But mine have been
harder to replicate.

Here are the details for my situation.

Several times recently my "mutt" email program has looped for
hours at a time in the middle of a sys_ftruncate call.  This happens
when I use the "$" command to write changes out to my mailbox.  It
does eventually return from the call and everything seems to have
worked ok.  But in the meantime the CPU is pegged, $MAIL
is locked so I can't receive new mail, and signals to the program
(like kill -9) don't take effect for hours.  Once it was 17 hours,
once 3, etc.

The problem showed up shortly after upgrading from Red Hat 7.1 and
converting the file systems to ext3.  I'm running Red Hat 7.3, kernel
2.4.18-3, mutt-1.2.5.1-1.

Strace didn't help at all, but thanks to a tip from Kevin Fenzi
I learned how to use sysrq to find out where the process was, viz:

 18:03:10 kernel: mutt          R current   1024  8893   7929  (NOTLB)
 18:03:10 kernel: Call Trace: [<c0127061>] truncate_list_pages [kernel] 0x79 
 18:03:10 kernel: [<c01271ff>] truncate_inode_pages [kernel] 0x3b 
 18:03:10 kernel: [<c0124f2e>] vmtruncate [kernel] 0x96 
 18:03:10 kernel: [<c01491f0>] inode_setattr [kernel] 0x24 
 18:03:10 kernel: [<d401f963>] ext3_setattr [ext3] 0x1c3 
 18:03:10 kernel: [<d401d810>] ext3_get_block [ext3] 0x0 
 18:03:10 kernel: [<c01281db>] do_generic_file_read [kernel] 0x2c3 
 18:03:10 kernel: [<c0149359>] notify_change [kernel] 0x5d 
 18:03:10 kernel: [<c012a2aa>] generic_file_write [kernel] 0x5c2 
 18:03:10 kernel: [<c01348ce>] do_truncate [kernel] 0x46 
 18:03:10 kernel: [<c0134bd1>] sys_ftruncate [kernel] 0x12d
 18:03:10 kernel: [<c01085f7>] system_call [kernel] 0x33 

I noticed that an fsck hadn't been done for months, so I did one
with this result, indicating some sort of problem with $MAIL:

 13:25:31 fsck: /var:  
 13:25:31 fsck: Truncating orphaned inode 44891 (uid=6265, gid=6265,
mode=0100600, size=175526062) 
 13:25:36 fsck: /var has gone 69 days without being checked, check forced. 
 13:25:43 fsck: /var: 1057/104040 files (24.0% non-contiguous), 281356/415768
blocks 

The file in question is large:

  44891 -rw-------    1 neal     neal     175694250 Aug 13 13:50 /var/mail/neal

It has been working fine for a few months, but has started being
noticable again recently.  In the last few days it hasn't taken as
long as 17 hours, but it has sometimes taken unusual and uncomfortable
amounts of time (many minutes at least).  Normally, with my 266 MB
$MAIL, it only takes a few seconds to update the file, since mutt is
clever enough to only write the tail end of the file starting with the
first change.

It doesn't seem like a mutt bug, since the whole episode takes place
inside a single system call, and the problem only showed up after
upgrading to Redhat 7.3 and ext3, leaving mutt unchanged.

Comment 4 Stephen Tweedie 2002-11-21 15:13:14 UTC

Were these SMP or single-processor systems?

Comment 5 Jeremy Sanders 2002-11-21 15:22:09 UTC

Single-processor

xpc3:~> cat /proc/cpuinfo 
processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 6
model		: 4
model name	: AMD Athlon(tm) Processor
stepping	: 2
cpu MHz		: 1109.935
cache size	: 256 KB
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 1
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx
fxsr syscall mmxext 3dnowext 3dnow
bogomips	: 2199.78

Comment 6 Neal McBurnett 2002-11-21 17:33:55 UTC

Mine is also a single-processor system: 

tmp:1845)cat /proc/cpuinfo 
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 5
model           : 8
model name      : AMD-K6(tm) 3D processor
stepping        : 12
cpu MHz         : 500.017
cache size      : 64 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr mce cx8 pge mmx syscall 3dnow k6_mtrr
bogomips        : 996.14

Comment 7 Stephen Tweedie 2002-11-27 17:53:16 UTC

The following patch should fix it --- basically the low latency patches were not
dealing with the case of a partial truncate to a huge file that is already in cache.

Comment 8 Stephen Tweedie 2002-11-27 17:55:01 UTC

Created attachment 86746 [details]
Fix for truncate hang

Comment 9 Stephen Tweedie 2002-11-27 17:56:02 UTC

*** Bug 77669 has been marked as a duplicate of this bug. ***

Comment 10 Stephen Tweedie 2002-12-02 20:21:21 UTC

*** Bug 77669 has been marked as a duplicate of this bug. ***

Comment 11 Neal McBurnett 2003-01-10 23:18:37 UTC

I applied the patch to kernel-source-2.4.18-3 and for the last few days, haven't
seen the delays I used to see when I use the "$" command to write changes out to
my mailbox.

Thanks, Stephen!

Comment 12 Stephen Tweedie 2003-01-11 00:13:40 UTC

Fixed in the 2.4.18-19.7.x and 2.4.18-19.8.0 errata kernels.