From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1b) Gecko/20020722 Description of problem: I can (almost) reproducably get a gimp process into an unkillable state with kernel 2.4.18-5 on an athlon. The gimp version is gimp-1.2.3-4. When I select an area for trimming on a large tiff image the process is stuck in an "R" state: 000 914 20587 18770 25 0 45768 43228 - R pts/4 5:18 gimp mccarthy.tiff 000 914 20588 20587 16 0 6492 3008 schedu S pts/4 0:00 /usr/lib/gimp/1.2/plug-ins/script-fu -gimp 9 8 -run 0 Using a normal kill and a kill -9 won't get rid of the process. I've also tried virtually every signal, but none of them have a response. The only way I've been able to remove the thing is go to /proc/pid and echo stuff from /dev/urandom in the mem space of the process. Pressing ctrl+scroll lock gives the following trace on the process: gimp R current 0 20587 18770 20588 (NOTLB) Call Trace: [<c0129886>] truncate_list_pages [kernel] 0x1e6 [<c01298db>] truncate_inode_pages [kernel] 0x3b [<c012742b>] vmtruncate [kernel] 0x9b [<c014dd06>] inode_setattr [kernel] 0x26 [<e0820f16>] ext3_setattr [ext3] 0x1c6 [<c0127ce5>] vm_enough_memory [kernel] 0x35 [<c014de8e>] notify_change [kernel] 0x5e [<c0138926>] do_truncate [kernel] 0x46 [<c0116542>] __wake_up [kernel] 0x42 [<e331c2a6>] es1371_interrupt [es1371] 0x66 [<c0138c39>] sys_ftruncate [kernel] 0x129 [<c0108913>] system_call [kernel] 0x33 script-fu S D6E00000 0 20588 20587 (NOTLB) Call Trace: [<c0120b14>] schedule_timeout [kernel] 0x14 [<c0147dc6>] do_select [kernel] 0x206 [<c0148169>] sys_select [kernel] 0x339 [<c0108913>] system_call [kernel] 0x33 Version-Release number of selected component (if applicable): How reproducible: Sometimes Steps to Reproduce: 1. Open up this particular tiff image (can supply this, but copyrighted (but downloadable) material) 2. Select a figure with the trim tool 3. Click trim Actual Results: Process hangs. No kill works. Expected Results: Process shouldn't die, but should be killable if it does. Additional info:
Here are also the contents of /proc/pid/status: Name: gimp State: R (running) Tgid: 20587 Pid: 20587 PPid: 18770 TracerPid: 0 Uid: 914 914 914 914 Gid: 15 15 15 15 FDSize: 32 Groups: 15 0 VmSize: 45768 kB VmLck: 0 kB VmRSS: 43228 kB VmData: 38832 kB VmStk: 88 kB VmExe: 1648 kB VmLib: 4220 kB SigPnd: 0000000000004100 SigBlk: 0000000000000000 SigIgn: 8000000000001000 SigCgt: 00000000000144e7 CapInh: 0000000000000000 CapPrm: 0000000000000000 CapEff: 0000000000000000
For what it's worth, I have seen the same behavior after running transcode (from freshrpms.net) for some hours to convert a few GB of DV to Divx. It completes, but stays hung. Transcode sees and acknowledges the kill signal when I hit Ctrl-C, but does not terminate.
I've noted something very similar on the ext3 list: https://listman.redhat.com/pipermail/ext3-users/2002-August/003923.html But if I wait long enough (once, 17 hours....) the kernel does return from the sys_ftruncate call and the process does resume working just fine. I'm running Redhat 7.3, kernel 2.4.18-3. Stephen Tweedie noted a similar complaint with kernel 2.4.18-10 reported as bugzilla bug 77669, in which high system load increased the reproducability. I don't know if system load was an issue for my cases, and I haven't experimented with that. But mine have been harder to replicate. Here are the details for my situation. Several times recently my "mutt" email program has looped for hours at a time in the middle of a sys_ftruncate call. This happens when I use the "$" command to write changes out to my mailbox. It does eventually return from the call and everything seems to have worked ok. But in the meantime the CPU is pegged, $MAIL is locked so I can't receive new mail, and signals to the program (like kill -9) don't take effect for hours. Once it was 17 hours, once 3, etc. The problem showed up shortly after upgrading from Red Hat 7.1 and converting the file systems to ext3. I'm running Red Hat 7.3, kernel 2.4.18-3, mutt-1.2.5.1-1. Strace didn't help at all, but thanks to a tip from Kevin Fenzi I learned how to use sysrq to find out where the process was, viz: 18:03:10 kernel: mutt R current 1024 8893 7929 (NOTLB) 18:03:10 kernel: Call Trace: [<c0127061>] truncate_list_pages [kernel] 0x79 18:03:10 kernel: [<c01271ff>] truncate_inode_pages [kernel] 0x3b 18:03:10 kernel: [<c0124f2e>] vmtruncate [kernel] 0x96 18:03:10 kernel: [<c01491f0>] inode_setattr [kernel] 0x24 18:03:10 kernel: [<d401f963>] ext3_setattr [ext3] 0x1c3 18:03:10 kernel: [<d401d810>] ext3_get_block [ext3] 0x0 18:03:10 kernel: [<c01281db>] do_generic_file_read [kernel] 0x2c3 18:03:10 kernel: [<c0149359>] notify_change [kernel] 0x5d 18:03:10 kernel: [<c012a2aa>] generic_file_write [kernel] 0x5c2 18:03:10 kernel: [<c01348ce>] do_truncate [kernel] 0x46 18:03:10 kernel: [<c0134bd1>] sys_ftruncate [kernel] 0x12d 18:03:10 kernel: [<c01085f7>] system_call [kernel] 0x33 I noticed that an fsck hadn't been done for months, so I did one with this result, indicating some sort of problem with $MAIL: 13:25:31 fsck: /var: 13:25:31 fsck: Truncating orphaned inode 44891 (uid=6265, gid=6265, mode=0100600, size=175526062) 13:25:36 fsck: /var has gone 69 days without being checked, check forced. 13:25:43 fsck: /var: 1057/104040 files (24.0% non-contiguous), 281356/415768 blocks The file in question is large: 44891 -rw------- 1 neal neal 175694250 Aug 13 13:50 /var/mail/neal It has been working fine for a few months, but has started being noticable again recently. In the last few days it hasn't taken as long as 17 hours, but it has sometimes taken unusual and uncomfortable amounts of time (many minutes at least). Normally, with my 266 MB $MAIL, it only takes a few seconds to update the file, since mutt is clever enough to only write the tail end of the file starting with the first change. It doesn't seem like a mutt bug, since the whole episode takes place inside a single system call, and the problem only showed up after upgrading to Redhat 7.3 and ext3, leaving mutt unchanged.
Were these SMP or single-processor systems?
Single-processor xpc3:~> cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 6 model : 4 model name : AMD Athlon(tm) Processor stepping : 2 cpu MHz : 1109.935 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr syscall mmxext 3dnowext 3dnow bogomips : 2199.78
Mine is also a single-processor system: tmp:1845)cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 5 model : 8 model name : AMD-K6(tm) 3D processor stepping : 12 cpu MHz : 500.017 cache size : 64 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr mce cx8 pge mmx syscall 3dnow k6_mtrr bogomips : 996.14
The following patch should fix it --- basically the low latency patches were not dealing with the case of a partial truncate to a huge file that is already in cache.
Created attachment 86746 [details] Fix for truncate hang
*** Bug 77669 has been marked as a duplicate of this bug. ***
I applied the patch to kernel-source-2.4.18-3 and for the last few days, haven't seen the delays I used to see when I use the "$" command to write changes out to my mailbox. Thanks, Stephen!
Fixed in the 2.4.18-19.7.x and 2.4.18-19.8.0 errata kernels.