Bug 64935

Summary: kernel-2.4.18-4 causes rstatd to segfault; kernel-2.4.18-3 is OK
Product: [Retired] Red Hat Linux Reporter: Joel Votaw <joel>
Component: rusersAssignee: Phil Knirsch <pknirsch>
Status: CLOSED CURRENTRELEASE QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.3CC: jlamb, redhat, rvokal, tao
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-01-12 12:59:36 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
proposed fix none

Description Joel Votaw 2002-05-14 21:25:08 UTC
From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)

Description of problem:
I upgraded from kernel-2.4.18-3 to kernel-2.4.18-4 on a RedHat 7.3 for i686 
box.  rstatd now segfaults when it tries to service a rup request.  I ran 
strace and it dies right after reading /proc/stat.  I downgraded to the old 
kernel version and rstatd works fine.  

It looks like the format of the "intr" line in /proc/stat changed (maybe the 
high order bit is always set in the data now?).  I suspect rstatd or libproc 
has a a statically sized buffer which is being overflowed.  

Ideally you would double check:

    libproc
    rstatd
    the kernel itself

and fix any errors in this area.  Also, you might check to see if the same 
error occurs on i386, alpha, and other platforms; I didn't get a chance to 
check that.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. Upgrade to kernel-2.4.18-4 on RedHat Linux 7.3 on an i686 (possibly other 
platforms too)
2. Enable rstatd service
3. Run rstatd manually with strace so you see what happens
4. Query the machine using "rup mymachine".
5. Note strace output.
	

Actual Results:  rstatd crashed.  rup client timed out.

Expected Results:  rstatd not crashing. rup client getting good data back. 
(This *does* work correctly with kernel-2.4.18-3.)

Additional info:

Here is the tail of my strace output

gettimeofday({1021408685, 316640}, NULL) = 0
lseek(3, 0, SEEK_SET)                   = 0
read(3, "74421.66 74360.31\n", 1023)    = 18
open("/proc/loadavg", O_RDONLY)         = 6
lseek(6, 0, SEEK_SET)                   = 0
read(6, "0.00 0.00 0.00 1/48 4143\n", 1023) = 25
open("/proc/stat", O_RDONLY)            = 7
read(7, "cpu  2862 41 12268 7426995\ncpu0 "..., 1023) = 1023
close(7)                                = 0
--- SIGSEGV (Segmentation fault) ---

Comment 1 Joel Votaw 2002-05-15 17:49:42 UTC
If you have problems reproducing this bug, email me and I will help you however 
I can.

Comment 2 Michael K. Johnson 2002-05-16 19:22:41 UTC
xosview had a similar bug reported against it in bug #65004, but this
is definitely a bug in rstatd, not in the kernel.

Comment 3 akopps 2002-05-25 04:24:49 UTC
I am observing the same problem with rstatd on my RH Linux 7.3 test box.

Comment 4 Phil Knirsch 2002-05-28 14:55:00 UTC
Yep, all related, /proc/stats output changed :-(

I'll try to get a fix done sometime soon, just a little busy right now.

Read ya, Phil

PS: Unofficial ETA is sometime early next week.

Comment 5 Matt Wilson 2002-05-30 21:08:32 UTC
Created attachment 59005 [details]
proposed fix

Comment 6 Matt Wilson 2002-06-10 15:44:18 UTC
built this fix in rawhide.


Comment 7 Jennifer E. Lamb 2002-09-17 21:06:29 UTC
One of my clients has reported the following in AS 2.1:

When rpc.rstatd is queried, the daemon crashes, and the remote client times out.
This was due to a change in the /proc/stat format.

I have created a patch that makes it work. There were 2 changes. Changing the
buffer to a larger value, and removing the leading 1 on the interrupt field to
let the assert pass correctly. This is against the rusers RPM.

diff -ruN netkit-rusers-0.17.orig/rpc.rstatd/rstat_proc.c
netkit-rusers-0.17/rpc.rstatd/rstat_proc.c
--- netkit-rusers-0.17.orig/rpc.rstatd/rstat_proc.c Fri Sep  6 16:46:13 2002
+++ netkit-rusers-0.17/rpc.rstatd/rstat_proc.c Fri Sep  6 22:54:16 2002
@@ -401,7 +401,7 @@
    unsigned *itot, unsigned *i1, unsigned *ct, struct _ldisk *d)
{
  static int stat;
-#define BUFFSIZE 1024
+#define BUFFSIZE 10240
  char buff[BUFFSIZE];
  int ndisks;
  
@@ -433,7 +433,7 @@
    sscanf(b, "swap %u %u", sin, sout);
    b = strstr(buff, "intr ");
    if(b)
-    sscanf(b, "intr %u %u", itot, i1);
+    sscanf(b, "intr %u 1%u", itot, i1);
    b = strstr(buff, "ctxt ");
    if(b)
    sscanf(b, "ctxt %u", ct);

This problem is the same as bugzilla bug 64935. Its patched in Rawhide and Milan
Beta 5. Will this be available as an errata against AS2.1?



Comment 8 Joel Votaw 2003-03-05 01:05:08 UTC
The patch seemed to work great for a while, but I am having the same problem 
again, it just takes longer for it to happen.  It seems to happen when the 
system has been up for a long time in heavy use -- I believe one of the fields 
in /proc/stat gets too big, either in length, or has a value that won't fit in 
a 32-bit int when converted.

Here is the output froms trace when I cause it to die (note that the line of 
code that may be barfing is listed there):

poll([{fd=5, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND,
revents=POLLIN|POLLRDNORM}], 1, -1) = 1
recvmsg(5, {msg_name(16)={sin_family=AF_INET, sin_port=htons(719),
sin_addr=inet_addr("172.16.12.18")}},
msg_iov(1)=[{"N\6\251\271\0\0\0\0\0\0\0\2\0\1\206\241\0\0\0\3\0\0\0\1"...,
8800}], msg_controllen=24, msg_control=0x804c9e8, , msg_flags=0}, 0) = 40
gettimeofday({1046825303, 191266}, NULL) = 0
lseek(3, 0, SEEK_SET)                   = 0
read(3, "7082163.25 2762983.98\n", 1023) = 22
open("/proc/loadavg", O_RDONLY)         = 6
lseek(6, 0, SEEK_SET)                   = 0
read(6, "1.00 1.00 1.00 2/107 4846\n", 1023) = 26
open("/proc/stat", O_RDONLY)            = 7
read(7, "cpu  377489858 4358 54955354 275"..., 1023) = 305
close(7)                                = 0
write(2, "rpc.rstatd: rstat_proc.c:440: ge"..., 69) = 69
rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
getpid()                                = 4840
kill(4840, SIGABRT)                     = 0
--- SIGABRT (Aborted) ---





Here is the content of /proc/stat:

cpu  377518191 4358 55003153 275766756
cpu0 377518191 4358 55003153 275766754
page 15264978 57854701
swap 2501022 8441218
intr 221612009 3626457387 3 0 0 0 0 3 0 1 0 0 883122944 3 0 6998945 19
disk_io: (3,0):(7022554,3271708,30529450,3750846,115709380)
ctxt 1668073263
btime 1039743140
processes 3858880


Comment 9 Bob Farmer 2003-04-30 03:08:14 UTC
Seeing the same problem here after a system has been up for a while.  Here's a
similar strace as above (just the end part), but with a larger -s value so the
whole string is printed:

...
open("/proc/stat", O_RDONLY)            = 7
read(7, "cpu  224476113 302404084 71892413 1211616544\ncpu0 116589464 150929667
35810452 601864994\ncpu1 107886649 151474417 36081961 609751550\npage 1229467537
1386040909\nswap 6986619 10169722\nintr 27842287 905194577 94 0 0 0 149581204 3
0 1 0 3260879792 0 20 0 7153886 2 2 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"..., 1023) = 789
close(7)                                = 0
write(2, "rpc.rstatd: rstat_proc.c:440: getstat: Assertion `*itot>*i1\'
failed.\n", 69) = 69
rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
getpid()                                = 26773
kill(26773, SIGABRT)                    = 0
--- SIGABRT (Aborted) ---

This bug happens with both the RH7.1 and RH7.3 rpc.rstatd...

Comment 10 Rhett Butler 2003-05-01 15:16:21 UTC
Same strace output as Comment #9 during a rup.

System is RH7.3 running 2.4.18-19.7.xsmp and rusers-server-0.17-12.

The fix was to upgrade to RH8.0's rusers-server-0.17-21.

Comment 11 Bob Farmer 2003-05-06 19:45:28 UTC
Nifty, upgrading to the RH8.0 rusers-server seems to fix the problem here too,
and I don't notice any adverse side-effects of doing so...  


Comment 12 Phil Knirsch 2004-01-12 12:59:36 UTC
Closing now as fixed in current release (which it is. ^^).

Read ya, Phil