Bug 64935
Summary: | kernel-2.4.18-4 causes rstatd to segfault; kernel-2.4.18-3 is OK | ||||||
---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Linux | Reporter: | Joel Votaw <joel> | ||||
Component: | rusers | Assignee: | Phil Knirsch <pknirsch> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Brian Brock <bbrock> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 7.3 | CC: | jlamb, redhat, rvokal, tao | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | i686 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2004-01-12 12:59:36 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Joel Votaw
2002-05-14 21:25:08 UTC
If you have problems reproducing this bug, email me and I will help you however I can. xosview had a similar bug reported against it in bug #65004, but this is definitely a bug in rstatd, not in the kernel. I am observing the same problem with rstatd on my RH Linux 7.3 test box. Yep, all related, /proc/stats output changed :-( I'll try to get a fix done sometime soon, just a little busy right now. Read ya, Phil PS: Unofficial ETA is sometime early next week. Created attachment 59005 [details]
proposed fix
built this fix in rawhide. One of my clients has reported the following in AS 2.1: When rpc.rstatd is queried, the daemon crashes, and the remote client times out. This was due to a change in the /proc/stat format. I have created a patch that makes it work. There were 2 changes. Changing the buffer to a larger value, and removing the leading 1 on the interrupt field to let the assert pass correctly. This is against the rusers RPM. diff -ruN netkit-rusers-0.17.orig/rpc.rstatd/rstat_proc.c netkit-rusers-0.17/rpc.rstatd/rstat_proc.c --- netkit-rusers-0.17.orig/rpc.rstatd/rstat_proc.c Fri Sep 6 16:46:13 2002 +++ netkit-rusers-0.17/rpc.rstatd/rstat_proc.c Fri Sep 6 22:54:16 2002 @@ -401,7 +401,7 @@ unsigned *itot, unsigned *i1, unsigned *ct, struct _ldisk *d) { static int stat; -#define BUFFSIZE 1024 +#define BUFFSIZE 10240 char buff[BUFFSIZE]; int ndisks; @@ -433,7 +433,7 @@ sscanf(b, "swap %u %u", sin, sout); b = strstr(buff, "intr "); if(b) - sscanf(b, "intr %u %u", itot, i1); + sscanf(b, "intr %u 1%u", itot, i1); b = strstr(buff, "ctxt "); if(b) sscanf(b, "ctxt %u", ct); This problem is the same as bugzilla bug 64935. Its patched in Rawhide and Milan Beta 5. Will this be available as an errata against AS2.1? The patch seemed to work great for a while, but I am having the same problem again, it just takes longer for it to happen. It seems to happen when the system has been up for a long time in heavy use -- I believe one of the fields in /proc/stat gets too big, either in length, or has a value that won't fit in a 32-bit int when converted. Here is the output froms trace when I cause it to die (note that the line of code that may be barfing is listed there): poll([{fd=5, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND, revents=POLLIN|POLLRDNORM}], 1, -1) = 1 recvmsg(5, {msg_name(16)={sin_family=AF_INET, sin_port=htons(719), sin_addr=inet_addr("172.16.12.18")}}, msg_iov(1)=[{"N\6\251\271\0\0\0\0\0\0\0\2\0\1\206\241\0\0\0\3\0\0\0\1"..., 8800}], msg_controllen=24, msg_control=0x804c9e8, , msg_flags=0}, 0) = 40 gettimeofday({1046825303, 191266}, NULL) = 0 lseek(3, 0, SEEK_SET) = 0 read(3, "7082163.25 2762983.98\n", 1023) = 22 open("/proc/loadavg", O_RDONLY) = 6 lseek(6, 0, SEEK_SET) = 0 read(6, "1.00 1.00 1.00 2/107 4846\n", 1023) = 26 open("/proc/stat", O_RDONLY) = 7 read(7, "cpu 377489858 4358 54955354 275"..., 1023) = 305 close(7) = 0 write(2, "rpc.rstatd: rstat_proc.c:440: ge"..., 69) = 69 rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0 getpid() = 4840 kill(4840, SIGABRT) = 0 --- SIGABRT (Aborted) --- Here is the content of /proc/stat: cpu 377518191 4358 55003153 275766756 cpu0 377518191 4358 55003153 275766754 page 15264978 57854701 swap 2501022 8441218 intr 221612009 3626457387 3 0 0 0 0 3 0 1 0 0 883122944 3 0 6998945 19 disk_io: (3,0):(7022554,3271708,30529450,3750846,115709380) ctxt 1668073263 btime 1039743140 processes 3858880 Seeing the same problem here after a system has been up for a while. Here's a similar strace as above (just the end part), but with a larger -s value so the whole string is printed: ... open("/proc/stat", O_RDONLY) = 7 read(7, "cpu 224476113 302404084 71892413 1211616544\ncpu0 116589464 150929667 35810452 601864994\ncpu1 107886649 151474417 36081961 609751550\npage 1229467537 1386040909\nswap 6986619 10169722\nintr 27842287 905194577 94 0 0 0 149581204 3 0 1 0 3260879792 0 20 0 7153886 2 2 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"..., 1023) = 789 close(7) = 0 write(2, "rpc.rstatd: rstat_proc.c:440: getstat: Assertion `*itot>*i1\' failed.\n", 69) = 69 rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0 getpid() = 26773 kill(26773, SIGABRT) = 0 --- SIGABRT (Aborted) --- This bug happens with both the RH7.1 and RH7.3 rpc.rstatd... Same strace output as Comment #9 during a rup. System is RH7.3 running 2.4.18-19.7.xsmp and rusers-server-0.17-12. The fix was to upgrade to RH8.0's rusers-server-0.17-21. Nifty, upgrading to the RH8.0 rusers-server seems to fix the problem here too, and I don't notice any adverse side-effects of doing so... Closing now as fixed in current release (which it is. ^^). Read ya, Phil |