Description of problem: Dell PE1950 Server servicing as IPv6 TS stream convert to IPv4 TS stream, using VLC, have 450Mbps multicasat net flow in/out. net-snmp snmpd service crash within servial days. Crash gdb propmt: Program received signal SIGFPE, Arithmetic exception. 0x00002aaaab37744a in var_hrproc (vp=0x7fffffffbf50, name=<value optimized out>, length=<value optimized out>, exact=<value optimized out>, var_len=0x7fffffffcb80, write_method=<value optimized out>) at host/hr_proc.c:183 183 long_return = 100 - long_return; (gdb) disassemble Dump of assembler code for function var_hrproc: 0x00002aaaab3773a0 <var_hrproc+0>: push %rbp ...... 0x00002aaaab377446 <var_hrproc+166>: sub 0x20(%rcx),%rdi 0x00002aaaab37744a <var_hrproc+170>: div %rdi 0x00002aaaab37744d <var_hrproc+173>: mov $0x64,%edx ...... (gdb) info registers rax 0x0 0 rbx 0x7fffffffcb80 140737488341888 rcx 0x2aaab4f0e180 46912668492160 rdx 0x0 0 rsi 0x2aaaab119170 46912502862192 rdi 0x0 0 rbp 0x7fffffffbf50 0x7fffffffbf50 rsp 0x7fffffffbee0 0x7fffffffbee0 ...... It is certian that the SIGFPE was lead by a DIV ZERO. Trace the code [net-snmp-5.4.2.1] agent/mibgroup/host/hr_proc.c, Line 181 - 185 long_return = (cpu->idle_ticks - cpu->history[0].idle_hist)*100; long_return /= (cpu->total_ticks - cpu->history[0].total_hist); long_return = 100 - long_return; if (long_return < 0) long_return = 0; Summary: BUG1. I think the absence of ZERO check before div in hr_proc.c line 182 causes the SIGFPE crash. BUG2. hr_proc.c line 184-185 no longer work as before as long_return is now a unsigned var. Since version 5.4.2, var long_return's definition changed from [signed long] to [fsblkcnt_t], which is always a unsigned integer. So line 184-185 will no longer work correctly. long_return define: agent/snmp_vars.c:fsblkcnt_t long_return; Version-Release number of selected component (if applicable): Name : net-snmp Relocations: (not relocatable) Version : 5.4.2.1 Vendor: Fedora Project Release : 3.fc10 Build Date: Mon 16 Feb 2009 07:26:07 PM CST How reproducible: Background net-flow 450Mbps IPv6(in)/IPv4(out) multicast IPTV TS flow(transform by vlc). Start /etc/init.d/snmpd service, it crash within serval days. Steps to Reproduce: 1. Background net-flow 450Mbps IPv6(in)/IPv4(out) multicast IPTV TS flow(transform by vlc). 2. Start /etc/init.d/snmpd service, and wait for it crash with serval days Actual results: snmpd crash with servial days Expected results: snmpd continue work without SIGFPE crash Additional info:
Strange, how did you make cpu->total_ticks == cpu->history[0].total_hist to get the division by zero? Is the affected CPU stuck somehow? Anyway, checking for zero is good idea. And I am going to finally remove the patch that adds fsblkcnt_t long_return, it's obviously wrong.
I fixed the bug upstream (SVN revision 17616), built new package in Rawhide and I am going to push updates in F10 and F11.
net-snmp-5.4.2.1-4.fc10 has been submitted as an update for Fedora 10. http://admin.fedoraproject.org/updates/net-snmp-5.4.2.1-4.fc10
net-snmp-5.4.2.1-12.fc11 has been submitted as an update for Fedora 11. http://admin.fedoraproject.org/updates/net-snmp-5.4.2.1-12.fc11
net-snmp-5.4.2.1-4.fc10 has been pushed to the Fedora 10 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with su -c 'yum --enablerepo=updates-testing update net-snmp'. You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F10/FEDORA-2009-5062
I once used kernel-2.6.24.7-92.fc8.x86_64 and snmpd crash frequently, finally system down. I now update to kernel-2.6.29.3-155.fc11.x86_64, snmpd crash disappeared these days. May be the old kernel-2.6.24.7-92.fc8 make cpu->total_ticks == cpu->history[0].total_hist to get the division by zero.
net-snmp-5.4.2.1-4.fc10 has been pushed to the Fedora 10 stable repository. If problems still persist, please make note of it in this bug report.
net-snmp-5.4.2.1-12.fc11 has been pushed to the Fedora 11 stable repository. If problems still persist, please make note of it in this bug report.