Bug 729738 - net-snmp dumps core in netsnmp_oid_find_prefix
Summary: net-snmp dumps core in netsnmp_oid_find_prefix
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: net-snmp
Version: 6.0
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: rc
: 6.2
Assignee: Jan Safranek
QA Contact: BaseOS QE Security Team
URL: http://sourceforge.net/tracker/index....
Whiteboard:
Depends On:
Blocks: 696653
TreeView+ depends on / blocked
 
Reported: 2011-08-10 17:37 UTC by Martin Wilck
Modified: 2011-12-06 17:12 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
When AgentX subagent was being disconnected from snmpd daemon, the daemon did not properly detach all outstanding SNMP requests from internal session object representing the AgentX subagent. Therefore, the snmpd daemon could crash when processing these requests. With this update, the snmpd daemon ensures that all outstaning SNMP requests do not point to AgentX sesion which is being closed.
Clone Of:
Environment:
Last Closed: 2011-12-06 17:12:16 UTC


Attachments (Terms of Use)
core dump + infos (1.85 MB, application/x-xz)
2011-08-10 17:38 UTC, Martin Wilck
no flags Details
SRVMAGT-BIOS MIB (8.72 KB, text/plain)
2011-08-12 07:44 UTC, Martin Wilck
no flags Details
serverview 5.10.22 (13.74 MB, application/x-tar)
2011-08-12 07:48 UTC, Martin Wilck
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:1524 normal SHIPPED_LIVE net-snmp bug fix update 2011-12-06 01:02:35 UTC

Description Martin Wilck 2011-08-10 17:37:26 UTC
Description of problem:
net-snmp dumps core in netsnmp_oid_find_prefix

Version-Release number of selected component (if applicable):
net-snmp-libs-5.5-27.el

How reproducible:
sporadically

Steps to Reproduce:
1. high-load test with running net-snmpd
  
Actual results:
core was generated by `/usr/sbin/snmpd -LS0-6d -Lf /dev/null -p /var/run/snmpd.pid'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007f41acdf5001 in netsnmp_oid_find_prefix ()
   from /usr/lib64/libnetsnmp.so.20
Thread 1 (Thread 2898):
#0  0x00007f41acdf5001 in netsnmp_oid_find_prefix ()
   from /usr/lib64/libnetsnmp.so.20
#1  0x00007f41aecc61ae in netsnmp_add_varbind_to_cache ()
   from /usr/lib64/libnetsnmpagent.so.20
#2  0x00007f41aecc685c in netsnmp_reassign_requests ()
   from /usr/lib64/libnetsnmpagent.so.20
#3  0x00007f41aecc68e8 in handle_getnext_loop ()
   from /usr/lib64/libnetsnmpagent.so.20
#4  0x00007f41aecc9b82 in check_delayed_request ()
   from /usr/lib64/libnetsnmpagent.so.20
#5  0x00007f41aecc9d0d in netsnmp_check_outstanding_agent_requests ()
   from /usr/lib64/libnetsnmpagent.so.20
#6  0x00007f41aecca7d5 in netsnmp_remove_delegated_requests_for_session ()
   from /usr/lib64/libnetsnmpagent.so.20
#7  0x00007f41aeceafb7 in close_agentx_session ()
   from /usr/lib64/libnetsnmpagent.so.20
#8  0x00007f41aeceb57c in handle_master_agentx_packet ()
   from /usr/lib64/libnetsnmpagent.so.20
#9  0x00007f41ace0566f in _sess_read () from /usr/lib64/libnetsnmp.so.20
#10 0x00007f41ace06049 in snmp_sess_read2 () from /usr/lib64/libnetsnmp.so.20
#11 0x00007f41ace0610b in snmp_read2 () from /usr/lib64/libnetsnmp.so.20
#12 0x00007f41af126fde in main ()


Expected results:
no coredump

Additional info:
Looks like an old problem in net-snmp

http://fixunix.com/snmp/173336-net-snmp-5-3-0-1-coredump-linux.html
http://www.mail-archive.com/net-snmp-users@lists.sourceforge.net/msg12603.html
http://sourceforge.net/tracker/index.php?func=detail&aid=1633670&group_id=12694&atid=112694

Comment 1 Martin Wilck 2011-08-10 17:38:36 UTC
Created attachment 517664 [details]
core dump + infos

Comment 3 Jan Safranek 2011-08-11 15:57:38 UTC
From the coredump I can only see that tree cache becomes corrupted, without any indication why. snmpd crashed when processing GETNEXT request for OID 1.3.6.1.4.1.231.2.10.2.2.1, while some AgentX subagent was being disconnected.

I cannot find if the disconnected subagent was involved in handling of the 1.3.6.1.4.1.231.2.10.2.2.1 OID or not...

I suppose you cannot share details about how do you use AgentX? How many subagents do you have, how often do they disconnect/reconnect?

With the information above (GETNEXT && a subagent being disconnected), can you reproduce the bug in a more reliable way?

I'll try to investigate the crash further in parallel, but without your subagent(s), I am mostly blind.

Comment 4 Martin Wilck 2011-08-12 07:44:38 UTC
Created attachment 517981 [details]
SRVMAGT-BIOS MIB

iso.org.dod.internet.private = 1.3.6.1.4
enterprises.sni.sniProductMibs.sniExtensions.sniServerMgmt.sniCommon.sniBios
1           231.2             .10           .2            .2        .1

 
sniBiosVersionMajor OBJECT-TYPE
        SYNTAX  INTEGER
        ACCESS  read-only
        STATUS  mandatory
        DESCRIPTION
"Major Version of the BIOS"
        ::= { sniBios 1 }

sniBiosVersionMinor OBJECT-TYPE
        SYNTAX  INTEGER
        ACCESS  read-only
        STATUS  mandatory
        DESCRIPTION
"Minor Version of the BIOS"
        ::= { sniBios 2 }

sniBiosDiagnosticStatus OBJECT-TYPE
        SYNTAX  INTEGER
        ACCESS  read-only
        STATUS  mandatory
        DESCRIPTION
"A bit field: 
BIT            MEANING
 0             Timeout reading an adapter ID (eisa)
 1             Adapter do not match configuration(eisa)
 2             CMOS RAM time found invalid
 3             Fixed disk/adapter fails initialization
 4             Memory size compare error at POST
 5             Invalid configuration information found at POST
 6             CMOS RAM checksum is bad
 7             Real-time clock lost power"
        ::= { sniBios 3 }

Comment 5 Martin Wilck 2011-08-12 07:48:43 UTC
Created attachment 517983 [details]
serverview 5.10.22

Here are the binaries of our agents, including all MIBS, docs, snmp configuration etc. They will probably only work on a PRIMERGY. Gary should be able to provide access to one (and install the agents) if needed.

Comment 6 Martin Wilck 2011-08-12 08:39:44 UTC
Here is a statement from our agent developer:

"These OIDs are served by our BIOS agent. I can't imagine why a problem should occur with these OIDs, this is more likely to be related to the internal processing of net-snmpd. Of all our agents, the BIOS agent is the one which has least to do."

"There are the following ServerView subagents: sc sc2 bus hd unix ether bios secur status inv thr vv hpsim vme. The process name is the agent name + "agt", e.g. scagt, busagt, etc."

"These subagents will register with snmpd when they start and unregister when they are stopped. However it happens sometimes that the AgentX communication is interrupted and must be reestablished. We see that once in a while in our traces."

"The question 'I suppose you cannot share details about how do you use AgentX?' can't be answered easily because this code is very ancient and it's not exactly clear what the question is targeted at."

Some more information from my side: most of our agents don't procure the information  for net-snmp directly. Rather, they communicate with a separate daemon (eecd) which collects the data.

Comment 7 Jan Safranek 2011-08-15 12:10:45 UTC
With dummy AgentX subagent which disconnects during first GETNEXT query and reconnects (+ lot of GETNEXT requests), I was able to get sigsegv once.

Valgrind tells me:
==4052==    at 0x4E40749: netsnmp_remove_delegated_requests_for_session (in /usr/lib64/libnetsnmpagent.so.20.0.0)
==4052==    by 0x4E60FB7: close_agentx_session (in /usr/lib64/libnetsnmpagent.so.20.0.0)
==4052==    by 0x4E6156B: handle_master_agentx_packet (in /usr/lib64/libnetsnmpagent.so.20.0.0)
==4052==    by 0x6CF466E: _sess_read (in /usr/lib64/libnetsnmp.so.20.0.0)
==4052==    by 0x6CF5048: snmp_sess_read2 (in /usr/lib64/libnetsnmp.so.20.0.0)
==4052==    by 0x6CF510A: snmp_read2 (in /usr/lib64/libnetsnmp.so.20.0.0)
==4052==    by 0x10CFDD: main (in /usr/sbin/snmpd)
==4052==  Address 0xbca8218 is 72 bytes inside a block of size 152 free'd
==4052==    at 0x4C2695D: free (vg_replace_malloc.c:366)
==4052==    by 0x4E43FCE: unregister_mibs_by_session (in /usr/lib64/libnetsnmpagent.so.20.0.0)
==4052==    by 0x4E60EA7: close_agentx_session (in /usr/lib64/libnetsnmpagent.so.20.0.0)
==4052==    by 0x4E617EB: handle_master_agentx_packet (in /usr/lib64/libnetsnmpagent.so.20.0.0)
==4052==    by 0x6CF3867: ??? (in /usr/lib64/libnetsnmp.so.20.0.0)
==4052==    by 0x6CF47A1: _sess_read (in /usr/lib64/libnetsnmp.so.20.0.0)
==4052==    by 0x6CF5048: snmp_sess_read2 (in /usr/lib64/libnetsnmp.so.20.0.0)
==4052==    by 0x6CF510A: snmp_read2 (in /usr/lib64/libnetsnmp.so.20.0.0)
==4052==    by 0x10CFDD: main (in /usr/sbin/snmpd)


But I still cannot reproduce it reliably.

Comment 8 Martin Wilck 2011-08-15 13:45:34 UTC
If you can give me instructions  how to run the instrumented SNMP daemon we can try to reproduce the problem here again.

I have asked QA to reproduce it with 6.1 first, because this one was originally reported for 6.0. There is no indication in the change logs though that it's fixed in 6.1.

Comment 9 Jan Safranek 2011-08-15 15:21:02 UTC
I've uploaded reproducer to upstream bug tracker, https://sourceforge.net/tracker/index.php?func=detail&aid=1633670&group_id=12694&atid=112694

I have successfully crashed net-snmp-5.7, upstream trunk and also RHEL 6.2 build I made today for RHEL 6.2 errata, so the bug is reproducible everywhere, probably incl. RHEL 6.1.

Comment 10 Martin Wilck 2011-08-15 15:51:03 UTC
That looks promising, thanks a lot for digging into this problem.

Comment 11 Jan Safranek 2011-09-05 10:24:44 UTC
I sent a fix to upstream bug tracker, it's not perfect, but at least snmpd does not crash.

Comment 13 Jan Safranek 2011-09-05 13:10:43 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
When AgentX subagent was being disconnected from snmpd daemon, the daemon did not properly detach all outstanding SNMP requests from internal session object representing the AgentX subagent. Therefore, the snmpd daemon could crash when processing these requests. With this update, the snmpd daemon ensures that all outstaning SNMP requests do not point to AgentX sesion which is being closed.

Comment 15 Jan Safranek 2011-09-07 13:09:20 UTC
Qa found out that if the AgentX subagent disconnects while processing  a request, the request then leaks a bit of memory in the master snmpd (approx 44 bytes per such request).

Valgrind report:
==8326==    at 0x4A04A28: calloc (vg_replace_malloc.c:467)
==8326==    by 0x4C33E6A: netsnmp_create_delegated_cache (agent_handler.c:713)
==8326==    by 0x4C36BC9: agentx_master_handler (master.c:591)
==8326==    by 0x4C3642E: netsnmp_call_handlers (agent_handler.c:440)
==8326==    by 0x4C26710: handle_var_requests (snmp_agent.c:2611)
==8326==    by 0x4C28395: handle_pdu (snmp_agent.c:3407)
==8326==    by 0x4C2A7EF: netsnmp_handle_request (snmp_agent.c:3203)
==8326==    by 0x4C2B2A9: handle_snmp_packet (snmp_agent.c:1929)
==8326==    by 0x6AD6867: _sess_process_packet (snmp_api.c:5604)
==8326==    by 0x6AD71FF: _sess_read (snmp_api.c:6043)
==8326==    by 0x6AD8048: snmp_sess_read2 (snmp_api.c:6075)
==8326==    by 0x6AD810A: snmp_read2 (snmp_api.c:5667)

I assume the AgentX subagents disconnect very rarely and this memory leak happens only in very exceptional case, so I left the leak there for now, while working on it upstream. Please reopen the bug if you'r AgentX disconnects often so the leak might matter.

Comment 17 Karel Srot 2011-09-08 07:25:31 UTC
I have filed a new bug 736580 for the memory leak.

Comment 18 errata-xmlrpc 2011-12-06 17:12:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1524.html


Note You need to log in before you can comment on or make changes to this bug.