Bug 629770 - NET-SNMP 5.3.2.2 showing netsnmp_subtree_load() message
Summary: NET-SNMP 5.3.2.2 showing netsnmp_subtree_load() message
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: net-snmp
Version: 5.5
Hardware: All
OS: Linux
medium
high
Target Milestone: rc
: ---
Assignee: Jan Safranek
QA Contact: Dalibor Pospíšil
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-09-02 22:23 UTC by Michael Worsham
Modified: 2018-11-26 19:00 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-11-13 08:23:00 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
"daemon" log file with snmpd -Dregister_mib,agentx (152.03 KB, text/plain)
2011-08-17 17:30 UTC, Martin Wilck
no flags Details
another log file - good case (246.33 KB, text/plain)
2011-08-17 17:55 UTC, Martin Wilck
no flags Details
another log file - bad case (258.55 KB, text/plain)
2011-08-17 17:57 UTC, Martin Wilck
no flags Details
data gathered the 27 of october (8.23 MB, application/x-gzip)
2011-10-27 10:07 UTC, Pierre Amadio
no flags Details
core file of snmpd (519.46 KB, application/x-zip-compressed)
2012-08-23 13:56 UTC, Josef Möllers
no flags Details
snmpd strace output (935.53 KB, application/x-gzip)
2012-09-24 11:03 UTC, Josef Möllers
no flags Details
reproducer script (1.54 KB, text/plain)
2012-10-08 13:23 UTC, Jan Safranek
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Legacy) 38204 0 None None None Never

Description Michael Worsham 2010-09-02 22:23:05 UTC
Description of problem:
Upon startup of /etc/init.d/snmpd daemon, the following message is seen in the /var/log/messages file:

netsnmp_assert !"registration != duplicate" failed agent_registry.c:536 netsnmp_subtree_load()

Version-Release number of selected component (if applicable):
NET-SNMP version 5.3.2.2

How reproducible:
Keeps showing up upon every restart of the snmpd daemon

Steps to Reproduce:
1. /etc/init.d/snmpd restart
  
Additional info:
net-snmp.x86_64                                                  1:5.3.2.2-9.el5_5.1                                            installed
net-snmp-libs.i386                                               1:5.3.2.2-9.el5_5.1                                            installed
net-snmp-libs.x86_64                                             1:5.3.2.2-9.el5_5.1                                            installed
net-snmp-utils.x86_64                                            1:5.3.2.2-9.el5_5.1                                            installed

Comment 1 Issue Tracker 2010-09-09 09:45:58 UTC
Event posted on 09-09-2010 05:45pm CST by dwu

Breakpoint 1, dlmod_load_module (dlm=0xb707c168) at ucd-snmp/dlmod.c:139
139	{
(gdb) b netsnmp_subtree_load
Breakpoint 2 at 0x9f7fd9: file agent_registry.c, line 404.
(gdb) cont
Continuing.

Breakpoint 2, netsnmp_subtree_load (new_sub=0xb7077508,
context_name=0xb7068a80 "") at agent_registry.c:404
404	    if (new_sub == NULL) {
(gdb) bt
#0  netsnmp_subtree_load (new_sub=0xb7077508, context_name=0xb7068a80
"") at agent_registry.c:404
#1  0x009f86d6 in netsnmp_register_mib (moduleName=0xb7067cc8
"CompaqInsight", var=0x0, varsize=0, numvars=0, mibloc=0xb7066770,
mibloclen=10, priority=127, 
    range_subid=0, range_ubound=0, ss=0x0, context=0xb7068a80 "",
timeout=-1, flags=0, reginfo=0xb7066260, perform_callback=1) at
agent_registry.c:670
#2  0x009fe7a6 in netsnmp_register_handler (reginfo=0xb7066260) at
agent_handler.c:248
#3  0x0067ef74 in netsnmp_register_old_api (moduleName=0x9b4bca
"CompaqInsight", var=0xbffd4f28, varsize=40, numvars=3682,
mibloc=0x9b5f20, mibloclen=7, 
    priority=127, range_subid=0, range_ubound=0, ss=0x0, context=0xa25232
"", timeout=-1, flags=0) at old_api.c:97
#4  0x009f7ae1 in register_mib_context (moduleName=0x9b4bca
"CompaqInsight", var=0xbffd4f28, varsize=40, numvars=3682,
mibloc=0x9b5f20, mibloclen=7, priority=127, 
    range_subid=0, range_ubound=0, ss=0x0, context=0xa25232 "",
timeout=-1, flags=0) at agent_registry.c:839
#5  0x009f7b66 in register_mib_range (moduleName=0x9b4bca
"CompaqInsight", var=0xbffd4f28, varsize=40, numvars=3682,
mibloc=0x9b5f20, mibloclen=7, priority=127, 
    range_subid=0, range_ubound=0, ss=0x0) at agent_registry.c:855
#6  0x009f7bcf in register_mib_priority (moduleName=0x9b4bca
"CompaqInsight", var=0xbffd4f28, varsize=40, numvars=3682,
mibloc=0x9b5f20, mibloclen=7, priority=127)
    at agent_registry.c:867
#7  0x009f7c28 in register_mib (moduleName=0x9b4bca "CompaqInsight",
var=0xbffd4f28, varsize=40, numvars=3682, mibloc=0x9b5f20, mibloclen=7) at
agent_registry.c:877
#8  0x009b4387 in init_cmaX () from /usr/lib/libcmaX.so
#9  0x007d3e9f in dlmod_load_module (dlm=0xb707c168) at
ucd-snmp/dlmod.c:193
#10 0x007d46b1 in dlmod_parse_config (token=0xbfffdb88 "dlmod",
cptr=0xbfffdf8e "cmaX") at ucd-snmp/dlmod.c:272
#11 0x0037e978 in run_config_handler (lptr=0xb7be03b8, token=0xbfffdb88
"dlmod", cptr=0xbfffdf8e "cmaX", when=0) at read_config.c:506
#12 0x0037eca8 in read_config (filename=0xbfffe4cc
"/etc/snmp/snmpd.conf", line_handler=0xb7b59728, when=0) at
read_config.c:796
#13 0x0037f97e in read_config_files_in_path (path=<value optimized out>,
ctmp=0xb7b594b0, when=0, perspath=0xb7012198 "/var/net-snmp",
persfile=0x0)
    at read_config.c:1118
#14 0x0037fcbb in read_config_files (when=0) at read_config.c:1199
#15 0x0038179f in read_configs () at read_config.c:864
#16 0x0036561a in init_snmp (type=0x5e1af0 "snmpd") at snmp_api.c:839
#17 0x005df02f in main (argc=7, argv=0xbfffea54) at snmpd.c:902
(gdb) cont
Continuing.

Breakpoint 2, netsnmp_subtree_load (new_sub=0xb7077470,
context_name=0xb705dff8 "") at agent_registry.c:404
404	    if (new_sub == NULL) {
(gdb) bt
#0  netsnmp_subtree_load (new_sub=0xb7077470, context_name=0xb705dff8
"") at agent_registry.c:404
#1  0x009f86d6 in netsnmp_register_mib (moduleName=0xb7051140
"CompaqInsight", var=0x0, varsize=0, numvars=0, mibloc=0xb7058b08,
mibloclen=10, priority=127, 
    range_subid=0, range_ubound=0, ss=0x0, context=0xb705dff8 "",
timeout=-1, flags=0, reginfo=0xb7063e60, perform_callback=1) at
agent_registry.c:670
#2  0x009fe7a6 in netsnmp_register_handler (reginfo=0xb7063e60) at
agent_handler.c:248
#3  0x0067ef74 in netsnmp_register_old_api (moduleName=0x9b4bca
"CompaqInsight", var=0xbffd4f28, varsize=40, numvars=3682,
mibloc=0x9b5f20, mibloclen=7, 
    priority=127, range_subid=0, range_ubound=0, ss=0x0, context=0xa25232
"", timeout=-1, flags=0) at old_api.c:97
#4  0x009f7ae1 in register_mib_context (moduleName=0x9b4bca
"CompaqInsight", var=0xbffd4f28, varsize=40, numvars=3682,
mibloc=0x9b5f20, mibloclen=7, priority=127, 
    range_subid=0, range_ubound=0, ss=0x0, context=0xa25232 "",
timeout=-1, flags=0) at agent_registry.c:839
#5  0x009f7b66 in register_mib_range (moduleName=0x9b4bca
"CompaqInsight", var=0xbffd4f28, varsize=40, numvars=3682,
mibloc=0x9b5f20, mibloclen=7, priority=127, 
    range_subid=0, range_ubound=0, ss=0x0) at agent_registry.c:855
#6  0x009f7bcf in register_mib_priority (moduleName=0x9b4bca
"CompaqInsight", var=0xbffd4f28, varsize=40, numvars=3682,
mibloc=0x9b5f20, mibloclen=7, priority=127)
    at agent_registry.c:867
#7  0x009f7c28 in register_mib (moduleName=0x9b4bca "CompaqInsight",
var=0xbffd4f28, varsize=40, numvars=3682, mibloc=0x9b5f20, mibloclen=7) at
agent_registry.c:877
#8  0x009b4387 in init_cmaX () from /usr/lib/libcmaX.so
#9  0x007d3e9f in dlmod_load_module (dlm=0xb707c168) at
ucd-snmp/dlmod.c:193
#10 0x007d46b1 in dlmod_parse_config (token=0xbfffdb88 "dlmod",
cptr=0xbfffdf8e "cmaX") at ucd-snmp/dlmod.c:272
#11 0x0037e978 in run_config_handler (lptr=0xb7be03b8, token=0xbfffdb88
"dlmod", cptr=0xbfffdf8e "cmaX", when=0) at read_config.c:506
#12 0x0037eca8 in read_config (filename=0xbfffe4cc
"/etc/snmp/snmpd.conf", line_handler=0xb7b59728, when=0) at
read_config.c:796
#13 0x0037f97e in read_config_files_in_path (path=<value optimized out>,
ctmp=0xb7b594b0, when=0, perspath=0xb7012198 "/var/net-snmp",
persfile=0x0)
    at read_config.c:1118
#14 0x0037fcbb in read_config_files (when=0) at read_config.c:1199
#15 0x0038179f in read_configs () at read_config.c:864
#16 0x0036561a in init_snmp (type=0x5e1af0 "snmpd") at snmp_api.c:839
#17 0x005df02f in main (argc=7, argv=0xbfffea54) at snmpd.c:902
(gdb) p *new_sub
$1 = {name_a = 0xb7065e60, namelen = 10 '\n', start_a = 0xb7065068,
start_len = 10 '\n', end_a = 0xb70511d0, end_len = 10 '\n',
variables = 0x0, variables_len = 0, 
  variables_width = 0, label_a = 0xb7053488 "CompaqInsight", session =
0x0, flags = 2 '\002', priority = 127 '\177', timeout = -1, next =
0x0, prev = 0x0, 
  children = 0x0, range_subid = 0, range_ubound = 0, reginfo = 0xb7063e60,
cacheid = 0, global_cacheid = 0, oid_off = 0}
(gdb) x/10 0xb7065068
0xb7065068:	1	3	6	1
0xb7065078:	4	1	232	1
0xb7065088:	1	1
(gdb) x/10 0xb70511d0
0xb70511d0:	1	3	6	1
0xb70511e0:	4	1	232	1
0xb70511f0:	1	2
(gdb) cont
Continuing.

Breakpoint 2, netsnmp_subtree_load (new_sub=0xb7077470,
context_name=0xb70697c0 "") at agent_registry.c:404
404	    if (new_sub == NULL) {
(gdb) p *((netsnmp_subtree*) 0xb7077508)
$2 = {name_a = 0xb7067760, namelen = 10 '\n', start_a = 0xb7066320,
start_len = 10 '\n', end_a = 0xb70660e0, end_len = 10 '\n',
variables = 0x0, variables_len = 0, 
  variables_width = 0, label_a = 0xb7068088 "CompaqInsight", session =
0x0, flags = 2 '\002', priority = 127 '\177', timeout = -1, next =
0xb7077340, 
  prev = 0xb7c54220, children = 0xb70775a0, range_subid = 0, range_ubound
= 0, reginfo = 0xb7066260, cacheid = 0, global_cacheid = 0, oid_off = 5}
(gdb) x/10 0xb7066320
0xb7066320:	1	3	6	1
0xb7066330:	4	1	232	1
0xb7066340:	1	1
(gdb) x/10 0xb70660e0
0xb70660e0:	1	3	6	1
0xb70660f0:	4	1	232	1
0xb7066100:	


This event sent from IssueTracker by dwu 
 issue 1347943

Comment 2 Mark Wu 2010-09-09 09:53:16 UTC
   It should be the problem of hp-snmp-agent. It keep registering the same module in netsnmp_register_old_api. So you see the following log:
netsnmp_assert !"registration != duplicate" failed agent_registry.c:536 netsnmp_subtree_load()

   And if you comment the line "dlmod cmaX /usr/lib/libcmaX.so" in snmpd.conf,the message will disappear.

Comment 4 Jan Safranek 2010-12-16 15:03:22 UTC
based on comment #2, I'm closing this bug.

Comment 5 Martin Wilck 2011-08-16 11:28:23 UTC
We are seeing the same problem here with our subagents in RHEL5.7. This hasn't been observed on earlier RHEL releases, nor in other Linux distributions.

I'd like to know if HP accepted that their agents were causing the bug.

Comment 6 Martin Wilck 2011-08-17 17:22:40 UTC
I think we have sufficient material to show that this is a net-snmpd problem (in RHEL5.7).

We are seeing the same thing with our serverview agents (see also bug #729738). The serverview start scripts start several AgentX subagents simultaneously. However every agent has its own MIB tree. It appears that in this situation, net-snmpd receives the same package multiple times (although our agents are not re-sending them). This is causing the wrong "duplicate" messages.

Comment 7 Martin Wilck 2011-08-17 17:30:17 UTC
Created attachment 518723 [details]
"daemon" log file with snmpd -Dregister_mib,agentx

In this example the following happens:

 1 net-snmpd starts up.
 2 3 agents (etheragt, biosagt, securagt) start up
 3 the agents are stopped again
 4 start again
 5 stop again
 6 start again
 7 stop again

2/3 and 4/5 go fine, but the last time we get "duplicate" messages.

The log shows that when this happens, the pdu 138849c7 is received twice:

Aug 17 10:21:29 B622 snmpd[10165]: handle pdu (req=0x138849c7,trans=0x0,sess=0x14) 
Aug 17 10:21:30 B622 snmpd[10165]: registering "AgentX subagent 20, session 0x2b0d4e5496a0, subsession 0x2b0d4e54a7b0" at 
Aug 17 10:21:30 B622 snmpd[10165]: send response, stat 0 (req=0x138849c7,trans=0x0,sess=0x14) 
Aug 17 10:21:30 B622 snmpd[10165]: handle pdu (req=0x138849c7,trans=0x0,sess=0x14) 
Aug 17 10:21:30 B622 snmpd[10165]: registering "AgentX subagent 20, session 0x2b0d4e5496a0, subsession 0x2b0d4e54a7b0" at 
Aug 17 10:21:30 B622 snmpd[10165]: netsnmp_assert !"registration != duplicate" failed agent_registry.c:536 netsnmp_subtree_load() 
Aug 17 10:21:30 B622 snmpd[10165]: duplicate registration 

We are sure though that the subagents don't resend their packets.

Comment 8 Martin Wilck 2011-08-17 17:55:54 UTC
Created attachment 518724 [details]
another log file - good case

In this case I have added a "sleep 5" between the starts of the subagents, and no duplicates are seen any more. Grepping for "handle pdu" also shows no dups.

Comment 9 Martin Wilck 2011-08-17 17:57:51 UTC
Created attachment 518725 [details]
another log file - bad case

Similar as the example before, but wthout sleeping in the start script between starts of subagents. Note the many duplicate "handle pdu" messages (many more than duplicate registration warnings - other "duplicate" pdu's don't seem to generate warnings).

Comment 10 Jan Safranek 2011-08-18 10:10:40 UTC
Damned agentx...

I can reproduce it with simple agentx (e.g. from https://sourceforge.net/tracker/index.php?func=detail&aid=1633670&group_id=12694&atid=112694).

Just keep any AgentX subagent registering and deregistering in a simple loop and the assert shows up eventually in snmpd log:

while true; do
    ./example -f -Lo &
    PID=$!
    sleep 1
    kill $PID
done

Comment 11 Jan Safranek 2011-08-19 12:12:40 UTC
In strace logs I can see that when this error occures, select() in snmpd reports new connection on AgentX socket (=new agentx subagent starts) much earlier than it reports that socket of the old subagent was closed -> new registration is processed before the old is deleted -> error.

I am not sure if this is your case, but if it is, there is not much what can be done on snmpd side. Your subagents should reconnect after this failure. And/or when your subagent dies, there should be some minimal delay (hundreds of miliseconds) to let snmpd (and kernel) close all sockets of the old subagent. In addition, let the subagents disconnect gracefully, i.e. they should send agentx-Close-PDU before exiting.

Please let me know if this information is helpful or you can still see the error message even when your subagents sent agentx-Close-PDU and/or restart of subagents is delayed. Please also provide complete strace output (with -tt option) from agent startup till the error message + few seconds later.

Comment 12 Martin Wilck 2011-08-19 12:29:06 UTC
Jan,

first of all, thanks. Perhaps there's a small misundestanding here. In this case there are no disconnects from our agents (except at the very end of the test, but there are no errors there). The agents just all register quickly after each other (started from a FOR loop in a shell script without waiting). The AgentX sessions are all working and the agents and snmpd are happy. The only problem is the "duplicate" warning - net-snmpd actually registers the sessions several times in a row. You can see that in the attached logs. You can also see that other PDUs are apparently received multiple times.

We did observe that by putting sleep's between the agent starts, the problem could be avoided. But the sleeps apparently need to be quite long (~5s) which is barely acceptable (almost a minute startup time for all agents).

Martin

Comment 13 Pierre Amadio 2011-10-17 09:45:16 UTC
Martin, in comment #8 you mention that a "sleep 5" between the starts of the subagents may work around the problem...

Would it be possible to know wich file and wich line are involved there ?

Comment 14 Gary Smith 2011-10-19 15:29:41 UTC
(In reply to comment #13)
> Martin, in comment #8 you mention that a "sleep 5" between the starts of the
> subagents may work around the problem...
> 
> Would it be possible to know wich file and wich line are involved there ?

@martin

If you could also provide a snippet of code as an example of how this has been worked around, it would help (I'm guessing the startup code for your subagents is going to be different to those in the HP environment).

Comment 16 Pierre Amadio 2011-10-27 10:07:52 UTC
Created attachment 530454 [details]
data gathered the 27 of october

Comment 17 Martin Wilck 2011-11-15 15:57:17 UTC
> If you could also provide a snippet of code as an example of how this has been
> worked around, it would help (I'm guessing the startup code for your subagents
> is going to be different to those in the HP environment).

It's just a bash for loop 

for x in $AGENTS; do 
    [ start agent ]
    < insert "sleep 5" here>
done

Comment 18 Jan Safranek 2011-11-16 08:37:39 UTC
Out of curiosity, is it the same set of subagents and init scripts as attached to bug #749227 ? Couldn't it be another symptom of bad select()?

Comment 19 Martin Wilck 2011-11-16 10:13:42 UTC
I pondered that, too. The agents are the same (although possibly a different version). But the problem of bug #749227 affects only net-snmp 5.5 and higher (older versions don't have the large_fd_set code).

Comment 21 Martin Wilck 2011-12-14 10:49:48 UTC
I reproduced this problem with RHEL5.8 alpha and the latest ServerView agents. I wasn't able to reproduce it with the simple subagent as Jan did in comment #10.

Comment 22 RHEL Program Management 2012-01-09 14:08:23 UTC
This request was evaluated by Red Hat Product Management for inclusion in Red Hat Enterprise Linux 5.8 and Red Hat does not plan to fix this issue the currently developed update.

Contact your manager or support representative in case you need to escalate this bug.

Comment 24 RHEL Program Management 2012-04-02 10:26:15 UTC
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux release.  Product Management has
requested further review of this request by Red Hat Engineering, for
potential inclusion in a Red Hat Enterprise Linux release for currently
deployed products.  This request is not yet committed for inclusion in
a release.

Comment 26 Martin Wilck 2012-04-23 13:49:37 UTC
Problem was still observed on 5.8 by FTS QA.

Comment 27 Jan Safranek 2012-06-11 11:45:45 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
In the previous net-snmp update, implementation of the "HOST-RESOURCES-MIB::hrStorageTable" table was rewritten and devices with ReiserFS and Oracle Cluster File System (OCFS2) were not reported. In this update, snmpd properly recognizes  ReiserFS and OCFS2 devices and reports them in "HOST-RESOURCES-MIB::hrStorageTable".

Comment 28 Jan Safranek 2012-06-11 11:46:35 UTC
Deleted Technical Notes Contents.

Old Contents:
In the previous net-snmp update, implementation of the "HOST-RESOURCES-MIB::hrStorageTable" table was rewritten and devices with ReiserFS and Oracle Cluster File System (OCFS2) were not reported. In this update, snmpd properly recognizes  ReiserFS and OCFS2 devices and reports them in "HOST-RESOURCES-MIB::hrStorageTable".

Comment 29 Jan Safranek 2012-06-13 11:33:21 UTC
Martin,

I am still not sure what's wrong with snmpd. Could you please provide strace -s 9999 -tt of snmpd from startup to point where you can see "registration != duplicate" assertion?

And it would be great if I could get AgentX communication capture using wireshark, but you need to configure AgentX over TCP (using -x option).

Thanks in advance,

Jan

Comment 30 Josef Möllers 2012-08-23 13:56:02 UTC
Created attachment 606600 [details]
core file of snmpd

Maybe this is better:

Being afraid of nothing I binary-patched /usr/lib64/libnetsnmpagent.so.10.0.3 (thanks, okteta) and replaced the first byte of the "callq snmp_log" instruction to a (privileged) "STI" (opcode FB) instruction.

I then ran the test stopping and starting our daemons and almost immediately snmpd crashed:
Program terminated with signal 11, Segmentation fault.
#0  0x00002afa5cfee221 in netsnmp_subtree_load () from /usr/lib64/libnetsnmpagent.so.10
(gdb) x/xb 0x00002afa5cfee221
0x2afa5cfee221 <netsnmp_subtree_load+849>:      0xfb
(gdb) quit

I'll try and start analyzing myself, but maybe you're a tad faster.

Comment 31 Josef Möllers 2012-08-23 14:04:32 UTC
Oh ... btw ... this was done on 5.7, rather than the original 5.5!

Comment 32 Josef Möllers 2012-08-24 14:15:11 UTC
I've started looking at the core dump.
What strikes me is that the list hanging off "context_subtrees" is HUGE! Together with the children there are tens of thousands of entries.
I thought that they would build an ordered list, like
node->start_a < node->next->start_a and
node->start_a <= node->children->start_a < node->children->next->start_a < ... < node->end_a
I find that the first rule is obeyed.
I have also seen that the second rule is not obeyed, but my disk was too small (a 4GB iSCSI target) to hold the resulting file ... but then, maybe this second rule is not true. I thought that whenever an agent wishes to register a new OID, the chain is opened at the point where the new OID fits in and a child node is inserted.

Anyway, I just wanted to share what I have found.

Comment 33 Josef Möllers 2012-08-27 11:11:39 UTC
I seem to be a bit confused about the nature of the list hanging off "context_subtrees": it seems that it's not a proper tree but some nodes have multiple references. It seems that there are "only" ~9000 unique struct netsnmp_subtree_s nodes in the list and this increases only very little when our daemons are started.
So please ignore my comment #32.

Comment 36 Josef Möllers 2012-09-18 13:39:58 UTC
Gary has asked me to add a short comment re the status.

I had been hoping that the core dump I provided would help in analyzing the cause of the duplicate registration. However, I have not had neither a positive nor a negative feedback after I posted it in comment #30!
If the core dump is of no use (which may very well be the case since this is very likely a race condition), then please say so and we should try and find other means.

I'm not sure if running snmpd under strace (or valgrind) will provide anything useful, because they slow down snmpd enough to prevent the error to happen (see bug#839235).

Thus said, I haven't tried strace in this case, but I could, if the core dump does not provide any useful insights.

Comment 37 Jan Safranek 2012-09-18 14:16:46 UTC
(In reply to comment #36)
> If the core dump is of no use (which may very well be the case since this is
> very likely a race condition), then please say so and we should try and find
> other means.

I haven't found anything useful in the core dumps, *any* reproducer or info what's going on inside snmpd would be much better.

On related point, as far as I know, net-snmp does process requests in multiple threads therefore the race condition is really strange. Something must be in wrong order on input, i.e. some messages comes earlier/later than usual. It's weird that slowing down snmpd using valgrind/strace actually changes its behavior, the packets should queue up in receive buffers...

Comment 38 Josef Möllers 2012-09-24 11:03:55 UTC
Created attachment 616477 [details]
snmpd strace output

Here's the desired snmpd strace output:
strace -s9999 -tt -fo /mnt/Temp/snmpd.strc /usr/sbin/snmpd -Lsd -Lf /dev/null -p /var/run/snmpd.pid -a

The error appeared at the 6th iteration of
#! /bin/bash

loops=0
rm -f StopTest
while [ ! -e 'StopTest' ]; do
    fgrep duplicate /var/log/messages > duplicates.pre
    srvmagt start
    srvmagt stop
    fgrep duplicate /var/log/messages > duplicates.post
    if cmp -s duplicates.pre duplicates.post; then
        let loops++
        continue
    else
        echo Differences after $loops successfull loops
        break
    fi
done
diff duplicates.pre duplicates.post

The diff is
> Sep 24 11:50:16 B425-SB18 snmpd[7546]: netsnmp_assert !"registration != duplicate" failed agent_registry.c:536 netsnmp_subtree_load() 
> Sep 24 11:50:16 B425-SB18 snmpd[7546]: netsnmp_assert !"registration != duplicate" failed agent_registry.c:536 netsnmp_subtree_load() 
> Sep 24 11:50:18 B425-SB18 snmpd[7546]: netsnmp_assert !"registration != duplicate" failed agent_registry.c:536 netsnmp_subtree_load()

Comment 39 Jan Safranek 2012-10-08 13:23:12 UTC
Created attachment 623451 [details]
reproducer script

I can finally reproduce the bug with agentx communication extracted from the strace, thanks a lot!

Comment 40 Jan Safranek 2012-10-09 08:57:10 UTC
Findings so far:

(In reply to comment #6)
> It appears that in this situation, net-snmpd receives the same
> package multiple times (although our agents are not re-sending them).
> This is causing the wrong "duplicate" messages.

Indeed, snmpd receives some packets multiple times. Therefore something, either the subagents or the kernel resends them.

At snmpd.strc:15870, snmpd receives *two* agentx packets in one recvfrom:
7546  11:50:16.326450 recvfrom(16, "\1\3\0\0\254\0\0\0\0\0\0\0\370\340\210+$\0\0\0\377\177 ...

The first one is recognized as:
dumpx_recv:24 00 00 00 
dumpv_recv:  Integer:   36 (0x24)
dumph_recv: AgentX Header
dumph_recv:   Version
dumpx_recv:    01 
dumpv_recv:      Version:       1
dumph_recv:   Command
dumpx_recv:    03 
dumpv_recv:      Command:       3 (Register)
dumph_recv:   Flags
dumpx_recv:    00 
dumpv_recv:      Flags: 0x0
dumph_recv:   Reserved Byte
dumpx_recv:    00 
dumpv_recv:      Reserved:      0x0
dumph_recv:   Session ID
dumpx_recv:    AC 00 00 00 
dumpv_recv:      Integer:       172 (0xAC)
dumph_recv:   Transaction ID
dumpx_recv:    00 00 00 00 
dumpv_recv:      Integer:       0 (0x00)
dumph_recv:   Packet ID
dumpx_recv:    F8 E0 88 2B 
dumpv_recv:      Integer:       730390776 (0x2B88E0F8)
dumph_recv:   Payload Length
dumpx_recv:    24 00 00 00 
dumpv_recv:      Integer:       36 (0x24)
...

and snmpd accepts the agentx registration.

However, the second packet in the recvfrom() is retransmission of the first one:
dumpx_recv:24 00 00 00 
dumpv_recv:  Integer:   36 (0x24)
dumph_recv: AgentX Header
dumph_recv:   Version
dumpx_recv:    01 
dumpv_recv:      Version:       1
dumph_recv:   Command
dumpx_recv:    03 
dumpv_recv:      Command:       3 (Register)
dumph_recv:   Flags
dumpx_recv:    00 
dumpv_recv:      Flags: 0x0
dumph_recv:   Reserved Byte
dumpx_recv:    00 
dumpv_recv:      Reserved:      0x0
dumph_recv:   Session ID
dumpx_recv:    AC 00 00 00 
dumpv_recv:      Integer:       172 (0xAC)
dumph_recv:   Transaction ID
dumpx_recv:    00 00 00 00 
dumpv_recv:      Integer:       0 (0x00)
dumph_recv:   Packet ID
dumpx_recv:    F8 E0 88 2B 
dumpv_recv:      Integer:       730390776 (0x2B88E0F8)
dumph_recv:   Payload Length
dumpx_recv:    24 00 00 00 
dumpv_recv:      Integer:       36 (0x24)

(notice the same session ID and packet ID). snmpd then sends the "registration != duplicate" message to the log.

*This message is completely harmless, the previous registration is valid.*

So, please check your agentx subagents and maybe lower the retransmission frequency, this could work around the problem for now.

Still, there is something weird in snmpd - the strace contains some other retransmissions, but no "registration != duplicate" is shown for them. I'm digging into it.

Comment 41 Jan Safranek 2012-10-09 11:48:36 UTC
(In reply to comment #40)
> Still, there is something weird in snmpd - the strace contains some other
> retransmissions, but no "registration != duplicate" is shown for them. I'm
> digging into it.

I misread the strace, there is no other retransmitted AgentX agentx-Register-PDU packet except those which result in "registration != duplicate" message.

The agentX subagent must not retransmit these agentx-Register-PDU requests, especially on reliable unix socket connection. There is no way, how snmpd can detect these retransmissions, especially because transactionID does not make any sense in these packets, see RFC2741, chapter 6.1:

6.1 AgentX PDU Header:
<snip>

transactionID

            The transaction ID uniquely identifies, for a given session,
            the single SNMP management request (and single SNMP PDU)
            with which an AgentX PDU is associated.
   <snip>
            The transaction ID has no significance and no defined value
            in AgentX administrative PDUs, i.e., AgentX PDUs that are
            not associated with an SNMP management request.

Agentx-Open-PDU and agentx-Register-PDU are examples of these "AgentX administrative PDUs".

Unless I miss something, I can say that the bug is on subagent side or in kernel, which duplicates these packets. If you agree, I would mark this bug as CLOSED/NOTABUG.

Comment 42 Josef Möllers 2012-10-09 12:05:33 UTC
Thanks, Jan.
I have forwarded your findings to the agent development team and asked for their comment if they can increase the retransmission timeout value.
Maybe if you could leave it open for just a few more days, I'd be grateful.
But otherwise I have no objection: a fine analysis.

Comment 44 Josef Möllers 2012-11-13 08:18:53 UTC
Hello Jan,

Thanks for your patience and keeping this bug open.

More analysis on this side has now revealed that, indeed, one of our agents re-sent the offending packet:

  14233 send(4, "\1\3\0\0w\3\0\0\0\0\0\0\315\266\363\37$\0\0\0\377\177\0\0\7\4\0\0\1\0\0\0"..., 56, MSG_NOSIGNAL) = 56
  14234 gettimeofday({1352722183, 737321}, NULL) = 0
  14235 gettimeofday({1352722183, 737353}, NULL) = 0
  14236 select(5, [4], NULL, NULL, {0, 999968}) = 0 (Timeout)
  14237 time(NULL)                        = [[time]]
  14238 gettimeofday({1352722184, 736580}, NULL) = 0
  14239 gettimeofday({1352722184, 736614}, NULL) = 0
  14240 select(5, [4], NULL, NULL, {0, 707}) = 0 (Timeout)
  14241 time(NULL)                        = [[time]]
  14242 gettimeofday({1352722184, 737537}, NULL) = 0
  14243 send(4, "\1\3\0\0w\3\0\0\0\0\0\0\315\266\363\37$\0\0\0\377\177\0\0\7\4\0\0\1\0\0\0"..., 56, MSG_NOSIGNAL) = 56
^^^^^ genau das selbe Paket ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  14244 gettimeofday({1352722184, 737626}, NULL) = 0
  14245 gettimeofday({1352722184, 737655}, NULL) = 0
  14246 select(5, [4], NULL, NULL, {0, 999971}) = 1 (in [4], left {0, 194000})
  14247 recv(4, "\1\22\0\0w\3\0\0\0\0\0\0\315\266\363\37,\0\0\0002E\16\0\0\0\0\0\5\0\0\0"..., 8192, 0) = 64

It's all within the context of snmp_synch_response_cb():
It does snmp_send() then waits for the response. If that does not arrive within 1s (the DEFAULT_TIMEOUT is 1000000us), snmp_timeout()->snmp_sess_timeout() is called and the packet is resent by snmp_resend_request().

Since the AGENTX_MSG_OPEN packet of that agent was received and processed, the snmpd must have been busy processing another packet of another agent in between, but since strace-ing (or any other kind of verbose-ing) snmpd causes the problem to disappear, I'm having a hard time looking for a means to find out what snmpd has been doing in this one second.

Anyway: Please close this bus as suggested in your comment #41. I'll pursue this matter on our side.


Note You need to log in before you can comment on or make changes to this bug.