Bug 442170

Summary: "DB_BUFFER_SMALL: User memory too small for return value" error when importing LDIF with replication active
Product: [Retired] 389 Reporter: Aleksander Adamowski <bugs-redhat>
Component: Replication - GeneralAssignee: Rich Megginson <rmeggins>
Status: CLOSED ERRATA QA Contact: Chandrasekar Kannan <ckannan>
Severity: low Docs Contact:
Priority: low    
Version: 1.1.0CC: andrey.ivanov, benl, david, jgalipea, nhosoi, nkinder, pep, tao
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-08-27 20:39:35 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 452721    
Attachments:
Description Flags
Error log from another DB_BUFFER_SMALL incident
none
access log from server1
none
access log from server2
none
access log from server2
none
access log from server1 (supplemented)
none
Crash of server2 when initialized from server1
none
server2 error log when it crashes
none
diffs
none
cvs commit log none

Description Aleksander Adamowski 2008-04-12 11:53:37 UTC
Description of problem:
I've set up multi-master replication between two servers, directory1 and directory2.

When I import an 6.8 MB LDIF file on directory1, the replication agreement to
directory2 goes out of sync and I get the following error on directory1 in
/var/log/dirsrv/slapd-directory1/errors:

[12/Apr/2008:01:08:02 +0200] NSMMReplicationPlugin - changelog program -
agmt="cn=with_directory2" (directory2:636): Failed to retrieve change with CSN
48000b7a007b00010000; db error - -30999 DB_BUFFER_SMALL: User memory too small
for return value
[12/Apr/2008:01:08:02 +0200] NSMMReplicationPlugin - agmt="cn=with_directory2"
(directory2:636): A changelog database error was encountered
[12/Apr/2008:01:08:02 +0200] NSMMReplicationPlugin - agmt="cn=with_directory2"
(directory2:636): Incremental update failed and requires administrator action

After I re-initialize the consumer directory2 (on directory1: Configuration ->
Replication -> userRoot -> with_directory2 -> Initialize Consumer), the
replication works again properly (in both directions), until I try to do this
LDIF import again.

In the userRoot Database Settings, I've increased "memory available for cache"
to 40971520, however it didn't help (I didn't count on it anyway, since it seems
the problem is not with directory's Berkeley DB cache, but with changelog
Berkely DB which seems to be separate. The changelog DB has "Max changelog
records" marked as "Unlimited".)


Version-Release number of selected component (if applicable):

fedora-ds-1.1.0-3.fc8 for x86_64

Comment 1 Rich Megginson 2008-04-14 14:46:17 UTC
Wasn't this error reported recently on the mailing lists?  And it was due to
disk corruption?

If this is a different issue:  Do you have any really large entries in your
LDIF?  Any attributes with large numbers of values (like large static groups)? 
Any really large attribute values (like a CRL)?

Does this problem only happen during initialization, or does incremental
replication begin working for a while, then it stops with this error?  Do you
know what the change associated with CSN 48000b7a007b00010000 is?  You can
examine the access log for the ADD or MOD or DEL operation associated with that CSN.

Comment 2 Aleksander Adamowski 2008-04-18 16:14:30 UTC
In my case disk corruption is ruled out since the problem occured on two
distinct machines in the same situation: LDIF import of our LDIF counting 3740
entries.

Yes, there are large entries there: many entries have jpegPhoto set (but they
don't exceed 100 KB each); There are some large mailing lists (a couple hundred
entries) and there is one posixGroup that has 595 memberUid values.

Unfortunately, I don't know what exact change caused this error.

Searching for this CSN in output from db_printlog on changelog BDB database
finds nothing. So does db_dump of the
63a0d802-080f11dd-9122b219-421e6ead_47ffda11000000010000.db4 file.

The CSN can be found in only one file in the changelogdb directory: __db.003,
but I don't think it gives an idea about what changes relate to it - here's the
fragment of strings dump from __db.003 around the place where that CSN number
occurs:

...
l=City1,ou=People,o=DIRECTORY_BASE_DN
City1
objectClass
locality
creatorsName
cn=manager
modifiersName
cn=manager
createTimestamp
20080411230759Z
modifyTimestamp
20080411230759Z
nsUniqueId
#1a0aba11-081c11dd-9122b219-421e6ead
48000b7c000700010000
48000b7c000600010000
48000b7c000500010000
48000b7c000400010000
48000b7c000300010000
48000b7c000200010000
48000b7c000100010000
48000b7c000000010000
48000b7a007b00010000
48000b7a007a00010000
48000b7a007800010000
M48000b7a000400010000
1a0aba06-081c11dd-9122b219-421e6ead
1a0aba03-081c11dd-9122b219-421e6ead
l=City2,ou=People,o=DIRECTORY_BASE_DN
City2
objectClass
locality
creatorsName
cn=manager
modifiersName
cn=manager
createTimestamp
20080411230757Z
modifyTimestamp
20080411230757Z
nsUniqueId
#1a0aba06-081c11dd-9122b219-421e6ead
...


There are several CSN numbers in a row and no data recorded in between... I'm
not an expert at analyzing binary structure of BDB databases so I may be wrong...

Comment 3 Rich Megginson 2008-04-18 16:27:14 UTC
Check the access log.  The access log has the CSN along with the operation and
the DN.

Comment 4 Aleksander Adamowski 2008-04-21 21:03:35 UTC
OK, I've found it in the access log:

[12/Apr/2008:01:07:57 +0200] conn=23 op=3966 ADD
dn="uid=SOME_USER_UID,l=City1,ou=People,o=DIRECTORY_BASE_DN"
[12/Apr/2008:01:07:57 +0200] conn=23 op=3966 RESULT err=0 tag=105 nentries=0
etime=0 csn=48000b7a007b00010000

Doesn't seem suspicious.

The entry isn't very big - no extremely long value lists, and it contains a
jpegPhoto of moderate size - the whole entry in LDIF representation has 9.3K in
size.

The next entry that was in the imported LDIF has 96K in size (most of that is
its large jpegPhoto). Its add operation had CSN 48000b7c000000010000 - no error
was logged in error log with respect to that CSN.

Comment 5 Aleksander Adamowski 2008-05-29 13:23:07 UTC
Created attachment 307048 [details]
Error log from another DB_BUFFER_SMALL incident

It hapened again, I'm attaching all related log fragments this time.

Comment 6 Aleksander Adamowski 2008-05-29 13:34:37 UTC
Created attachment 307054 [details]
access log from server1

server1 is where the DB_BUFFER_SMALL was logged.

Comment 7 Aleksander Adamowski 2008-05-29 13:37:51 UTC
Created attachment 307059 [details]
access log from server2

server2 is a partner in MMR replication

Comment 8 Aleksander Adamowski 2008-05-29 13:38:24 UTC
Created attachment 307061 [details]
access log from server2

server3 is a partner in MMR replication

Comment 9 Aleksander Adamowski 2008-05-29 13:46:02 UTC
Created attachment 307064 [details]
access log from server1 (supplemented)

server1 is where the DB_BUFFER_SMALL was logged.

Comment 10 Aleksander Adamowski 2008-05-29 14:06:55 UTC
Here's how I think it occured:

1) User "otheruid" tries to bind at 13:35:42 and gets error 49 - wrong credentials:

(server 1 access log):
[28/May/2008:13:35:42 +0200] conn=1211058 op=99 SRCH base="o=DIRECTORY_BASE_DN"
scope=2 filter="(mail=otheruid)" attrs="homeDirectory cn
clearPassword userPassword uidNumber gidNumber mail quota shared disableimap
disablepop3 disableshared"
[28/May/2008:13:35:42 +0200] conn=1211058 op=99 ENTRY
dn="uid=otheruid,o=Freelancerzy,ou=People,o=DIRECTORY_BASE_DN"
[28/May/2008:13:35:42 +0200] conn=1211058 op=99 RESULT err=0 tag=101 nentries=1
etime=0
[28/May/2008:13:35:42 +0200] conn=1211059 op=49 BIND
dn="uid=otheruid,o=Freelancerzy,ou=People,o=DIRECTORY_BASE_DN" method=128 version=3
[28/May/2008:13:35:42 +0200] conn=1211059 op=49 RESULT err=49 tag=97 nentries=0
etime=0


Since password lockout is active and replication is active (in order to
work-around bug 442560), the change in passwordRetryCount got replicated to
other nodes. So far so good.

2) User "otheruid" tries to bind at 13:35:56, this time with a correct password:

(server 1 access log):
[28/May/2008:13:35:56 +0200] conn=1211894 op=2 SRCH base="o=DIRECTORY_BASE_DN"
scope=2 filter="(mail=otheruid)" attrs="homeD
irectory cn clearPassword userPassword uidNumber gidNumber mail quota shared
disableimap disablepop3 disableshared"
[28/May/2008:13:35:56 +0200] conn=1211894 op=2 ENTRY
dn="uid=otheruid,o=Freelancerzy,ou=People,o=DIRECTORY_BASE_DN"
[28/May/2008:13:35:56 +0200] conn=1211894 op=2 RESULT err=0 tag=101 nentries=1
etime=0
[28/May/2008:13:35:56 +0200] conn=1211895 op=1 BIND
dn="uid=otheruid,o=Freelancerzy,ou=People,o=DIRECTORY_BASE_DN" method=128 version=
3
[28/May/2008:13:35:56 +0200] conn=1211895 op=1 RESULT err=0 tag=97 nentries=0
etime=0 dn="uid=otheruid,o=freelancerzy,ou=pe
ople,o=DIRECTORY_BASE_DN"

(server 2 access log):
This results in successful bind and in reset of account lockout attributes (CSN
483d60e0000000030000). So the MOD operation is replicated and the replicas
accept it as can be seen in respective access logs:

[28/May/2008:13:35:55 +0200] conn=125656 op=16 EXT oid="2.16.840.1.113730.3.5.3"
name="Netscape Replication Start Session"
[28/May/2008:13:35:55 +0200] conn=125656 op=16 RESULT err=0 tag=120 nentries=0
etime=0
[28/May/2008:13:35:55 +0200] conn=125655 op=21 EXT oid="2.16.840.1.113730.3.5.5"
name="Netscape Replication End Session"
[28/May/2008:13:35:55 +0200] conn=125655 op=21 RESULT err=0 tag=120 nentries=0
etime=0
[28/May/2008:13:35:57 +0200] conn=125655 op=22 EXT oid="2.16.840.1.113730.3.5.3"
name="Netscape Replication Start Session"
[28/May/2008:13:35:57 +0200] conn=125655 op=22 RESULT err=0 tag=120 nentries=0
etime=0
[28/May/2008:13:35:57 +0200] conn=125655 op=23 MOD
dn="uid=otheruid,o=freelancerzy,ou=people,o=DIRECTORY_BASE_DN"
[28/May/2008:13:35:57 +0200] conn=125655 op=23 RESULT err=0 tag=103 nentries=0
etime=0 csn=483d60e0000000030000

However, 9 minutes later server1 (the originator of the change) logs an error:

(server1 error log):
[28/May/2008:13:44:26 +0200] NSMMReplicationPlugin - changelog program -
agmt="cn=with_server2" (server2:636): Failed to
retrieve change with CSN 483d60e0000000030000; db error - -30999
DB_BUFFER_SMALL: User memory too small for return value
[28/May/2008:13:44:26 +0200] NSMMReplicationPlugin - agmt="cn=with_server2"
(server2:636): A changelog database error was
 encountered
[28/May/2008:13:44:26 +0200] NSMMReplicationPlugin - changelog program -
agmt="cn=with_server3" (server3:636): Failed to
retrieve change with CSN 483d60e0000000030000; db error - -30999
DB_BUFFER_SMALL: User memory too small for return value
[28/May/2008:13:44:26 +0200] NSMMReplicationPlugin - agmt="cn=with_server3"
(server3:636): A changelog database error was
 encountered
[28/May/2008:13:44:26 +0200] NSMMReplicationPlugin - agmt="cn=with_server3"
(server3:636): Incremental update failed and
requires administrator action
[28/May/2008:13:44:26 +0200] NSMMReplicationPlugin - agmt="cn=with_server2"
(server2:636): Incremental update failed and
requires administrator action


Comment 11 Aleksander Adamowski 2008-05-29 14:13:17 UTC
Created attachment 307069 [details]
Crash of server2 when initialized from server1

Here's another error log, somewhat related. When remedying this situation by
starting initialization of the server2 replica from server1, the consumer
(server2) has crashed.

I don't have a core file because core file ulimit was too small (I've raised it
for future crashes).

Comment 12 Aleksander Adamowski 2008-05-29 14:19:08 UTC
Created attachment 307072 [details]
server2 error log when it crashes

The previous log was from the provider (server1) which went fine.

Here's the log from the consumer (server2) which crashed.

The problem starts immediately after this line:

[29/May/2008:14:39:48 +0200] - WARNING: Import is running with
nsslapd-db-private-import-mem on; No other process is allowed to access the
database


and it starts with these lines:

[29/May/2008:14:39:55 +0200] - slapd shutting down - signaling operation
threads
[29/May/2008:14:39:55 +0200] - slapd shutting down - waiting for 1 thread to
terminate


During normal initialization it should be like this instead:
[29/May/2008:14:49:38 +0200] - WARNING: Import is running with
nsslapd-db-private-import-mem on; No other process is allowed to access the
database
[29/May/2008:14:49:50 +0200] - import userRoot: Workers finished; cleaning
up...
[29/May/2008:14:49:50 +0200] - import userRoot: Workers cleaned up.
[29/May/2008:14:49:50 +0200] - import userRoot: Indexing complete. 
Post-processing...
[29/May/2008:14:49:50 +0200] - import userRoot: Flushing caches...
[29/May/2008:14:49:50 +0200] - import userRoot: Closing files...
[29/May/2008:14:49:53 +0200] - import userRoot: Import complete.  Processed
11330 entries in 15 seconds. (755.33 entries/sec)

From the timestamps it seems that slapd has crashed when the initialization
process was finishing.

Comment 13 Rich Megginson 2008-05-29 15:14:38 UTC
Do you only see this problem when using global password policy to replicate the
password lockout attributes?

Comment 14 Rich Megginson 2008-05-29 15:19:59 UTC
It doesn't look like a crash, it looks like the process is being shutdown or
exiting for some reason.  I wonder if this has anything to do with the process
running out of memory (DB_BUFFER_SMALL) then the system OOM Killer sends the
termination signal to the process?  If you can easily reproduce the problem, try
running top or some other memory monitor to monitor the memory growth of the
server process - I guess server2 in this case since it is the one receiving the
termination signal.

Comment 15 Aleksander Adamowski 2008-05-29 15:58:59 UTC
Regarding comment #13: hard to tell since its occurance is quite rare and only
on production systems with varied traffic. I'm not going to turn lockout off on
them because of security reasons.

Regarding comment #14: I can't see OOM Killer activity in /var/log/messages
(kernel usually logs a message when it kills a process).

However, I've discovered a strange log message in /var/log/messages (it's
strange since FDS isn't supposed to use under normal circumstances syslog):

May 29 14:39:48 server2 ns-slapd: Exceeded max number of logs allowed 



Comment 16 Rich Megginson 2008-05-29 16:23:36 UTC
There are a few cases where slapd will log to syslog - this is one of them.  In
this case, the server error and/or access log rotation has exceeded the number
of rotated logs to use in its log rotation policy, so it has no place to log the
message, so it logs to syslog.  What are you log rotation policy settings?  Are
you using the defaults?  How many access and error log files do you have in your
/var/log/dirsrv/slapd-instance directory?

Comment 17 Aleksander Adamowski 2008-05-29 19:25:43 UTC
Yes, I had probably hit the limit on logfile count, so I've adjusted that now. 

But is it a possible cause for server abnormal exit? Is this by design?

The main server (server1) had hit that limit several times too, twice yesterday
and 3 more times this month, and it's still up.


Comment 18 Aleksander Adamowski 2008-05-29 19:29:02 UTC
For the record, server1 has worked continuously since April 15th (that's almost
a month and a half). Hit the log limit 5 times during that period.

Comment 19 Rich Megginson 2008-05-29 19:33:39 UTC
AFAICT, hitting the log file limit should not cause the server to exit, which is
what you have seen.  So I'm still not sure why the server is exiting, or what
could be killing it.

Comment 20 Aleksander Adamowski 2008-05-29 19:40:10 UTC
BTW, I've read the thread at
http://www.redhat.com/archives/fedora-directory-users/2008-April/msg00026.html .

It seems this is the same problem. We are on x86_64, too.

However, in our case disk data corruption is most probably out of question.

Comment 22 Aleksander Adamowski 2008-05-29 19:48:31 UTC
Also possibly helpful:
http://forums.oracle.com/forums/thread.jspa?messageID=2268901&#2268901

Comment 23 Rich Megginson 2008-05-29 20:14:44 UTC
If you can easily reproduce the problem, can you try reproducing it with the
replication log level on? 
http://directory.fedoraproject.org/wiki/FAQ#Troubleshooting

Note that if this is a high traffic production environment, you probably don't
want to do this, or only do it for a very short period of time, because debug
logging will slow the server down quite a bit.

We are using the C API -
http://www.oracle.com/technology/documentation/berkeley-db/db/api_c/dbc_get.html

I believe the problem is in clcache_cursor_get().  We should already know the
key and data sizes, and should already have pre-allocated the buffer.  So I'm
not sure why we are getting this error at this time.  Using the repl debug log
level, I would expect to see something like this:
"clcache: (%d | %d) %s reallocated and retry returns %d"

Comment 24 Aleksander Adamowski 2008-05-30 08:23:59 UTC
I wouldn't say I can _easily_ reproduce this, but it occurs every couple of weeks.

These are production systems, but update traffic is moderate so enabling
replication debugging is not a problem - I've enabled it and we'll see.



Comment 25 David Boreham 2008-06-19 22:00:12 UTC
Looking at the code :

clcache_cursor_get ( DBC *cursor, CLC_Buffer *buf, int flag )
{
	int rc;

	rc = cursor->c_get ( cursor,
						 & buf->buf_key,
						 & buf->buf_data,
						 buf->buf_load_flag | flag );
	if ( ENOMEM == rc ) {
		/*
		 * The record takes more space than the current size of the
		 * buffer. Fortunately, buf->buf_data.size has been set by
		 * c_get() to the actual data size needed. So we can
		 * reallocate the data buffer and try to read again.
		 */


(This is 7.1, pulled from Google code search, so apologies if the trunk is
different now)

I am wondering if what's happened is that the BDB has changed slightly at some
point in the past. The code seems to expect the call to dbc->c_get() to return
ENOMEM. However the current BDB docs suggest that it does not return ENOMEM, but
instead the aforementioned DB_BUFFER_SMALL error code.

I seem to remember that there was a period when Sleepycat wanted to remove all
the libc error codes from their API (because there had been many instances where
it was impossible for an application to tell if a given error had originated
inside BDB, or in a libc call made by BDB on its behalf). To resolve this
problem all the existing cases of libc (errno.h) error codes generated inside
libdb were changed to 'native' error codes. 

It may be that if the test for ENOMEM were changed to DB_BUFFER_SMALL (or
perhaps test for either to be safe), then this bug would be fixed.

Just an idea, I haven't spent much time thinking about this and I may be
completely wrong.



Comment 26 Rich Megginson 2008-06-19 22:03:58 UTC
That's a good bet to be the cause.  We're now using bdb 4.3 or 4.6, so it's
quite likely sleepycat changed the error code.  I'll try some tests to see if
that's the case.

Comment 27 Rich Megginson 2008-06-23 16:39:18 UTC
Created attachment 310041 [details]
diffs

Looks like ENOMEM was the culprit.  I can easily reproduce the problem by using
an attribute value > 32 KB with replication.  With the fix attached, everything
works fine.

Comment 28 Nathan Kinder 2008-06-23 18:03:39 UTC
The fix looks good.  Thanks for the pointer on this issue David!

Comment 29 Rich Megginson 2008-06-23 18:33:44 UTC
Created attachment 310059 [details]
cvs commit log

Reviewed by: nkinder (Thanks!)
Fix Description: BDB 4.3 does not use ENOMEM if the given buffer is too small -
it uses DB_BUFFER_SMALL.  This fix allows us to use DB_BUFFER_SMALL in BDB 4.2
and earlier too.  I also cleaned up some of the cl5 api return codes to return
an appropriate error code to the higher levels rather than pass the ENOMEM up.
Platforms tested: RHEL5
Flag Day: no
Doc impact: no

Comment 30 Nathan Kinder 2008-07-10 23:04:34 UTC
Checked into Directory_Server_8_0_Branch.

Checking in ldap/servers/plugins/replication/cl5_clcache.c;
/cvs/dirsec/ldapserver/ldap/servers/plugins/replication/cl5_clcache.c,v  <-- 
cl5_clcache.c
new revision: 1.7.2.1; previous revision: 1.7
done

Comment 32 Jenny Severance 2008-08-15 17:40:52 UTC
fix verified (RHEL4-32, RHEL4-64, RHEL5-32, RHEL5-64, HP UX, SOL9)

Comment 34 errata-xmlrpc 2008-08-27 20:39:35 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2008-0602.html