Bug 442170
Summary: | "DB_BUFFER_SMALL: User memory too small for return value" error when importing LDIF with replication active | ||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] 389 | Reporter: | Aleksander Adamowski <bugs-redhat> | ||||||||||||||||||||
Component: | Replication - General | Assignee: | Rich Megginson <rmeggins> | ||||||||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Chandrasekar Kannan <ckannan> | ||||||||||||||||||||
Severity: | low | Docs Contact: | |||||||||||||||||||||
Priority: | low | ||||||||||||||||||||||
Version: | 1.1.0 | CC: | andrey.ivanov, benl, david, jgalipea, nhosoi, nkinder, pep, tao | ||||||||||||||||||||
Target Milestone: | --- | ||||||||||||||||||||||
Target Release: | --- | ||||||||||||||||||||||
Hardware: | x86_64 | ||||||||||||||||||||||
OS: | Linux | ||||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||||||
Last Closed: | 2008-08-27 20:39:35 UTC | Type: | --- | ||||||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||||
Embargoed: | |||||||||||||||||||||||
Bug Depends On: | |||||||||||||||||||||||
Bug Blocks: | 452721 | ||||||||||||||||||||||
Attachments: |
|
Description
Aleksander Adamowski
2008-04-12 11:53:37 UTC
Wasn't this error reported recently on the mailing lists? And it was due to disk corruption? If this is a different issue: Do you have any really large entries in your LDIF? Any attributes with large numbers of values (like large static groups)? Any really large attribute values (like a CRL)? Does this problem only happen during initialization, or does incremental replication begin working for a while, then it stops with this error? Do you know what the change associated with CSN 48000b7a007b00010000 is? You can examine the access log for the ADD or MOD or DEL operation associated with that CSN. In my case disk corruption is ruled out since the problem occured on two distinct machines in the same situation: LDIF import of our LDIF counting 3740 entries. Yes, there are large entries there: many entries have jpegPhoto set (but they don't exceed 100 KB each); There are some large mailing lists (a couple hundred entries) and there is one posixGroup that has 595 memberUid values. Unfortunately, I don't know what exact change caused this error. Searching for this CSN in output from db_printlog on changelog BDB database finds nothing. So does db_dump of the 63a0d802-080f11dd-9122b219-421e6ead_47ffda11000000010000.db4 file. The CSN can be found in only one file in the changelogdb directory: __db.003, but I don't think it gives an idea about what changes relate to it - here's the fragment of strings dump from __db.003 around the place where that CSN number occurs: ... l=City1,ou=People,o=DIRECTORY_BASE_DN City1 objectClass locality creatorsName cn=manager modifiersName cn=manager createTimestamp 20080411230759Z modifyTimestamp 20080411230759Z nsUniqueId #1a0aba11-081c11dd-9122b219-421e6ead 48000b7c000700010000 48000b7c000600010000 48000b7c000500010000 48000b7c000400010000 48000b7c000300010000 48000b7c000200010000 48000b7c000100010000 48000b7c000000010000 48000b7a007b00010000 48000b7a007a00010000 48000b7a007800010000 M48000b7a000400010000 1a0aba06-081c11dd-9122b219-421e6ead 1a0aba03-081c11dd-9122b219-421e6ead l=City2,ou=People,o=DIRECTORY_BASE_DN City2 objectClass locality creatorsName cn=manager modifiersName cn=manager createTimestamp 20080411230757Z modifyTimestamp 20080411230757Z nsUniqueId #1a0aba06-081c11dd-9122b219-421e6ead ... There are several CSN numbers in a row and no data recorded in between... I'm not an expert at analyzing binary structure of BDB databases so I may be wrong... Check the access log. The access log has the CSN along with the operation and the DN. OK, I've found it in the access log: [12/Apr/2008:01:07:57 +0200] conn=23 op=3966 ADD dn="uid=SOME_USER_UID,l=City1,ou=People,o=DIRECTORY_BASE_DN" [12/Apr/2008:01:07:57 +0200] conn=23 op=3966 RESULT err=0 tag=105 nentries=0 etime=0 csn=48000b7a007b00010000 Doesn't seem suspicious. The entry isn't very big - no extremely long value lists, and it contains a jpegPhoto of moderate size - the whole entry in LDIF representation has 9.3K in size. The next entry that was in the imported LDIF has 96K in size (most of that is its large jpegPhoto). Its add operation had CSN 48000b7c000000010000 - no error was logged in error log with respect to that CSN. Created attachment 307048 [details]
Error log from another DB_BUFFER_SMALL incident
It hapened again, I'm attaching all related log fragments this time.
Created attachment 307054 [details]
access log from server1
server1 is where the DB_BUFFER_SMALL was logged.
Created attachment 307059 [details]
access log from server2
server2 is a partner in MMR replication
Created attachment 307061 [details]
access log from server2
server3 is a partner in MMR replication
Created attachment 307064 [details]
access log from server1 (supplemented)
server1 is where the DB_BUFFER_SMALL was logged.
Here's how I think it occured: 1) User "otheruid" tries to bind at 13:35:42 and gets error 49 - wrong credentials: (server 1 access log): [28/May/2008:13:35:42 +0200] conn=1211058 op=99 SRCH base="o=DIRECTORY_BASE_DN" scope=2 filter="(mail=otheruid)" attrs="homeDirectory cn clearPassword userPassword uidNumber gidNumber mail quota shared disableimap disablepop3 disableshared" [28/May/2008:13:35:42 +0200] conn=1211058 op=99 ENTRY dn="uid=otheruid,o=Freelancerzy,ou=People,o=DIRECTORY_BASE_DN" [28/May/2008:13:35:42 +0200] conn=1211058 op=99 RESULT err=0 tag=101 nentries=1 etime=0 [28/May/2008:13:35:42 +0200] conn=1211059 op=49 BIND dn="uid=otheruid,o=Freelancerzy,ou=People,o=DIRECTORY_BASE_DN" method=128 version=3 [28/May/2008:13:35:42 +0200] conn=1211059 op=49 RESULT err=49 tag=97 nentries=0 etime=0 Since password lockout is active and replication is active (in order to work-around bug 442560), the change in passwordRetryCount got replicated to other nodes. So far so good. 2) User "otheruid" tries to bind at 13:35:56, this time with a correct password: (server 1 access log): [28/May/2008:13:35:56 +0200] conn=1211894 op=2 SRCH base="o=DIRECTORY_BASE_DN" scope=2 filter="(mail=otheruid)" attrs="homeD irectory cn clearPassword userPassword uidNumber gidNumber mail quota shared disableimap disablepop3 disableshared" [28/May/2008:13:35:56 +0200] conn=1211894 op=2 ENTRY dn="uid=otheruid,o=Freelancerzy,ou=People,o=DIRECTORY_BASE_DN" [28/May/2008:13:35:56 +0200] conn=1211894 op=2 RESULT err=0 tag=101 nentries=1 etime=0 [28/May/2008:13:35:56 +0200] conn=1211895 op=1 BIND dn="uid=otheruid,o=Freelancerzy,ou=People,o=DIRECTORY_BASE_DN" method=128 version= 3 [28/May/2008:13:35:56 +0200] conn=1211895 op=1 RESULT err=0 tag=97 nentries=0 etime=0 dn="uid=otheruid,o=freelancerzy,ou=pe ople,o=DIRECTORY_BASE_DN" (server 2 access log): This results in successful bind and in reset of account lockout attributes (CSN 483d60e0000000030000). So the MOD operation is replicated and the replicas accept it as can be seen in respective access logs: [28/May/2008:13:35:55 +0200] conn=125656 op=16 EXT oid="2.16.840.1.113730.3.5.3" name="Netscape Replication Start Session" [28/May/2008:13:35:55 +0200] conn=125656 op=16 RESULT err=0 tag=120 nentries=0 etime=0 [28/May/2008:13:35:55 +0200] conn=125655 op=21 EXT oid="2.16.840.1.113730.3.5.5" name="Netscape Replication End Session" [28/May/2008:13:35:55 +0200] conn=125655 op=21 RESULT err=0 tag=120 nentries=0 etime=0 [28/May/2008:13:35:57 +0200] conn=125655 op=22 EXT oid="2.16.840.1.113730.3.5.3" name="Netscape Replication Start Session" [28/May/2008:13:35:57 +0200] conn=125655 op=22 RESULT err=0 tag=120 nentries=0 etime=0 [28/May/2008:13:35:57 +0200] conn=125655 op=23 MOD dn="uid=otheruid,o=freelancerzy,ou=people,o=DIRECTORY_BASE_DN" [28/May/2008:13:35:57 +0200] conn=125655 op=23 RESULT err=0 tag=103 nentries=0 etime=0 csn=483d60e0000000030000 However, 9 minutes later server1 (the originator of the change) logs an error: (server1 error log): [28/May/2008:13:44:26 +0200] NSMMReplicationPlugin - changelog program - agmt="cn=with_server2" (server2:636): Failed to retrieve change with CSN 483d60e0000000030000; db error - -30999 DB_BUFFER_SMALL: User memory too small for return value [28/May/2008:13:44:26 +0200] NSMMReplicationPlugin - agmt="cn=with_server2" (server2:636): A changelog database error was encountered [28/May/2008:13:44:26 +0200] NSMMReplicationPlugin - changelog program - agmt="cn=with_server3" (server3:636): Failed to retrieve change with CSN 483d60e0000000030000; db error - -30999 DB_BUFFER_SMALL: User memory too small for return value [28/May/2008:13:44:26 +0200] NSMMReplicationPlugin - agmt="cn=with_server3" (server3:636): A changelog database error was encountered [28/May/2008:13:44:26 +0200] NSMMReplicationPlugin - agmt="cn=with_server3" (server3:636): Incremental update failed and requires administrator action [28/May/2008:13:44:26 +0200] NSMMReplicationPlugin - agmt="cn=with_server2" (server2:636): Incremental update failed and requires administrator action Created attachment 307069 [details]
Crash of server2 when initialized from server1
Here's another error log, somewhat related. When remedying this situation by
starting initialization of the server2 replica from server1, the consumer
(server2) has crashed.
I don't have a core file because core file ulimit was too small (I've raised it
for future crashes).
Created attachment 307072 [details]
server2 error log when it crashes
The previous log was from the provider (server1) which went fine.
Here's the log from the consumer (server2) which crashed.
The problem starts immediately after this line:
[29/May/2008:14:39:48 +0200] - WARNING: Import is running with
nsslapd-db-private-import-mem on; No other process is allowed to access the
database
and it starts with these lines:
[29/May/2008:14:39:55 +0200] - slapd shutting down - signaling operation
threads
[29/May/2008:14:39:55 +0200] - slapd shutting down - waiting for 1 thread to
terminate
During normal initialization it should be like this instead:
[29/May/2008:14:49:38 +0200] - WARNING: Import is running with
nsslapd-db-private-import-mem on; No other process is allowed to access the
database
[29/May/2008:14:49:50 +0200] - import userRoot: Workers finished; cleaning
up...
[29/May/2008:14:49:50 +0200] - import userRoot: Workers cleaned up.
[29/May/2008:14:49:50 +0200] - import userRoot: Indexing complete.
Post-processing...
[29/May/2008:14:49:50 +0200] - import userRoot: Flushing caches...
[29/May/2008:14:49:50 +0200] - import userRoot: Closing files...
[29/May/2008:14:49:53 +0200] - import userRoot: Import complete. Processed
11330 entries in 15 seconds. (755.33 entries/sec)
From the timestamps it seems that slapd has crashed when the initialization
process was finishing.
Do you only see this problem when using global password policy to replicate the password lockout attributes? It doesn't look like a crash, it looks like the process is being shutdown or exiting for some reason. I wonder if this has anything to do with the process running out of memory (DB_BUFFER_SMALL) then the system OOM Killer sends the termination signal to the process? If you can easily reproduce the problem, try running top or some other memory monitor to monitor the memory growth of the server process - I guess server2 in this case since it is the one receiving the termination signal. Regarding comment #13: hard to tell since its occurance is quite rare and only on production systems with varied traffic. I'm not going to turn lockout off on them because of security reasons. Regarding comment #14: I can't see OOM Killer activity in /var/log/messages (kernel usually logs a message when it kills a process). However, I've discovered a strange log message in /var/log/messages (it's strange since FDS isn't supposed to use under normal circumstances syslog): May 29 14:39:48 server2 ns-slapd: Exceeded max number of logs allowed There are a few cases where slapd will log to syslog - this is one of them. In this case, the server error and/or access log rotation has exceeded the number of rotated logs to use in its log rotation policy, so it has no place to log the message, so it logs to syslog. What are you log rotation policy settings? Are you using the defaults? How many access and error log files do you have in your /var/log/dirsrv/slapd-instance directory? Yes, I had probably hit the limit on logfile count, so I've adjusted that now. But is it a possible cause for server abnormal exit? Is this by design? The main server (server1) had hit that limit several times too, twice yesterday and 3 more times this month, and it's still up. For the record, server1 has worked continuously since April 15th (that's almost a month and a half). Hit the log limit 5 times during that period. AFAICT, hitting the log file limit should not cause the server to exit, which is what you have seen. So I'm still not sure why the server is exiting, or what could be killing it. BTW, I've read the thread at http://www.redhat.com/archives/fedora-directory-users/2008-April/msg00026.html . It seems this is the same problem. We are on x86_64, too. However, in our case disk data corruption is most probably out of question. Possible solution? : http://forums.oracle.com/forums/thread.jspa?messageID=2054136 Relevant documentation: http://www.oracle.com/technology/documentation/berkeley-db/db/api_cxx/dbt_class.html Also possibly helpful: http://forums.oracle.com/forums/thread.jspa?messageID=2268901� If you can easily reproduce the problem, can you try reproducing it with the replication log level on? http://directory.fedoraproject.org/wiki/FAQ#Troubleshooting Note that if this is a high traffic production environment, you probably don't want to do this, or only do it for a very short period of time, because debug logging will slow the server down quite a bit. We are using the C API - http://www.oracle.com/technology/documentation/berkeley-db/db/api_c/dbc_get.html I believe the problem is in clcache_cursor_get(). We should already know the key and data sizes, and should already have pre-allocated the buffer. So I'm not sure why we are getting this error at this time. Using the repl debug log level, I would expect to see something like this: "clcache: (%d | %d) %s reallocated and retry returns %d" I wouldn't say I can _easily_ reproduce this, but it occurs every couple of weeks. These are production systems, but update traffic is moderate so enabling replication debugging is not a problem - I've enabled it and we'll see. Looking at the code : clcache_cursor_get ( DBC *cursor, CLC_Buffer *buf, int flag ) { int rc; rc = cursor->c_get ( cursor, & buf->buf_key, & buf->buf_data, buf->buf_load_flag | flag ); if ( ENOMEM == rc ) { /* * The record takes more space than the current size of the * buffer. Fortunately, buf->buf_data.size has been set by * c_get() to the actual data size needed. So we can * reallocate the data buffer and try to read again. */ (This is 7.1, pulled from Google code search, so apologies if the trunk is different now) I am wondering if what's happened is that the BDB has changed slightly at some point in the past. The code seems to expect the call to dbc->c_get() to return ENOMEM. However the current BDB docs suggest that it does not return ENOMEM, but instead the aforementioned DB_BUFFER_SMALL error code. I seem to remember that there was a period when Sleepycat wanted to remove all the libc error codes from their API (because there had been many instances where it was impossible for an application to tell if a given error had originated inside BDB, or in a libc call made by BDB on its behalf). To resolve this problem all the existing cases of libc (errno.h) error codes generated inside libdb were changed to 'native' error codes. It may be that if the test for ENOMEM were changed to DB_BUFFER_SMALL (or perhaps test for either to be safe), then this bug would be fixed. Just an idea, I haven't spent much time thinking about this and I may be completely wrong. That's a good bet to be the cause. We're now using bdb 4.3 or 4.6, so it's quite likely sleepycat changed the error code. I'll try some tests to see if that's the case. Created attachment 310041 [details]
diffs
Looks like ENOMEM was the culprit. I can easily reproduce the problem by using
an attribute value > 32 KB with replication. With the fix attached, everything
works fine.
The fix looks good. Thanks for the pointer on this issue David! Created attachment 310059 [details]
cvs commit log
Reviewed by: nkinder (Thanks!)
Fix Description: BDB 4.3 does not use ENOMEM if the given buffer is too small -
it uses DB_BUFFER_SMALL. This fix allows us to use DB_BUFFER_SMALL in BDB 4.2
and earlier too. I also cleaned up some of the cl5 api return codes to return
an appropriate error code to the higher levels rather than pass the ENOMEM up.
Platforms tested: RHEL5
Flag Day: no
Doc impact: no
Checked into Directory_Server_8_0_Branch. Checking in ldap/servers/plugins/replication/cl5_clcache.c; /cvs/dirsec/ldapserver/ldap/servers/plugins/replication/cl5_clcache.c,v <-- cl5_clcache.c new revision: 1.7.2.1; previous revision: 1.7 done fix verified (RHEL4-32, RHEL4-64, RHEL5-32, RHEL5-64, HP UX, SOL9) An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2008-0602.html |