458679 – Fedora database is reported to be corrupt after many created users

Bug 458679 - Fedora database is reported to be corrupt after many created users

Summary: Fedora database is reported to be corrupt after many created users

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	openldap
Sub Component:
Version:	9
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Jan Safranek
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-08-11 15:10 UTC by Marek Greško
Modified:	2008-11-06 04:04 UTC (History)
CC List:	0 users
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-11-06 04:04:28 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Debug of shutting down the ldap server after which the database corruption occurs (65.55 KB, application/octet-stream) 2008-09-22 10:35 UTC, Marek Greško	no flags	Details
View All

Description Marek Greško 2008-08-11 15:10:09 UTC

Description of problem:
I run a script that creates around 30 000 users by smbldap_useradd. After it I stopped ldap, and when started it reported database to be corrupt.

Version-Release number of selected component (if applicable):
openldap-2.4.10-1.fc9.i386


How reproducible:


Steps to Reproduce:
1. Run ldap server.
2. Create more than 30 000 users using openldap
3. Restart ldap server.
  
Actual results:
Database is reported to be corrupt.

Expected results:
Database remains consistent.

Additional info:

Comment 1 Jan Safranek 2008-09-15 13:42:14 UTC

I can't reproduce it (using simple ldapadd). I'll let smbldap_useradd running over the night in case it does something else.

Can you still reproduce the bug? It may be related to #458683, does it work if you allow more memory to slapd?

Comment 2 Marek Greško 2008-09-16 07:42:34 UTC

It is still there. Maybe I should rebuild a database after memory limit removal. When I restarted ldap (/etc/init.d/ldap restart) I got shutdown failed and then the database was corrupt.

I should say I put this into /etc/sysconfig/ldap:

function recoverdatabase () {
        /usr/sbin/slapd_db_recover -h /var/lib/ldap
        /usr/sbin/slaptest
        chown ldap:ldap /var/lib/ldap/*
        return 0
}

if [ "x$1" = "xstart" ]; then
        recoverdatabase
fi

maybe if I remove it, ldap will start OK. But in that case ldap will not start after incorrect shutdown.

Comment 3 Jan Safranek 2008-09-18 13:03:48 UTC

Your code should do no harm. But I wonder what could have made ldap to shutdown incorrectly...

Is here anything interesting in the log with loglevel 2? Do you use unusually slow storage device (like network share) for bdb database or the bdb logs?

And please post strace of the shutdown - attach strace to running ldap by "strace -o trace.log -r -p `pidof slapd`" and stop the ldap service in another terminal.

Comment 4 Marek Greško 2008-09-18 15:37:37 UTC

I rebuilded database and I did not get the error yet.

It could be something wrong in database or I suspect some write operation was pending while shutdown and the database recover was run on not yet cleanly closed database. Maybe I am not able to do it at the correct time to simulate it again.

I tried /etc/init.d/ldap stop and /etc/init.d/ldap start and everything was working correctly. Then I tried /etc/init.d/ldap restart and also everything worked.

Probably it is not safe to do /etc/init.d/ldap restart.

I will try to recognize the situations when it occures.

Comment 5 Jan Safranek 2008-09-19 08:01:08 UTC

(In reply to comment #4)
> Probably it is not safe to do /etc/init.d/ldap restart.

There is no difference in calling stop && start and simple restart.

There could be problem with slow slapd reaction to signal - "service ldap stop" sends SIGTERM to the slapd process and waits for 3 seconds. The slapd process starts to shut down, flushes all buffers, closes database etc. and if it can't be finished in these 3 seconds, the init script kills slapd by SIGKILL in middle of the operation. This can result in broken database. But, according to long-term experience, 3 seconds are a lot more than common slapd needs to shutdown itself... That's why I ask if you have any unusually slow storage.

Anyway, please report back if you get the log and/or strace from failed shutdown.

Comment 6 Marek Greško 2008-09-19 08:12:18 UTC

I am still not able to reproduce the problem after database rebuild. I do not have any unusual storage. Database itself and bdb logs are on local disk. The size of database is more than 800 MB now.

Comment 7 Marek Greško 2008-09-19 08:19:18 UTC

Maybe 3 seconds for shutdown is not enough when write operation is in progress. It is possible to have more concurrent smbldap-useradd accesses in my setup. And maybe I cannot guess the moment when to run /etc/init.d/ldap stop again.

Comment 8 Marek Greško 2008-09-22 10:35:04 UTC

Created attachment 317349 [details]
Debug of shutting down the ldap server after which the database corruption occurs

Comment 9 Marek Greško 2008-09-22 10:40:31 UTC

To get the database corrupted after shutdown I run a script smbldap-usermod which changed the home directory for each of the more then 32000 users.

When I run /etc/init.d/ldap restart the shutdown failed and the startup reported database to be corrupted. After subsequent /etc/init.d/ldap restart the shutdown worked properly, but the startup always reports corrupted database.

After running /etc/init.d/ldap stop and /etc/init.d/ldap start everything works OK. Remember the code in comment 2 which cleans the database on /etc/init.d/ldap start.

Comment 10 Marek Greško 2008-10-08 14:50:00 UTC

I get rid of the problem by setting

checkpoint      1024 5

after the suffix line of the slapd.conf. This settings makes slapd to checkpoint database every 5 minutes. When not setting checkpoint it does it only on slapd shutdown. When running slapd longer time and many changes occured on bdb database the checkpoint procedure lasts longer time than the /etc/init.d/ldap stop procedure is willing to wait for it to finish. This setting probably makes slapd to be able to finish checkpoint procedure (for the last 5 minutes changes) in 3 seconds. But I would advice to use longer time for killing slapd in /etc/init.d/ldap than 3 seconds anyway. I would also advice to appear the checkpoint setting in default slapd.conf in the openldap-servers package.

Comment 11 Jan Safranek 2008-10-09 14:11:48 UTC

> But I would advice to use longer time for
> killing slapd in /etc/init.d/ldap than 3 seconds anyway.

3 seconds are good for most of the users - you are the first one complaining... I'll add new option to /etc/sysconfig/ldap, where you can tune it.

> I would also advice to appear the checkpoint setting in default slapd.conf in 
> the openldap-servers package.

Good idea, I'll add it there.

Comment 12 Fedora Update System 2008-10-13 12:12:23 UTC

openldap-2.4.10-2.fc9 has been submitted as an update for Fedora 9.
http://admin.fedoraproject.org/updates/openldap-2.4.10-2.fc9

Comment 13 Fedora Update System 2008-10-16 02:07:12 UTC

openldap-2.4.10-2.fc9 has been pushed to the Fedora 9 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update openldap'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F9/FEDORA-2008-8834

Comment 14 Fedora Update System 2008-11-06 04:04:25 UTC

openldap-2.4.10-2.fc9 has been pushed to the Fedora 9 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.