1642838 – after certain failed import operation, impossible to replay an import operation.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1642838 - after certain failed import operation, impossible to replay an import operation.

Summary: after certain failed import operation, impossible to replay an import operation.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	389-ds-base
Sub Component:
Version:	7.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	unspecified
Target Milestone:	rc
Target Release:	7.7
Assignee:	thierry bordaz
QA Contact:	RHDS QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1672177
TreeView+	depends on / blocked

Reported:	2018-10-25 07:54 UTC by German Parente
Modified:	2020-09-13 22:17 UTC (History)
CC List:	13 users (show)
Fixed In Version:	389-ds-base-1.3.9.1-1.el7
Doc Type:	If docs needed, set a value
Doc Text:	Cause: Pending operations mark permanently a backend as busy Consequence: If an import is rejected because of pending operations further import attempts will fail Fix: if there are pending operations, at the beginning of an import, rejects the import without marking the backend as busy. Result: A failing import does not prevent further import
Clone Of:
Clones:	1672177 (view as bug list)
Environment:
Last Closed:	2019-08-06 12:58:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Tentative patch to confirm the root cause (1.69 KB, application/mbox) 2018-12-20 13:56 UTC, thierry bordaz	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	389ds 389-ds-base issues 3176	0	None	closed	after certain failed import operation, impossible to replay an import operation	2020-10-09 07:00:09 UTC
Red Hat Product Errata	RHBA-2019:2152	0	None	None	None	2019-08-06 12:59:06 UTC

Description German Parente 2018-10-25 07:54:14 UTC

Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 German Parente 2018-10-25 08:11:43 UTC

Adding description here:

After an on-line import failure, if the customer does an import on-line if fails continuously.


/usr/lib64/dirsrv/slapd-instance/ldif2db.pl -n userRoot -D "cn=directory Manager" -w password  -i /var/opt/dr_bak.ldif -P LDAP
ldap_add: Operations error (1)
Failed to add task entry "cn=import_2018_10_24_8_6_11, cn=import, cn=tasks, cn=config" error (1)

and we can see this in errors log:

[23/Oct/2018:04:20:04.954458711 +0300] - ERR - ldbm_back_ldif2ldbm - ldbm: 'userRoot' is already in the middle of another task and cannot be disturbed.

I cannot find yet which was the cause of the first failure but the instance is set to busy:


int
instance_set_busy(ldbm_instance *inst)
{
    PR_Lock(inst->inst_config_mutex);
    if (is_instance_busy(inst)) {
        PR_Unlock(inst->inst_config_mutex);
        return -1;
    }

    inst->inst_flags |= INST_FLAG_BUSY;
    PR_Unlock(inst->inst_config_mutex);
    return 0;
}

and as when we try a new import we check first if the instance is set to busy and in that case we don't follow, there's no way to exit from this situation than restarting the server.

The "unbusy" is here:

/* mark a backend instance to be not "busy" anymore */
void
instance_set_not_busy(ldbm_instance *inst)
{
    int readonly;

    PR_Lock(inst->inst_config_mutex);
    inst->inst_flags &= ~INST_FLAG_BUSY;
    /* set backend readonly flag to match instance flags again
     * (sometimes the instance changes the readonly status when it's busy)
     */
    readonly = (inst->inst_flags & INST_FLAG_READONLY ? 1 : 0);
    slapi_mtn_be_set_readonly(inst->inst_be, readonly);
    PR_Unlock(inst->inst_config_mutex);
}

and it's called in

int
ldbm_back_ldif2ldbm(Slapi_PBlock *pb)

in "fail:" label

fail:
    /* DON'T enable the backend -- leave it offline */
    instance_set_not_busy(inst);
    return ret;

but could there be a point where we exit without "goto fail" and not setting the instance as "not busy" ? I see several points. I will try to reproduce it.

Comment 3 thierry bordaz 2018-10-25 08:56:24 UTC

German, I agree with your analyze and also use of restart as workaround.
Did you manage to reproduce the problem?  The hang being due to the backend being busy, can you reproduce systematically ?

Comment 7 mreynolds 2018-11-30 17:01:25 UTC

If the problem happens again with the customer can we please get a full stack trace of the process?

http://www.port389.org/docs/389ds/FAQ/faq.html#debugging-hangs

gdb -ex 'set confirm off' -ex 'set pagination off' -ex 'thread apply all bt full' -ex 'quit' /usr/sbin/ns-slapd `pidof ns-slapd` > stacktrace.`date +%s`.txt 2>&1

Thanks!

Comment 11 Ludwig 2018-12-05 14:51:24 UTC

I did some investigation on the available data, the dataset for Oct and one for Nov, looks like the application has changed in between.

The data from Oct show that there was continously this pattern, every hour, while the server was running:
- do an offline ldif export (this is done in a separate process)
- do an online ldif import task just after this.

I was suspecting that the offline process could fail and keep "something" locked/busy, but had a hard time to produce any issue. What I could trigger was that if I killed the offline export process the next import task was hanging, waiting for a cond var. But at that time the backend was set busy and the next added import tasks were getting the busy error - while this task was hanging forever. I could not get any confirmation from the data that something similar was happening, also because teh access logs for the time the problem started had been rotated away.

So I was turning to a data set from Nov and looked at a failuere staribg Nov,16th.

What I noticed was that the application pattern had changed. Now we have:
- do an online ldif export task
- do an online ldif import task
there are a few searches for the tasks, probably to check if the task has finished, but not systematically.
Also it looks like the log lvel had been increased, slowing everything down. 

In the cases where did see the failure, the behaviour was correct, the export task was not completed when the import task was added.

Here is a merged access and error log snippet for a pair of export import tasks:


[16/Nov/2018:18:15:41.160628481 +0200] conn=11525 fd=70 slot=70 connection from 2a00:8a00:4000:20f::a:4 to 2a00:8a00:4000:20f::a:4
[16/Nov/2018:18:15:42.493124254 +0200] conn=11525 op=0 EXT oid="1.3.6.1.4.1.1466.20037" name="start_tls_plugin"
[16/Nov/2018:18:15:42.512744374 +0200] conn=11525 op=0 RESULT err=0 tag=120 nentries=0 etime=0.0105717998
[16/Nov/2018:18:15:42.860370397 +0200] conn=11525 TLS1.2 256-bit AES-GCM
[16/Nov/2018:18:15:42.960040847 +0200] conn=11525 op=1 BIND dn="cn=Manager" method=128 version=3
[16/Nov/2018:18:15:43.163769962 +0200] conn=11525 op=1 RESULT err=0 tag=97 nentries=0 etime=0.0326843079 dn="cn=manager"
[16/Nov/2018:18:15:43.846598401 +0200] conn=11525 op=2 ADD dn="cn=export_2018_11_16_18_15_26,cn=export,cn=tasks,cn=config"
[16/Nov/2018:18:15:55.647976124 +0200] conn=11525 op=2 RESULT err=0 tag=105 nentries=0 etime=12.0334717669
[16/Nov/2018:18:15:57.076186999 +0200] conn=11525 op=3 UNBIND
[16/Nov/2018:18:15:57.084377035 +0200] conn=11525 op=3 fd=70 closed - U1
[16/Nov/2018:18:15:57.476812346 +0200] - INFO - task_export_thread - Beginning export of 'userRoot'
[16/Nov/2018:18:15:59.679047981 +0200] conn=11555 fd=68 slot=68 connection from 2a00:8a00:4000:20f::a:4 to 2a00:8a00:4000:20f::a:4
[16/Nov/2018:18:15:59.944913467 +0200] conn=11555 op=0 BIND dn="cn=Manager" method=128 version=3
[16/Nov/2018:18:16:00.181729337 +0200] conn=11555 op=0 RESULT err=0 tag=97 nentries=0 etime=0.0372103288 dn="cn=manager"
[16/Nov/2018:18:16:01.186641671 +0200] conn=11555 op=1 ADD dn="cn=import_2018_11_16_18_15_59,cn=import,cn=tasks,cn=config"
[16/Nov/2018:18:16:01.350433164 +0200] - ERR - ldbm_back_ldif2ldbm - ldbm: 'userRoot' of another task and cannot be disturbed.
[16/Nov/2018:18:16:01.612748700 +0200] conn=11555 op=1 RESULT err=1 tag=105 nentries=0 etime=1.0040462601
[16/Nov/2018:18:16:01.796121815 +0200] conn=11555 op=2 UNBIND
[16/Nov/2018:18:16:01.797475881 +0200] conn=11555 op=2 fd=68 closed - U1
[16/Nov/2018:18:16:07.328750893 +0200] - INFO - export_one_entry - export userRoot: Processed 1000 entries (37%).
[16/Nov/2018:18:16:11.393718184 +0200] - INFO - export_one_entry - export userRoot: Processed 2000 entries (74%).
[16/Nov/2018:18:16:14.477029740 +0200] - INFO - ldbm_back_ldbm2ldif - export userRoot: Processed 2676 entries (100%).
[16/Nov/2018:18:16:14.875705542 +0200] - INFO - task_export_thread - Export finished.

so the "is already in the middle" message is logged while the export is running.

Comment 21 thierry bordaz 2018-12-20 13:56:03 UTC

Created attachment 1515859 [details]
Tentative patch to confirm the root cause

Comment 23 thierry bordaz 2019-01-04 09:36:31 UTC

upstream ticket https://pagure.io/389-ds-base/issue/50117

Comment 29 thierry bordaz 2019-01-07 08:49:37 UTC

Upstream ticket pushed -> POST

Comment 33 Akshay Adhikari 2019-05-15 05:52:19 UTC

Hi, Verification steps are not very clear, can you please provide with some steps. Thanks!

Comment 34 thierry bordaz 2019-05-21 09:36:07 UTC

Akshay,

The testcase was not written but I think it is feasible.
1 - Create an instance and provision users
2 - export (online or offline) the backend into an ldif file (e.g. /var/opt/dr_bak.ldif)
3 - start the instance
4 - apply a high load of operations (srch/add/mod).
5 - run an online import  /usr/lib64/dirsrv/slapd-instance/ldif2db.pl -n userRoot -D "cn=directory Manager" -w password  -i /var/opt/dr_bak.ldif -P LDAP
    Import should fail with a message like ("Import can not proceed until they are completed")
6 - stop the load
7 - rerun the online import (step 5), it should succeed

Comment 35 Akshay Adhikari 2019-05-22 05:43:01 UTC

Build Tested: 389-ds-base-1.3.9.1-7.el7.x86_64

Steps:

1) Create an instance and provision users

2) Export (online or offline) the backend into an ldif file (e.g. /etc/dirsrv/slapd-standalone1/dr_bak.ldif)

db2ldif.pl -Z standalone1 -n userRoot -D "cn=directory Manager" -w password -a /etc/dirsrv/slapd-standalone1/dr_baknew.ldif -P LDAP

3) Start the instance if exported in offline mode

4) Apply a high load of operations (search/add/mod)

5) Run an online import 

ldif2db.pl -Z standalone1 -n userRoot -D "cn=directory Manager" -w password  -i /etc/dirsrv/slapd-standalone1/dr_bak.ldif -P LDAP
ldap_add: Operations error (1)
Failed to add task entry "cn=import_2019_5_21_6_29_37, cn=import, cn=tasks, cn=config" error (1)

Error log:
[21/May/2019:06:29:37.119685538 -0400] - ERR - ldbm_back_ldif2ldbm - ldbm: 'userRoot' there are 1 pending operation(s). Import can not proceed until they are completed.

6) Stop the load

7) Rerun the online import (step 5), it should succeed

ldif2db.pl -Z standalone1 -n userRoot -D "cn=directory Manager" -w password  -i /etc/dirsrv/slapd-standalone1/dr_bak.ldif -P LDAP
Successfully added task entry "cn=import_2019_5_21_6_29_41, cn=import, cn=tasks, cn=config"

Marking it as VERIFIED.

Comment 37 errata-xmlrpc 2019-08-06 12:58:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2152

Note You need to log in before you can comment on or make changes to this bug.