Bug 1642838

Summary:

after certain failed import operation, impossible to replay an import operation.

Product:

Red Hat Enterprise Linux 7

Reporter:

German Parente <gparente>

Component:

389-ds-base

Assignee:

thierry bordaz <tbordaz>

Status:

CLOSED ERRATA

QA Contact:

RHDS QE <ds-qe-bugs>

Severity:

unspecified

Docs Contact:

Priority:

high

Version:

7.4

CC:

aadhikar, cpelland, joboyer, lkrispen, mreynolds, msauton, nkinder, pasik, rmeggins, spichugi, tbordaz, tmihinto, vashirov

Target Milestone:

Keywords:

ZStream

Target Release:

7.7

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

389-ds-base-1.3.9.1-1.el7

Doc Type:

If docs needed, set a value

Doc Text:

Cause: Pending operations mark permanently a backend as busy Consequence: If an import is rejected because of pending operations further import attempts will fail Fix: if there are pending operations, at the beginning of an import, rejects the import without marking the backend as busy. Result: A failing import does not prevent further import

Story Points:

---

Clone Of:

Clones:

1672177 (view as bug list)

Environment:

Last Closed:

2019-08-06 12:58:51 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1672177

Attachments:

Description	Flags
Tentative patch to confirm the root cause	none

Description German Parente 2018-10-25 07:54:14 UTC

Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 German Parente 2018-10-25 08:11:43 UTC

Adding description here:

After an on-line import failure, if the customer does an import on-line if fails continuously.


/usr/lib64/dirsrv/slapd-instance/ldif2db.pl -n userRoot -D "cn=directory Manager" -w password  -i /var/opt/dr_bak.ldif -P LDAP
ldap_add: Operations error (1)
Failed to add task entry "cn=import_2018_10_24_8_6_11, cn=import, cn=tasks, cn=config" error (1)

and we can see this in errors log:

[23/Oct/2018:04:20:04.954458711 +0300] - ERR - ldbm_back_ldif2ldbm - ldbm: 'userRoot' is already in the middle of another task and cannot be disturbed.

I cannot find yet which was the cause of the first failure but the instance is set to busy:


int
instance_set_busy(ldbm_instance *inst)
{
    PR_Lock(inst->inst_config_mutex);
    if (is_instance_busy(inst)) {
        PR_Unlock(inst->inst_config_mutex);
        return -1;
    }

    inst->inst_flags |= INST_FLAG_BUSY;
    PR_Unlock(inst->inst_config_mutex);
    return 0;
}

and as when we try a new import we check first if the instance is set to busy and in that case we don't follow, there's no way to exit from this situation than restarting the server.

The "unbusy" is here:

/* mark a backend instance to be not "busy" anymore */
void
instance_set_not_busy(ldbm_instance *inst)
{
    int readonly;

    PR_Lock(inst->inst_config_mutex);
    inst->inst_flags &= ~INST_FLAG_BUSY;
    /* set backend readonly flag to match instance flags again
     * (sometimes the instance changes the readonly status when it's busy)
     */
    readonly = (inst->inst_flags & INST_FLAG_READONLY ? 1 : 0);
    slapi_mtn_be_set_readonly(inst->inst_be, readonly);
    PR_Unlock(inst->inst_config_mutex);
}

and it's called in

int
ldbm_back_ldif2ldbm(Slapi_PBlock *pb)

in "fail:" label

fail:
    /* DON'T enable the backend -- leave it offline */
    instance_set_not_busy(inst);
    return ret;

but could there be a point where we exit without "goto fail" and not setting the instance as "not busy" ? I see several points. I will try to reproduce it.

Comment 3 thierry bordaz 2018-10-25 08:56:24 UTC

German, I agree with your analyze and also use of restart as workaround.
Did you manage to reproduce the problem?  The hang being due to the backend being busy, can you reproduce systematically ?

Comment 7 mreynolds 2018-11-30 17:01:25 UTC

If the problem happens again with the customer can we please get a full stack trace of the process?

http://www.port389.org/docs/389ds/FAQ/faq.html#debugging-hangs

gdb -ex 'set confirm off' -ex 'set pagination off' -ex 'thread apply all bt full' -ex 'quit' /usr/sbin/ns-slapd `pidof ns-slapd` > stacktrace.`date +%s`.txt 2>&1

Thanks!

Comment 11 Ludwig 2018-12-05 14:51:24 UTC

I did some investigation on the available data, the dataset for Oct and one for Nov, looks like the application has changed in between.

The data from Oct show that there was continously this pattern, every hour, while the server was running:
- do an offline ldif export (this is done in a separate process)
- do an online ldif import task just after this.

I was suspecting that the offline process could fail and keep "something" locked/busy, but had a hard time to produce any issue. What I could trigger was that if I killed the offline export process the next import task was hanging, waiting for a cond var. But at that time the backend was set busy and the next added import tasks were getting the busy error - while this task was hanging forever. I could not get any confirmation from the data that something similar was happening, also because teh access logs for the time the problem started had been rotated away.

So I was turning to a data set from Nov and looked at a failuere staribg Nov,16th.

What I noticed was that the application pattern had changed. Now we have:
- do an online ldif export task
- do an online ldif import task
there are a few searches for the tasks, probably to check if the task has finished, but not systematically.
Also it looks like the log lvel had been increased, slowing everything down. 

In the cases where did see the failure, the behaviour was correct, the export task was not completed when the import task was added.

Here is a merged access and error log snippet for a pair of export import tasks:


[16/Nov/2018:18:15:41.160628481 +0200] conn=11525 fd=70 slot=70 connection from 2a00:8a00:4000:20f::a:4 to 2a00:8a00:4000:20f::a:4
[16/Nov/2018:18:15:42.493124254 +0200] conn=11525 op=0 EXT oid="1.3.6.1.4.1.1466.20037" name="start_tls_plugin"
[16/Nov/2018:18:15:42.512744374 +0200] conn=11525 op=0 RESULT err=0 tag=120 nentries=0 etime=0.0105717998
[16/Nov/2018:18:15:42.860370397 +0200] conn=11525 TLS1.2 256-bit AES-GCM
[16/Nov/2018:18:15:42.960040847 +0200] conn=11525 op=1 BIND dn="cn=Manager" method=128 version=3
[16/Nov/2018:18:15:43.163769962 +0200] conn=11525 op=1 RESULT err=0 tag=97 nentries=0 etime=0.0326843079 dn="cn=manager"
[16/Nov/2018:18:15:43.846598401 +0200] conn=11525 op=2 ADD dn="cn=export_2018_11_16_18_15_26,cn=export,cn=tasks,cn=config"
[16/Nov/2018:18:15:55.647976124 +0200] conn=11525 op=2 RESULT err=0 tag=105 nentries=0 etime=12.0334717669
[16/Nov/2018:18:15:57.076186999 +0200] conn=11525 op=3 UNBIND
[16/Nov/2018:18:15:57.084377035 +0200] conn=11525 op=3 fd=70 closed - U1
[16/Nov/2018:18:15:57.476812346 +0200] - INFO - task_export_thread - Beginning export of 'userRoot'
[16/Nov/2018:18:15:59.679047981 +0200] conn=11555 fd=68 slot=68 connection from 2a00:8a00:4000:20f::a:4 to 2a00:8a00:4000:20f::a:4
[16/Nov/2018:18:15:59.944913467 +0200] conn=11555 op=0 BIND dn="cn=Manager" method=128 version=3
[16/Nov/2018:18:16:00.181729337 +0200] conn=11555 op=0 RESULT err=0 tag=97 nentries=0 etime=0.0372103288 dn="cn=manager"
[16/Nov/2018:18:16:01.186641671 +0200] conn=11555 op=1 ADD dn="cn=import_2018_11_16_18_15_59,cn=import,cn=tasks,cn=config"
[16/Nov/2018:18:16:01.350433164 +0200] - ERR - ldbm_back_ldif2ldbm - ldbm: 'userRoot' of another task and cannot be disturbed.
[16/Nov/2018:18:16:01.612748700 +0200] conn=11555 op=1 RESULT err=1 tag=105 nentries=0 etime=1.0040462601
[16/Nov/2018:18:16:01.796121815 +0200] conn=11555 op=2 UNBIND
[16/Nov/2018:18:16:01.797475881 +0200] conn=11555 op=2 fd=68 closed - U1
[16/Nov/2018:18:16:07.328750893 +0200] - INFO - export_one_entry - export userRoot: Processed 1000 entries (37%).
[16/Nov/2018:18:16:11.393718184 +0200] - INFO - export_one_entry - export userRoot: Processed 2000 entries (74%).
[16/Nov/2018:18:16:14.477029740 +0200] - INFO - ldbm_back_ldbm2ldif - export userRoot: Processed 2676 entries (100%).
[16/Nov/2018:18:16:14.875705542 +0200] - INFO - task_export_thread - Export finished.

so the "is already in the middle" message is logged while the export is running.

Comment 21 thierry bordaz 2018-12-20 13:56:03 UTC

Created attachment 1515859 [details]
Tentative patch to confirm the root cause

Comment 23 thierry bordaz 2019-01-04 09:36:31 UTC

upstream ticket https://pagure.io/389-ds-base/issue/50117

Comment 29 thierry bordaz 2019-01-07 08:49:37 UTC

Upstream ticket pushed -> POST

Comment 33 Akshay Adhikari 2019-05-15 05:52:19 UTC

Hi, Verification steps are not very clear, can you please provide with some steps. Thanks!

Comment 34 thierry bordaz 2019-05-21 09:36:07 UTC

Akshay,

The testcase was not written but I think it is feasible.
1 - Create an instance and provision users
2 - export (online or offline) the backend into an ldif file (e.g. /var/opt/dr_bak.ldif)
3 - start the instance
4 - apply a high load of operations (srch/add/mod).
5 - run an online import  /usr/lib64/dirsrv/slapd-instance/ldif2db.pl -n userRoot -D "cn=directory Manager" -w password  -i /var/opt/dr_bak.ldif -P LDAP
    Import should fail with a message like ("Import can not proceed until they are completed")
6 - stop the load
7 - rerun the online import (step 5), it should succeed

Comment 35 Akshay Adhikari 2019-05-22 05:43:01 UTC

Build Tested: 389-ds-base-1.3.9.1-7.el7.x86_64

Steps:

1) Create an instance and provision users

2) Export (online or offline) the backend into an ldif file (e.g. /etc/dirsrv/slapd-standalone1/dr_bak.ldif)

db2ldif.pl -Z standalone1 -n userRoot -D "cn=directory Manager" -w password -a /etc/dirsrv/slapd-standalone1/dr_baknew.ldif -P LDAP

3) Start the instance if exported in offline mode

4) Apply a high load of operations (search/add/mod)

5) Run an online import 

ldif2db.pl -Z standalone1 -n userRoot -D "cn=directory Manager" -w password  -i /etc/dirsrv/slapd-standalone1/dr_bak.ldif -P LDAP
ldap_add: Operations error (1)
Failed to add task entry "cn=import_2019_5_21_6_29_37, cn=import, cn=tasks, cn=config" error (1)

Error log:
[21/May/2019:06:29:37.119685538 -0400] - ERR - ldbm_back_ldif2ldbm - ldbm: 'userRoot' there are 1 pending operation(s). Import can not proceed until they are completed.

6) Stop the load

7) Rerun the online import (step 5), it should succeed

ldif2db.pl -Z standalone1 -n userRoot -D "cn=directory Manager" -w password  -i /etc/dirsrv/slapd-standalone1/dr_bak.ldif -P LDAP
Successfully added task entry "cn=import_2019_5_21_6_29_41, cn=import, cn=tasks, cn=config"

Marking it as VERIFIED.

Comment 37 errata-xmlrpc 2019-08-06 12:58:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2152