Bug 1642838
Summary: | after certain failed import operation, impossible to replay an import operation. | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | German Parente <gparente> | ||||
Component: | 389-ds-base | Assignee: | thierry bordaz <tbordaz> | ||||
Status: | CLOSED ERRATA | QA Contact: | RHDS QE <ds-qe-bugs> | ||||
Severity: | unspecified | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 7.4 | CC: | aadhikar, cpelland, joboyer, lkrispen, mreynolds, msauton, nkinder, pasik, rmeggins, spichugi, tbordaz, tmihinto, vashirov | ||||
Target Milestone: | rc | Keywords: | ZStream | ||||
Target Release: | 7.7 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | 389-ds-base-1.3.9.1-1.el7 | Doc Type: | If docs needed, set a value | ||||
Doc Text: |
Cause: Pending operations mark permanently a backend as busy
Consequence: If an import is rejected because of pending operations further import attempts will fail
Fix: if there are pending operations, at the beginning of an import, rejects the import without marking the backend as busy.
Result: A failing import does not prevent further import
|
Story Points: | --- | ||||
Clone Of: | |||||||
: | 1672177 (view as bug list) | Environment: | |||||
Last Closed: | 2019-08-06 12:58:51 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1672177 | ||||||
Attachments: |
|
Description
German Parente
2018-10-25 07:54:14 UTC
Adding description here: After an on-line import failure, if the customer does an import on-line if fails continuously. /usr/lib64/dirsrv/slapd-instance/ldif2db.pl -n userRoot -D "cn=directory Manager" -w password -i /var/opt/dr_bak.ldif -P LDAP ldap_add: Operations error (1) Failed to add task entry "cn=import_2018_10_24_8_6_11, cn=import, cn=tasks, cn=config" error (1) and we can see this in errors log: [23/Oct/2018:04:20:04.954458711 +0300] - ERR - ldbm_back_ldif2ldbm - ldbm: 'userRoot' is already in the middle of another task and cannot be disturbed. I cannot find yet which was the cause of the first failure but the instance is set to busy: int instance_set_busy(ldbm_instance *inst) { PR_Lock(inst->inst_config_mutex); if (is_instance_busy(inst)) { PR_Unlock(inst->inst_config_mutex); return -1; } inst->inst_flags |= INST_FLAG_BUSY; PR_Unlock(inst->inst_config_mutex); return 0; } and as when we try a new import we check first if the instance is set to busy and in that case we don't follow, there's no way to exit from this situation than restarting the server. The "unbusy" is here: /* mark a backend instance to be not "busy" anymore */ void instance_set_not_busy(ldbm_instance *inst) { int readonly; PR_Lock(inst->inst_config_mutex); inst->inst_flags &= ~INST_FLAG_BUSY; /* set backend readonly flag to match instance flags again * (sometimes the instance changes the readonly status when it's busy) */ readonly = (inst->inst_flags & INST_FLAG_READONLY ? 1 : 0); slapi_mtn_be_set_readonly(inst->inst_be, readonly); PR_Unlock(inst->inst_config_mutex); } and it's called in int ldbm_back_ldif2ldbm(Slapi_PBlock *pb) in "fail:" label fail: /* DON'T enable the backend -- leave it offline */ instance_set_not_busy(inst); return ret; but could there be a point where we exit without "goto fail" and not setting the instance as "not busy" ? I see several points. I will try to reproduce it. German, I agree with your analyze and also use of restart as workaround. Did you manage to reproduce the problem? The hang being due to the backend being busy, can you reproduce systematically ? If the problem happens again with the customer can we please get a full stack trace of the process? http://www.port389.org/docs/389ds/FAQ/faq.html#debugging-hangs gdb -ex 'set confirm off' -ex 'set pagination off' -ex 'thread apply all bt full' -ex 'quit' /usr/sbin/ns-slapd `pidof ns-slapd` > stacktrace.`date +%s`.txt 2>&1 Thanks! I did some investigation on the available data, the dataset for Oct and one for Nov, looks like the application has changed in between. The data from Oct show that there was continously this pattern, every hour, while the server was running: - do an offline ldif export (this is done in a separate process) - do an online ldif import task just after this. I was suspecting that the offline process could fail and keep "something" locked/busy, but had a hard time to produce any issue. What I could trigger was that if I killed the offline export process the next import task was hanging, waiting for a cond var. But at that time the backend was set busy and the next added import tasks were getting the busy error - while this task was hanging forever. I could not get any confirmation from the data that something similar was happening, also because teh access logs for the time the problem started had been rotated away. So I was turning to a data set from Nov and looked at a failuere staribg Nov,16th. What I noticed was that the application pattern had changed. Now we have: - do an online ldif export task - do an online ldif import task there are a few searches for the tasks, probably to check if the task has finished, but not systematically. Also it looks like the log lvel had been increased, slowing everything down. In the cases where did see the failure, the behaviour was correct, the export task was not completed when the import task was added. Here is a merged access and error log snippet for a pair of export import tasks: [16/Nov/2018:18:15:41.160628481 +0200] conn=11525 fd=70 slot=70 connection from 2a00:8a00:4000:20f::a:4 to 2a00:8a00:4000:20f::a:4 [16/Nov/2018:18:15:42.493124254 +0200] conn=11525 op=0 EXT oid="1.3.6.1.4.1.1466.20037" name="start_tls_plugin" [16/Nov/2018:18:15:42.512744374 +0200] conn=11525 op=0 RESULT err=0 tag=120 nentries=0 etime=0.0105717998 [16/Nov/2018:18:15:42.860370397 +0200] conn=11525 TLS1.2 256-bit AES-GCM [16/Nov/2018:18:15:42.960040847 +0200] conn=11525 op=1 BIND dn="cn=Manager" method=128 version=3 [16/Nov/2018:18:15:43.163769962 +0200] conn=11525 op=1 RESULT err=0 tag=97 nentries=0 etime=0.0326843079 dn="cn=manager" [16/Nov/2018:18:15:43.846598401 +0200] conn=11525 op=2 ADD dn="cn=export_2018_11_16_18_15_26,cn=export,cn=tasks,cn=config" [16/Nov/2018:18:15:55.647976124 +0200] conn=11525 op=2 RESULT err=0 tag=105 nentries=0 etime=12.0334717669 [16/Nov/2018:18:15:57.076186999 +0200] conn=11525 op=3 UNBIND [16/Nov/2018:18:15:57.084377035 +0200] conn=11525 op=3 fd=70 closed - U1 [16/Nov/2018:18:15:57.476812346 +0200] - INFO - task_export_thread - Beginning export of 'userRoot' [16/Nov/2018:18:15:59.679047981 +0200] conn=11555 fd=68 slot=68 connection from 2a00:8a00:4000:20f::a:4 to 2a00:8a00:4000:20f::a:4 [16/Nov/2018:18:15:59.944913467 +0200] conn=11555 op=0 BIND dn="cn=Manager" method=128 version=3 [16/Nov/2018:18:16:00.181729337 +0200] conn=11555 op=0 RESULT err=0 tag=97 nentries=0 etime=0.0372103288 dn="cn=manager" [16/Nov/2018:18:16:01.186641671 +0200] conn=11555 op=1 ADD dn="cn=import_2018_11_16_18_15_59,cn=import,cn=tasks,cn=config" [16/Nov/2018:18:16:01.350433164 +0200] - ERR - ldbm_back_ldif2ldbm - ldbm: 'userRoot' of another task and cannot be disturbed. [16/Nov/2018:18:16:01.612748700 +0200] conn=11555 op=1 RESULT err=1 tag=105 nentries=0 etime=1.0040462601 [16/Nov/2018:18:16:01.796121815 +0200] conn=11555 op=2 UNBIND [16/Nov/2018:18:16:01.797475881 +0200] conn=11555 op=2 fd=68 closed - U1 [16/Nov/2018:18:16:07.328750893 +0200] - INFO - export_one_entry - export userRoot: Processed 1000 entries (37%). [16/Nov/2018:18:16:11.393718184 +0200] - INFO - export_one_entry - export userRoot: Processed 2000 entries (74%). [16/Nov/2018:18:16:14.477029740 +0200] - INFO - ldbm_back_ldbm2ldif - export userRoot: Processed 2676 entries (100%). [16/Nov/2018:18:16:14.875705542 +0200] - INFO - task_export_thread - Export finished. so the "is already in the middle" message is logged while the export is running. Created attachment 1515859 [details]
Tentative patch to confirm the root cause
upstream ticket https://pagure.io/389-ds-base/issue/50117 Upstream ticket pushed -> POST Hi, Verification steps are not very clear, can you please provide with some steps. Thanks! Akshay, The testcase was not written but I think it is feasible. 1 - Create an instance and provision users 2 - export (online or offline) the backend into an ldif file (e.g. /var/opt/dr_bak.ldif) 3 - start the instance 4 - apply a high load of operations (srch/add/mod). 5 - run an online import /usr/lib64/dirsrv/slapd-instance/ldif2db.pl -n userRoot -D "cn=directory Manager" -w password -i /var/opt/dr_bak.ldif -P LDAP Import should fail with a message like ("Import can not proceed until they are completed") 6 - stop the load 7 - rerun the online import (step 5), it should succeed Build Tested: 389-ds-base-1.3.9.1-7.el7.x86_64 Steps: 1) Create an instance and provision users 2) Export (online or offline) the backend into an ldif file (e.g. /etc/dirsrv/slapd-standalone1/dr_bak.ldif) db2ldif.pl -Z standalone1 -n userRoot -D "cn=directory Manager" -w password -a /etc/dirsrv/slapd-standalone1/dr_baknew.ldif -P LDAP 3) Start the instance if exported in offline mode 4) Apply a high load of operations (search/add/mod) 5) Run an online import ldif2db.pl -Z standalone1 -n userRoot -D "cn=directory Manager" -w password -i /etc/dirsrv/slapd-standalone1/dr_bak.ldif -P LDAP ldap_add: Operations error (1) Failed to add task entry "cn=import_2019_5_21_6_29_37, cn=import, cn=tasks, cn=config" error (1) Error log: [21/May/2019:06:29:37.119685538 -0400] - ERR - ldbm_back_ldif2ldbm - ldbm: 'userRoot' there are 1 pending operation(s). Import can not proceed until they are completed. 6) Stop the load 7) Rerun the online import (step 5), it should succeed ldif2db.pl -Z standalone1 -n userRoot -D "cn=directory Manager" -w password -i /etc/dirsrv/slapd-standalone1/dr_bak.ldif -P LDAP Successfully added task entry "cn=import_2019_5_21_6_29_41, cn=import, cn=tasks, cn=config" Marking it as VERIFIED. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2152 |