Bug 1769734
| Summary: | Heavy StartTLS connection load can randomly fail with err=1 | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Amith <apeetham> |
| Component: | 389-ds-base | Assignee: | thierry bordaz <tbordaz> |
| Status: | CLOSED ERRATA | QA Contact: | RHDS QE <ds-qe-bugs> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 8.1 | CC: | apeetham, atikhono, grajaiya, jhrozek, lslebodn, mreynolds, mzidek, nkinder, pasik, pbrezina, sbose, spichugi, tbordaz, thalman, tscherf, vashirov |
| Target Milestone: | rc | Keywords: | Regression |
| Target Release: | 8.0 | Flags: | pm-rhel:
mirror+
|
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | sync-to-jira | ||
| Fixed In Version: | 389-ds-base-1.4.3.8-2.module+el8.3.0+6591+ebfc9766 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-11-04 03:07:24 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Attachments: | |||
|
Description
Amith
2019-11-07 10:30:20 UTC
Hi, can you attach SSSD logs with debug_level=9from a test run as well? bye, Sumit Created attachment 1633614 [details]
SSSD domain log in gzip format.
Attached the domain log, since it is huge in size.. almost 284 MB, i have compressed it.
[check_encryption_used] (0x1000): ldap_get_option failed to get sasl ssf, assuming SASL is not used. [check_encryption_used] (0x4000): Encryption used: SASL SSF [0] tls_inplace [TLS inplace]. [simple_bind_send] (0x0100): Executing simple bind as: uid=kau1,ou=People,dc=example,dc=com [simple_bind_send] (0x2000): ldap simple bind sent, msgid = 3 [sdap_op_add] (0x2000): New operation 3 timeout 8 [sdap_process_result] (0x2000): Trace: sh[0x556af608f740], connected[1], ops[0x556af6091bd0], ldap[0x5 56af613db30] [sdap_process_result] (0x2000): Trace: end of ldap_result list [sdap_process_result] (0x2000): Trace: sh[0x556af60d6ee0], connected[1], ops[0x556af6000470], ldap[0x5 56af5ff2d90] [sdap_process_message] (0x4000): Message type: [LDAP_RES_EXTENDED] [sdap_connect_done] (0x0080): START TLS result: Operations error(1), Other operations are still pending on the connection. [sdap_connect_done] (0x0080): ldap_install_tls failed: [Connect error] [Other operations are still pending on the connection.] Search for "Other operations are still pending on the connection" && "heavy load" yields https://lists.fedoraproject.org/pipermail/389-users/2010-April/011408.html But this is very old. Not sure if this is relevant. Hi, thanks for checking Alexey. So maybe it is the DS which cannot handle the load. Amith, can you attach the Directory Server access logs from the time of the test as well? bye, Sumit Created attachment 1636020 [details]
The Dirsrv access log.
Hi, there is an error in the access log during the TLS setup: [13/Nov/2019:10:06:21.614100711 -0500] conn=7388 fd=65 slot=65 connection from 10.16.56.57 to 10.16.56.59 [13/Nov/2019:10:06:21.624743256 -0500] conn=7388 op=0 EXT oid="1.3.6.1.4.1.1466.20037" name="start_tls_plugin" [13/Nov/2019:10:06:21.624810824 -0500] conn=7388 op=0 RESULT err=1 tag=120 nentries=0 etime=0.0000125518 [13/Nov/2019:10:06:21.691564430 -0500] conn=7388 op=-1 fd=65 closed error 34 (Numerical result out of range) - B2 I'm do not know how to get more details here, but to me it looks that the load issue is not on the SSSD but on the directory server side. bye, Sumit (In reply to Amith from comment #5) > Created attachment 1636020 [details] > The Dirsrv access log. If I understand correctly, there should be also accompanying "errors" log (besides "access"). Could you please attach it as well? For example in the bz 918689, there is the same issue seen in "access" log: [02/Jan/2014:14:52:15 +051800] conn=11 op=-1 fd=64 closed error 34 (Numerical result out of range) - B2 And "errors" provides additional information (as an example only, in that case it was "Incoming BER Element was too long"; it could be useful to see what is the case here) Created attachment 1636429 [details]
The Dirsrv erros log file.
(In reply to Amith from comment #8) > Created attachment 1636429 [details] > The Dirsrv erros log file. 'errors' and 'access' logs do not match: 'access' ends at 14/Nov/2019:02:37:12 and 'errors' starts at 14/Nov/2019:09:15:44. Please, provide both logs capturing the moment issue happens. Created attachment 1636692 [details]
Acess file from latest test failure.
Created attachment 1636693 [details]
Error file from latest test failure.
Access and Error log file from latest beaker run are attached. Here is the beaker link : https://beaker.engineering.redhat.com/jobs/3898531 I have been executing this test with earlier RHEL versions to figure out whether its an issue with ldap server. We have scheduled tests with RHEL-7.8, & RHEL-7.7 as SERVER and RHEL-8.1 as client. We saw similar failures. I have been working on different combinations of test and will update here the beaker jobs, once finished. DS logs combined are: ``` [15/Nov/2019:17:54:43.979067998 -0500] conn=4499 fd=65 slot=65 connection from 10.16.56.57 to 10.16.56.56 [15/Nov/2019:17:54:44.012170796 -0500] conn=4499 op=0 EXT oid="1.3.6.1.4.1.1466.20037" name="start_tls_plugin" [15/Nov/2019:17:54:44.012212970 -0500] conn=4499 op=0 RESULT err=1 tag=120 nentries=0 etime=0.0000076513 [15/Nov/2019:17:54:44.021844989 -0500] - ERR - log_ber_too_big_error - conn=4499 fd=65 Incoming BER Element may be misformed. This may indicate an attempt to use TLS on a plaintext port, IE ldaps://localhost:389. Check your client LDAP_URI settings. [15/Nov/2019:17:54:44.043159954 -0500] conn=4499 op=-1 fd=65 closed error 34 (Numerical result out of range) - B2 ``` But I am not sure if this makes any sense, as an error observed in `sssd_LDAP.log` is "Other operations are still pending on the connection": (Thu Nov 7 07:40:13 2019) [sssd[be[LDAP]]] [sdap_connect_done] (0x0080): ldap_install_tls failed: [Connect error] [Other operations are still pending on the connection.] Btw, timestamp of this error - "Nov 7 07:40:13" - doesn't match timestamp of an error in DS log from comment 5 - "13/Nov/2019:10:06:21" Amith, could you please provide `sssd_LDAP.log` from the same run as DS access&errors logs? I was unable to find it in the beaker link. In the given sssd_LDAP.log failing operation looks like: ``` [sdap_process_result] (0x2000): Trace: sh[0x556af60d6ee0], connected[1], ops[0x556af6000470], ldap[0x556af5ff2d90] [sdap_process_message] (0x4000): Message type: [LDAP_RES_EXTENDED] [sdap_connect_done] (0x0080): START TLS result: Operations error(1), Other operations are still pending on the connection. [sdap_connect_done] (0x0080): ldap_install_tls failed: [Connect error] [Other operations are still pending on the connection.] [sdap_op_destructor] (0x2000): Operation 1 finished ``` There are a lot of concurrent connections running at different stages in the log. Unfortunately there is no way to correlate this failed "Operation 1" / "sdap_handle=0x556af60d6ee0" with other parts of the log (where this operation was started: only fd was logged there). "Using file descriptor [..] for the connection." can't be matched with corresponding `socket()`/`close()` as well. May be debug messages could be improved here. So far awaiting complete set of logs. Created attachment 1638052 [details] SSSD domain log. Attached the log files from latest beaker run: sssd domain log, DS access and error log. JOB: https://beaker.engineering.redhat.com/jobs/3903646 Created attachment 1638053 [details]
The Dirsrv access log.
Created attachment 1638054 [details]
The Dirsrv erros log file.
Hm, but that's it: ``` [19/Nov/2019:05:16:17.435711775 -0500] conn=11277 fd=71 slot=71 connection from 10.16.56.51 to 10.16.56.54 [19/Nov/2019:05:16:17.465202052 -0500] conn=11277 op=0 EXT oid="1.3.6.1.4.1.1466.20037" name="start_tls_plugin" [19/Nov/2019:05:16:17.465249810 -0500] conn=11277 op=0 RESULT err=1 tag=120 nentries=0 etime=0.0000073579 [19/Nov/2019:05:16:17.535207558 -0500] - ERR - log_ber_too_big_error - conn=11277 fd=71 Incoming BER Element may be misformed. This may indicate an attempt to use TLS on a plaintext port, IE ldaps://localhost:389. Check your client LDAP_URI settings. [19/Nov/2019:05:16:17.540751056 -0500] conn=11277 op=-1 fd=71 closed error 34 (Numerical result out of range) - B2 ``` and matching SSSD log entry is: ``` (Tue Nov 19 05:16:17 2019) [sssd[be[LDAP]]] [sdap_connect_done] (0x0080): START TLS result: Operations error(1), Other operations are still pending on the connection. ``` where "Other operations are still pending on the connection" was obtained via `ldap_parse_result()` Mark, could you please give a hint how to debug further - what could be the reason DS refuses to "start_tls_plugin"? (In reply to Alexey Tikhonov from comment #18) > Hm, but that's it: > ``` > [19/Nov/2019:05:16:17.435711775 -0500] conn=11277 fd=71 slot=71 connection > from 10.16.56.51 to 10.16.56.54 > [19/Nov/2019:05:16:17.465202052 -0500] conn=11277 op=0 EXT > oid="1.3.6.1.4.1.1466.20037" name="start_tls_plugin" > [19/Nov/2019:05:16:17.465249810 -0500] conn=11277 op=0 RESULT err=1 tag=120 > nentries=0 etime=0.0000073579 > [19/Nov/2019:05:16:17.535207558 -0500] - ERR - log_ber_too_big_error - > conn=11277 fd=71 Incoming BER Element may be misformed. This may indicate an > attempt to use TLS on a plaintext port, IE ldaps://localhost:389. Check your > client LDAP_URI settings. > [19/Nov/2019:05:16:17.540751056 -0500] conn=11277 op=-1 fd=71 closed error > 34 (Numerical result out of range) - B2 > ``` > and matching SSSD log entry is: > ``` > (Tue Nov 19 05:16:17 2019) [sssd[be[LDAP]]] [sdap_connect_done] (0x0080): > START TLS result: Operations error(1), Other operations are still pending on > the connection. > ``` > where "Other operations are still pending on the connection" was obtained > via `ldap_parse_result()` > > > Mark, could you please give a hint how to debug further - what could be the > reason DS refuses to "start_tls_plugin"? No idea, I've never seen this before. Maybe turn on connection logging in hopes of getting more information: https://access.redhat.com/documentation/en-us/red_hat_directory_server/9.0/html/configuration_command_and_file_reference/error-logs#error-logs-levels Error log level: 8 I would also try ldapsearch using startTLS to see if that works. Maybe it's something messed up on the SSSD Client? (In reply to mreynolds from comment #19) > > ``` > > [19/Nov/2019:05:16:17.435711775 -0500] conn=11277 fd=71 slot=71 connection > > from 10.16.56.51 to 10.16.56.54 > > [19/Nov/2019:05:16:17.465202052 -0500] conn=11277 op=0 EXT > > oid="1.3.6.1.4.1.1466.20037" name="start_tls_plugin" > > [19/Nov/2019:05:16:17.465249810 -0500] conn=11277 op=0 RESULT err=1 tag=120 > > nentries=0 etime=0.0000073579 > > [19/Nov/2019:05:16:17.535207558 -0500] - ERR - log_ber_too_big_error - > > conn=11277 fd=71 Incoming BER Element may be misformed. This may indicate an > > attempt to use TLS on a plaintext port, IE ldaps://localhost:389. Check your > > client LDAP_URI settings. > > [19/Nov/2019:05:16:17.540751056 -0500] conn=11277 op=-1 fd=71 closed error > > 34 (Numerical result out of range) - B2 > > ``` > No idea, I've never seen this before. Maybe turn on connection logging in > hopes of getting more information: > > https://access.redhat.com/documentation/en-us/red_hat_directory_server/9.0/ > html/configuration_command_and_file_reference/error-logs#error-logs-levels > > Error log level: 8 Amith, could you please provide new set of logs with this DS debug level enabled? > I would also try ldapsearch using startTLS to see if that works. Well, in the access log there are 15700 successful "start_tls_plugin" operations and only one failed (this is load testing). > Maybe it's something messed up on the SSSD Client? Might be, and I am looking for a clue. Mark, would observed DS behaviour be like in the log above if SSSD for example tried to StartTLS on a connection having another operation in progress? (In reply to Alexey Tikhonov from comment #20) > Mark, would observed DS behaviour be like in the log above if SSSD for > example tried to StartTLS on a connection having another operation in > progress? But this is a new connection, and it's the first operation (op=0) on that connection, so there are no pending operations on the same connection at that time. Maybe I am not understanding your question. [19/Nov/2019:05:16:17.435711775 -0500] conn=11277 fd=71 slot=71 connection from 10.16.56.51 to 10.16.56.54 [19/Nov/2019:05:16:17.465202052 -0500] conn=11277 op=0 EXT oid="1.3.6.1.4.1.1466.20037" name="start_tls_plugin" [19/Nov/2019:05:16:17.465249810 -0500] conn=11277 op=0 RESULT err=1 tag=120 nentries=0 etime=0.0000073579 Are there other incoming connections using start-tls that are occurring at the same time (05:16:17) ? (In reply to mreynolds from comment #21) > (In reply to Alexey Tikhonov from comment #20) > > > Mark, would observed DS behaviour be like in the log above if SSSD for > > example tried to StartTLS on a connection having another operation in > > progress? > > But this is a new connection, and it's the first operation (op=0) on that > connection, so there are no pending operations on the same connection at > that time. Indeed. Thank you. > [19/Nov/2019:05:16:17.435711775 -0500] conn=11277 fd=71 slot=71 connection > from 10.16.56.51 to 10.16.56.54 > [19/Nov/2019:05:16:17.465202052 -0500] conn=11277 op=0 EXT > oid="1.3.6.1.4.1.1466.20037" name="start_tls_plugin" > [19/Nov/2019:05:16:17.465249810 -0500] conn=11277 op=0 RESULT err=1 tag=120 > nentries=0 etime=0.0000073579 > > Are there other incoming connections using start-tls that are occurring at > the same time (05:16:17) ? A lot of: $ grep "\"start_tls_plugin\"" access | grep "05:16:17" | wc -l 52 -- ranging from conn=11257 to conn=11307. And they are intermixed: 3 other connections complete TSL setup ("conn=11272 TLS1.3 128-bit AES-GCM") and 7 new connections are accepted after [19/Nov/2019:05:16:17.435711775 -0500] conn=11277 fd=71 slot=71 connection from 10.16.56.51 to 10.16.56.54 but before [19/Nov/2019:05:16:17.465202052 -0500] conn=11277 op=0 EXT oid="1.3.6.1.4.1.1466.20037" name="start_tls_plugin" [19/Nov/2019:05:16:17.465249810 -0500] conn=11277 op=0 RESULT err=1 tag=120 nentries=0 etime=0.0000073579 Btw, timestamp diff between last two is negligible, but quite a lot happens in the log during next 0.07 sec before: [19/Nov/2019:05:16:17.535207558 -0500] - ERR - log_ber_too_big_error - conn=11277 fd=71 Incoming BER Element may be misformed. But "access" log entry already had "err=1" (in case of successful StartTLS here is "err=0"). Is it the same err as seen in "errors" log ("- ERR - log_ber_too_big_error - conn=11277")? Or is one consequence of another? Why there is this delay? Looking at the code, we run into this issue when the ber encoded packet is misformed. There is something wrong with the ber length (too big or too small). This is not something the server is doing. The server just receives a ber element (via openldap client library), but the length specified in the ber element is not matching the ber content. So it looks like the network traffic got corrupted. It would be good to do some network capture, and also to check the network for packet loss, or errors. (In reply to Alexey Tikhonov from comment #20) > (In reply to mreynolds from comment #19) > > > ``` > > > [19/Nov/2019:05:16:17.435711775 -0500] conn=11277 fd=71 slot=71 connection > > > from 10.16.56.51 to 10.16.56.54 > > > [19/Nov/2019:05:16:17.465202052 -0500] conn=11277 op=0 EXT > > > oid="1.3.6.1.4.1.1466.20037" name="start_tls_plugin" > > > [19/Nov/2019:05:16:17.465249810 -0500] conn=11277 op=0 RESULT err=1 tag=120 > > > nentries=0 etime=0.0000073579 > > > [19/Nov/2019:05:16:17.535207558 -0500] - ERR - log_ber_too_big_error - > > > conn=11277 fd=71 Incoming BER Element may be misformed. This may indicate an > > > attempt to use TLS on a plaintext port, IE ldaps://localhost:389. Check your > > > client LDAP_URI settings. > > > [19/Nov/2019:05:16:17.540751056 -0500] conn=11277 op=-1 fd=71 closed error > > > 34 (Numerical result out of range) - B2 > > > ``` > > > No idea, I've never seen this before. Maybe turn on connection logging in > > hopes of getting more information: > > > > https://access.redhat.com/documentation/en-us/red_hat_directory_server/9.0/ > > html/configuration_command_and_file_reference/error-logs#error-logs-levels > > > > Error log level: 8 > > Amith, could you please provide new set of logs with this DS debug level > enabled? > Sure, i have scheduled a job. I am waiting for the results. > > > I would also try ldapsearch using startTLS to see if that works. > > Well, in the access log there are 15700 successful "start_tls_plugin" > operations and only one failed (this is load testing). > > > > Maybe it's something messed up on the SSSD Client? > > Might be, and I am looking for a clue. > > Mark, would observed DS behaviour be like in the log above if SSSD for > example tried to StartTLS on a connection having another operation in > progress? Created attachment 1638893 [details]
SSSD domain log with DS error log level : 8
Created attachment 1638894 [details]
Dirsrv errors log when Err log level = 8.
Created attachment 1638895 [details]
Dirsrv access log when DS err log level = 8.
Generated the set of logs as requested, by setting "nsslapd-errorlog-level = 8" in LDAP server. They are attached. The logs did not tell us much more than we already knew, there is a ber encoding: Access Log: [22/Nov/2019:11:40:52.541164984 -0500] conn=66 fd=72 slot=72 connection from 10.16.56.60 to 10.19.34.24 [22/Nov/2019:11:40:56.152543488 -0500] conn=66 op=0 EXT oid="1.3.6.1.4.1.1466.20037" name="start_tls_plugin" [22/Nov/2019:11:40:56.152622812 -0500] conn=66 op=0 RESULT err=1 tag=120 nentries=0 etime=1.1225190639 [22/Nov/2019:11:40:57.846114105 -0500] conn=66 op=-1 fd=72 closed - B1 Errors log: [22/Nov/2019:11:40:52.498208907 -0500] - DEBUG - connection_reset - new connection on 72 [22/Nov/2019:11:40:54.294060576 -0500] - DEBUG - connection_table_dump_activity_to_errors_log - activity on 72r [22/Nov/2019:11:40:54.335773788 -0500] - DEBUG - handle_pr_read_ready - read activity on 72 [22/Nov/2019:11:40:55.580630529 -0500] - DEBUG - connection_threadmain - conn 66 read operation successfully - thread_turbo_flag 0 more_data 1 ops_initiated 1 refcnt 2 flags 0 [22/Nov/2019:11:40:55.622819720 -0500] - DEBUG - connection_check_activity_level - conn 66 activity level = 0 [22/Nov/2019:11:40:56.110904594 -0500] - DEBUG - connection_threadmain - conn 66 queued because more_data [22/Nov/2019:11:40:56.194243021 -0500] - DEBUG - connection_threadmain - conn 66 check more_data 1 thread_turbo_flag 0repl_conn_bef 0, repl_conn_now 0 [22/Nov/2019:11:40:57.052718336 -0500] - DEBUG - connection_threadmain - conn 66 read operation successfully - thread_turbo_flag 0 more_data 1 ops_initiated 2 refcnt 3 flags 0 [22/Nov/2019:11:40:57.102718810 -0500] - DEBUG - connection_threadmain - conn 66 queued because more_data [22/Nov/2019:11:40:57.195068947 -0500] - DEBUG - connection_threadmain - conn 66 read operation successfully - thread_turbo_flag 0 more_data 0 ops_initiated 3 refcnt 3 flags 0 [22/Nov/2019:11:40:57.769431703 -0500] - DEBUG - disconnect_server_nomutex_ext - Setting conn 66 fd=72 to be disconnected: reason -4999 Reason -4999 == Bad Ber Tag (B1) A successful StartTLS connection looks like: Access Log [22/Nov/2019:11:40:49.645376731 -0500] conn=63 fd=64 slot=64 connection from 10.16.56.60 to 10.19.34.24 [22/Nov/2019:11:40:50.667725628 -0500] conn=63 op=0 EXT oid="1.3.6.1.4.1.1466.20037" name="start_tls_plugin" [22/Nov/2019:11:40:50.667826919 -0500] conn=63 op=0 RESULT err=0 tag=120 nentries=0 etime=0.1637595329 [22/Nov/2019:11:40:51.303114876 -0500] conn=63 TLS1.3 128-bit AES-GCM Errors Log [22/Nov/2019:11:40:49.602578676 -0500] - DEBUG - connection_reset - new connection on 64 [22/Nov/2019:11:40:50.099367623 -0500] - DEBUG - connection_table_dump_activity_to_errors_log - activity on 64r [22/Nov/2019:11:40:50.262672660 -0500] - DEBUG - handle_pr_read_ready - read activity on 64 [22/Nov/2019:11:40:50.512994654 -0500] - DEBUG - connection_threadmain - conn 63 read operation successfully - thread_turbo_flag 0 more_data 0 ops_initiated 1 refcnt 2 flags 0 [22/Nov/2019:11:40:50.598504739 -0500] - DEBUG - connection_check_activity_level - conn 63 activity level = 0 [22/Nov/2019:11:40:50.633011609 -0500] - DEBUG - connection_make_readable_nolock - making readable conn 63 fd=64 [22/Nov/2019:11:40:50.839345301 -0500] - DEBUG - connection_threadmain - conn 63 check more_data 0 thread_turbo_flag 0repl_conn_bef 0, repl_conn_now 0 [22/Nov/2019:11:40:51.431811658 -0500] - DEBUG - connection_threadmain - conn 63 read operation successfully - thread_turbo_flag 0 more_data 0 ops_initiated 2 refcnt 2 flags 17 In the set of logs the failures happened in a small cluster. I would be curious to see if disabling turbo mode would help: # ldapmodify -D "cn=directory manager" -W dn: cn=config changetype: modify replace: nsslapd-enable-turbo-mode nsslapd-enable-turbo-mode: off Restart the server, and try and reproduce the problem again. (In reply to mreynolds from comment #29) > In the set of logs the failures happened in a small cluster. I would be > curious to see if disabling turbo mode would help: > > # ldapmodify -D "cn=directory manager" -W > dn: cn=config > changetype: modify > replace: nsslapd-enable-turbo-mode > nsslapd-enable-turbo-mode: off > > > Restart the server, and try and reproduce the problem again. Setting needinfo per this request. Amith, would this be possible? (In reply to Alexey Tikhonov from comment #30) > (In reply to mreynolds from comment #29) > > In the set of logs the failures happened in a small cluster. I would be > > curious to see if disabling turbo mode would help: > > > > # ldapmodify -D "cn=directory manager" -W > > dn: cn=config > > changetype: modify > > replace: nsslapd-enable-turbo-mode > > nsslapd-enable-turbo-mode: off > > > > > > Restart the server, and try and reproduce the problem again. > > Setting needinfo per this request. > > Amith, would this be possible? Sure, i will try and attach the log files. Generated the set of logs by setting "nsslapd-enable-turbo-mode: off" and attached them. # ldapsearch -xv -h kvm-03-guest25.hv2.lab.eng.bos.redhat.com -D "cn=Manager,dc=example,dc=com" -w Secret123 -b "cn=config" | grep nsslapd-enable-turbo-mode ldap_initialize( ldap://kvm-03-guest25.hv2.lab.eng.bos.redhat.com ) filter: (objectclass=*) requesting: All userApplication attributes nsslapd-enable-turbo-mode: off Created attachment 1640460 [details]
SSSD domain log when DS turbo-mode : off
Created attachment 1640461 [details]
Dirsrv errors log when DS turbo_mode = off
Created attachment 1640462 [details]
Dirsrv access log when DS turbo_mode = off.
(In reply to mreynolds from comment #29) > The logs did not tell us much more than we already knew, there is a ber encoding: > Reason -4999 == Bad Ber Tag (B1) But it looks different than usual "log_ber_too_big_error"/"reason -4998" > In the set of logs the failures happened in a small cluster. I would be > curious to see if disabling turbo mode would help: It looks like disabling turbo mode made things much worse. Usually we saw 1 error during execution of the test, now there are 9 errors: [28/Nov/2019:11:24:00.676935617 -0500] - ERR - log_ber_too_big_error - conn=2727 fd=65 Incoming BER Element may be misformed. This may indicate an attempt to use TLS on a plaintext port, IE ldaps://localhost:389. Check your client LDAP_URI settings. [28/Nov/2019:11:24:00.677312541 -0500] - DEBUG - get_next_from_buffer - ber_get_next failed for connection 2727 [28/Nov/2019:11:24:00.677682307 -0500] - DEBUG - disconnect_server_nomutex_ext - Setting conn 2727 fd=65 to be disconnected: reason -4998 Mark, what would you say? Is there anyway to change the test to use ldaps instead of startTLS? I'd be curious to see if the same issues occur. Next we need to see the network traffic. Can you run wireshark while reproducing the issue, and send us the access log and the capture file? Thanks! I was able to reproduce this using ldapsearch via this simple script:
===========================================================
#!/bin/sh
while [ 1 ]
do
LDAPTLS_CACERTDIR=/etc/dirsrv/slapd-localhost ldapsearch -xLLL -ZZ -D cn=dm -w redhat123 -b "" -s base dn &
LDAPTLS_CACERTDIR=/etc/dirsrv/slapd-localhost ldapsearch -xLLL -ZZ -D cn=dm -w redhat123 -b "" -s base dn &
LDAPTLS_CACERTDIR=/etc/dirsrv/slapd-localhost ldapsearch -xLLL -ZZ -D cn=dm -w redhat123 -b "" -s base dn
done
============================================================
Within 30 seconds I see:
ldap_start_tls: Operations error (1)
additional info: Other operations are still pending on the connection.
I did not see this error using LDAP or LDAPS. So it appears to be StartTLS specific.
We have observed the similar testcase failure during RHEL-7.8 regression runs. Looks like this bug exists in 7.8 as well so please backport the fix. See the failed beaker job: https://beaker.engineering.redhat.com/recipes/7969275#task107069194 The scenario of test is - open coneection - start tls EXTOP - bind - search - unbind DS assumes that when processing start_tls extop there is no active operation on the connection. To do this it checks all operations attached to the connection. If the bind is received before the start tls completes, the bind is added to the connection. It can be pending on the work queue or even taken by an other worker (my case). In such case start tls consider that it is invalid to proceed. I think it is valid to return operation error if there is a pending active operation. In the current case the bind operation is just detected but no yet active/read (because start tls holds the connection c_mutex). The test could be smarter to check if the pending operations are already decoded/read (active) or just waiting for the start tls to complete. Need to think more about this. Upstream ticket pushed => POST Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (389-ds:1.4 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:4695 |