1769734 – Heavy StartTLS connection load can randomly fail with err=1

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1769734 - Heavy StartTLS connection load can randomly fail with err=1

Summary: Heavy StartTLS connection load can randomly fail with err=1

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	389-ds-base
Sub Component:
Version:	8.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	8.0
Assignee:	thierry bordaz
QA Contact:	RHDS QE
Docs Contact:
URL:
Whiteboard:	sync-to-jira
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-11-07 10:30 UTC by Amith
Modified:	2020-11-04 03:08 UTC (History)
CC List:	16 users (show)
Fixed In Version:	389-ds-base-1.4.3.8-2.module+el8.3.0+6591+ebfc9766
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-11-04 03:07:24 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)
The reproducer check_user.c file. (8.15 KB, text/x-csrc) 2019-11-07 10:30 UTC, Amith	no flags	Details
SSSD domain log in gzip format. (10.79 MB, application/gzip) 2019-11-07 12:56 UTC, Amith	no flags	Details
The Dirsrv access log. (2.78 MB, application/gzip) 2019-11-14 07:50 UTC, Amith	no flags	Details
The Dirsrv erros log file. (18.48 KB, text/plain) 2019-11-15 10:39 UTC, Amith	no flags	Details
Acess file from latest test failure. (2.75 MB, application/gzip) 2019-11-16 07:42 UTC, Amith	no flags	Details
Error file from latest test failure. (18.61 KB, text/plain) 2019-11-16 07:43 UTC, Amith	no flags	Details
SSSD domain log. (10.75 MB, application/gzip) 2019-11-20 09:49 UTC, Amith	no flags	Details
The Dirsrv access log. (2.77 MB, application/gzip) 2019-11-20 09:51 UTC, Amith	no flags	Details
The Dirsrv erros log file. (18.61 KB, text/plain) 2019-11-20 09:51 UTC, Amith	no flags	Details
SSSD domain log with DS error log level : 8 (2.47 MB, text/plain) 2019-11-22 19:39 UTC, Amith	no flags	Details
Dirsrv errors log when Err log level = 8. (332.53 KB, text/plain) 2019-11-22 19:41 UTC, Amith	no flags	Details
Dirsrv access log when DS err log level = 8. (1.06 MB, application/gzip) 2019-11-22 19:42 UTC, Amith	no flags	Details
SSSD domain log when DS turbo-mode : off (7.18 MB, application/gzip) 2019-11-28 17:46 UTC, Amith	no flags	Details
Dirsrv errors log when DS turbo_mode = off (3.58 MB, application/gzip) 2019-11-28 17:48 UTC, Amith	no flags	Details
Dirsrv access log when DS turbo_mode = off. (2.41 MB, application/gzip) 2019-11-28 17:50 UTC, Amith	no flags	Details
Show Obsolete (5) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	389ds 389-ds-base issues 4088	0	None	closed	Heavy StartTLS connection load can randomly fail with err=1	2020-09-23 12:36:21 UTC
Red Hat Product Errata	RHEA-2020:4695	0	None	None	None	2020-11-04 03:07:38 UTC

Description Amith 2019-11-07 10:30:20 UTC

Created attachment 1633603 [details]
The reproducer check_user.c file.

Description of problem:
This issue is already automated in our Performance test suite (bug 783138). I am creating a new bug for RHEL-8.1. We have seen consistent failures during RHEL-8.1 regression rounds. Even though in earlier rhel versions these failures were intermittent.

Version-Release number of selected component (if applicable):
sssd-2.2.0-19.el8.x86_64.rpm

How reproducible:
We see this all the time in RHEL-8.1 Performance beaker runs.

Steps to Reproduce:
1. Setup 389ds and add an ldapuser.
2. Configure sssd.conf in a client m/c as follows:

    [sssd]
    config_file_version = 2
    services = nss, pam
    domains = LDAP

    [domain/LDAP]
    id_provider = ldap
    ldap_uri = ldap://$SERVER
    ldap_search_base = dc=example,dc=com
    ldap_tls_cacert = /etc/openldap/certs/cacert.asc
    ldap_connection_expire_timeout=10

3. We have reproducer check_user.c file, see the file contents below in Addtional info.

4. Compile the C file as given below:

# gcc -lpam data/check_user.c -o check_user

5. Create /etc/pam.d/check_user file as given below:

# cat /etc/pam.d/check_user
auth    required        pam_sss.so
account required        pam_sss.so

6. When executed, the reproducer check_user file prompts for ldapuser & password, and then it runs user authentication 1000 times. In order to simulate high load, we created a function which executes this command in the background. Then we used a for loop to call the fucntion 15 times. See the function and loop below:

## FUNCTION
check_user_func()
{
  expect -f - <<<'
    set timeout 300
    spawn ./check_user -s check_user -n 1000 kau1
    expect "Password: "
    send -- "Secret123\r"
    expect "^User * authenticated successfully 1000 times\r\n$"
    exit [expr [lindex [wait] 3] != 0]
    expect_after default {
        wait
        exit 1
    }
'
}

## FOR LOOP
for i in {1..15}
do
rlRun 'check_user_func' &
sleep 5
done


Actual results:
We have seen user authentication failing a number of times.

Expected results:
A green run of the test.

Additional info:
The reproduce file is also attached, here are the contents, check_user.c :

/*
 * Check user against unix authentication using PAM framework
 *
 * check_user [-n <nloop>] <username>
 *
 * We need to add the following to /etc/pam.d/check_user PAM configuration
 * file:
 * auth		required	pam_unix.so
 * account	required	pam_unix.so
 * (or else use passwd PAM service)
 *
 * - derived from PAM in IDS.ppt by Yoram Benchetrit
 * - compile with -lpam
 * If Linux libpam_misc rpm is loaded define PAM_MISC
 */
#include <stdlib.h>
#include <security/pam_appl.h>
#ifdef PAM_MISC
#include <security/pam_misc.h>
#else /* PAM_MISC */
#include <unistd.h>
#include <sys/select.h>
#include <termios.h>
#include <signal.h>
#include <string.h>
#endif /* PAM_MISC */
#include <stdio.h>

#define	BUFSIZE		128

#ifdef PAM_MISC
static struct pam_conv conv = { misc_conv, NULL };
#else /* PAM_MISC*/
int		echoOff = 0;
struct sigaction	oact;
struct termios		tSet;			/* terminal settings */

void
int_hdlr(int nSig, siginfo_t *pSigInfo, void *pCtxt)
{
	if (echoOff)
	{
		/* restore the terminal settings */
		if (tcsetattr(fileno(stdin), TCSANOW, &tSet) < 0)
		{
			perror("Failed to reset terminal attributes");
		}
		echoOff = 0;
	}
}

int
doNoEcho()
{
	struct sigaction	act;
	struct termios	ts;
	int				ifd;

	if (!echoOff)
	{
		/* set up signal handler */
		sigemptyset(&act.sa_mask);
		act.sa_flags = SA_SIGINFO;
		act.sa_sigaction = int_hdlr;
		if (sigaction(SIGINT, &act, &oact) < 0)
		{
			perror("doNoEcho: failed to install signal handler");
			return 1;
		}
		ifd = fileno(stdin);
		if (tcgetattr(ifd, &ts) == 0)
		{
			memcpy((void *) &tSet, (void *) &ts, sizeof ts);
			ts.c_lflag &= ~ECHO;
			if (tcsetattr(ifd, TCSAFLUSH, &ts) == 0)
				echoOff = 1;
		}
		if (!echoOff)
			perror("doNoEcho: warning: failed to disable terminal echo");
	}

	return 0;
}

int
clearNoEcho()
{
	int				ifd;

	if (echoOff)
	{
		ifd = fileno(stdin);
		if (tcsetattr(ifd, TCSANOW, &tSet) == 0)
		{
			echoOff = 0;
		}
		/* restore previous signal handler */
		if (sigaction(SIGINT, &oact, (struct sigaction *) NULL) < 0)
		{
			perror("clearNoEcho: failed to reset signal handler");
			return 1;
		}
		if (echoOff)
			perror("clearNoEcho: warning: failed to reset terminal attributes");
	}

	return 0;
}

/* get an input line of text, stripping off any trailing newline */
char *
getResponse(char *msg)
{
	char		sBuf[BUFSIZE+1], sJunk[BUFSIZE+1];
	struct timeval	tv;
	fd_set		rfds;
	int			len, ifd, rc;

	if (msg)
		printf("%s", msg);

	fgets(sBuf, sizeof sBuf, stdin);
	len = strlen(sBuf) - 1;
	if (sBuf[len] == '\n')
		sBuf[len] = '\0';
	else
	{
		/* some implementations of fgets() seem to leave the newline in the
		 * input stream. If there is some ready to read then retrieve it.
		 */
		FD_ZERO(&rfds);
		ifd = fileno(stdin);
		FD_SET(ifd, &rfds);

		memset((void *) &tv, 0, sizeof tv);
		if ((rc = select(ifd+1, &rfds, (fd_set *) NULL, (fd_set *) NULL,
			&tv)) < 0)
		{
			perror("getResponse: error in select");
			return (char *) NULL;
		}
		else if (rc > 0)
		{
			fgets(sJunk, sizeof sJunk, stdin);
			if (strlen(sJunk) > 1)
				printf("getResponse: warning: additional input discarded\n");
		}
	}

	if (echoOff)
		printf("\n");
	return strdup(sBuf);
}


/* Derived from http://developers.sun.com/solaris/articles/user_auth_solaris3.html
 */
/* This conversion function wil cache the user responses.
 * - A full implementation should create a list of user name with each
 *   containing an array of prompts and responses.
 */

struct conv_data
{
	char	*user;			/* supplied by caller */
	int		nPrompt;		/* number of prompts */
	struct
	{
		char	*prompt;	/* prompt message */
		char	*response;	/* saved response */
	}		msg[PAM_MAX_NUM_MSG];
};

int conv_func2(int num_msg, const struct pam_message **msg,
	struct pam_response **resp, void *app_data);

struct pam_conv conv = { conv_func2, NULL };

int
conv_func2(int num_msg, const struct pam_message **msg,
	struct pam_response **resp, void *app_data)
{
	const struct pam_message	*m;
	struct pam_response	*r;
	struct conv_data	*cvd;
	char				*ustr;
	int					i, n, rc, bPrompt;

	if (num_msg < 0 || num_msg >= PAM_MAX_NUM_MSG)
	{
		fprintf(stderr, "conv_func: invalid nnumber of messages: %d\n",
			num_msg);
		*resp = NULL;
		return PAM_CONV_ERR;
	}

	if ((r = *resp = calloc(num_msg, sizeof (struct pam_response))) == NULL)
		return PAM_BUF_ERR;

	rc = PAM_CONV_ERR;
	m = *msg;
	cvd = (struct conv_data *) app_data;
	bPrompt = 0;
	for (i = 0; i < num_msg; i++, m++, r++)
	{
		if (m->msg == NULL)
		{
			fprintf(stderr, "conv_func: message %d is NULL, style %d\n",
				i, m->msg_style);
			goto err;
		}

		if (m->msg[(n = strlen(m->msg))] == '\n')
		{
			if ((ustr = strdup(m->msg)) == (char *) NULL)
				goto err;
			ustr[n] = '\0';
		}
		else
			ustr = (char *) m->msg;

		r->resp = NULL;
		r->resp_retcode = 0;

		switch (m->msg_style)
		{
			case PAM_PROMPT_ECHO_OFF:
				bPrompt++;
				/* FALLTHROUGH */
			case PAM_PROMPT_ECHO_ON:
				/* look for previously supplied response */
				if (cvd)
				{
					for (n = 0; n < cvd->nPrompt; n++)
					{
						if (!strcmp(ustr, cvd->msg[n].prompt))
						{
							r->resp = strdup(cvd->msg[n].response);
							if (r->resp == (char *) NULL)
							{
								rc = PAM_BUF_ERR;
								goto err;
							}
							break;
						}
					}
				}
				if (!r->resp)
				{
					if ((bPrompt && doNoEcho()) ||
						(r->resp = getResponse(ustr)) == (char *) NULL ||
						(bPrompt && clearNoEcho()))
						goto err;
					if (cvd && cvd->nPrompt < PAM_MAX_NUM_MSG)
					{
						if ((cvd->msg[cvd->nPrompt].prompt =
							strdup(ustr)) == (char *) NULL ||
							(cvd->msg[cvd->nPrompt++].response =
							strdup(r->resp)) == (char *) NULL)
						{
							rc = PAM_BUF_ERR;
							goto err;
						}
					}
				}
				break;
			case PAM_ERROR_MSG:
				fprintf(stderr, "%s\n", ustr);
				break;
			case PAM_TEXT_INFO:
				printf("%s\n", ustr);
				break;
		}

		if (ustr != m->msg)
			free((void *) ustr);
		ustr = (char *) NULL;
	}

	return PAM_SUCCESS;
err:
	free((void *) r);
	if (ustr)
		free((void *) ustr);
	*resp = NULL;
	return rc;
}
#endif /* PAM_MISC */

int
pam_test(char *service, struct conv_data *cvd)
{
	pam_handle_t	*pamh = NULL;
	char			*s;
	int				rc, rc2;

	rc = PAM_SERVICE_ERR;		/* anything other than PAM_SUCCESS */

	if (service == (char *) NULL)
		service = "passwd";		/* Use the passwd PAM service by default */
	rc = pam_start(service, (const char *) cvd->user, &conv, &pamh);
	if (rc != PAM_SUCCESS)
	{
		s = "pam_start";
		goto err;
	}

	rc = pam_authenticate(pamh, 0);
	if (rc != PAM_SUCCESS)
	{
		s = "pam_authenticate";
		if (rc == PAM_CRED_INSUFFICIENT || rc == PAM_AUTH_ERR)
			goto err;
	}

	if (rc == PAM_SUCCESS)
	{
		rc = pam_acct_mgmt(pamh, 0);
		if (rc != PAM_SUCCESS)
		{
			s = "pam_acct_mgmt";
			goto err;
		}
	}

err:
	if ((rc2 = pam_end(pamh, rc)) != PAM_SUCCESS)
	{
		fprintf(stderr, "Warning: error in \"pam_end\": %s\n",
			 pam_strerror(pamh, rc2));
		rc = PAM_SUCCESS;
	}
	if (rc != PAM_SUCCESS)
		fprintf(stderr, "Error in \"%s\": %s\n", s, pam_strerror(pamh, rc));

	return (rc == PAM_SUCCESS ? 0 : 1);
}

void
usage(char *s)
{
	fprintf(stderr, "usage: %s [-n <nloop>] [-s <service>] <username>\n", s);
	exit(1);
}


int
main(int argc, char *argv[])
{
	struct conv_data	cvd;
	char		**pArg;
	char		*user;
	char		*service;
	int			nLoop = 1;
	int			n, rc;

	service = (char *) NULL;
	pArg = &argv[1];
	while ((n = getopt(argc, argv, "n:s:")) != EOF)
	{
		switch (n)
		{
			case 'n':	nLoop = strtol(optarg, (char **) NULL, 0);
						pArg++;
						break;
			case 's':	service = optarg;
						pArg++;
						break;
			default:	usage(argv[0]);
		}
		pArg++;
	}

	if (nLoop < 1)
		nLoop = 1;

	if (*pArg == (char *) NULL)
		usage(argv[0]);

	user = *pArg;
	memset((void *) &cvd, 0, sizeof cvd);
	cvd.user = (char *) user;
	conv.appdata_ptr = (void *) &cvd;

	for (n = 0; n < nLoop; n++)
	{
		if ((rc = pam_test(service, &cvd)) != 0)
		{
			printf("Authentication failure for user \"%s\" in loop %d\n",
				user, n + 1);
			break;
		}
	}
	if (rc == 0)
	{
		if (n == 1)
			printf("User \"%s\" authenticated successfully\n", user);
		else
			printf("User \"%s\" authenticated successfully %d times\n",
				user, n);
	}
	for (n = 0; n < cvd.nPrompt; n++)
	{
		if (cvd.msg[n].prompt)
			free((void *) cvd.msg[n].prompt);
		if (cvd.msg[n].response)
			free((void *) cvd.msg[n].response);
	}

	exit(rc == 0 ? 0 : 1);
}

Comment 1 Sumit Bose 2019-11-07 11:00:34 UTC

Hi,

can you attach SSSD logs with debug_level=9from a test run as well?

bye,
Sumit

Comment 2 Amith 2019-11-07 12:56:21 UTC

Created attachment 1633614 [details]
SSSD domain log in gzip format.

Attached the domain log, since it is huge in size.. almost 284 MB, i have compressed it.

Comment 3 Alexey Tikhonov 2019-11-07 13:58:18 UTC

[check_encryption_used] (0x1000): ldap_get_option failed to get sasl ssf, assuming SASL is not used.
[check_encryption_used] (0x4000): Encryption used: SASL SSF [0] tls_inplace [TLS inplace].
[simple_bind_send] (0x0100): Executing simple bind as: uid=kau1,ou=People,dc=example,dc=com
[simple_bind_send] (0x2000): ldap simple bind sent, msgid = 3
[sdap_op_add] (0x2000): New operation 3 timeout 8
[sdap_process_result] (0x2000): Trace: sh[0x556af608f740], connected[1], ops[0x556af6091bd0], ldap[0x5
56af613db30]
[sdap_process_result] (0x2000): Trace: end of ldap_result list
[sdap_process_result] (0x2000): Trace: sh[0x556af60d6ee0], connected[1], ops[0x556af6000470], ldap[0x5
56af5ff2d90]
[sdap_process_message] (0x4000): Message type: [LDAP_RES_EXTENDED]
[sdap_connect_done] (0x0080): START TLS result: Operations error(1), Other operations are still pending on the connection.
[sdap_connect_done] (0x0080): ldap_install_tls failed: [Connect error] [Other operations are still pending on the connection.]


Search for "Other operations are still pending on the connection" && "heavy load" yields https://lists.fedoraproject.org/pipermail/389-users/2010-April/011408.html
But this is very old. Not sure if this is relevant.

Comment 4 Sumit Bose 2019-11-07 15:32:35 UTC

Hi,

thanks for checking Alexey. So maybe it is the DS which cannot handle the load. Amith, can you attach the Directory Server access logs from the time of the test as well?

bye,
Sumit

Comment 5 Amith 2019-11-14 07:50:49 UTC

Created attachment 1636020 [details]
The Dirsrv access log.

Comment 6 Sumit Bose 2019-11-14 08:12:24 UTC

Hi,

there is an error in the access log during the TLS setup:

[13/Nov/2019:10:06:21.614100711 -0500] conn=7388 fd=65 slot=65 connection from 10.16.56.57 to 10.16.56.59
[13/Nov/2019:10:06:21.624743256 -0500] conn=7388 op=0 EXT oid="1.3.6.1.4.1.1466.20037" name="start_tls_plugin"
[13/Nov/2019:10:06:21.624810824 -0500] conn=7388 op=0 RESULT err=1 tag=120 nentries=0 etime=0.0000125518
[13/Nov/2019:10:06:21.691564430 -0500] conn=7388 op=-1 fd=65 closed error 34 (Numerical result out of range) - B2


I'm do not know how to get more details here, but to me it looks that the load issue is not on the SSSD but on the directory server side.

bye,
Sumit

Comment 7 Alexey Tikhonov 2019-11-14 10:27:07 UTC

(In reply to Amith from comment #5)
> Created attachment 1636020 [details]
> The Dirsrv access log.

If I understand correctly, there should be also accompanying "errors" log (besides "access"). Could you please attach it as well?

For example in the bz 918689, there is the same issue seen in "access" log:
[02/Jan/2014:14:52:15 +051800] conn=11 op=-1 fd=64 closed error 34 (Numerical result out of range) - B2

And "errors" provides additional information (as an example only, in that case it was "Incoming BER Element was too long"; it could be useful to see what is the case here)

Comment 8 Amith 2019-11-15 10:39:14 UTC

Created attachment 1636429 [details]
The Dirsrv erros log file.

Comment 9 Alexey Tikhonov 2019-11-15 10:53:07 UTC

(In reply to Amith from comment #8)
> Created attachment 1636429 [details]
> The Dirsrv erros log file.

'errors' and 'access' logs do not match: 'access' ends at 14/Nov/2019:02:37:12 and 'errors' starts at 14/Nov/2019:09:15:44.

Please, provide both logs capturing the moment issue happens.

Comment 10 Amith 2019-11-16 07:42:06 UTC

Created attachment 1636692 [details]
Acess file from latest test failure.

Comment 11 Amith 2019-11-16 07:43:24 UTC

Created attachment 1636693 [details]
Error file from latest test failure.

Comment 12 Amith 2019-11-16 07:48:20 UTC

Access and Error log file from latest beaker run are attached. Here is the beaker link : https://beaker.engineering.redhat.com/jobs/3898531

I have been executing this test with earlier RHEL versions to figure out whether its an issue with ldap server. We have scheduled tests with RHEL-7.8, & RHEL-7.7 as SERVER and RHEL-8.1 as client. We saw similar failures. I have been working on different combinations of test and will update here the beaker jobs, once finished.

Comment 13 Alexey Tikhonov 2019-11-18 18:26:33 UTC

DS logs combined are:
```
[15/Nov/2019:17:54:43.979067998 -0500] conn=4499 fd=65 slot=65 connection from 10.16.56.57 to 10.16.56.56
[15/Nov/2019:17:54:44.012170796 -0500] conn=4499 op=0 EXT oid="1.3.6.1.4.1.1466.20037" name="start_tls_plugin"
[15/Nov/2019:17:54:44.012212970 -0500] conn=4499 op=0 RESULT err=1 tag=120 nentries=0 etime=0.0000076513
[15/Nov/2019:17:54:44.021844989 -0500] - ERR - log_ber_too_big_error - conn=4499 fd=65 Incoming BER Element may be misformed. This may indicate an attempt to use TLS on a plaintext port, IE ldaps://localhost:389. Check your client LDAP_URI settings.
[15/Nov/2019:17:54:44.043159954 -0500] conn=4499 op=-1 fd=65 closed error 34 (Numerical result out of range) - B2
```

But I am not sure if this makes any sense, as an error observed in `sssd_LDAP.log` is "Other operations are still pending on the connection":
(Thu Nov  7 07:40:13 2019) [sssd[be[LDAP]]] [sdap_connect_done] (0x0080): ldap_install_tls failed: [Connect error] [Other operations are still pending on the connection.]

Btw, timestamp of this error - "Nov  7 07:40:13" - doesn't match timestamp of an error in DS log from comment 5 - "13/Nov/2019:10:06:21"


Amith, could you please provide `sssd_LDAP.log` from the same run as DS access&errors logs? I was unable to find it in the beaker link.

Comment 14 Alexey Tikhonov 2019-11-18 19:14:31 UTC

In the given sssd_LDAP.log failing operation looks like:
```
[sdap_process_result] (0x2000): Trace: sh[0x556af60d6ee0], connected[1], ops[0x556af6000470], ldap[0x556af5ff2d90]
[sdap_process_message] (0x4000): Message type: [LDAP_RES_EXTENDED]
[sdap_connect_done] (0x0080): START TLS result: Operations error(1), Other operations are still pending on the connection.
[sdap_connect_done] (0x0080): ldap_install_tls failed: [Connect error] [Other operations are still pending on the connection.]
[sdap_op_destructor] (0x2000): Operation 1 finished
```

There are a lot of concurrent connections running at different stages in the log.

Unfortunately there is no way to correlate this failed "Operation 1" / "sdap_handle=0x556af60d6ee0" with other parts of the log (where this operation was started: only fd was logged there).
"Using file descriptor [..] for the connection." can't be matched with corresponding `socket()`/`close()` as well.
May be debug messages could be improved here.

So far awaiting complete set of logs.

Comment 15 Amith 2019-11-20 09:49:57 UTC

Created attachment 1638052 [details]
SSSD domain log.


Attached the log files from latest beaker run: sssd domain log, DS access and error log.

JOB: https://beaker.engineering.redhat.com/jobs/3903646

Comment 16 Amith 2019-11-20 09:51:00 UTC

Created attachment 1638053 [details]
The Dirsrv access log.

Comment 17 Amith 2019-11-20 09:51:41 UTC

Created attachment 1638054 [details]
The Dirsrv erros log file.

Comment 18 Alexey Tikhonov 2019-11-20 14:26:40 UTC

Hm, but that's it:
```
[19/Nov/2019:05:16:17.435711775 -0500] conn=11277 fd=71 slot=71 connection from 10.16.56.51 to 10.16.56.54
[19/Nov/2019:05:16:17.465202052 -0500] conn=11277 op=0 EXT oid="1.3.6.1.4.1.1466.20037" name="start_tls_plugin"
[19/Nov/2019:05:16:17.465249810 -0500] conn=11277 op=0 RESULT err=1 tag=120 nentries=0 etime=0.0000073579
[19/Nov/2019:05:16:17.535207558 -0500] - ERR - log_ber_too_big_error - conn=11277 fd=71 Incoming BER Element may be misformed. This may indicate an attempt to use TLS on a plaintext port, IE ldaps://localhost:389. Check your client LDAP_URI settings.
[19/Nov/2019:05:16:17.540751056 -0500] conn=11277 op=-1 fd=71 closed error 34 (Numerical result out of range) - B2
```
and matching SSSD log entry is:
```
(Tue Nov 19 05:16:17 2019) [sssd[be[LDAP]]] [sdap_connect_done] (0x0080): START TLS result: Operations error(1), Other operations are still pending on the connection.
```
where "Other operations are still pending on the connection" was obtained via `ldap_parse_result()`


Mark, could you please give a hint how to debug further - what could be the reason DS refuses to "start_tls_plugin"?

Comment 19 mreynolds 2019-11-20 14:42:34 UTC

(In reply to Alexey Tikhonov from comment #18)
> Hm, but that's it:
> ```
> [19/Nov/2019:05:16:17.435711775 -0500] conn=11277 fd=71 slot=71 connection
> from 10.16.56.51 to 10.16.56.54
> [19/Nov/2019:05:16:17.465202052 -0500] conn=11277 op=0 EXT
> oid="1.3.6.1.4.1.1466.20037" name="start_tls_plugin"
> [19/Nov/2019:05:16:17.465249810 -0500] conn=11277 op=0 RESULT err=1 tag=120
> nentries=0 etime=0.0000073579
> [19/Nov/2019:05:16:17.535207558 -0500] - ERR - log_ber_too_big_error -
> conn=11277 fd=71 Incoming BER Element may be misformed. This may indicate an
> attempt to use TLS on a plaintext port, IE ldaps://localhost:389. Check your
> client LDAP_URI settings.
> [19/Nov/2019:05:16:17.540751056 -0500] conn=11277 op=-1 fd=71 closed error
> 34 (Numerical result out of range) - B2
> ```
> and matching SSSD log entry is:
> ```
> (Tue Nov 19 05:16:17 2019) [sssd[be[LDAP]]] [sdap_connect_done] (0x0080):
> START TLS result: Operations error(1), Other operations are still pending on
> the connection.
> ```
> where "Other operations are still pending on the connection" was obtained
> via `ldap_parse_result()`
> 
> 
> Mark, could you please give a hint how to debug further - what could be the
> reason DS refuses to "start_tls_plugin"?

No idea, I've never seen this before.  Maybe turn on connection logging in hopes of getting more information:

https://access.redhat.com/documentation/en-us/red_hat_directory_server/9.0/html/configuration_command_and_file_reference/error-logs#error-logs-levels

Error log level: 8

I would also try ldapsearch using startTLS to see if that works.  Maybe it's something messed up on the SSSD Client?

Comment 20 Alexey Tikhonov 2019-11-20 15:12:38 UTC

(In reply to mreynolds from comment #19)
> > ```
> > [19/Nov/2019:05:16:17.435711775 -0500] conn=11277 fd=71 slot=71 connection
> > from 10.16.56.51 to 10.16.56.54
> > [19/Nov/2019:05:16:17.465202052 -0500] conn=11277 op=0 EXT
> > oid="1.3.6.1.4.1.1466.20037" name="start_tls_plugin"
> > [19/Nov/2019:05:16:17.465249810 -0500] conn=11277 op=0 RESULT err=1 tag=120
> > nentries=0 etime=0.0000073579
> > [19/Nov/2019:05:16:17.535207558 -0500] - ERR - log_ber_too_big_error -
> > conn=11277 fd=71 Incoming BER Element may be misformed. This may indicate an
> > attempt to use TLS on a plaintext port, IE ldaps://localhost:389. Check your
> > client LDAP_URI settings.
> > [19/Nov/2019:05:16:17.540751056 -0500] conn=11277 op=-1 fd=71 closed error
> > 34 (Numerical result out of range) - B2
> > ```

> No idea, I've never seen this before.  Maybe turn on connection logging in
> hopes of getting more information:
> 
> https://access.redhat.com/documentation/en-us/red_hat_directory_server/9.0/
> html/configuration_command_and_file_reference/error-logs#error-logs-levels
> 
> Error log level: 8

Amith, could you please provide new set of logs with this DS debug level enabled?


> I would also try ldapsearch using startTLS to see if that works.

Well, in the access log there are 15700 successful "start_tls_plugin" operations and only one failed (this is load testing).

 
>  Maybe it's something messed up on the SSSD Client?

Might be, and I am looking for a clue.

Mark, would observed DS behaviour be like in the log above if SSSD for example tried to StartTLS on a connection having another operation in progress?

Comment 21 mreynolds 2019-11-20 15:21:54 UTC

(In reply to Alexey Tikhonov from comment #20)

> Mark, would observed DS behaviour be like in the log above if SSSD for
> example tried to StartTLS on a connection having another operation in
> progress?

But this is a new connection, and it's the first operation (op=0) on that connection, so there are no pending operations on the same connection at that time.  Maybe I am not understanding your question.

[19/Nov/2019:05:16:17.435711775 -0500] conn=11277 fd=71 slot=71 connection from 10.16.56.51 to 10.16.56.54
[19/Nov/2019:05:16:17.465202052 -0500] conn=11277 op=0 EXT oid="1.3.6.1.4.1.1466.20037" name="start_tls_plugin"
[19/Nov/2019:05:16:17.465249810 -0500] conn=11277 op=0 RESULT err=1 tag=120 nentries=0 etime=0.0000073579

Are there other incoming connections using start-tls that are occurring at the same time (05:16:17) ?

Comment 22 Alexey Tikhonov 2019-11-20 16:28:42 UTC

(In reply to mreynolds from comment #21)
> (In reply to Alexey Tikhonov from comment #20)
> 
> > Mark, would observed DS behaviour be like in the log above if SSSD for
> > example tried to StartTLS on a connection having another operation in
> > progress?
> 
> But this is a new connection, and it's the first operation (op=0) on that
> connection, so there are no pending operations on the same connection at
> that time. 

Indeed. Thank you.


> [19/Nov/2019:05:16:17.435711775 -0500] conn=11277 fd=71 slot=71 connection
> from 10.16.56.51 to 10.16.56.54
> [19/Nov/2019:05:16:17.465202052 -0500] conn=11277 op=0 EXT
> oid="1.3.6.1.4.1.1466.20037" name="start_tls_plugin"
> [19/Nov/2019:05:16:17.465249810 -0500] conn=11277 op=0 RESULT err=1 tag=120
> nentries=0 etime=0.0000073579
> 
> Are there other incoming connections using start-tls that are occurring at
> the same time (05:16:17) ?

A lot of:
$ grep "\"start_tls_plugin\"" access | grep "05:16:17" | wc -l
52
  --  ranging from conn=11257 to conn=11307.

And they are intermixed: 3 other connections complete TSL setup ("conn=11272 TLS1.3 128-bit AES-GCM") and 7 new connections are accepted after
[19/Nov/2019:05:16:17.435711775 -0500] conn=11277 fd=71 slot=71 connection from 10.16.56.51 to 10.16.56.54
but before 
[19/Nov/2019:05:16:17.465202052 -0500] conn=11277 op=0 EXT oid="1.3.6.1.4.1.1466.20037" name="start_tls_plugin"
[19/Nov/2019:05:16:17.465249810 -0500] conn=11277 op=0 RESULT err=1 tag=120 nentries=0 etime=0.0000073579


Btw, timestamp diff between last two is negligible, but quite a lot happens in the log during next 0.07 sec before:
[19/Nov/2019:05:16:17.535207558 -0500] - ERR - log_ber_too_big_error - conn=11277 fd=71 Incoming BER Element may be misformed.

But "access" log entry already had "err=1" (in case of successful StartTLS here is "err=0"). Is it the same err as seen in "errors" log ("- ERR - log_ber_too_big_error - conn=11277")? Or is one consequence of another? Why there is this delay?

Comment 23 mreynolds 2019-11-20 17:11:03 UTC

Looking at the code, we run into this issue when the ber encoded packet is misformed.  There is something wrong with the ber length (too big or too small).  This is not something the server is doing.  The server just receives a ber element (via openldap client library), but the length specified in the ber element is not matching the ber content.  So it looks like the network traffic got corrupted.  It would be good to do some network capture, and also to check the network for packet loss, or errors.

Comment 24 Amith 2019-11-22 15:44:47 UTC

(In reply to Alexey Tikhonov from comment #20)
> (In reply to mreynolds from comment #19)
> > > ```
> > > [19/Nov/2019:05:16:17.435711775 -0500] conn=11277 fd=71 slot=71 connection
> > > from 10.16.56.51 to 10.16.56.54
> > > [19/Nov/2019:05:16:17.465202052 -0500] conn=11277 op=0 EXT
> > > oid="1.3.6.1.4.1.1466.20037" name="start_tls_plugin"
> > > [19/Nov/2019:05:16:17.465249810 -0500] conn=11277 op=0 RESULT err=1 tag=120
> > > nentries=0 etime=0.0000073579
> > > [19/Nov/2019:05:16:17.535207558 -0500] - ERR - log_ber_too_big_error -
> > > conn=11277 fd=71 Incoming BER Element may be misformed. This may indicate an
> > > attempt to use TLS on a plaintext port, IE ldaps://localhost:389. Check your
> > > client LDAP_URI settings.
> > > [19/Nov/2019:05:16:17.540751056 -0500] conn=11277 op=-1 fd=71 closed error
> > > 34 (Numerical result out of range) - B2
> > > ```
> 
> > No idea, I've never seen this before.  Maybe turn on connection logging in
> > hopes of getting more information:
> > 
> > https://access.redhat.com/documentation/en-us/red_hat_directory_server/9.0/
> > html/configuration_command_and_file_reference/error-logs#error-logs-levels
> > 
> > Error log level: 8
> 
> Amith, could you please provide new set of logs with this DS debug level
> enabled?
> 

Sure, i have scheduled a job. I am waiting for the results.

> 
> > I would also try ldapsearch using startTLS to see if that works.
> 
> Well, in the access log there are 15700 successful "start_tls_plugin"
> operations and only one failed (this is load testing).
> 
>  
> >  Maybe it's something messed up on the SSSD Client?
> 
> Might be, and I am looking for a clue.
> 
> Mark, would observed DS behaviour be like in the log above if SSSD for
> example tried to StartTLS on a connection having another operation in
> progress?

Comment 25 Amith 2019-11-22 19:39:43 UTC

Created attachment 1638893 [details]
SSSD domain log with DS error log level : 8

Comment 26 Amith 2019-11-22 19:41:01 UTC

Created attachment 1638894 [details]
Dirsrv errors log when Err log level = 8.

Comment 27 Amith 2019-11-22 19:42:19 UTC

Created attachment 1638895 [details]
Dirsrv access log when DS err log level = 8.

Comment 28 Amith 2019-11-22 19:44:13 UTC

Generated the set of logs as requested, by setting "nsslapd-errorlog-level = 8" in LDAP server. They are attached.

Comment 29 mreynolds 2019-11-22 21:03:25 UTC

The logs did not tell us much more than we already knew, there is a ber encoding:

Access Log:

[22/Nov/2019:11:40:52.541164984 -0500] conn=66 fd=72 slot=72 connection from 10.16.56.60 to 10.19.34.24
[22/Nov/2019:11:40:56.152543488 -0500] conn=66 op=0 EXT oid="1.3.6.1.4.1.1466.20037" name="start_tls_plugin"
[22/Nov/2019:11:40:56.152622812 -0500] conn=66 op=0 RESULT err=1 tag=120 nentries=0 etime=1.1225190639
[22/Nov/2019:11:40:57.846114105 -0500] conn=66 op=-1 fd=72 closed - B1

Errors log:

[22/Nov/2019:11:40:52.498208907 -0500] - DEBUG - connection_reset - new connection on 72
[22/Nov/2019:11:40:54.294060576 -0500] - DEBUG - connection_table_dump_activity_to_errors_log - activity on 72r
[22/Nov/2019:11:40:54.335773788 -0500] - DEBUG - handle_pr_read_ready - read activity on 72
[22/Nov/2019:11:40:55.580630529 -0500] - DEBUG - connection_threadmain - conn 66 read operation successfully - thread_turbo_flag 0 more_data 1 ops_initiated 1 refcnt 2 flags 0
[22/Nov/2019:11:40:55.622819720 -0500] - DEBUG - connection_check_activity_level - conn 66 activity level = 0
[22/Nov/2019:11:40:56.110904594 -0500] - DEBUG - connection_threadmain - conn 66 queued because more_data
[22/Nov/2019:11:40:56.194243021 -0500] - DEBUG - connection_threadmain - conn 66 check more_data 1 thread_turbo_flag 0repl_conn_bef 0, repl_conn_now 0
[22/Nov/2019:11:40:57.052718336 -0500] - DEBUG - connection_threadmain - conn 66 read operation successfully - thread_turbo_flag 0 more_data 1 ops_initiated 2 refcnt 3 flags 0
[22/Nov/2019:11:40:57.102718810 -0500] - DEBUG - connection_threadmain - conn 66 queued because more_data
[22/Nov/2019:11:40:57.195068947 -0500] - DEBUG - connection_threadmain - conn 66 read operation successfully - thread_turbo_flag 0 more_data 0 ops_initiated 3 refcnt 3 flags 0
[22/Nov/2019:11:40:57.769431703 -0500] - DEBUG - disconnect_server_nomutex_ext - Setting conn 66 fd=72 to be disconnected: reason -4999

Reason -4999 == Bad Ber Tag (B1)



A successful StartTLS connection looks like:

Access Log

[22/Nov/2019:11:40:49.645376731 -0500] conn=63 fd=64 slot=64 connection from 10.16.56.60 to 10.19.34.24
[22/Nov/2019:11:40:50.667725628 -0500] conn=63 op=0 EXT oid="1.3.6.1.4.1.1466.20037" name="start_tls_plugin"
[22/Nov/2019:11:40:50.667826919 -0500] conn=63 op=0 RESULT err=0 tag=120 nentries=0 etime=0.1637595329
[22/Nov/2019:11:40:51.303114876 -0500] conn=63 TLS1.3 128-bit AES-GCM

Errors Log

[22/Nov/2019:11:40:49.602578676 -0500] - DEBUG - connection_reset - new connection on 64
[22/Nov/2019:11:40:50.099367623 -0500] - DEBUG - connection_table_dump_activity_to_errors_log - activity on 64r
[22/Nov/2019:11:40:50.262672660 -0500] - DEBUG - handle_pr_read_ready - read activity on 64
[22/Nov/2019:11:40:50.512994654 -0500] - DEBUG - connection_threadmain - conn 63 read operation successfully - thread_turbo_flag 0 more_data 0 ops_initiated 1 refcnt 2 flags 0
[22/Nov/2019:11:40:50.598504739 -0500] - DEBUG - connection_check_activity_level - conn 63 activity level = 0
[22/Nov/2019:11:40:50.633011609 -0500] - DEBUG - connection_make_readable_nolock - making readable conn 63 fd=64
[22/Nov/2019:11:40:50.839345301 -0500] - DEBUG - connection_threadmain - conn 63 check more_data 0 thread_turbo_flag 0repl_conn_bef 0, repl_conn_now 0
[22/Nov/2019:11:40:51.431811658 -0500] - DEBUG - connection_threadmain - conn 63 read operation successfully - thread_turbo_flag 0 more_data 0 ops_initiated 2 refcnt 2 flags 17




In the set of logs the failures happened in a small cluster.  I would be curious to see if disabling turbo mode would help:

# ldapmodify -D "cn=directory manager" -W
dn: cn=config
changetype: modify
replace: nsslapd-enable-turbo-mode
nsslapd-enable-turbo-mode: off


Restart the server, and try and reproduce the problem again.

Comment 30 Alexey Tikhonov 2019-11-25 11:04:07 UTC

(In reply to mreynolds from comment #29)
> In the set of logs the failures happened in a small cluster.  I would be
> curious to see if disabling turbo mode would help:
> 
> # ldapmodify -D "cn=directory manager" -W
> dn: cn=config
> changetype: modify
> replace: nsslapd-enable-turbo-mode
> nsslapd-enable-turbo-mode: off
> 
> 
> Restart the server, and try and reproduce the problem again.

Setting needinfo per this request.

Amith, would this be possible?

Comment 31 Amith 2019-11-25 13:18:22 UTC

(In reply to Alexey Tikhonov from comment #30)
> (In reply to mreynolds from comment #29)
> > In the set of logs the failures happened in a small cluster.  I would be
> > curious to see if disabling turbo mode would help:
> > 
> > # ldapmodify -D "cn=directory manager" -W
> > dn: cn=config
> > changetype: modify
> > replace: nsslapd-enable-turbo-mode
> > nsslapd-enable-turbo-mode: off
> > 
> > 
> > Restart the server, and try and reproduce the problem again.
> 
> Setting needinfo per this request.
> 
> Amith, would this be possible?

Sure, i will try and attach the log files.

Comment 32 Amith 2019-11-28 17:42:20 UTC

Generated the set of logs by setting "nsslapd-enable-turbo-mode: off" and attached them.

# ldapsearch -xv -h kvm-03-guest25.hv2.lab.eng.bos.redhat.com -D "cn=Manager,dc=example,dc=com" -w Secret123 -b "cn=config" | grep nsslapd-enable-turbo-mode
ldap_initialize( ldap://kvm-03-guest25.hv2.lab.eng.bos.redhat.com )
filter: (objectclass=*)
requesting: All userApplication attributes
nsslapd-enable-turbo-mode: off

Comment 33 Amith 2019-11-28 17:46:20 UTC

Created attachment 1640460 [details]
SSSD domain log when DS turbo-mode : off

Comment 34 Amith 2019-11-28 17:48:53 UTC

Created attachment 1640461 [details]
Dirsrv errors log when DS turbo_mode = off

Comment 35 Amith 2019-11-28 17:50:02 UTC

Created attachment 1640462 [details]
Dirsrv access log when DS turbo_mode = off.

Comment 36 Alexey Tikhonov 2019-11-29 18:14:49 UTC

(In reply to mreynolds from comment #29)
> The logs did not tell us much more than we already knew, there is a ber encoding:
> Reason -4999 == Bad Ber Tag (B1)

But it looks different than usual "log_ber_too_big_error"/"reason -4998"


> In the set of logs the failures happened in a small cluster.  I would be
> curious to see if disabling turbo mode would help:


It looks like disabling turbo mode made things much worse. Usually we saw 1 error during execution of the test, now there are 9 errors:

[28/Nov/2019:11:24:00.676935617 -0500] - ERR - log_ber_too_big_error - conn=2727 fd=65 Incoming BER Element may be misformed. This may indicate an attempt to use TLS on a plaintext port, IE ldaps://localhost:389. Check your client LDAP_URI settings.
[28/Nov/2019:11:24:00.677312541 -0500] - DEBUG - get_next_from_buffer - ber_get_next failed for connection 2727
[28/Nov/2019:11:24:00.677682307 -0500] - DEBUG - disconnect_server_nomutex_ext - Setting conn 2727 fd=65 to be disconnected: reason -4998


Mark, what would you say?

Comment 37 mreynolds 2019-11-29 20:00:08 UTC

Is there anyway to change the test to use ldaps instead of startTLS?  I'd be curious to see if the same issues occur.

Next we need to see the network traffic.  Can you run wireshark while reproducing the issue, and send us the access log and the capture file?  Thanks!

Comment 38 mreynolds 2019-12-02 21:15:23 UTC

I was able to reproduce this using ldapsearch via this simple script:

===========================================================
#!/bin/sh

while [ 1 ]
do
    LDAPTLS_CACERTDIR=/etc/dirsrv/slapd-localhost ldapsearch -xLLL -ZZ -D cn=dm -w redhat123 -b "" -s base dn &
    LDAPTLS_CACERTDIR=/etc/dirsrv/slapd-localhost ldapsearch -xLLL -ZZ -D cn=dm -w redhat123 -b "" -s base dn &
    LDAPTLS_CACERTDIR=/etc/dirsrv/slapd-localhost ldapsearch -xLLL -ZZ -D cn=dm -w redhat123 -b "" -s base dn
done
============================================================

Within 30 seconds I see:

ldap_start_tls: Operations error (1)
	additional info: Other operations are still pending on the connection.


I did not see this error using LDAP or LDAPS.  So it appears to be StartTLS specific.

Comment 39 Amith 2020-03-02 11:47:14 UTC

We have observed the similar testcase failure during RHEL-7.8 regression runs. Looks like this bug exists in 7.8 as well so please backport the fix.
See the failed beaker job: https://beaker.engineering.redhat.com/recipes/7969275#task107069194

Comment 40 thierry bordaz 2020-04-17 17:32:53 UTC

The scenario of test is
 - open coneection
 - start tls EXTOP
 - bind
 - search
 - unbind
 
DS assumes that when processing start_tls extop there is no  active operation on the connection. To do this it checks all operations attached to the connection.
If the bind is received before the start tls completes, the bind is added to the connection. It can be pending on the work queue or even taken by an other worker (my case).
In such case start tls consider that it is invalid to proceed.

I think it is valid to return operation error if there is a pending active operation.

In the current case the bind operation is just detected but no yet active/read (because start tls holds the connection c_mutex). The test could be smarter to check
if the pending operations are already decoded/read (active) or just waiting for the start tls to complete. Need to think more about this.

Comment 41 thierry bordaz 2020-04-21 15:48:51 UTC

Upstream ticket pushed => POST

Comment 47 errata-xmlrpc 2020-11-04 03:07:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (389-ds:1.4 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:4695

Note You need to log in before you can comment on or make changes to this bug.