RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 918696 - Multiple threads simultaneously working on connection's private buffer causes ns-slapd to abort
Summary: Multiple threads simultaneously working on connection's private buffer causes...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: 389-ds-base
Version: 7.0
Hardware: Unspecified
OS: Unspecified
medium
unspecified
Target Milestone: rc
: ---
Assignee: Rich Megginson
QA Contact: Sankar Ramalingam
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-03-06 18:14 UTC by Nathan Kinder
Modified: 2020-09-13 20:05 UTC (History)
3 users (show)

Fixed In Version: 389-ds-base-1.3.1.2-1.el7
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-06-13 10:50:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github 389ds 389-ds-base issues 276 0 None None None 2020-09-13 20:05:55 UTC

Description Nathan Kinder 2013-03-06 18:14:16 UTC
This bug is created as a clone of upstream ticket:
https://fedorahosted.org/389/ticket/276

On a multi CPU system, ns-slapd process aborts. Stack trace says it aborted calling PR_ASSERT() at the following location in connection.c (Release version of 389 dirsrv is 1.2.0).  Problem occurs on latest code as well.

{{{
Connection.c
1690 int connection_read_operation(Connection *conn, Operation *op,
ber_tag_t *tag, int *remaining_data)
1691 {

1734         }
1735         /* If we still haven't seen a complete PDU, read from the network */
1736         while (*tag == LBER_DEFAULT) {
1737                 int ioblocktimeout_waits = config_get_ioblocktimeout() / CONN_TURBO_TIMEOUT_INTERVAL;
1738                 /* We should never get here with data remaining in the buffer */
1739                 PR_ASSERT( !new_operation || 0 == (conn->c_private->c_buffer_bytes - conn->c_private->c_buffer_offset) );<------------------ aborted calling PR_ASSERT()
1740                 /* We make a non-blocking read call */
-----
}}}

Using instrumented image I have collected data and found that:

While a thread is working on connection's private buffer, another thread which belongs to older connection makes the connection readable again by modifying c_gettingber to zero.
This results in waking up one more thread. Finally more than one thread works simultaneously on connection's private buffer and results in aborting ns-slapd.


It happens as below:

1. Thread X is working on Connection-1's last operation.

2. After performing the LDAP operation ( connection_dispatch_operation() ) decrements the connection's c_refcnt  by calling connection_release_nolock() in connection_threadmain() function.

3. Since this is the last operation, c_refcnt becomes zero.

4. Thread X yet to execute the code that updates c_threadnumber and  c_gettingber. But as soon as c_refcnt of the connection becomes zero, main thread allocates the same connection object/structure to new incoming connection connection-2.  Thread-Y started executing operation belongs to this newer connection (Connection-2).

{{{
connection.c

2008 static void
2009 connection_threadmain()
2010 {


2187                     if (!thread_turbo_flag&&     !more_data) {
2188                                 connection_release_nolock (conn);<------------------  Thread-X decremented c_refcnt and now c_refcnt = 0
2189                         }
2190                     PR_Unlock( conn->c_mutex );
2191                 }
2192                 /* Since we didn't do so earlier, we need to make a replication connection readable again here */
2193                 if ( ((1 == is_timedout) || (replication_connection&&     !thread_turbo_flag))&&     !more_data)
2194                         connection_make_readable(conn);
2195                 pb = NULL;
2196
2197                 if (!thread_turbo_flag&&     !more_data) { /* Don't do this in turbo mode */
2198                         PR_Lock( conn->c_mutex );
2199                         /* if the threadnumber of now below the maximum, wakeup
2200                          * the listener thread so that we start polling on this
2201                          * connection again
2202                          */
2203                         /* DBDB I think this code is bogus -- we already signaled the listener above here */
2204                         if (conn->c_threadnumber == config_get_maxthreadsperconn())
2205                                 need_wakeup = 1;
2206                         else
2207                                 need_wakeup = 0;
2208                         conn->c_threadnumber--;
2209                         PR_Unlock( conn->c_mutex );
2210
2211                         if (need_wakeup)
2212                                 signal_listner();
2213                 }
2214
2215
2216         } /* while (1) */
2217 }
}}}

5. Thread-Y started processing operation data belongs to newer connection Connection-2  (connection_read_operation() ).

6. Now Thread-X of connection-1 called make_connection_readable() and made the connection readable again by setting c_gettingber to 0.

7. This wakes up another thread of connection-2 and it too starts processing operation data (connection_read_operation() ).

8. So at this point of time more than one thread is working on connection's private data.

9. As two threads are working on the connections c_buffer, c_bufferbytes and c_offset, at some time it meats conditions to abort() at the following location:

Connection.c:  connection_read_operation()
{{{
1734         }
1735         /* If we still haven't seen a complete PDU, read from the network */
1736         while (*tag == LBER_DEFAULT) {
1737                 int ioblocktimeout_waits = config_get_ioblocktimeout() / CONN_TURBO_TIMEOUT_INTERVAL;
1738                 /* We should never get here with data remaining in the buffer */
1739                 PR_ASSERT( !new_operation || 0 == (conn->c_private->c_buffer_bytes - conn->c_private->c_buffer_offset) );<------------------ aborted calling PR_ASSERT()
1740                 /* We make a non-blocking read call */
}}}



The probable fix could be:

In the connection_threadmain() function, if it is CONN_DONE don't make the connection readable again after decrementing the c_refcnt.

So should we handle CONN_DONE and CONN_TIMEDOUT situation differently as below?



Present code:
{{{
2097                 switch (ret) {
2098                         case CONN_DONE:
2099                                 /* This means that the connection was closed, so clear turbo mode */
2100                         /*FALLTHROUGH*/
2101                         case CONN_TIMEDOUT:
2102                                 thread_turbo_flag = 0;
2103                                 is_timedout = 1;
}}}

new code:

{{{
2097                 switch (ret) {
2098                         case CONN_DONE:
2099                                 /* This means that the connection was closed, so clear turbo mode */
                                         is_conn_done = 0;<-----------------
                                         thread_turbo_flag = 0;<-------------
2101                         case CONN_TIMEDOUT:
2102                                 thread_turbo_flag = 0;
2103                                 is_timedout = 1;
}}}

Note that you don't have to set thread_turbo_flag = 0 in the CONN_DONE case because it will be set when it falls through to the CONN_TIMEDOUT case.



Present:
{{{
2193                 if ( ((1 == is_timedout) || (replication_connection&&     !thread_turbo_flag))&&     !more_data )
2194                         connection_make_readable(conn);
2195                 pb = NULL;

}}}

new:
{{{
2193    --------->      if ( !is_conn_done&&     ((1 == is_timedout) || (replication_connection&&     !thread_turbo_flag))&&     !more_data )
2194                         connection_make_readable(conn);
2195                 pb = NULL;
}}}


If ret is CONN_DONE, but you set is_conn_done = 0, how does the code know the connection is really done?

Comment 1 Rich Megginson 2013-10-01 23:26:56 UTC
moving all ON_QA bugs to MODIFIED in order to add them to the errata (can't add bugs in the ON_QA state to an errata).  When the errata is created, the bugs should be automatically moved back to ON_QA.

Comment 3 Sankar Ramalingam 2014-02-04 06:14:39 UTC
Is this sanity only verification?

Comment 4 Noriko Hosoi 2014-02-04 06:41:12 UTC
Rich provided a reproducer on Ticket 276 (https://fedorahosted.org/389/ticket/276).

Attachment nounbind.c​ added
reproducer - c code

Attachment examples.h​ added
reproducer - header

Attachment Makefile​ added
reproducer - Makefile

Compile nounbind.c and simultaneously run 100 of them in the endless loop for one hour.

Comment 5 Milan Kubík 2014-02-14 15:20:40 UTC
After running the program for one hour in a loop the server was still running. No errors found. Though I didn't manage to run 100 of them at the same time. There were around 10 processes running at the same time.

Is it ok to verify this under these circumstances?

Comment 6 Noriko Hosoi 2014-02-17 02:35:21 UTC
(In reply to Milan Kubík from comment #5)
> After running the program for one hour in a loop the server was still
> running. No errors found. Though I didn't manage to run 100 of them at the
> same time. There were around 10 processes running at the same time.
> 
> Is it ok to verify this under these circumstances?

I think that's good enough.  Thank you, Milan.  Please mark VERIFIED.

Comment 7 Ludek Smid 2014-06-13 10:50:58 UTC
This request was resolved in Red Hat Enterprise Linux 7.0.

Contact your manager or support representative in case you have further questions about the request.


Note You need to log in before you can comment on or make changes to this bug.