Bug 186657 - PassSync crashed after running four hours under load
PassSync crashed after running four hours under load
Status: CLOSED NEXTRELEASE
Product: Red Hat Directory Server
Classification: Red Hat
Component: Sync Service (Show other bugs)
7.1
All Linux
medium Severity medium
: DS8.0
: ---
Assigned To: Nathan Kinder
Orla Hegarty
:
Depends On:
Blocks: 152373 182367 240316
  Show dependency treegraph
 
Reported: 2006-03-24 21:25 EST by To Ngan
Modified: 2008-08-11 19:45 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-05-26 16:34:58 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
PassSync crash screen shot (47.99 KB, image/png)
2006-03-24 21:27 EST, To Ngan
no flags Details
CVS DIffs (13.32 KB, patch)
2006-03-29 10:52 EST, Nathan Kinder
no flags Details | Diff
Revised Diffs (18.26 KB, patch)
2006-03-30 14:32 EST, Nathan Kinder
no flags Details | Diff
Revised Diffs (19.65 KB, patch)
2006-03-30 17:48 EST, Nathan Kinder
no flags Details | Diff

  None (edit)
Description To Ngan 2006-03-24 21:25:37 EST
Description of problem:
When trying to verify recent memory leak fix, PassSync.exe crashed after running
for about 4 hours under load.  I'll attach the crash dialog later.

Version-Release number of selected component (if applicable):
DS 7.1 SP2 20060322.1 candidate

How reproducible:
Not sure yet.  Seen it once so far.

Steps to Reproduce:
1. Install and set up passsync using TET acceptance
2. Run gomod_a1.pl (ldapmodify which keeps changing users' passwd against AD)
3. After about 4 hours
  
Actual results:
Service crashed.

When looking at the Red Hat PassSync.log on the AD machine, I saw a bunch of
weired messages regarding 1 particular user entry passwd change (started minutes
before the crash).  I don't know exactly how this happened in the first place,
and if it is directly related to the crash:

Th03/24/06 15:50:59: Backoff time expired.  Attempting sync
03/24/06 15:50:59: 1 new entries loaded from data file
03/24/06 15:50:59: Cleared contents of data file
03/24/06 15:50:59: Password list has 3 entries
03/24/06 15:50:59: Attempting to sync password for testuser0073
03/24/06 15:50:59: Searching for (ntuserdomainid=testuser0073)
03/24/06 15:50:59: Ldap error in ModifyPassword
	19: Constraint violation
03/24/06 15:50:59: Modify password failed for remote entry:
uid=tuser0073,ou=people,dc=sfbay,dc=redhat,dc=com,dc=local
03/24/06 15:50:59: Deferring password change for testuser0073
03/24/06 15:50:59: Attempting to sync password for ÍÍÍÍÍÍÍÍÍÍÍÍÍýýýýÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝ	
03/24/06 15:50:59: Searching for (ntuserdomainid=ÍÍÍÍÍÍÍÍÍÍÍÍÍýýýýÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝ	)
03/24/06 15:50:59: There are no entries that match:
ÍÍÍÍÍÍÍÍÍÍÍÍÍýýýýÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝ	
03/24/06 15:50:59: Deferring password change for ÍÍÍÍÍÍÍÍÍÍÍÍÍýýýýÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝ	
03/24/06 15:50:59: Attempting to sync password for testuser0115
03/24/06 15:50:59: Searching for (ntuserdomainid=testuser0115)
03/24/06 15:50:59: Password modified for remote entry:
uid=tuser0115,ou=people,dc=sfbay,dc=redhat,dc=com,dc=local
03/24/06 15:50:59: Removing password change from list
03/24/06 15:50:59: Backing off for 64000ms
03/24/06 15:50:59: Backoff time expired.  Attempting sync
03/24/06 15:50:59: 1 new entries loaded from data file
03/24/06 15:50:59: Cleared contents of data file
03/24/06 15:50:59: Password list has 3 entries
03/24/06 15:50:59: Attempting to sync password for testuser0073
03/24/06 15:50:59: Searching for (ntuserdomainid=testuser0073)
03/24/06 15:50:59: Ldap error in ModifyPassword
	19: Constraint violation
03/24/06 15:50:59: Modify password failed for remote entry:
uid=tuser0073,ou=people,dc=sfbay,dc=redhat,dc=com,dc=local
03/24/06 15:50:59: Deferring password change for testuser0073
03/24/06 15:50:59: Attempting to sync password for ÍÍÍÍÍÍÍÍÍÍÍÍÍýýýýÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝ	
03/24/06 15:50:59: Searching for (ntuserdomainid=ÍÍÍÍÍÍÍÍÍÍÍÍÍýýýýÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝ	)
03/24/06 15:50:59: There are no entries that match:
ÍÍÍÍÍÍÍÍÍÍÍÍÍýýýýÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝ	
03/24/06 15:50:59: Deferring password change for ÍÍÍÍÍÍÍÍÍÍÍÍÍýýýýÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝ	
03/24/06 15:50:59: Attempting to sync password for testuser0115
03/24/06 15:50:59: Searching for (ntuserdomainid=testuser0115)
03/24/06 15:50:59: Password match, no modify performed: testuser0115
03/24/06 15:50:59: Removing password change from list
03/24/06 15:50:59: Backing off for 64000msis keeps repeating up until the
passsync.exe crash.  After I restart the pass sync service, no more of this
message.  I'll keep the test running over weekend, and see if it happens again.



Expected results:
No Crash

Additional info:
Nathan said the locking design of dat file (accessed by both lsass and passsync)
are questionable.  Under password mod load, problem may occur.  Nathan believes
this is most likely not a regression from DS 7.1 SP1, but to improving the
locking design is more appropriate for DS 7.2 though.
Comment 1 To Ngan 2006-03-24 21:27:53 EST
Created attachment 126687 [details]
PassSync crash screen shot
Comment 2 To Ngan 2006-03-27 10:31:22 EST
After last crash, I restarted passsync and continued the test.  PassSync.exe
crashed again, this time happened after 12 hours.

I clicked Abort on the crash dialog.  When I tried to restart passsync.exe
again, it failed to start.  lsass.exe is taking up lots of CPU cycle, memory
footprint is only 88520k, but VM size is 1.32G.  The whole system became pretty
much non-responsive.
Comment 3 To Ngan 2006-03-27 11:00:17 EST
Again, about 2 minutes before this 2nd crash, passsync.log shows repeated
attempts to process what appears to be a bad entry (Memory corruption? Bad data
due to file contention?)

03/25/06 02:55:02: Attempting to sync password for testuser0061
03/25/06 02:55:02: Searching for (ntuserdomainid=testuser0061)
03/25/06 02:55:02: Ldap error in ModifyPassword
        19: Constraint violation
03/25/06 02:55:02: Modify password failed for remote entry:
uid=tuser0061,ou=people,dc=sfbay,dc=redhat,dc=com,dc=local
03/25/06 02:55:02: Deferring password change for testuser0061
03/25/06 02:55:02: Attempting to sync password for ÍÍÍÍÍÍÍÍÍÍÍÍÍýýýýÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝ
03/25/06 02:55:02: Searching for
(ntuserdomainid=ÍÍÍÍÍÍÍÍÍÍÍÍÍýýýýÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝ       )
03/25/06 02:55:02: There are no entries that match: ÍÍÍÍÍÍÍÍÍÍÍÍÍýýýýÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝ
03/25/06 02:55:02: Deferring password change for ÍÍÍÍÍÍÍÍÍÍÍÍÍýýýýÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝ
03/25/06 02:55:02: Backing off for 2000ms
03/25/06 02:55:02: Backoff time expired.  Attempting sync
03/25/06 02:55:02: 1 new entries loaded from data file
03/25/06 02:55:02: Cleared contents of data file
03/25/06 02:55:02: Password list has 3 entries
03/25/06 02:55:02: Attempting to sync password for testuser0061
03/25/06 02:55:02: Searching for (ntuserdomainid=testuser0061)
03/25/06 02:55:02: Ldap error in ModifyPassword
        19: Constraint violation
03/25/06 02:55:02: Modify password failed for remote entry:
uid=tuser0061,ou=people,dc=sfbay,dc=redhat,dc=com,dc=local
03/25/06 02:55:02: Deferring password change for testuser0061
03/25/06 02:55:02: Attempting to sync password for ÍÍÍÍÍÍÍÍÍÍÍÍÍýýýýÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝ
03/25/06 02:55:02: Searching for
(ntuserdomainid=ÍÍÍÍÍÍÍÍÍÍÍÍÍýýýýÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝ       )
03/25/06 02:55:02: There are no entries that match: ÍÍÍÍÍÍÍÍÍÍÍÍÍýýýýÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝ
03/25/06 02:55:02: Deferring password change for ÍÍÍÍÍÍÍÍÍÍÍÍÍýýýýÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝ
Comment 4 Nathan Kinder 2006-03-29 10:52:54 EST
Created attachment 126998 [details]
CVS DIffs

These changes add a mutex around accessing the passhook.dat file.  The logic is
that you must aquire the mutex before reading the changes out of the file
(loadSet), and you release the mutex after saving the file back out (saveSet. 
Both passhook.dll and passsync.exe always do a loadSet then a saveSet.	The
work done between loadSet and saveSet is quite small, so the mutex is not being
held longer than necessary.

A few changes were required by the addition of the mutex.  Since passsync.exe
actually creates the mutex object, the mutex will not exist when the passsync
service is not running.  Passhook.dll need to assume that it's safe to continue
accessing the file if the mutex does not exist.  The passsync service also
needed to be modified because of this.	Prior to these changes, passsync would
try to read the contents of passhook.dat immediately after startup.  Since
passhook.dll may be accessing the file without a mutex at that time, we need to
be more cautious.  I changed the behavior of passsync.exe to either wait for an
event from passhook.dll, or to wakeup every 5 minutes to check the passhook.dat
file for changes.  It will no longer check the data file at startup.  This
allows any running passhook.dll function to finish it's job (which is very
quick since there's only ever one password change to deal with per function
call).	Every call to passhook.dll after passsync.exe is running will see that
the mutex object exists and wait to aquire it before accessing the data file. 
If passhook.dll can't aquire the mutex, it will continue after 30 seconds.  We
don't want passhook.dll to wait forever since it will block the lsass.exe
process.

There will also a few small cleaup items in this fix.  One was a memory leak
where we needed to un-freeze a strstream object to allow the destructor to free
the memory used.
Comment 5 Nathan Kinder 2006-03-30 14:32:51 EST
Created attachment 127069 [details]
Revised Diffs

In response to Rich's review, the fix has been changed.  There was a concern
that having a timeout on passhook and passsync could result in either lost
password changes, blocking lsass, or ignoring the mutex and corrupting the data
file.

This new approach deals with the all of the above issues.  Basically, we are
still using a mutex around access to the passhook data file, but we will have
no timeouts while waiting to acquire the mutex on either the passhook or
passsync side.	In order to not block the lsass.exe process, we now have
passhook.dll create a thread for each password change.	This thread will aquire
the mutex, add the change to the data file, and release the mutex.  If the
thread has to wait a long time to acquire the mutex (in the case that passsync
is reading in a huge data file), it will not block lsass.exe from going about
it's normal business.

In order to avoid the overhead of constantly creating a handle to the mutex
object in passhook.dll for every change, the handle is created when the dll is
initialized at startup now.  We also read the logLevel settign from the
registry at this time instead of doing it for every password change as we did
before.  Since the threads will be writing to the passhook log file now, the
mutex must protect that file too.  All access to the passhook log file is done
with the mutex acquired.
Comment 6 Nathan Kinder 2006-03-30 17:48:22 EST
Created attachment 127081 [details]
Revised Diffs

In response to reviews, there are a few minor modifications.  These changes
mainly concern logging error conditions and error checking.
Comment 7 Nathan Kinder 2006-03-30 17:54:09 EST
Checked into winsync (HEAD).  Reviewed by Rich, Noriko, and Pete.

Checking in passwordsync/passhand.cpp;
/cvs/dirsec/winsync/passwordsync/passhand.cpp,v  <--  passhand.cpp
new revision: 1.10; previous revision: 1.9
done
Checking in passwordsync/passhand.h;
/cvs/dirsec/winsync/passwordsync/passhand.h,v  <--  passhand.h
new revision: 1.8; previous revision: 1.7
done
Checking in passwordsync/passhook/passhook.cpp;
/cvs/dirsec/winsync/passwordsync/passhook/passhook.cpp,v  <--  passhook.cpp
new revision: 1.9; previous revision: 1.8
done
Checking in passwordsync/passsync/syncserv.cpp;
/cvs/dirsec/winsync/passwordsync/passsync/syncserv.cpp,v  <--  syncserv.cpp
new revision: 1.14; previous revision: 1.13
done
Checking in passwordsync/passsync/syncserv.h;
/cvs/dirsec/winsync/passwordsync/passsync/syncserv.h,v  <--  syncserv.h
new revision: 1.8; previous revision: 1.7
done
Comment 8 To Ngan 2006-04-04 13:08:40 EDT
Verified against build 20060330.1
Same mod passwd script ran for days without passsync.exe crash.

Also bad passhook.dat file won't cause lsass.exe to hog memory and lock up the
whole system.  It'll simply print out errors in the passhook.log and passsync.log.
Comment 9 Orla Hegarty 2006-04-06 19:16:14 EDT
*DOCS*
Password sync tool under sustained heavy password modification load on the
Active Directory side could potentially crash passsync.exe. This has been
fixed. 
Comment 11 Chandrasekar Kannan 2007-07-25 15:17:09 EDT
DS7.2 is not a valid milestone anymore. Anything thats set to DS7.2 should be
set to DS8.0. Will make further changes per bug council on 07/24/2007, after this.
Comment 12 Chandrasekar Kannan 2008-08-11 19:45:50 EDT
Bug already CLOSED. setting screened+ flag

Note You need to log in before you can comment on or make changes to this bug.