1508498 – Authn/TOTP defined users periodically prompt for just password credentials to access resources

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1508498 - Authn/TOTP defined users periodically prompt for just password credentials to access resources

Summary: Authn/TOTP defined users periodically prompt for just password credentials to...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	ipa
Sub Component:
Version:	7.4
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	IPA Maintainers
QA Contact:	ipa-qe
Docs Contact:
URL:
Whiteboard:
Depends On:	1622168
Blocks:	1509470 1615964
TreeView+	depends on / blocked

Reported:	2017-11-01 14:48 UTC by baiesi
Modified:	2018-10-30 10:58 UTC (History)
CC List:	24 users (show)
Fixed In Version:	ipa-4.6.4-8.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1615964 (view as bug list)
Environment:
Last Closed:	2018-10-30 10:57:10 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
core_backtrace (2.47 KB, text/plain) 2017-11-06 19:53 UTC, baiesi	no flags	Details
core_backtrace (3.37 KB, text/plain) 2017-11-06 19:55 UTC, baiesi	no flags	Details
3 coredumps (545.98 KB, application/x-gzip) 2017-11-07 12:08 UTC, Michal Reznik	no flags	Details
str_user95 ( one of user that was running 2FA auth) configuration from IDM admin (114.94 KB, image/png) 2017-11-16 19:22 UTC, milei	no flags	Details
sssd.log on IPA client 206 (279.07 KB, text/plain) 2018-05-25 19:37 UTC, milei	no flags	Details
valgrind of ipa-otpd during crashes (73.24 KB, text/plain) 2018-05-25 20:37 UTC, Nathaniel McCallum	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:3187	0	None	None	None	2018-10-30 10:58:17 UTC

Description baiesi 2017-11-01 14:48:56 UTC

Description of problem:
Users are defined to use only Two factor authentication (password + OTP). Although the while running automation IdM periodically prompt the user for a "Password:" instead of the First Factor/ Second Factor prompts. The real concern is if the user is prompt for just the "Password", and enters it, he/she now is granted access to the resources which originally required two factor authentication.

Note: After repetitive attempts we were also able to perform manually in a terminal while system under load.

Automation load calculating 14% failure rate

How reproducible:
Steps to Reproduce:
1. added 100 users with associated tokens totp/hotp
2. config a group of 10 users setting for two factor authentication only
3. config diff group of 80 users for sudo tests with defined sudo rules
4. start STR system test with concurrent authn/totp and sudo test load scenarios

Actual results:
1. auth totp automation test periodically prompt for "Password:" instead of two factor authn. In this state if the user enters the ldap password, authn succeeds and allow access to protected resources.

Expected results:
1. expected all authn challenges to be prompted for two factor authn with First Factor/ Second Factor prompts

Additional info:

STR RunID: R-1761

rpm -qi ipa-server
Name : ipa-server
Version : 4.5.0
Release : 21.el7_4.1.2
Architecture: x86_64
Install Date: Mon 16 Oct 2017 10:58:54 AM EDT
Group : System Environment/Base
Size : 1050087
License : GPLv3+
Signature : RSA/SHA256, Thu 17 Aug 2017 03:51:44 AM EDT, Key ID 199e2f91fd431d51
Source RPM : ipa-4.5.0-21.el7_4.1.2.src.rpm
Build Date : Wed 16 Aug 2017 02:31:08 PM EDT
Build Host : x86-041.build.eng.bos.redhat.com

rpm -qi 389-ds-base
Name : 389-ds-base
Version : 1.3.6.1
Release : 19.el7_4
Architecture: x86_64
Install Date: Mon 16 Oct 2017 10:58:39 AM EDT
Group : System Environment/Daemons
Size : 5480103
License : GPLv3+
Signature : RSA/SHA256, Fri 25 Aug 2017 03:59:01 AM EDT, Key ID 199e2f91fd431d51
Source RPM : 389-ds-base-1.3.6.1-19.el7_4.src.rpm
Build Date : Tue 22 Aug 2017 09:33:10 AM EDT
Build Host : x86-019.build.eng.bos.redhat.com

**Test Environment:
IdM ver 4.5

1. master 1
Red Hat Enterprise Linux Server release 7.4 (Maipo) - metal
master.qe.sti.com 3.10.0-514.el7.x86_64 #1 SMP Wed Oct 19 11:24:13 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
ram: 16251636 kB
cpu 4core/ 16 proc

2. replica 1 - metal
Red Hat Enterprise Linux Server release 7.4 (Maipo)
replica2.qe.sti.com 3.10.0-514.el7.x86_64 #1 SMP Wed Oct 19 11:24:13 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
ram: 65689944 kB
cpu 10 core / 40 proc

3. client 1 -metal
Red Hat Enterprise Linux Server release 7.4 (Maipo)
replica1.qe.sti.com 3.10.0-514.el7.x86_64 #1 SMP Wed Oct 19 11:24:13 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
ram: 65689944 kB
cpu 10 core / 40 proc

**users:
str_user90-99
user authentication types: [X ] Two factor authentication (password + OTP)

**tokens:
Unique ID: totp-90-99
Type: TOTP
Description: STI
Owner: str_user94
Algorithm sha1
Digits 6
Clock offset (seconds) 0
Clock interval (seconds) 30

**Reproduced manually on a terminal after subsequent tries:
...
...
...
[root@replica2 ~]# ssh str_user94.sti.com
First Factor:
[root@replica2 ~]# ssh str_user94.sti.com
First Factor:
[root@replica2 ~]# ssh str_user94.sti.com
Password:
Last login: Mon Oct 30 10:55:56 2017 from replica2.qe.sti.com
** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** **
This System is reserved by baiesi.

**Authn/TOTP Test scenario:
Authn totp test simulating 10 user threads all with valid token and setup for Two factor authentication (password + OTP) only

Comment 3 Jan Pazdziora (Red Hat) 2017-11-01 15:05:31 UTC

May I assume that the use-case is about ssh access, not about authentication via Apache, and thus the component of this bugzilla is not correct?

Comment 4 baiesi 2017-11-01 20:42:20 UTC

Correct Jan, not about ssh via apache, update as needed please.

more info...
I ran abrt-cli on the Master IPA server and listed these as a clue.  Notice the ips-otpd as reason, and count. The time displays timestamp of the first occurrence of this problem.  The system test ran 3 days from 2017-10-27 13:25:20 to 2017-10-30 13:25:20.

[root@master ~]# abrt-cli list
id 5e6b56bcf450564ab847556d0b2184fa3aa7c96b
reason:         ipa-otpd killed by SIGABRT
time:           Thu 26 Oct 2017 08:27:13 PM EDT
cmdline:        /usr/libexec/ipa/ipa-otpd ldapi://%2fvar%2frun%2fslapd-QE-STI-COM.socket
package:        ipa-server-4.5.0-21.el7_4.1.2
uid:            0 (root)
count:          61
Directory:      /var/spool/abrt/ccpp-2017-10-26-20:27:13-7369
Run 'abrt-cli report /var/spool/abrt/ccpp-2017-10-26-20:27:13-7369' for creating a case in Red Hat Customer Portal

id 27a843e1980c4988dc08017ea094a59a66c4e3d9
reason:         ipa-otpd killed by SIGSEGV
time:           Thu 26 Oct 2017 07:34:43 PM EDT
cmdline:        /usr/libexec/ipa/ipa-otpd ldapi://%2fvar%2frun%2fslapd-QE-STI-COM.socket
package:        ipa-server-4.5.0-21.el7_4.1.2
uid:            0 (root)
count:          168
Directory:      /var/spool/abrt/ccpp-2017-10-26-19:34:43-18537
Run 'abrt-cli report /var/spool/abrt/ccpp-2017-10-26-19:34:43-18537' for creating a case in Red Hat Customer Portal

id 3cefbc4aca035133c01d648358bdc3fda1ae6be1
reason:         ipa-otpd killed by SIGSEGV
time:           Thu 26 Oct 2017 09:12:26 PM EDT
cmdline:        /usr/libexec/ipa/ipa-otpd ldapi://%2fvar%2frun%2fslapd-QE-STI-COM.socket
package:        ipa-server-4.5.0-21.el7_4.1.2
uid:            0 (root)
count:          11
Directory:      /var/spool/abrt/ccpp-2017-10-26-21:12:26-3400
Run 'abrt-cli report /var/spool/abrt/ccpp-2017-10-26-21:12:26-3400' for creating a case in Red Hat Customer Portal

Comment 5 Jan Pazdziora (Red Hat) 2017-11-02 08:37:59 UTC

Moving to ipa component for further triage.

Comment 6 baiesi 2017-11-02 16:17:58 UTC

note: The beaker job only provisions the operating system for the master/replica and client.  It does not install IdM or perform any test execution.

more-clues...
I ran abrt-cli also on Replica1 in our test environment which displays similar abrt issues as above as listed.

Comment 7 Jan Pazdziora (Red Hat) 2017-11-03 08:07:18 UTC

(In reply to baiesi from comment #6)
> note: The beaker job only provisions the operating system for the
> master/replica and client.  It does not install IdM or perform any test
> execution.

Which job? There are tons of beaker jobs that install IdM topologies and run tests.

Comment 8 baiesi 2017-11-03 13:08:56 UTC

Hi Jan,
Ok then, The job just provisions the operating systems on 3 beaker machines in our test env. No IdM install, No Run Test Tasks.
https://beaker.engineering.redhat.com/jobs/2097428
Thanks

Comment 9 Alexander Bokovoy 2017-11-06 14:31:18 UTC

can we get actual backtraces? I don't see anything attached to this bug.

Comment 10 baiesi 2017-11-06 19:53:35 UTC

Created attachment 1348732 [details]
core_backtrace

id 27a843e1980c4988dc08017ea094a59a66c4e3d9
reason:         ipa-otpd killed by SIGSEGV
time:           Thu 26 Oct 2017 07:34:43 PM EDT
cmdline:        /usr/libexec/ipa/ipa-otpd ldapi://%2fvar%2frun%2fslapd-QE-STI-COM.socket
package:        ipa-server-4.5.0-21.el7_4.1.2
uid:            0 (root)
count:          173
Directory:      /var/spool/abrt/ccpp-2017-10-26-19:34:43-18537
Run 'abrt-cli report /var/spool/abrt/ccpp-2017-10-26-19:34:43-18537' for creating a case in Red Hat Customer Portal
filename: core_backtrace1

Comment 11 baiesi 2017-11-06 19:55:02 UTC

Created attachment 1348733 [details]
core_backtrace

id 5e6b56bcf450564ab847556d0b2184fa3aa7c96b
reason:         ipa-otpd killed by SIGABRT
time:           Thu 26 Oct 2017 08:27:13 PM EDT
cmdline:        /usr/libexec/ipa/ipa-otpd ldapi://%2fvar%2frun%2fslapd-QE-STI-COM.socket
package:        ipa-server-4.5.0-21.el7_4.1.2
uid:            0 (root)
count:          61
Directory:      /var/spool/abrt/ccpp-2017-10-26-20:27:13-7369
Run 'abrt-cli report /var/spool/abrt/ccpp-2017-10-26-20:27:13-7369' for creating a case in Red Hat Customer Portal
filename: core_backtrace2

Comment 12 Alexander Bokovoy 2017-11-06 20:28:59 UTC

Comment on attachment 1348733 [details]
core_backtrace

Thanks, as I suspected, a coredump is needed to see what was state of a queue item before freeing as otpd_queue_item_free() frees bunch of objects.

Comment 13 Michal Reznik 2017-11-07 11:57:09 UTC

As per comment #4 found 3 coredumps on the master. 1st and 3rd looks similar, related to libkrad.so RADIUS library and 2nd is a memory corruption:

Oct 26 19:34:43 master kernel: ipa-otpd[18537]: segfault at 1021 ip 00007fad959e2114 sp 00007ffe04585d70 error 4 in libkrad.so.0.0[7fad959de000+7000]
Oct 26 19:34:43 master abrt-hook-ccpp: Process 18537 (ipa-otpd) of user 0 killed by SIGSEGV - dumping core
Oct 26 19:34:43 master ipa-otpd: str_user98.COM: request received
Oct 26 19:34:43 master ipa-otpd: str_user98.COM: user query start
Oct 26 19:34:43 master ipa-otpd: str_user98.COM: user query end: uid=str_user98,cn=users,cn=accounts,dc=qe,dc=sti,dc=com
Oct 26 19:34:43 master ipa-otpd: str_user98.COM: bind start: uid=str_user98,cn=users,cn=accounts,dc=qe,dc=sti,dc=com
Oct 26 19:34:43 master ipa-otpd: str_user98.COM: bind end: success
Oct 26 19:34:43 master ipa-otpd: str_user98.COM: response sent: Access-Accept
Oct 26 19:34:43 master systemd: ipa-otpd: main process exited, code=dumped, status=11/SEGV
Oct 26 19:34:43 master systemd: Unit ipa-otpd entered failed state.
Oct 26 19:34:43 master systemd: ipa-otpd failed.


Oct 26 20:27:13 master ipa-otpd: *** Error in `/usr/libexec/ipa/ipa-otpd': double free or corruption (!prev): 0x00007f03cf94dc00 ***
Oct 26 20:27:13 master ipa-otpd: ======= Backtrace: =========
Oct 26 20:27:13 master ipa-otpd: /lib64/libc.so.6(+0x7c619)[0x7f03cd862619]
Oct 26 20:27:13 master ipa-otpd: /usr/libexec/ipa/ipa-otpd(+0x3ca3)[0x7f03ced6bca3]
Oct 26 20:27:13 master ipa-otpd: /lib64/libverto.so.1(verto_fire+0x18)[0x7f03ce4d5cd8]
Oct 26 20:27:13 master ipa-otpd: /lib64/libtevent.so.0(+0xaedb)[0x7f03caa82edb]
Oct 26 20:27:13 master ipa-otpd: /lib64/libtevent.so.0(+0x92a7)[0x7f03caa812a7]
Oct 26 20:27:13 master ipa-otpd: /lib64/libtevent.so.0(_tevent_loop_once+0x9d)[0x7f03caa7d0cd]
Oct 26 20:27:13 master ipa-otpd: /lib64/libverto.so.1(verto_run+0x4f)[0x7f03ce4d54af]
Oct 26 20:27:13 master ipa-otpd: /usr/libexec/ipa/ipa-otpd(+0x232f)[0x7f03ced6a32f]
Oct 26 20:27:13 master ipa-otpd: /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f03cd807c05]
Oct 26 20:27:13 master ipa-otpd: /usr/libexec/ipa/ipa-otpd(+0x23c8)[0x7f03ced6a3c8]


Oct 26 21:12:26 master kernel: ipa-otpd[3400]: segfault at 8 ip 00007f0f90ddcba4 sp 00007ffc79052320 error 4 in libkrad.so.0.0[7f0f90dda000+7000]
Oct 26 21:12:26 master ipa-otpd: str_user91.COM: request received
Oct 26 21:12:26 master ipa-otpd: str_user91.COM: user query start
Oct 26 21:12:26 master ipa-otpd: str_user91.COM: user query end: uid=str_user91,cn=users,cn=accounts,dc=qe,dc=sti,dc=com
Oct 26 21:12:26 master ipa-otpd: str_user91.COM: bind start: uid=str_user91,cn=users,cn=accounts,dc=qe,dc=sti,dc=com
Oct 26 21:12:26 master ipa-otpd: str_user91.COM: bind end: Invalid credentials
Oct 26 21:12:26 master ipa-otpd: str_user91.COM: response sent: Access-Reject
Oct 26 21:12:26 master ipa-otpd: str_user90.COM: request received
Oct 26 21:12:26 master ipa-otpd: str_user90.COM: user query start
Oct 26 21:12:26 master ipa-otpd: str_user90.COM: user query end: uid=str_user90,cn=users,cn=accounts,dc=qe,dc=sti,dc=com
Oct 26 21:12:26 master ipa-otpd: str_user90.COM: bind start: uid=str_user90,cn=users,cn=accounts,dc=qe,dc=sti,dc=com
Oct 26 21:12:26 master ipa-otpd: str_user90.COM: bind end: Invalid credentials
Oct 26 21:12:26 master ipa-otpd: str_user90.COM: response sent: Access-Reject
Oct 26 21:12:26 master ipa-otpd: str_user94.COM: request received
Oct 26 21:12:26 master ipa-otpd: str_user94.COM: user query start
Oct 26 21:12:26 master abrt-hook-ccpp: Process 3400 (ipa-otpd) of user 0 killed by SIGSEGV - dumping core

Other coredumps were not created as they were deemed to be duplicates. Attaching the dumps...

Comment 14 Michal Reznik 2017-11-07 12:08:07 UTC

Created attachment 1348922 [details]
3 coredumps

Comment 15 Alexander Bokovoy 2017-11-07 12:21:51 UTC

A stacktrace for the krad issue:

(gdb) bt full
#0  0x00007fad959e2114 in krad_packet_new_response (ctx=0x7fad98080120, secret=secret@entry=0x7fad96482488 "", code=<optimized out>, set=set@entry=0x0, 
    request=request@entry=0x7fad98099c00, response=response@entry=0x7fad98086740) at packet.c:324
        retval = 0
        pkt = 0x7fad980869a0
        attrset_len = 0
#1  0x00007fad9648083c in on_bind_readable (vctx=<optimized out>, ev=<optimized out>) at bind.c:115
        errstr = 0x7fad96482384 "error"
        i = 0
        results = 0x7fad980867e0
        item = 0x7fad98086730
        rslt = <optimized out>
#2  otpd_on_bind_io (vctx=<optimized out>, ev=<optimized out>) at bind.c:144
        flags = <optimized out>
#3  0x00007fad95bebcd8 in verto_fire (ev=0x7fad98085ea0) at verto.c:947
        priv = <optimized out>
        __PRETTY_FUNCTION__ = "verto_fire"
#4  0x00007fad92198edb in epoll_event_loop (tvalp=0x7ffe04585ea0, epoll_ev=0x7fad98082950) at ../tevent_epoll.c:728
        fde = <optimized out>
        flags = <optimized out>
        mpx_fde = <optimized out>
        ret = 1
        i = 0
        timeout = <optimized out>
        wait_errno = <optimized out>
        events = {{events = 1, data = {ptr = 0x7fad98085fc0, fd = -1744281664, u32 = 2550685632, u64 = 140383556755392}}}
#5  epoll_event_loop_once (ev=<optimized out>, location=<optimized out>) at ../tevent_epoll.c:930
        epoll_ev = 0x7fad98082950
        tval = {tv_sec = 30, tv_usec = 0}
        panic_triggered = false
#6  0x00007fad921972a7 in std_event_loop_once (ev=0x7fad980826d0, location=0x7fad9239f132 "verto-tevent.c:55") at ../tevent_standard.c:114
        glue_ptr = <optimized out>
        glue = 0x7fad98082800
        ret = <optimized out>
#7  0x00007fad921930cd in _tevent_loop_once (ev=0x7fad980826d0, location=0x7fad9239f132 "verto-tevent.c:55") at ../tevent.c:721
        ret = <optimized out>
        nesting_stack_ptr = 0x0
#8  0x00007fad95beb4af in verto_run (ctx=0x7fad98080530) at verto.c:578
No locals.
#9  0x00007fad9648032f in main (argc=<optimized out>, argv=0x7ffe04586098) at main.c:327
        hostname = "master.qe.sti.com", '\000' <repeats 15 times>, "\001\000\000\000\000\000\000\000M#H\226\255\177", '\000' <repeats 18 times>
        retval = <optimized out>
        hndata = {magic = 0, length = 17, data = 0x7ffe04585f50 "master.qe.sti.com"}
        sig = <optimized out>

Comment 16 Alexander Bokovoy 2017-11-07 12:23:29 UTC

And a backtrace for the crash in otpd_queue_item_free():

#0  0x00007f03cd81b1f7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
        resultvar = 0
        pid = 7369
        selftid = 7369
#1  0x00007f03cd81c8e8 in __GI_abort () at abort.c:90
        save_stage = 2
        act = {__sigaction_handler = {sa_handler = 0x7f03cd964e94, sa_sigaction = 0x7f03cd964e94}, sa_mask = {__val = {3, 140733581701164, 4, 139654310805607, 1, 139654310814345, 3, 
              140733581701140, 12, 139654310814349, 2, 139654310814349, 2, 140733581701952, 140733581701952, 140733581703712}}, sa_flags = 11, sa_restorer = 0x7fff17253540}
        sigs = {__val = {32, 0 <repeats 15 times>}}
#2  0x00007f03cd85af47 in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x7f03cd967608 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:196
        ap = {{gp_offset = 40, fp_offset = 48, overflow_arg_area = 0x7fff17253a30, reg_save_area = 0x7fff17253940}}
        ap_copy = {{gp_offset = 16, fp_offset = 48, overflow_arg_area = 0x7fff17253a30, reg_save_area = 0x7fff17253940}}
        fd = 2
        on_2 = <optimized out>
        list = <optimized out>
        nlist = <optimized out>
        cp = <optimized out>
        written = <optimized out>
#3  0x00007f03cd862619 in malloc_printerr (ar_ptr=0x7f03cdba2760 <main_arena>, ptr=<optimized out>, str=0x7f03cd967730 "double free or corruption (!prev)", action=3) at malloc.c:5023
        buf = "00007f03cf94dc00"
        cp = <optimized out>
        ar_ptr = 0x7f03cdba2760 <main_arena>
        ptr = <optimized out>
        str = 0x7f03cd967730 "double free or corruption (!prev)"
        action = 3
#4  _int_free (av=0x7f03cdba2760 <main_arena>, p=<optimized out>, have_lock=0) at malloc.c:3845
        size = <optimized out>
        fb = <optimized out>
        nextchunk = <optimized out>
        nextsize = <optimized out>
        nextinuse = <optimized out>
        prevsize = <optimized out>
        bck = <optimized out>
        fwd = <optimized out>
        errstr = 0x7f03cd967730 "double free or corruption (!prev)"
        locked = <optimized out>
#5  0x00007f03ced6bca3 in otpd_queue_item_free (item=0x7f03cf93cc10) at queue.c:61
No locals.
#6  0x00007f03ce4d5cd8 in verto_fire (ev=0x7f03cf937f40) at verto.c:947
        priv = <optimized out>
        __PRETTY_FUNCTION__ = "verto_fire"
#7  0x00007f03caa82edb in epoll_event_loop (tvalp=0x7fff17253af0, epoll_ev=0x7f03cf936950) at ../tevent_epoll.c:728
        fde = <optimized out>
        flags = <optimized out>
        mpx_fde = <optimized out>
        ret = 1
        i = 0
        timeout = <optimized out>
        wait_errno = <optimized out>
        events = {{events = 4, data = {ptr = 0x7f03cf938000, fd = -812417024, u32 = 3482550272, u64 = 139654344179712}}}
#8  epoll_event_loop_once (ev=<optimized out>, location=<optimized out>) at ../tevent_epoll.c:930
        epoll_ev = 0x7f03cf936950
        tval = {tv_sec = 30, tv_usec = 0}
        panic_triggered = false
#9  0x00007f03caa812a7 in std_event_loop_once (ev=0x7f03cf9366d0, location=0x7f03cac89132 "verto-tevent.c:55") at ../tevent_standard.c:114
        glue_ptr = <optimized out>
        glue = 0x7f03cf936800
        ret = <optimized out>
#10 0x00007f03caa7d0cd in _tevent_loop_once (ev=0x7f03cf9366d0, location=0x7f03cac89132 "verto-tevent.c:55") at ../tevent.c:721
        ret = <optimized out>
        nesting_stack_ptr = 0x0
#11 0x00007f03ce4d54af in verto_run (ctx=0x7f03cf934530) at verto.c:578
No locals.
#12 0x00007f03ced6a32f in main (argc=<optimized out>, argv=0x7fff17253ce8) at main.c:327
        hostname = "master.qe.sti.com", '\000' <repeats 15 times>, "\001\000\000\000\000\000\000\000M\303\326\316\003\177", '\000' <repeats 18 times>
        retval = <optimized out>
        hndata = {magic = 0, length = 17, data = 0x7fff17253ba0 "master.qe.sti.com"}
        sig = <optimized out>

Comment 17 Alexander Bokovoy 2017-11-07 13:13:50 UTC

The latter crash happens in queue.c:61 which is

    krad_packet_free(item->req);

According to glibc, it is a double free. We have

(gdb) print *(struct verto_ev*)0x7f03cf937f40
$7 = {next = 0x7f03cf937e20, ctx = 0x7f03cf934530, type = VERTO_EV_TYPE_IO, callback = 0x7f03ced6c120 <otpd_on_stdout_writable>, onfree = 0x0, priv = 0x0, ev = 0x7f03cf938000, flags = 177, 
  actual = 177, depth = 1, deleted = 0, option = {io = {fd = 1, state = VERTO_EV_FLAG_IO_WRITE}, signal = 1, interval = 137438953473, child = {proc = 1, status = 32}}}
(gdb) print *(struct otpd_queue_item *)0x7f03cf93cc10
$2 = {next = 0x7f03cf93a3f0, req = 0x7f03cf94dc00, rsp = 0x7f03cf950c30, sent = 20, error = 0x0, user = {dn = 0x7f03cf9382b0 "\320Γ\317\003\177", 
    uid = 0x7f03cf93a530 "Х\223\317\003\177", ipatokenRadiusUserName = 0x0, ipatokenRadiusConfigLink = 0x0, other = 0x0}, radius = {ipatokenUserMapAttribute = 0x0, 
    ipatokenRadiusSecret = 0x0, ipatokenRadiusServer = 0x0, ipatokenRadiusTimeout = 0, ipatokenRadiusRetries = 0}, msgid = 95}
(gdb) print *((struct otpd_queue_item *)0x7f03cf93cc10)->req
$3 = {
  buffer = "\360\245\223\317\003\177\000\000\020\235\223\317\003\177", '\000' <repeats 18 times>, "sti.com\006\006\000\000\000\b\002\022q\233\341\315\373\177\025\330@\212\372\300\020\244\030\324\001\027str_user95.COM", '\000' <repeats 4009 times>, attrset = 0x7f03cf93a150, pkt = {magic = -1760647422, length = 86, data = 0x1020 <Address 0x1020 out of bounds>}}
(gdb) print *((struct otpd_queue_item *)0x7f03cf93cc10)->req->attrset
$4 = {ctx = 0x7f03cf93a520, list = {tqh_first = 0x0, tqh_last = 0x7f03cf93a158}}
(gdb) print *((struct otpd_queue_item *)0x7f03cf93cc10)->req->attrset->list.tqh_last
$5 = (struct attr_st *) 0x0

Comment 18 Alexander Bokovoy 2017-11-07 13:22:40 UTC

The first crash is in krad_packet_new_response():

    pkt_id_set(pkt, pkt_id_get(request));

Here we have: request
(gdb) print *(krad_packet*)0x7fad98099c00
$1 = {
  buffer = "؍+\225\255\177\000\000؍+\225\255\177\000\000\360\233\t\230\255\177\000\000\360\233\t\230\255\177\000\000sti.com\006\006\000\000\000\b\002\022{\366LPTP\242MTY\356\265b\273Z\266\001\027str_user93.COM", '\000' <repeats 4009 times>, attrset = 0x7fad98086850, pkt = {magic = -1760647422, length = 86, data = 0x1020 <Address 0x1020 out of bounds>}}

and response:
(gdb) print *(krad_packet*)0x7fad98086740
$2 = {
  buffer = "\240i\b\230\255\177", '\000' <repeats 18 times>, "\340g\b\230\255\177\000\000\300g\b\230\255\177", '\000' <repeats 66 times>, "\377\377\377\377\000\000\000\000\240\a\000\000\000\000\000\000 ", '\000' <repeats 15 times>, "\071\063\000\000\000\000\000\000\300\a\000\000\000\000\000\000@\000\000\000\000\000\000\000\024\000\000\000\000\000\000\000a\000\000\000\000\000\000\000 ]\b\230\255\177\000\000\000\000\000\000\000\000\000\000\340g\b\230\255\177", '\000' <repeats 18 times>..., attrset = 0x0, pkt = {magic = 0, length = 0, data = 0x0}}

pkt_id_get() is equivalent of accessing request->pkt->data array. We see that data is broken (0x1020 is not a valid pointer).

Comment 19 Alexander Bokovoy 2017-11-07 13:33:21 UTC

So in both crashes we deal with a malformed packet structure since logically data in pkt should be pointing to pkt->buffer.

Comment 20 Nathaniel McCallum 2017-11-09 12:33:32 UTC

(In reply to Alexander Bokovoy from comment #19)
> So in both crashes we deal with a malformed packet structure since logically
> data in pkt should be pointing to pkt->buffer.

Can you give me the output of valgrind for this? That should more directly locate the root cause.

Comment 21 Florence Blanc-Renaud 2017-11-14 17:22:11 UTC

Upstream ticket:
https://pagure.io/freeipa/issue/7262

Comment 22 Nathaniel McCallum 2017-11-14 18:26:24 UTC

Any luck on that valgrind output?

Comment 23 Alexander Bokovoy 2017-11-14 18:36:46 UTC

Bruce is going to reproduce it with valgrind since he has some stress-test infra where it is reproducible.

Comment 24 baiesi 2017-11-15 14:50:06 UTC

I installed valgrind on master /replica, then restarted IdM and began running tests.

Currently not seeing otpd crashes on the master or replica yet, but noticing I'm still able to get prompted periodically for "Password:" instead of expected two factor for specified users.

- manually if I'm fast enough and type the password, I can get access to the host resource.
- If not fast enough it prompts me again for First Factor.

The system load tests are running and will continue to run for 3 days. i added some of the original tests to the mix, same as when we first saw these concerns, to reproduce again.

note: on an Idm client i do see a crash related to "sssd_be" but i did not install valgrind there... see below...
id 2c877560b712ecda705ce95f4dfd5af81a8b7874
reason:         sssd_be killed by SIGSEGV
time:           Wed 15 Nov 2017 06:35:44 AM EST
cmdline:        /usr/libexec/sssd/sssd_be --domain qe.sti.com --uid 0 --gid 0 --debug-to-files
package:        sssd-common-1.15.2-50.el7_4.2
uid:            0 (root)
count:          1
Directory:      /var/spool/abrt/ccpp-2017-11-15-06:35:44-17467

Let us know if additional info is needed or want access to the environment.  I sent out the creds in a email earlier to a group requiring access.

Comment 25 Nathaniel McCallum 2017-11-15 15:21:05 UTC

First, I was alerted to the fact that you have configured "password + otp". This configuration allows 1FA (Password) *or* 2FA (Password+OTP). What you probably want is 'otp' which will allow 2FA (Password+OTP) only.

Second, the discrepancy between "First Factor:" and "Password:" could be explained by a replication delay. That is, if you are configuring the user for a token and then attempt to login before the token configuration has reached the replica it would explain why you aren't getting consistent behavior. This would be a race condition between replication and kinit. Alternatively, we could have a problem with replication that is causing malformed configurations (at least temporarily).

Third, the crashes could be related to replication delays and/or malformed configuration. That is, ipa-otpd might not carefully handle this error case.

Note that making the change from 'password + otp' to 'otp' will likely not make any change in behavior if we have a replication issue. This is because when the user is configured for 'otp' but does not have a valid token provisioned yet, we still allow 'password' logins. This is by design. So if you have a replication issue where the kinit is happening before the replica receives the token configuration, you will still be allowed to login with "Password:".

Comment 26 baiesi 2017-11-16 19:12:59 UTC

>>Response to Comment 25  

first:
The 10 users defined in IdM 'str_user90-99' performing 2 factor authentication are setup to use only 'Two factor authentication (password + OTP)'. Not in any other combination with 'Password' or 'Radius'.

Second:
Users, tokens and user Two factor authentication (password + OTP) user configuration were done 2 weeks ago.  I do not add user/tokens/config just before authentication in my run-time authn otp tests.

Third:
ok thanks

Note:
-As you mentioned above, how do you configure password + otp' to 'otp'?  On my IdM UI, I only have seen settings for 'Password', 'Radius' and 'Two factor authentication (password + OTP)'? Are there others?

-Again I pre setup users/tokens and config user for only Two factor authentication (password + OTP) which was done 2 weeks ago. So replication should be complete, I hope:)...  I've logged into the UI for both Master and Replica servers comparing user/token configuration which look fine.

*If you still think its a replication issue, it may be a good test to shut down the Replica server and run the tests, then the opposite, always leaving one of them up?  Let me know...

Let us know if additional info is needed or want access to the environment.

Thanks

Comment 27 milei 2017-11-16 19:22:51 UTC

Created attachment 1353691 [details]
str_user95 ( one of user that was running 2FA auth) configuration from IDM admin

Screen shot str_user95Configure.png is showing user str_user95 is configured for 2 factor authentication( password + otp) only. This str_user95 is one of users that was used at system load test.

Comment 28 Xiyang Dong 2017-11-17 15:01:27 UTC

Hello Bruce and Min,
I am just guessing ,would it be possible that you set the global auth to multiple methods like two-factor and password,and disabled per-user override, so the individual setting to only two-factor didn't work?
Please refer to:https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/linux_domain_identity_authentication_and_policy_guide/otp#user-auth-methods-cli

Comment 29 baiesi 2017-11-22 22:50:10 UTC

Hi Xiyang:

Thank you very much for your quickly reply. I saw couple of commands:
ipa config-mod --user-auth-type=otp
ipa config-mod --user-auth-type=otp --user-auth-type=disabled
ipa user-mod user --user-auth-type=otp

The User Auth Types can be set globally or on a per-user basis and indicates which authentication methods are allowed.  We did not run any of these commands to configure otp with config-mod command so it is using whatever the default is.

This is what ipa config-show --all reveals:

[cloud-user@host-172-16-125-14 ~]$ ipa config-show --all
  dn: cn=ipaConfig,cn=etc,dc=qe,dc=sti,dc=com
  Maximum username length: 32
  Home directory base: /home
  Default shell: /bin/sh
  Default users group: ipausers
  Default e-mail domain: qe.sti.com
  Search time limit: 2
  Search size limit: 100
  User search fields: uid,givenname,sn,telephonenumber,ou,title
  Group search fields: cn,description
  Enable migration mode: FALSE
  Certificate Subject base: O=QE.STI.COM
  Default group objectclasses: top, groupofnames, nestedgroup, ipausergroup, ipaobject
  Default user objectclasses: top, person, organizationalperson, inetorgperson, inetuser, posixaccount, krbprincipalaux, krbticketpolicyaux, ipaobject, ipasshuser
  Password Expiration Notification (days): 4
  Password plugin features: AllowNThash, KDC:Disable Last Success
  SELinux user map order: guest_u:s0$xguest_u:s0$user_u:s0$staff_u:s0-s0:c0.c1023$unconfined_u:s0-s0:c0.c1023
  Default SELinux user: unconfined_u:s0-s0:c0.c1023
  Default PAC types: MS-PAC, nfs:NONE
  IPA masters: master.qe.sti.com, replica1.qe.sti.com
  IPA CA servers: master.qe.sti.com
  IPA NTP servers: master.qe.sti.com, replica1.qe.sti.com
  IPA CA renewal master: master.qe.sti.com
  IPA master capable of PKINIT: master.qe.sti.com, replica1.qe.sti.com
  aci: (targetattr = "cn || createtimestamp || entryusn || ipacertificatesubjectbase || ipaconfigstring || ipacustomfields || ipadefaultemaildomain || ipadefaultloginshell || ipadefaultprimarygroup ||
       ipadomainresolutionorder || ipagroupobjectclasses || ipagroupsearchfields || ipahomesrootdir || ipakrbauthzdata || ipamaxusernamelength || ipamigrationenabled || ipapwdexpadvnotify ||
       ipasearchrecordslimit || ipasearchtimelimit || ipaselinuxusermapdefault || ipaselinuxusermaporder || ipauserauthtype || ipauserobjectclasses || ipausersearchfields || modifytimestamp ||
       objectclass")(targetfilter = "(objectclass=ipaguiconfig)")(version 3.0;acl "permission:System: Read Global Configuration";allow (compare,read,search) userdn = "ldap:///all";)
  cn: ipaConfig
  objectclass: nsContainer, top, ipaGuiConfig, ipaConfigObject, ipaUserAuthTypeClass, ipaNameResolutionData

Comment 30 Nathaniel McCallum 2017-11-27 16:09:37 UTC

(In reply to baiesi from comment #26)
> >>Response to Comment 25  
> 
> first:
> The 10 users defined in IdM 'str_user90-99' performing 2 factor
> authentication are setup to use only 'Two factor authentication (password +
> OTP)'. Not in any other combination with 'Password' or 'Radius'.

Good!

> Second:
> Users, tokens and user Two factor authentication (password + OTP) user
> configuration were done 2 weeks ago.  I do not add user/tokens/config just
> before authentication in my run-time authn otp tests.

That probably rules out replication issues.

> Note:
> -As you mentioned above, how do you configure password + otp' to 'otp'?  On
> my IdM UI, I only have seen settings for 'Password', 'Radius' and 'Two
> factor authentication (password + OTP)'? Are there others?

Nope. Just pick the third one.

> -Again I pre setup users/tokens and config user for only Two factor
> authentication (password + OTP) which was done 2 weeks ago. So replication
> should be complete, I hope:)...  I've logged into the UI for both Master and
> Replica servers comparing user/token configuration which look fine.
> 
> *If you still think its a replication issue, it may be a good test to shut
> down the Replica server and run the tests, then the opposite, always leaving
> one of them up?  Let me know...
> 
> Let us know if additional info is needed or want access to the environment.

We really need to see a crash under valgrind.

Comment 31 baiesi 2017-11-28 16:52:31 UTC

Hi Nathaniel McCallum,

We recently modified our setup test environment and increased user, group and tokens topology to expand our system testing effort. We decided to turn off Valgrind and restarted idm services for an over night 16hr run to see if its suppressing the crashes we initially saw.

We found additional crashes occurring during this latest run on the master server.

same reason root@master ccpp-2017-10-26-19:34:43-18537]#
cat /var/spool/abrt/ccpp-2017-10-26-19:34:43-18537/reason
ipa-otpd killed by SIGSEGV

Found this in valgind Q/A:
http://valgrind.org/docs/manual/faq.html#faq.crashes
"4.4. My program crashes normally, but doesn't under Valgrind, or vice versa. What's happening?
When a program runs under Valgrind, its environment is slightly different to when it runs natively. For example, the memory layout is different, and the way that threads are scheduled is different.
Most of the time this doesn't make any difference, but it can, particularly if your program is buggy. For example, if your program crashes because it erroneously accesses memory that is unaddressable, it's possible that this memory will not be unaddressable when run under Valgrind. Alternatively, if your program has data races, these may not manifest under Valgrind.
There isn't anything you can do to change this, it's just the nature of the way Valgrind works that it cannot exactly replicate a native execution environment. In the case where your program crashes due to a memory error when run natively but not when run under Valgrind, in most cases Memcheck should identify the bad memory operation."

Thanks Again for you expertise, Please advise..

Comment 32 Florence Blanc-Renaud 2018-05-04 08:01:01 UTC

Hi,
as Nathaniel pointed, it would really help us to have a crash under Valgrind. I have 2 additional questions:

1. Does your latest comment mean that you were not able to reproduce a crash when running under Valgrind?
2. According to comment #c24, the prompt for "Password" instead of "First factor" occurred even though there was no crash, meaning that we may have 2 different issues here. Do you have 389-ds access logs when this happens? Just an hypothesis here, but if the ldap server is overloaded we may see timeouts on some of the searches and in this case we would need to check how ipa-otp reacts (does it consider that the user has no OTP and should be authenticated with password only?)

Comment 33 baiesi 2018-05-04 17:40:42 UTC

Hi Florence,
Yes, when valgrind was installed we were not able to reproduce the crash.:(   I do not have 389 logs currently but we can regenerate these. I like your hypothesis, good thinking...

**Well seems like we should give it another go,using Rhel7.5-latest to confirm ...**

First:
we will install the latest IdM, configure and run the system test for a few hours and observe and document results.
-verify config
-verify system admin/ auth load results
-verify the process still is crashes
-verify whether I can acess a protected resource with just password instead of a totp
-allow dev access to the test environment as needed ot observe issue.
-collect and view the 389-ds access logs you mentioned

Second:
we can add valgrind, with your support configuring it and then run the same system test /load and hope for the process to crash as experienced in the past.

We just need allocate some resources to free up to run these system tests for you.

Let use know if this works for you?
Thanks
Bruce Aiesi

Comment 34 Florence Blanc-Renaud 2018-05-14 08:13:35 UTC

Hi Bruce,
yes, your plan looks good to me. Thank you,

Comment 35 baiesi 2018-05-21 15:39:29 UTC

Hi Florence,

We recently installed rhel7.5 which is a ASAN build used to detect memory leaks for another defect, but should be fine here... Feel free to login to the test environment with the cred below to access the 389-ds access logs. Try and manually authenticate as we did to replicate. Feel free to contact us directly at any time or schedule a meeting to discuss in more detail...

Env:
users: 1000
tokens: 200 totp and 200 hotp
-1 master
-1 replica
-19 clients

To simplify the environment we are only running 1 totp authn test against 1 idm client. All admin tests are disabled here.

via automation...
We ran a totp test under light load of 5 threads performing concurrent authn with diff users against 1 idm client (10.8.245.206)
-attempts=684, passed=29, failed=655
-we verified our 5 users each have 2 tokens each assigned to them, (totp and hotp)
-we verified users were configured for (Two-Factor Authentication = (password+otp)) only
-again we were periodically prompt with "Password" and not the expected "First Factor"/ Second Factor"
-again we were periodically able to gain access to the protected host with just a ldap password when prompt with "Password"
-again we receive periodic crash reports - ipa-otpd killed by SIGABRT - /usr/libexec/ipa/ipa-otpd ldapi://%2fvar%2frun%2fslapd-QE-STI-COM.socket

via manually...
Of course after seeing the results via automation, we decided to verify manually. We initially ssh'd into idm client (10.8.245.206). You can below the sdtout, first it asks for First Factor: as expected...great... When we try again it promps for "Password", eventually allowing us access the protected host resource, which is not expected.

stdout....
[cloud-user@host-172-16-125-23 ~]$ ssh idm_user5.245.206
First Factor:
[1]+ Stopped ssh idm_user5.245.206

[cloud-user@host-172-16-125-23 ~]$ ssh idm_user5.245.206
Password:
Last login: Mon May 21 10:16:53 2018 from ovpn-122-141.rdu2.redhat.com
Could not chdir to home directory /home/idm_user5: No such file or directory

Comment 36 Florence Blanc-Renaud 2018-05-22 09:47:17 UTC

Hi Nathaniel,

would you have time to check the machines? Thanks

Comment 37 Nathaniel McCallum 2018-05-22 15:53:29 UTC

This morning I spent some time working on this bug. I am treating the "Password:" prompt issue and the ipa-otpd crash issue as separate problems since they are unrelated (ipa-otpd doesn't control what prompt the client sees at all).

=== "Password:" prompt issue ===

1. I was unable to reproduce the issue using a raw kinit. When using kinit I *always* received two prompts ("First factor:" and "Second factor:").

2. I was able to briefly reproduce the issue by creating a test user and doing "su - idm_user5". Occasionally, I would receive the "Password:" prompt and I was able to log in using only a single factor. I should have checked, at the time, if I had received a kerberos ticket (via klist) after the successful login; but I did not. I am now unable to reproduce it again. Future tests which are able to trigger this condition should immediately run klist and provide the output. I suspect there will not be a ticket because...

3. During the testing I rebooted the servers and, accidentally, ran "su - idm_user5" while the servers were down. I immediately received the "Password:" prompt and was able to log in. This caught me by surprise until I realized that SSSD caching was enabled:

[domain/qe.sti.com]
cache_credentials = True
krb5_store_password_if_offline = True

My suspicion is that SSSD is periodically performing authentication in cached mode even when we are online. We can confirm this by running klist (see #2 above) after a successful 1FA login. If there is no ticket, then SSSD allowed a local login only due to authenticating in cached mode.

If this suspicion is correct, then the system is running correctly as configured and the only bug might be in how eager SSSD is to perform a cached, local authentication.

=== ipa-otpd crash issue ===

1. I was able to observe that ipa-otpd had entered a failed state many times in systemctl.

2. I was unable to convince ipa-otpd to crash for me.

3. I have enabled ipa-otpd to run under valgrind on both the master and the replica. We should hopefully be able to get decent information after a crash now.

Comment 38 baiesi 2018-05-22 19:29:23 UTC

Hi Nathaniel,
In on-line mode, which we are, are you saying it is designed and expected behaviour for users protected with just two factor authentication, periodically to get prompt for "Password:" instead of First Factor: and  "Second Factor" prompts? ,and granted access to my protected resource with just a ldap password.  I never expected this behaviour in my automated tests and may have to adjust if this is the case.  For on-line / off-line, it seems that if I want two factor authn to protect my resource, it should never check cache, or grant access with just my ldap password. Although I can see it ok with users set-up with just a password  authentication use case...
Thanks

Comment 39 Nathaniel McCallum 2018-05-22 20:34:42 UTC

(In reply to baiesi from comment #38)
> Hi Nathaniel,
> In on-line mode, which we are, are you saying it is designed and expected
> behaviour for users protected with just two factor authentication,
> periodically to get prompt for "Password:" instead of First Factor: and 
> "Second Factor" prompts and granted access to my protected resource with
> just a ldap password?

No. It is designed for the following:

1. If the system is online, two factors will be verified (remotely, by the krb5 KDC) during login and a ticket will be procured. This allows the user access to the local system and to network resources.

2. If the system is offline, only the first factor will be verified (locally, by SSSD) during login and a ticket will NOT be procured. This allows the user access to the local system but NOT to network resources.

The primary driver for this is the laptop use case.

If you disable caching and clear the cache, you should *never* see "Password:". I believe that, for some erroneous reason, sssd is periodically doing an offline authentication even when we are online. If I'm correct, this is a bug that should be fixed.

IMHO, if you are using this to protect a network resource, you should disable SSSD caching.

> I never expected this behaviour in my automated tests
> and may have to adjust if this is the case.  For on-line / off-line, it
> seems that if I want two factor authn to protect my resource, it should
> never check cache, or grant access with just my ldap password. Although I
> can see it ok with users set-up with just a password  authentication use
> case...

Yes, for network resources I would disable first factor caching altogether.

Comment 40 baiesi 2018-05-23 20:33:56 UTC

Hi Nathaniel,
I tried your theory and disabled cache in the sssd.conf file, then restart the service and cleared cache

1.  sudo vim /etc/sssd/sssd.conf
[domain/qe.sti.com]
cache_credentials = False
krb5_store_password_if_offline = False
...

2. then restart
sudo service sssd restart
sudo service sssd status

3. then cleared cache...
sudo sss_cache -E

I tried authn again and I am still getting periodically prompt fwith "Password:".  It initially asked for "First Factor:" the first time, I ctl-C out.  The second time it prompt me "Password:" again.  The good news here is that its not giving me access to the host resource anymore as it did yesterday.  The bad news is its prompting me, with the wrong prompt..  Is ther an additional step I need to perform with sssd?

Thanks
Bruce


*Terminal manual std-out:
[baiesi@dhcp-41-174 keys]$ ssh idm_user170.245.206
First Factor:

[baiesi@dhcp-41-174 keys]$ ssh idm_user170.245.206
Password:
Password:
Password:
idm_user170.245.206's password:
Permission denied, please try again.
idm_user170.245.206's password:

*Automated Tests Stdout:
ssh attempt: 10.8.245.206 idm_user100, password1$, http.COM, QE.STI.COM, qe.sti.com
Running Command: ssh idm_user100.245.206
Warning: 1. Should not be prompting for LDAP password for TOTP test, will attempt to penetrate: password1$
Warning: 2. Should not be prompting for LDAP password for TOTP test again
failed out...

Comment 41 Nathaniel McCallum 2018-05-23 20:52:20 UTC

So that appears to have stopped the caching (hence logins fail). But we don't know why SSSD is giving us the wrong prompt. I think this is now an SSSD bug.

Comment 42 Sumit Bose 2018-05-23 21:58:03 UTC

If SSSD is in the offline state it will still prompt for 'password:' even if no password is cached. The check if the password is cached and matches the input is done after the user is prompted.

You can verify if SSSD is offline when you see the 'password:' prompt by calling

    sssctl domain-status qe.sti.com

The next step is to determine why SSSD is offline. For this I need the SSSD logs with debug_level=9 at least in the [domain/...] section. Having debug_level=9 in the [nss] and [pam] section might be useful as well.

HTH

bye,
Sumit

Comment 43 baiesi 2018-05-24 14:39:05 UTC

Hi Sumit,
I just installed "sudo yum install sssd sssd-tools" and ran "sssctl domain-status qe.sti.com" It initially indicated online mode, then a few minutes later indicated off-line. So for some reason this idm client is periodically oscillating between online/offline. Question now is why?
Thanks

[cloud-user@host-172-16-125-23 ~]$ sudo sssctl domain-status qe.sti.com
Online status: Online
Active servers:IPA: master.qe.sti.com
Discovered IPA servers:- master.qe.sti.com

[cloud-user@host-172-16-125-23 ~]$ sudo sssctl domain-status qe.sti.com
Online status: Offline
Active servers:IPA: master.qe.sti.com
Discovered IPA servers:- master.qe.sti.com

Comment 44 Sumit Bose 2018-05-24 15:05:54 UTC

(In reply to baiesi from comment #43)
> Hi Sumit,
> I just installed "sudo yum install sssd sssd-tools" and ran "sssctl
> domain-status qe.sti.com" It initially indicated online mode, then a few
> minutes later indicated off-line. So for some reason this idm client is
> periodically oscillating between online/offline. Question now is why?
> Thanks
> 
> [cloud-user@host-172-16-125-23 ~]$ sudo sssctl domain-status qe.sti.com
> Online status: Online
> Active servers:IPA: master.qe.sti.com
> Discovered IPA servers:- master.qe.sti.com
> 
> [cloud-user@host-172-16-125-23 ~]$ sudo sssctl domain-status qe.sti.com
> Online status: Offline
> Active servers:IPA: master.qe.sti.com
> Discovered IPA servers:- master.qe.sti.com

Ok, this would explain the prompting behavior. Can you attach the logs I mentioned above to the ticket?

bye,
Sumit

Comment 45 baiesi 2018-05-24 21:10:32 UTC

Hi Sumit,

Additional clues:
So I stopped all authentications to the idm client and began monitoring the status of "sudo sssctl domain-status qe.sti.com" which indicated "online" over a period of 1hr, ok so far so good...

Then I started a light load test cycling user 1-50 as 1 thread sequentially.  After a few minutes of performing 2FA successfully, the domain status began cycling between on-line / off-line periodically again.

note: Is there a possible correlation and switching to Off-line mode when the ipa-otpd crash occurs, as discussed in this defect.

I still need to configure and attach the logs.

Thanks
Bruce

Comment 46 Sumit Bose 2018-05-25 10:41:44 UTC

Hi,

you are right, there are a number of Kerberos errors which will cause SSSD to assume that the KDC has issues which will not only affect the current request but other requests as well and marks this KDC as 'not working'. If all discovers or configured KDCs are tried SSSD will switch to offline mode. The Kerberos error code for this are currently

- KRB5_KDCREP_SKEW
- KRB5KRB_AP_ERR_SKEW
- KRB5KRB_AP_ERR_TKT_EXPIRED
- KRB5KRB_AP_ERR_TKT_NYV
- KRB5_KDC_UNREACH
- KRB5_REALM_CANT_RESOLVE
- KRB5_REALM_UNKNOWN

So yes, the authentication failures due to crashing ipa-otpd might cause SSSD to switch to offline mode.

HTH

bye,
Sumit

Comment 47 baiesi 2018-05-25 14:13:31 UTC

Hi Sumit,
Looking at this krb5kdc.log on the Idm Master the snip of log indicates msg "verify failure: Connection timed out", see below when auth under light load. seemingly putting SSSD to offline mode.

*krb5kdc.log
May 25 09:28:10 master.qe.sti.com krb5kdc[2493](info): AS_REQ (8 etypes {18 17 20 19 16 23 25 26}) 10.8.245.206: NEEDED_PREAUTH: idm_user16.COM for krbtgt/QE.STI.COM.COM, Additional pre-authentication required
May 25 09:28:10 master.qe.sti.com krb5kdc[2493](info): closing down fd 12

>>>>>>>May 25 09:28:15 master.qe.sti.com krb5kdc[2494](info): preauth (otp) verify failure: Connection timed out<<<<<<<<<<<<

May 25 09:28:15 master.qe.sti.com krb5kdc[2494](info): AS_REQ (8 etypes {18 17 20 19 16 23 25 26}) 10.8.245.206: PREAUTH_FAILED: idm_user16.COM for krbtgt/QE.STI.COM.COM, Preauthentication failed
May 25 09:28:17 master.qe.sti.com krb5kdc[2494](info): closing down fd 12
May 25 09:28:18 master.qe.sti.com krb5kdc[2494](info): AS_REQ (8 etypes {18 17 20 19 16 23 25 26}) 10.8.245.206: NEEDED_PREAUTH: idm_user16.COM for krbtgt/QE.STI.COM.COM, Additional pre-authentication required
May 25 09:28:18 master.qe.sti.com krb5kdc[2494](info): closing down fd 12

Seems we are now back to where we started, knowing that ipa-otpd crashes, which is the initial cause of the symptoms we uncovered.

Comment 48 Sumit Bose 2018-05-25 15:30:50 UTC

Yes, I agree.

Since the reply to the AS_REQ is delay for 5s and the PREAUTH_FAILED error is returned you might be able to avoid the SSSD goes offline by increasing krb5_auth_timeout in the [domain/...] section of sssd.conf. The default is 6s, i.e. SSSD waits for 6s for AS_REQ to succeed. If the timeout is hit SSSD marks the KDC as 'not working' and will switch to the offline mode if there are no KDC left.

If you set 'krb5_auth_timeout = 10' is should cover the 5s delay on the KDC side if ipa-otpd crashed and SSSD will just return an authentication error to the client based on PREAUTH_FAILED but will not switch to the offline mode.

I think 'sssd' is currently not the right component for this ticket. Would you mind to switch it back to 'ipa' or even to 'krb5' if it looks the issue might be in libkrad?

bye,
Sumit

Comment 49 Nathaniel McCallum 2018-05-25 15:52:49 UTC

Since we have good reason to suspect that the ipa-otpd crash is causing sssd to go into offline mode, I have moved the component back to IPA.

Sumit, thank you for your help!

Bruce, can you confirm that raising the timeout makes the prompt disappear? If so, let's focus on the ipa-otpd crash.

Comment 50 baiesi 2018-05-25 18:06:27 UTC

Well I do not see a direct correlations to ipa-otp crashes and sssd going off line, it was a theory I was suggesting.  The sssd oscillating from online/offline is more frequent than the crashes seen in the current environment.



I've changed sssd.conf to:

[domain/qe.sti.com]
debug_level = 9
cache_credentials = False
krb5_store_password_if_offline = False
krb5_auth_timeout = 10
.
.
.

still getting prompt for "Password:" when offline and still seeing sssd oscillating only when applying authn load to it. I also now have a simple script monitoring the on-line/off-line status now.

thanks
Bruce

Comment 51 Nathaniel McCallum 2018-05-25 18:45:13 UTC

Let's work on ipa-otpd crashes and return to the online/offline problem (to ensure they aren't correlated). I have reviewed the valgrind output and do see the crashes. However, I forgot to install the debug info, so they are of limited value. I since have installed the debug info and I restarted the service on master.

Bruce, could you please put some load on the system so that we can reproduce the crash?

Comment 52 Sumit Bose 2018-05-25 18:51:11 UTC

(In reply to baiesi from comment #50)
> Well I do not see a direct correlations to ipa-otp crashes and sssd going
> off line, it was a theory I was suggesting.  The sssd oscillating from
> online/offline is more frequent than the crashes seen in the current
> environment.

If you have any SSSD logs please attach them to the ticket, then I can check if there are other reasons for SSSD to go offline.

Comment 53 milei 2018-05-25 19:37:42 UTC

Created attachment 1441691 [details]
sssd.log on IPA client 206

sssd.log from ipa client( 206) is attached.

Comment 54 milei 2018-05-25 19:44:21 UTC

Nathaniel, 5 thread totp authentication test is running. Scheduled to run for 2 hours.

****************************************************************

Run ID 	R-4371
Profile Name 	idm-75-auth-bug
Run Name 	idm system test_8_lce-6
Tester 	milei
State 	Running Running
Group 	idm
Start Date 	2018-05-25 15:29:48 EDT
Stop Date 	2018-05-25 17:29:48 EDT
*****************************************************************

Comment 55 Sumit Bose 2018-05-25 20:04:17 UTC

(In reply to milei from comment #53)
> Created attachment 1441691 [details]
> sssd.log on IPA client 206
> 
> sssd.log from ipa client( 206) is attached.

I need the sssd_your.domain.name.log file.

Comment 56 Nathaniel McCallum 2018-05-25 20:37:03 UTC

Created attachment 1441708 [details]
valgrind of ipa-otpd during crashes

Comment 57 milei 2018-05-25 20:41:17 UTC

(In reply to Nathaniel McCallum from comment #56)
> Created attachment 1441708 [details]
> valgrind of ipa-otpd during crashes

(In reply to Sumit Bose from comment #55)
> (In reply to milei from comment #53)
> > Created attachment 1441691 [details]
> > sssd.log on IPA client 206
> > 
> > sssd.log from ipa client( 206) is attached.
> 
> I need the sssd_your.domain.name.log file.

sssd_your.domain.name.log.tar.gz is 22MB, bigger than bugzilla's limit ( 20MB).
Please see my Email to access log.

Comment 58 Nathaniel McCallum 2018-05-25 20:46:48 UTC

Robbie, can you look at this crasher? It might be a libverto bug. See the above backtrace and valgrind logs.

Comment 59 Robbie Harwood 2018-05-30 18:59:38 UTC

I think it's an otpd queue bug: https://github.com/freeipa/freeipa/pull/1980

Comment 64 Florence Blanc-Renaud 2018-06-19 12:14:34 UTC

The fix has already been committed:
master:
    a2e8d98 Fix elements not being removed in otpd_queue_pop_msgid()
https://pagure.io/freeipa/c/a2e8d989a359690f493930461adf7f52a0cc745d?branch=master

ipa-4-6:
    a417b81 Fix elements not being removed in otpd_queue_pop_msgid()
https://pagure.io/freeipa/c/a417b81ebfe60e4d2dfc688dfb2d1865486468e0?branch=ipa-4-6

ipa-4-5:
    3eb1a07 Fix elements not being removed in otpd_queue_pop_msgid()
https://pagure.io/freeipa/c/3eb1a07d0ee86e2aaf25843a1eaff01289d434cd?branch=ipa-4-5

As the commit in ipa-4-6 was done before the rebase, the fix is already included in ipa-4.6.4-1.el7

Comment 69 Scott Poore 2018-08-18 04:37:03 UTC

I believe this is still failing:

[root@qe-blade-07 ~]# rpm -q ipa-server
ipa-server-4.6.4-6.el7.x86_64

From STR tests I am seeing logins prompt for password when it should be prompting for OTP First and Second Factors.


##############################

ipa_authn_totp_negative.py::Test_totp::test_totp[{'userstartindex': 50, 'userprefix': 'idm_user', 'adminpassword': 'Secret123', 'host': 'CLIENT_IP', 'threadtotal': 5, 'threadid': 0}] 
...
Running Command: ssh idm_user50@CLIENT_IP
Password: Warning: Should not be prompting for LDAP password for TOTP test
ERROR: assert False : Warning: Should not be prompting for LDAP password for TOTP test
...
FAILED

##############################

ipa_authn_totp_negative.py::Test_totp::test_totp[{'userstartindex': 50, 'uid': '518ff8', 'userprefix': 'idm_user', 'tdlogpath': '/opt/sti_ext/test_driver/harness/logs/R4729/CLIENT_IP/authn_totp_2_factor_neg', 'adminpassword': 'Secret123', 'host': 'CLIENT_IP', 'threadtotal': 5, 'threadid': 1}] 
...
Running Command: ssh idm_user51@CLIENT_IP
First Factor: Set1 First Factor LDAP: password1$
password1$

Second Factor: 2018-08-17 21:46:53 Set1 Second Factor TOTP: 123456 <<<<<<<<< BAD TOKENCODDE
123456

First Factor: Warning - Authn TOTP failed, is requesting to re enter first factor again for idm_user51
...
PASSED

Comment 70 Scott Poore 2018-08-18 04:45:34 UTC

Also, running a manual test, I can see the eventual prompt for password:

[root@cloud-qe-05 ~]# while true; do expect -f test; if [ $? -eq 1 ]; then break; fi; done
spawn ssh idm_user50@CLIENT_IP
First Factor: 
Second Factor: 
spawn ssh idm_user50@CLIENT_IP EXPECTED
First Factor: 
Second Factor: 
spawn ssh idm_user50@CLIENT_IP EXPECTED
First Factor: 
Second Factor: 
spawn ssh idm_user50@CLIENT_IP EXPECTED
First Factor: 
Second Factor: 
spawn ssh idm_user50@CLIENT_IP EXPECTED
First Factor: 
Second Factor: 
spawn ssh idm_user50@CLIENT_IP EXPECTED
First Factor: 
Second Factor: 
Password: ^Cspawn ssh idm_user50@CLIENT_IP
Password: BADBADBAD[root@cloud-qe-05 ~]# vim test
[root@cloud-qe-05 ~]# !ssh
ssh idm_user50@CLIENT_IP
Password: 

[root@cloud-qe-05 ~]# 

However, I have been unable so far to actually login with the users password at this point.

Comment 71 Scott Poore 2018-08-18 04:50:33 UTC

Nathaniel,

I have set sssd.conf entry:

krb5_store_password_if_offline = False

And considering the results above, is this still an issue?  Or am I missing something?

FYI, the test expect script looks like this:



set timeout 30
set send_slow {1 .1}
spawn ssh idm_user50@CLIENT_IP
expect {
	"First Factor:" {
		send -- "password1$\r"
	}
	"Password:" {
		send_user "BADBADBAD"
		exit 1
	}
	timeout {
		exit 175
	}
}
expect {
	"Second Factor:" {
		send -- "123456\r"
	}
	"Password:" {
		send_user "BADBADBAD"
		exit 1
	}
	"Last login:" {
		send_user "LOGIN SUCCEEDED UNEXPECTEDLY\r"
		exit 1
	}
	timeout {
		exit 175
	}
}
expect {
	"First Factor:" {
		send_user "\r\r"
		send_user "                 LOGIN FAILED AS EXPECTED\r"
		exit 0
	}
	"Last Login:" {
		send_user "\r\r"
		send_user "                 LOGIN SUCCEEDED UNEXPECTEDLY\r"
		exit 1
	}
	timeout {
		exit 175
	}
}

Comment 72 Scott Poore 2018-08-18 04:58:37 UTC

Also, we are also still seeing the ipa-otpd crash:

From an abrt report:

time:           Fri 17 Aug 2018 10:18:15 PM EDT
package:        ipa-server-4.6.4-6.el7
reason:         ipa-otpd killed by SIGABRT
crash_function: otpd_queue_item_free
cmdline:        /usr/libexec/ipa/ipa-otpd ldapi://%2fvar%2frun%2fslapd-QE-STI-COM.socket
executable:     /usr/libexec/ipa/ipa-otpd
component:      ipa
uid:            0
username:       root
hostname:       qe-blade-07.qe.sti.com
os_release:     Red Hat Enterprise Linux Server release 7.6 Beta (Maipo)
architecture:   x86_64
pwd:            /
kernel:         3.10.0-933.el7.x86_64
abrt_version:   2.1.11

Comment 82 Robbie Harwood 2018-08-24 15:02:51 UTC

The crash Scott reported in comment#72 has a separate bug report (with fix) in https://bugzilla.redhat.com/show_bug.cgi?id=1622168

Comment 83 Florence Blanc-Renaud 2018-08-31 19:04:40 UTC

Fixed upstream
master:
https://pagure.io/freeipa/c/fe650087b0dd35b3dd23347e98647fdc526614f8
https://pagure.io/freeipa/c/ab636682c4a0274d411c702ca62a6d26ab8b7869

Comment 84 Florence Blanc-Renaud 2018-09-03 07:11:42 UTC

Fixed upstream
ipa-4-5:
https://pagure.io/freeipa/c/b2d4d7a23f4288ed5f840626bb090c57c66950fa
https://pagure.io/freeipa/c/84883e41ad09cb32f3af9d88db5aad2682877494

ipa-4-6:
https://pagure.io/freeipa/c/b0b37d4ed92585f980443ad8dcbee29c38ac3bc9
https://pagure.io/freeipa/c/d0b23ec375db02090abb2a5ff5a83d5ff756b871

ipa-4-7:
https://pagure.io/freeipa/c/2dae9e28b3b508f43af293ae8602f9208ed3c4e0
https://pagure.io/freeipa/c/56ec7c8c2cca40c71b7129ab07aa5bac64239133

Comment 85 Tibor Dudlák 2018-09-04 12:39:33 UTC

BZ marked as MODIFIED by mistake changing back to POST.

Comment 87 Scott Poore 2018-09-12 13:47:08 UTC

Verified.

Version ::

ipa-server-4.6.4-9.el7.x86_64


Results ::

STR job run on simple IPA Master-Replica-Client environment.

Run ID 	R-4849
Profile Name 	idm-76-otp-pos1
Run Name 	idm_bug_check_totp_7_lce-1
Tester 	spoore
State 	Complete Complete
Group 	idm
Start Date 	2018-09-11 18:12:13
Stop Date 	2018-09-12 06:12:13
Test Duration 	Days:0, Hrs:12, Mins:0, Secs:0
Lab Controller Engine 	LCE-1-10.8.48.121:8084

I did not see any crash reports during this run.  We were seeing crashes very frequently when we ran this test.

[root@qe-blade-04 log]# abrt-cli list
The Autoreporting feature is disabled. Please consider enabling it by issuing
'abrt-auto-reporting enabled' as a user with root privileges


[root@qe-blade-09 slapd-QE-STI-COM]# abrt-cli list
The Autoreporting feature is disabled. Please consider enabling it by issuing
'abrt-auto-reporting enabled' as a user with root privileges

Also, I adjusted the tests slightly to no longer throw failures when prompted for ldap password if the password was rejected.  As long as the prompt does not then allow login, I left it marked passed to simplify troubleshooting for this bug.  All cases of this that I investigated when this occurred were due to locked accounts which could be the result of the negative tests running too frequently.

Comment 89 errata-xmlrpc 2018-10-30 10:57:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3187

Note You need to log in before you can comment on or make changes to this bug.

abokovoy
baiesi
ekeck
frenaud
grajaiya
ipa-maint
jhrozek
lslebodn
milei
mkosek
mreznik
mzidek
ndehadra
npmccallum
pasik
pbrezina
pvoborni
rcritten
rharwood
sbose
spoore
tdudlak
tscherf
xdong