Bug 1494843
Summary: | KCM does not work | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Dennis Gilmore <dennis> | ||||||||||||||
Component: | sssd | Assignee: | Jakub Hrozek <jhrozek> | ||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||||||||||
Severity: | unspecified | Docs Contact: | |||||||||||||||
Priority: | unspecified | ||||||||||||||||
Version: | 27 | CC: | abokovoy, andrew, asn, bberg, bojan, chuck.lever, debarshir, dennis, fidencio, jhrozek, j, kparal, lars, lslebodn, nalin, pbrezina, pschindl, rharwood, sbose, sgallagh, ssorce, tomek, tpopela, wattersm, zing | ||||||||||||||
Target Milestone: | --- | Keywords: | CommonBugs | ||||||||||||||
Target Release: | --- | ||||||||||||||||
Hardware: | Unspecified | ||||||||||||||||
OS: | Unspecified | ||||||||||||||||
Whiteboard: | https://fedoraproject.org/wiki/Common_F27_bugs#kerberos-broken | ||||||||||||||||
Fixed In Version: | sssd-1.16.1-2.fc27 sssd-1.16.1-2.fc28 sssd-1.16.1-2.fc26 | Doc Type: | If docs needed, set a value | ||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||
Clone Of: | Environment: | ||||||||||||||||
Last Closed: | 2018-04-02 13:01:48 UTC | Type: | Bug | ||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||
Embargoed: | |||||||||||||||||
Attachments: |
|
Description
Dennis Gilmore
2017-09-23 13:13:46 UTC
(In reply to Dennis Gilmore from comment #0) > Description of problem: > On a new fedora 27 install kerberos does not work. I get the following > message > kinit: Internal credentials cache error while getting default ccache > I was able to get kerberos working by commenting out > I use IPA for authentication > If you have file /etc/krb5.conf.d/kcm_default_ccache then it means package sssd-kcm is installed. http://fedoraproject.org/wiki/Releases/27/ChangeSet#Kerberos_KCM_credential_cache_by_default We will need to see log files from daemons sssd-secrets and sssd-kcm Add following lines to sssd.conf or (/etc/sssd/conf.d/11_kcm.conf) [kcm] debug_level = 10 debug_microseconds = true [secrets] debug_level = 10 debug_microseconds = true * restart services sssd sssd-secrets and sssd-kcm * reproduce issue with "KRB5_TRACE=/dev/stderr kinit .." * and provide output of previous command + log files /var/log/sssd/sssd_secrets.log /var/log/sssd/sssd_kcm.log sssd_kcm.log gave me (Mon Sep 25 16:34:52 2017) [sssd[kcm]] [kcm_recv] (0x0010): Failed to receive data (104, Connection reset by peer), aborting client I did not get any extra output from anything else. kinit was the exact same output. I had to reboot the system for the changes to take effect. I also had issues logging in as my user when KCM was configured and the sssd debugging enabled (In reply to Dennis Gilmore from comment #2) > sssd_kcm.log gave me > (Mon Sep 25 16:34:52 2017) [sssd[kcm]] [kcm_recv] (0x0010): Failed to > receive data (104, Connection reset by peer), aborting client > Do you have /var/log/sssd/sssd_secrets.log with increased debug level? Was sssd-secrets running at that time. (it is socket activated service. The same as sssd-kcm. > I had to reboot the system . I also had issues > logging in as my user when KCM was configured and the sssd debugging enabled Which changes do you have in mind? Do you mean default krb5ccache in /etc/krb5.conf.d/kcm_default_ccache? Because it was already written in description that it was enought to modify that file. I have exactly the same problem. Kerberos was working great for redhat.com and fedoraproject.org on Fedora 26. After upgrade to Fedora 27, it's completely broken. $ klist -a klist: Internal credentials cache error while getting default ccache $ KRB5_TRACE=/dev/stderr kinit kparal kinit: Internal credentials cache error while searching for ccache for kparal In sssd_kcm.log I see: (Thu Sep 28 14:16:40 2017) [sssd[kcm]] [kcm_recv] (0x0010): Failed to receive data (104, Connection reset by peer), ab orting client (Thu Sep 28 14:16:40 2017) [sssd[kcm]] [kcm_recv] (0x0010): Failed to receive data (104, Connection reset by peer), ab orting client (Thu Sep 28 14:16:40 2017) [sssd[kcm]] [kcm_recv] (0x0010): Failed to receive data (104, Connection reset by peer), ab orting client (Thu Sep 28 14:16:40 2017) [sssd[kcm]] [kcm_recv] (0x0010): Failed to receive data (104, Connection reset by peer), ab orting client (Thu Sep 28 14:16:42 2017) [sssd[kcm]] [kcm_recv] (0x0010): Failed to receive data (104, Connection reset by peer), aborting client (Thu Sep 28 14:16:42 2017) [sssd[kcm]] [kcm_recv] (0x0010): Failed to receive data (104, Connection reset by peer), aborting client over and over again, every few seconds. In sssd_secrets.log I see: (Thu Sep 28 14:03:48 2017) [sssd[secrets]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Thu Sep 28 14:05:29 2017) [sssd[secrets]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Thu Sep 28 14:13:03 2017) [sssd[secrets]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Thu Sep 28 14:16:03 2017) [sssd[secrets]] [orderly_shutdown] (0x0010): SIGTERM: killing children over and over again, every few minutes. While testing this, I performed the changes stated in comment 1 (created /etc/sssd/conf.d/11_kcm.conf). Both sssd-secrets.service and sssd-kcm.service were running during that time. I can confirm that after commenting out contents of /etc/krb5.conf.d/kcm_default_ccache file kerberos starts working perfectly. (In reply to Kamil Páral from comment #5) > I can confirm that after commenting out contents of > /etc/krb5.conf.d/kcm_default_ccache file kerberos starts working perfectly. Please provide debug logs as per comment #1. btw if you have issues with generating the log files even after adding the debug_level directives to sssd.conf to the kcm and secret sections and restarting sssd, chances are the old sssd_kcm and/or sssd_secrets processes are still around, you can kill them so that the next socket-activated instances read the verbose debug level (pkill sssd_kcm; pkill sssd_secrets). Alternatively, you can always run the deamons on the foreground: /usr/libexec/sssd/sssd_kcm --uid 0 --gid 0 --debug-level=10 --debug-to-files /usr/libexec/sssd/sssd_secrets --uid 0 --gid 0 --debug-level=10 --debug-to-files That should generate /var/log/sssd/sssd_kcm.log and /var/log/sssd/sssd_secrets.log Created attachment 1331984 [details]
sssd_secrets.log
Created attachment 1331985 [details]
sssd_kcm.log.xz
I rebooted before generating the log files. These logs are not verbose enough, because sssd logs only critical errors by default. Please use a higher debug level, see comment #7 to run the deamons from the command line if you have problems generating the logs. And please truncate the existing logs so the logs you attach ideally only include a single test (In reply to Jakub Hrozek from comment #11) > These logs are not verbose enough, because sssd logs only critical errors by > default. Please use a higher debug level, see comment #7 to run the deamons > from the command line if you have problems generating the logs. > > And please truncate the existing logs so the logs you attach ideally only > include a single test Running daemons from command line is not ideal because they are running with different SELinux context. Instructions in comment 1 were created for Denis because he use sssd with IPA but in case of clean/minimal f27 it won't work due to upstream bug https://pagure.io/SSSD/sssd/issue/3439. It will work with missing sssd.conf with following change cp /etc/sssd/conf.d/11_kcm.conf /etc/sssd/sssd.conf The main reason why I wrote instructions in comment#1 is enabling debug logs with microseconds which might help a lot. (and will not work with instructins in comment#7) BTW I suspect bug due to responder_idle_timeout (which I disabled on my laptop due to running daemons with valgrind) I placed the content from comment 1 into /etc/sssd/sssd.conf and still the logs are not verbose. I'm available in the Red Hat office, perhaps you can stop by and try to debug it locally, if you can't reproduce that? So, to summarize. For some reason the logs started being verbose only after some time after reboot. The issue might be affected by other issues, some of them related to gnome-online-accounts. For testing purposes I disabled them. I tried to reproduce the issue on fresh F27 installs, and even on an F26->F27 upgrade, but it worked. On my local machine, though, it only started working after deleting everything in /var/lib/sss/secrets/ (including hidden files) and rebooting. Since then, KCM seems to work. Dennis, can you confirm the same workaround works for you as well? Thanks. So, I just ran into my issue again, and after a short chat with Patrick Uiterwijk it looks the race condition is simply that systemd has not opened the socket when the sssd service is being active. i.e. what happens is: * sssd-*.service is started * sssd-*.socket is also triggered (but the socket is not bound yet) * daemon comes up * daemon binds to the socket as systemd has not done so yet * systemd fails to bind the socket and the sssd-*.socket units fail to start up There are different possible fixes for this: * add proper Before=/After= lines * prevent the service from ever trying to bind to the socket if running under systemd Clearing the needinfo flag as the above analysis might explain the seen issue (at least the issue that I was seeing). Benjamin, your issue was very likely different. (In reply to Benjamin Berg from comment #15) > So, I just ran into my issue again, and after a short chat with Patrick > Uiterwijk it looks the race condition is simply that systemd has not opened > the socket when the sssd service is being active. i.e. what happens is: > > * sssd-*.service is started > * sssd-*.socket is also triggered (but the socket is not bound yet) > * daemon comes up > * daemon binds to the socket as systemd has not done so yet > * systemd fails to bind the socket and the sssd-*.socket units fail to > start up > > There are different possible fixes for this: > * add proper Before=/After= lines > * prevent the service from ever trying to bind to the socket if running > under systemd (In reply to Lukas Slebodnik from comment #17) > Benjamin, your issue was very likely different. Isn't that bug 1475999 ? (In reply to Debarshi Ray from comment #18) > (In reply to Benjamin Berg from comment #15) > > So, I just ran into my issue again, and after a short chat with Patrick > > Uiterwijk it looks the race condition is simply that systemd has not opened > > the socket when the sssd service is being active. i.e. what happens is: > > > > * sssd-*.service is started > > * sssd-*.socket is also triggered (but the socket is not bound yet) > > * daemon comes up > > * daemon binds to the socket as systemd has not done so yet > > * systemd fails to bind the socket and the sssd-*.socket units fail to > > start up > > > > There are different possible fixes for this: > > * add proper Before=/After= lines > > * prevent the service from ever trying to bind to the socket if running > > under systemd > > (In reply to Lukas Slebodnik from comment #17) > > Benjamin, your issue was very likely different. > > Isn't that bug 1475999 ? NO BZ1475999 is already fixed even though state of BZ does not say so. sh4 grep -E "kcm|secrets" /usr/lib/systemd/system-preset/* /usr/lib/systemd/system-preset/90-default.preset:enable sssd-kcm.socket /usr/lib/systemd/system-preset/90-default.preset:enable sssd-secrets.socket Still present in sssd-kcm-1.16.0-5.fc27.x86_64 (In reply to Andrew Meredith from comment #20) > Still present in sssd-kcm-1.16.0-5.fc27.x86_64 May I ask what kind of error can you see after calling klist/kinit? The same as in Comment4 ? For me, adding debug level to logs did not create more logs. Only the message "[sssd[kcm]] [kcm_recv] (0x0010): Failed to receive data (104, Connection reset by peer), aborting client" is ever generated. But straceing KCM, I see that directly before generating this message, KCM does: 155677 writev(12, [{iov_base="\0\0\0\0", iov_len=4}], 1) = 4 155677 writev(12, [{iov_base="\226\307:\215", iov_len=4}], 1) = 4 155677 writev(12, [{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0… ", iov_len=2048}], 1) = -1 EPIPE (Broken pipe) And FD 12 seem to be the socket passed from socket activation: 155677 accept(3, {sa_family=AF_UNIX}, [110->2]) = 12 So I guess whatever connects to /var/run/.heim_org.h5l.kcm-socket just closes connection in the middle of conversation. (In reply to Tomasz Torcz from comment #22) > For me, adding debug level to logs did not create more logs. Only the > message "[sssd[kcm]] [kcm_recv] (0x0010): Failed to receive data (104, > Connection reset by peer), aborting client" is ever generated. > I hope following steps will help to provide more verbose log files Add following lines to /etc/sssd/sssd.conf (if file does not exist then create new one) """ [kcm] debug_level = 10 debug_microseconds = true [secrets] debug_level = 10 debug_microseconds = true """ And execute following commands * chmod 600 /etc/sssd/sssd.conf * /usr/sbin/sssd --genconf * systemctl stop sssd-secrets and sssd-kcm * reproduce issue with "KRB5_TRACE=/dev/stderr kinit .." * and provide output of previous command + log files /var/log/sssd/sssd_secrets.log /var/log/sssd/sssd_kcm.log # /usr/sbin/sssd --genconf (Sun Feb 4 18:33:34:570150 2018) [sssd] [main] (0x0010): pidfile exists at /var/run/sssd.pid SSSD is already running # % KRB5_TRACE=/dev/stderr kinit zdzichu kinit: Matching credential not found while getting default ccache % Files attached. Created attachment 1391035 [details]
sssd_kcm.log.xz
Created attachment 1391036 [details]
sssd_secrets.log.xz
(In reply to Tomasz Torcz from comment #24) > # /usr/sbin/sssd --genconf > (Sun Feb 4 18:33:34:570150 2018) [sssd] [main] (0x0010): pidfile exists at > /var/run/sssd.pid > > SSSD is already running > # Ahh I forgot about this case. "/usr/sbin/sssd --genconf" will not regenerate internal conf.db if sssd is already running. Here are updated steps (I'm sorry for inconveniences): Add following lines to /etc/sssd/sssd.conf (if file does not exist then create new one) """ [kcm] debug_level = 10 debug_microseconds = true [secrets] debug_level = 10 debug_microseconds = true """ And execute following commands * chmod 600 /etc/sssd/sssd.conf * /usr/sbin/sssd --genconf || systemctl restart sssd * rm -f /var/log/sssd/* * systemctl stop sssd-secrets and sssd-kcm * reproduce issue with "KRB5_TRACE=/dev/stderr kinit .." * and provide output of previous command + log files /var/log/sssd/sssd_secrets.log /var/log/sssd/sssd_kcm.log # chmod 600 /etc/sssd/sssd.conf # /usr/sbin/sssd --genconf || systemctl restart sssd (Fri Feb 9 07:57:33:884277 2018) [sssd] [main] (0x0010): pidfile exists at /var/run/sssd.pid SSSD is already running # rm -f /var/log/sssd/* # systemctl stop sssd-secrets and sssd-kcm Failed to stop and.service: Unit and.service not loaded. Warning: Stopping sssd-secrets.service, but it can still be activated by: sssd-secrets.socket Warning: Stopping sssd-kcm.service, but it can still be activated by: sssd-kcm.socket % KRB5_TRACE=/dev/stderr kinit zdzichu kinit: Matching credential not found while getting default ccache % At this point I've tried to login remotely via SSH (it, too, doesn't work). KCM logs for SSH connection start at Fri Feb 9 07:58:57:260232 2018 UID/GID 1001 you see in logs belongs to my user: # id zdzichu uid=1001(zdzichu) gid=1001(zdzichu) groups=1001(zdzichu),190(systemd-journal),10(wheel) # getent passwd zdzichu zdzichu:*:1001:1001:Tomek Torcz:/home/zdzichu:/bin/zsh Created attachment 1393551 [details]
sssd_kcm.log
Created attachment 1393552 [details]
sssd_secrets.log
Having the same issue on my workstation which is running CentOS 7. The kcm log actually filled up the / partition with over 20 GB of data. The same message is logged repeatedly in the log file. [sssd[secrets]] [orderly_shutdown] (0x0010): SIGTERM: killing children (In reply to Michael Watters from comment #31) > Having the same issue on my workstation which is running CentOS 7. The kcm > log actually filled up the / partition with over 20 GB of data. The same > message is logged repeatedly in the log file. > > > [sssd[secrets]] [orderly_shutdown] (0x0010): SIGTERM: killing children Michael, your problem is not the same as described by Tomasz, but it does sound like an issue. Logging the responder activity when it's socket-activated (it's your case, right?) may be done in a more careful way in order to avoid: - flooding the logs; - filling up the disk of the users; I'm opening a upstream bug about this one. (In reply to Fabiano Fidêncio from comment #32) > (In reply to Michael Watters from comment #31) > > Having the same issue on my workstation which is running CentOS 7. The kcm > > log actually filled up the / partition with over 20 GB of data. The same > > message is logged repeatedly in the log file. > > > > > > [sssd[secrets]] [orderly_shutdown] (0x0010): SIGTERM: killing children > > Michael, your problem is not the same as described by Tomasz, but it does > sound like an issue. > Logging the responder activity when it's socket-activated (it's your case, > right?) may be done in a more careful way in order to avoid: > - flooding the logs; > - filling up the disk of the users; > > I'm opening a upstream bug about this one. Here's a link for the upstream issue: https://pagure.io/SSSD/sssd/issue/3651 (In reply to Fabiano Fidêncio from comment #33) > (In reply to Fabiano Fidêncio from comment #32) > > (In reply to Michael Watters from comment #31) > > > Having the same issue on my workstation which is running CentOS 7. The kcm > > > log actually filled up the / partition with over 20 GB of data. The same > > > message is logged repeatedly in the log file. > > > > > > > > > [sssd[secrets]] [orderly_shutdown] (0x0010): SIGTERM: killing children > > > > Michael, your problem is not the same as described by Tomasz, but it does > > sound like an issue. > > Logging the responder activity when it's socket-activated (it's your case, > > right?) may be done in a more careful way in order to avoid: > > - flooding the logs; > > - filling up the disk of the users; > > > > I'm opening a upstream bug about this one. > > Here's a link for the upstream issue: https://pagure.io/SSSD/sssd/issue/3651 And the bug report has been inferred to be non useful and I've closed it. I have updated the bug report with additional details. The actual issue is that KCM lookups appear to be failing. I am seeing this in CentOS 7 and Fedora 27 which both have the sssd-kcm.socket service enabled. Here's the log messages which are being repeated every few seconds. (Mon Feb 26 07:36:50 2018) [sssd[kcm]] [kcm_recv] (0x0010): Failed to parse data (74, Bad message), aborting client (Mon Feb 26 07:36:50 2018) [sssd[kcm]] [kcm_recv] (0x0010): Failed to parse data (74, Bad message), aborting client (Mon Feb 26 07:36:50 2018) [sssd[kcm]] [kcm_recv] (0x0010): Failed to parse data (74, Bad message), aborting client (Mon Feb 26 07:36:50 2018) [sssd[kcm]] [kcm_recv] (0x0010): Failed to parse data (74, Bad message), aborting client (Mon Feb 26 07:36:50 2018) [sssd[kcm]] [kcm_recv] (0x0010): Failed to parse data (74, Bad message), aborting client (Mon Feb 26 07:36:50 2018) [sssd[kcm]] [kcm_recv] (0x0010): Failed to parse data (74, Bad message), aborting client (Mon Feb 26 07:36:50 2018) [sssd[kcm]] [kcm_recv] (0x0010): Failed to parse data (74, Bad message), aborting client (Mon Feb 26 07:36:50 2018) [sssd[kcm]] [kcm_recv] (0x0010): Failed to parse data (74, Bad message), aborting client (Mon Feb 26 07:36:50 2018) [sssd[kcm]] [kcm_recv] (0x0010): Failed to parse data (74, Bad message), aborting client Since it's EBADMSG, there are only two places in the code where this can happen: 152 if (msglen == 0) { 153 DEBUG(SSSDBG_CRIT_FAILURE, "Illegal zero-length message\n"); 154 return EBADMSG; 155 } 156 157 if (msglen != reqbuf->v_msg.nprocessed) { 158 DEBUG(SSSDBG_CRIT_FAILURE, 159 "Sender claims the message is %"PRIu32" bytes, " 160 "but received %zu\n", 161 msglen, reqbuf->v_msg.nprocessed); 162 return EBADMSG; 163 } So it's either a zero-length message or the sender sent a message that is not as long as it claims to be, or, of course, we have a bug in the sssd processing logic. I think the first step should be to bump the debug logs to a more verbose level to see which of the two it is. The logs should also include the PID of the caller which might reveal who's sending the (potentially bogus?) data. I wonder if we should also enhance upstream to dump the whole raw message (or at least the headers etc, perhaps not the payload..) with a high debug level. When I was debugging KCM, IIRC I was using strace's "-e read=$num" to see the dump of the KCM socket data.. Jakub, we should implement some log throttling, probably by using a private timestamped cache or something. It makes no sense to spit out the same exact message multiple times a second... perahps we should also think of throttling clients, to prevent a bad behaved client from monopolizing the KCM. (In reply to Simo Sorce from comment #37) > Jakub, > we should implement some log throttling, probably by using a private > timestamped cache or something. It makes no sense to spit out the same exact > message multiple times a second... perahps we should also think of > throttling clients, to prevent a bad behaved client from monopolizing the > KCM. I would say we should firstly focus on bug an then on unnecessary log messages. And it is very likely that fixing bug will reduce size of log file as well. I'm hitting this problem with sssd-kcm-1.16.0-6.fc27.x86_64. Stopping sssd-kcm{.service,.socket}, removing /var/lib/sss/secrets/*, and restarting sssd-kcm.socket seems to have cleared things up. I am not able to reproduce the problem on demand. If it matters, I am running with multiple credential caches (one internal, one for fedoraproject.org): $ klist -l Principal name Cache name -------------- ---------- username KCM:1000:7064 username KCM:1000 Even restarting the sssd-kcm service only temporarily resolves the problem. Here you can see that I had a functional credentials cache for approximately 15 seconds: [lars@madhatter ~]$ date; klist Thu Mar 15 08:24:27 EDT 2018 klist: Internal credentials cache error while getting default ccache [lars@madhatter ~]$ sudo systemctl stop sssd-kcm.{service,socket}; sudo rm -f /var/lib/sss/secrets/.secrets.mkey /var/lib/sss/secrets/secrets.ldb; sudo systemctl start sssd-kcm.socket [lars@madhatter ~]$ date; klist Thu Mar 15 08:24:29 EDT 2018 Ticket cache: KCM:1000:66640 Default principal: username [lars@madhatter ~]$ date; klist Thu Mar 15 08:24:34 EDT 2018 klist: Internal credentials cache error while getting default ccache Okay, this bugzilla has several different issues being treated as the same, although they're not. So, let's try to organize it and make it more useful to everyone involved. Michael Watters, are you running SSSD against AD? I have the feeling I've fixed your issue and I may even have a testing build that you can install, test and give me some feedback. Just do a `dnf copr enable fidencio/sssd-warpil` and update SSSD. Lars Kellogg-Stedman, please, I'd really like to ask you for logs. As this bug is a little bit messier than usual, I'd like to be sure what kind of issue we're dealing here and then I'll open more tickets, each one specific for each issue we're facing. Tomasz, I'll get to your logs. You're still facing the issue, right? Andrew, Kamil, Dennis. What are the status of this issue for you guys? Would you be able to provide us some debug_logs as well and maybe some steps to reproduce the issue? Sorry for the really long time it's taking us to track this issue down. I think that the problems reported by Dennis, Kamil and Lars are the same issue - the one which you debugged with me yesterday. I had the same symptoms as the ones described in comment 2, comment 4 (I checked the logs I had saved in the past), comment 39 and comment 40. Especially note how restarting the systemd sockets (comment 39 and comment 40) seems to fix the problem. That's also how we accidentally (?) fixed my problem yesterday. I don't know the exact problem that Benjamin (comment 15) is referring to. However, given that he is talking about systemd sockets, maybe that's also the same problem? (In reply to Debarshi Ray from comment #42) > I think that the problems reported by Dennis, Kamil and Lars are the same > issue - the one which you debugged with me yesterday. I had the same > symptoms as the ones described in comment 2, comment 4 (I checked the logs I > had saved in the past), comment 39 and comment 40. Yep, but we didn't find out what was causing the issue you've faced. And in the logs provided I can't the very same JSON error we could find in yours. So, I would take the safest path here and do not treat the issues as the same (at least till the logs show us the opposite). > > Especially note how restarting the systemd sockets (comment 39 and comment > 40) seems to fix the problem. That's also how we accidentally (?) fixed my > problem yesterday. Yep, but the problem came back for Lars. So, I'm talking to him privately in order to get more data and try to find out what may actually cause the issue and then come up with a proper solution. > > I don't know the exact problem that Benjamin (comment 15) is referring to. > However, given that he is talking about systemd sockets, maybe that's also > the same problem? I don't think so, Rishi. Benjamin problem was https://github.com/SSSD/sssd/commit/53d1459e9b87196b4f6e327f0f5db4d9229bf541 ... which makes me want to re-review Units parameters used for sssd-kcm and sssd-secrets as those still seem to be problematic. (In reply to Fabiano Fidêncio from comment #43) > (In reply to Debarshi Ray from comment #42) > > I think that the problems reported by Dennis, Kamil and Lars are the same > > issue - the one which you debugged with me yesterday. I had the same > > symptoms as the ones described in comment 2, comment 4 (I checked the logs I > > had saved in the past), comment 39 and comment 40. > > Yep, but we didn't find out what was causing the issue you've faced. And in > the logs provided I can't the very same JSON error we could find in yours. The thing is that I had dozens of these in my logs from last November before the JSON error was logged: [sssd[kcm]] [kcm_recv] (0x0010): Failed to receive data (104, Connection reset by peer), aborting client [sssd[secrets]] [orderly_shutdown] (0x0010): SIGTERM: killing children Dennis, Kamil, do you happen to have this in your sssd_kcm.log: [sssd[kcm]] [sec_list_parse] (0x0020): Failed to parse JSON payload on line 21: unable to decode byte 0x88 > > Especially note how restarting the systemd sockets (comment 39 and comment > > 40) seems to fix the problem. That's also how we accidentally (?) fixed my > > problem yesterday. > > Yep, but the problem came back for Lars. So, I'm talking to him privately in > order to get more data and try to find out what may actually cause the issue > and then come up with a proper solution. Ok, great. > > I don't know the exact problem that Benjamin (comment 15) is referring to. > > However, given that he is talking about systemd sockets, maybe that's also > > the same problem? > > I don't think so, Rishi. > Benjamin problem was > https://github.com/SSSD/sssd/commit/53d1459e9b87196b4f6e327f0f5db4d9229bf541 > ... which makes me want to re-review Units parameters used for sssd-kcm and > sssd-secrets as those still seem to be problematic. I see, ok. People, I'd like to provide you guys a test build that may fix those issues you've been facing. To install it, please, just do `dnf copr enable fidencio/sssd-kcm-issues && dnf update sssd`. Please, let me know if you face more issues after this build (logs are always needed to help us to debug the issue for you). Thanks, Fabiano: I've been running with the test build since last night, and I haven't had any problems with Kerberos since then. Fabiano: I am using sssd with AD authentication enabled. Unfortunately I'm still having issues with the sssd_kcm.log file filling up my root partition. There are thousands of lines repeating the same error as follows. (Fri Mar 23 08:57:11 2018) [sssd[kcm]] [kcm_recv] (0x0010): Failed to parse data (74, Bad message), aborting client I am running CentOS 7 so I will not be able to install the copr packages. Do you have a yum repo with packages for CentOS 7 available? (In reply to Michael Watters from comment #47) > Fabiano: > > I am using sssd with AD authentication enabled. Unfortunately I'm still > having issues with the sssd_kcm.log file filling up my root partition. > There are thousands of lines repeating the same error as follows. > > (Fri Mar 23 08:57:11 2018) [sssd[kcm]] [kcm_recv] (0x0010): Failed to parse > data (74, Bad message), aborting client > > I am running CentOS 7 so I will not be able to install the copr packages. > Do you have a yum repo with packages for CentOS 7 available? Gimme a few minutes and I'll provide you an EL7 based build. (In reply to Fabiano Fidêncio from comment #48) > (In reply to Michael Watters from comment #47) > > Fabiano: > > > > I am using sssd with AD authentication enabled. Unfortunately I'm still > > having issues with the sssd_kcm.log file filling up my root partition. > > There are thousands of lines repeating the same error as follows. > > > > (Fri Mar 23 08:57:11 2018) [sssd[kcm]] [kcm_recv] (0x0010): Failed to parse > > data (74, Bad message), aborting client > > > > I am running CentOS 7 so I will not be able to install the copr packages. > > Do you have a yum repo with packages for CentOS 7 available? > > Gimme a few minutes and I'll provide you an EL7 based build. If you're patching the 7.4 package, please note that there were several patches for kCM that never made it to RHEL.. (In reply to Jakub Hrozek from comment #49) > > > > Gimme a few minutes and I'll provide you an EL7 based build. > > If you're patching the 7.4 package, please note that there were several > patches for kCM that never made it to RHEL.. And seems that f27 src.rpm doesn't build properly for EL7, thus it'll take more time than a few minutes. Till the end of the day I'll provide the EL7 copr. (In reply to Fabiano Fidêncio from comment #50) > (In reply to Jakub Hrozek from comment #49) > > > > > > Gimme a few minutes and I'll provide you an EL7 based build. > > > > If you're patching the 7.4 package, please note that there were several > > patches for kCM that never made it to RHEL.. > > And seems that f27 src.rpm doesn't build properly for EL7, thus it'll take > more time than a few minutes. > > Till the end of the day I'll provide the EL7 copr. Not possible till the end of the day. COPR is just kicking me quite hard as the build is breaking due to the following error: Traceback (most recent call last): File "/usr/libexec/mock/mock", line 481, in rebuild_generic ret = cmd(item) File "/usr/libexec/mock/mock", line 552, in cmd follow_links=options.symlink_dereference) File "/usr/lib/python3.6/site-packages/mockbuild/trace_decorator.py", line 96, in trace result = func(*args, **kw) File "/usr/lib/python3.6/site-packages/mockbuild/backend.py", line 391, in buildsrpm rebuilt_srpm = self.rebuild_installed_srpm(chrootspec, timeout) File "/usr/lib/python3.6/site-packages/mockbuild/trace_decorator.py", line 96, in trace result = func(*args, **kw) File "/usr/lib/python3.6/site-packages/mockbuild/backend.py", line 456, in rebuild_installed_srpm printOutput=self.config['print_main_output'] File "/usr/lib/python3.6/site-packages/mockbuild/buildroot.py", line 204, in doChroot env=env, shell=shell, *args, **kargs) File "/usr/lib/python3.6/site-packages/mockbuild/trace_decorator.py", line 96, in trace result = func(*args, **kw) File "/usr/lib/python3.6/site-packages/mockbuild/util.py", line 626, in do raise exception.Error("Command failed: \n # %s\n%s" % (command, output), child.returncode) mockbuild.exception.Error: Command failed: # /usr/bin/systemd-nspawn -q -M f90cb55e9fae49b0bc683c1c2281821b -D /var/lib/mock/731963-epel-7-x86_64-1521841038.897379/root -a --setenv=TERM=vt100 --setenv=SHELL=/bin/bash --setenv=HOME=/builddir --setenv=HOSTNAME=mock --setenv=PATH=/usr/bin:/bin:/usr/sbin:/sbin --setenv=PROMPT_COMMAND=printf "\033]0;<mock-chroot>\007" --setenv=PS1=<mock-chroot> \s-\v\$ --setenv=LANG=en_US.UTF-8 -u mockbuild bash --login -c /usr/bin/rpmbuild -bs --target x86_64 --nodeps /builddir/build/SPECS/sssd.spec I will get back to it on Monday and then provide Michael's build. I'm running sssd-1.16.1-1 on F27, with KDC being AD (Windows). If I attempt kinit using certain principals (to my knowledge, these are not "Unix" IDs in AD), the sssd_kcm.log file grows by GBs in a very short time (in fact, it filled my /var). The messages are all along the lines of "Failed to parse data" mentioned in comments to this bug. If nothing else, that is a DOS, so this excessive logging should be changed. I don't have any debugging enabled. Just FYI. (In reply to Bojan Smojver from comment #52) > I'm running sssd-1.16.1-1 on F27, with KDC being AD (Windows). If I attempt > kinit using certain principals (to my knowledge, these are not "Unix" IDs in > AD), the sssd_kcm.log file grows by GBs in a very short time (in fact, it > filled my /var). The messages are all along the lines of "Failed to parse > data" mentioned in comments to this bug. Bojan, Please, would you mind to: `dnf copr enable fidencio/sssd-kcm-issues && dnf update sssd`. By doing that you'll be testing a build that includes two fixes that I've provided for sssd KCM but that are still under review. It should solve your problem. Thanks, I may try this when I login next. Re #41: I'm still facing the problems. I tried with today's copr build, no improvement. (In reply to Tomasz Torcz from comment #55) > Re #41: > > I'm still facing the problems. I tried with today's copr build, no > improvement. I'm sorry you are still facing issues. I'll leave the final say to Fabiano because he spends the most hours maintaining the Fedora SSSD packages, but since we had several reporters confirm that the fixes under review did help them, I would suggest to use this BZ to track that issue and I would prefer if people who are still facing issues with KCM to file a separate ticket. Otherwise I'm afraid we might lose track of several problems tracked in a single bugzilla. (In reply to Fabiano Fidêncio from comment #53) > Please, would you mind to: `dnf copr enable fidencio/sssd-kcm-issues && dnf > update sssd`. > > By doing that you'll be testing a build that includes two fixes that I've > provided for sssd KCM but that are still under review. It should solve your > problem. That worked for me. I can now get the TGT from AD for the principal that I could not before. The spamming of the log stopped and I can now actually login to some machines that I could not into before using ssh with GSSAPI. So, in my case, your packages work. Thanks. Michael Watters, Sorry for the long delay it took for me to provide you a build, but here it is: https://copr.fedorainfracloud.org/coprs/fidencio/sssd-kcm-issues/build/733586/ In order to easily install it, please, ensure that you have `yum-plugin-copr` installed and then do: yum copr enable fidencio/sssd-kcm-issue; yum update sssd; Please, mind that in case you downgrade from this version you'd have to remove the cache as in 1.16.1 we bumped our cache version. Let us know about the results. Fabiano, Thanks for building those packages. I'm getting a 404 error using the yum copr enable fidencio/sssd-kcm-issue command however I was able to install the packages from the repo using the following procedure. cd /etc/yum.repos.d curl -O https://copr.fedorainfracloud.org/coprs/fidencio/sssd-kcm-issues/repo/epel-7/fidencio-sssd-kcm-issues-epel-7.repo yum update sssd After updating the packages I restarted ssd-kcm.service and sssd-kcm.socket which appears to have quieted the log files down. I'm still seeing errors but not nearly as often. For example, these messages appear every about every 8 minutes. (Wed Mar 28 15:56:12 2018) [sssd[kcm]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Wed Mar 28 16:03:47 2018) [sssd[kcm]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Wed Mar 28 16:11:22 2018) [sssd[kcm]] [orderly_shutdown] (0x0010): SIGTERM: killing children (In reply to Michael Watters from comment #59) > (Wed Mar 28 15:56:12 2018) [sssd[kcm]] [orderly_shutdown] (0x0010): SIGTERM: > killing children > (Wed Mar 28 16:03:47 2018) [sssd[kcm]] [orderly_shutdown] (0x0010): SIGTERM: > killing children > (Wed Mar 28 16:11:22 2018) [sssd[kcm]] [orderly_shutdown] (0x0010): SIGTERM: > killing children Yep, I'm seeing these too with copr packages. Just FYI. Michael, Borjan, Those messages are not an error. KCM shuts itself down after 300 seconds inactive (well, currently even when activities are happening ... I've opened a PR fixing this issue already submitted). So, IOW, those messages are "normal" and *will* be toned down soon (I'll open a PR for this soon). The important question ... KCM works for you Michael? Did the build solve the issue you had before? (In reply to Fabiano Fidêncio from comment #61) > So, IOW, those messages are "normal" and *will* be toned down soon (I'll > open a PR for this soon). Cool, thanks. (In reply to Fabiano Fidêncio from comment #61) > The important question ... KCM works for you Michael? Did the build solve > the issue you had before? Yes, the new packages are working great. My hard drive hasn't filled up since I installed them. :D Tomasz, Please, I'd like to ask you to cc yourself to https://bugzilla.redhat.com/show_bug.cgi?id=1553056 That bug has a very similar logs than yours and I'd like to treat that one out of the scope of this one. I'm going to close this one soon as the most part of the fixes were included in a Fedora build that I've done Today. Deal? :-) Sure, thanks for looking into the logs. sssd-1.16.1-2.fc28 has been submitted as an update to Fedora 28. https://bodhi.fedoraproject.org/updates/FEDORA-2018-94e3753451 sssd-1.16.1-2.fc26 has been submitted as an update to Fedora 26. https://bodhi.fedoraproject.org/updates/FEDORA-2018-5de5bfcbe2 sssd-1.16.1-2.fc27 has been submitted as an update to Fedora 27. https://bodhi.fedoraproject.org/updates/FEDORA-2018-495d60f679 sssd-1.16.1-2.fc26 has been pushed to the Fedora 26 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-5de5bfcbe2 sssd-1.16.1-2.fc27 has been pushed to the Fedora 27 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-495d60f679 sssd-1.16.1-2.fc28 has been pushed to the Fedora 28 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-94e3753451 sssd-1.16.1-2.fc27 has been pushed to the Fedora 27 stable repository. If problems still persist, please make note of it in this bug report. Since this issue also affects RHEL (CentOS) 7 will there be updates released for EL7 as well? (In reply to Michael Watters from comment #73) > Since this issue also affects RHEL (CentOS) 7 will there be updates released > for EL7 as well? At some point, yes. Unfortunately I can't exactly tell you when it'll officially there. I'll add the patches to our sssd-1-16 copr repo (still have to discuss this with our team) and propose to backport the issues fixed so far (and hopefully also the ones to come). sssd-1.16.1-2.fc28 has been pushed to the Fedora 28 stable repository. If problems still persist, please make note of it in this bug report. sssd-1.16.1-2.fc26 has been pushed to the Fedora 26 stable repository. If problems still persist, please make note of it in this bug report. Just as an aside here, this version of sssd_kcm from the latest RPM often appears near the top of top output on an otherwise idle machine. Both my VM (AD back end) and my laptop (krb5/ldap back end) often list this process with just under 2% of CPU utilisation. Seems unusual... I just discovered that these packages cause sssd to segfault on CentOS 7.5. journalctl shows the following info. mdct-dev27b sssd[7722]: Starting up mdct-dev27b kernel: sssd[7722]: segfault at 7fbd4d3ead58 ip 00007fbd4d3ead58 sp 00007ffd3eb93a38 error 15 in memberof.so[7fbd4d3ea00 mdct-dev27b abrt-hook-ccpp[7723]: Process 7722 (sssd) of user 0 killed by SIGSEGV - dumping core mdct-dev27b abrt-hook-ccpp[7723]: Failed to create core_backtrace: dwfl_getthread_frames failed: No DWARF information found mdct-dev27b systemd[1]: sssd.service: main process exited, code=dumped, status=11/SEGV mdct-dev27b systemd[1]: Failed to start System Security Services Daemon. Would it be possible to update these packages to work on 7.5? (In reply to Michael Watters from comment #78) > I just discovered that these packages cause sssd to segfault on CentOS 7.5. > > journalctl shows the following info. > > mdct-dev27b sssd[7722]: Starting up > mdct-dev27b kernel: sssd[7722]: segfault at 7fbd4d3ead58 ip 00007fbd4d3ead58 > sp 00007ffd3eb93a38 error 15 in memberof.so[7fbd4d3ea00 > mdct-dev27b abrt-hook-ccpp[7723]: Process 7722 (sssd) of user 0 killed by > SIGSEGV - dumping core > mdct-dev27b abrt-hook-ccpp[7723]: Failed to create core_backtrace: > dwfl_getthread_frames failed: No DWARF information found > mdct-dev27b systemd[1]: sssd.service: main process exited, code=dumped, > status=11/SEGV > mdct-dev27b systemd[1]: Failed to start System Security Services Daemon. > > Would it be possible to update these packages to work on 7.5? Michael, The patches are now part of https://copr.fedorainfracloud.org/coprs/g/sssd/sssd-1-16/ (which is already built against 7.5) Please, just use that repo and mine can be retired :-) Also, let us know if that repo worked for you. Fabiano, Thanks. The packages provided at https://copr.fedorainfracloud.org/coprs/g/sssd/sssd-1-16 do appear to work with CentOS 7.5. SSH logins using password authentication with kerberos are now working again. Still seeing sssd_kcm in an infinite loop, filling up /var/log with the following: (2020-08-10 15:03:04:680265): [kcm] [kcm_input_parse] (0x1000): Received message with length 0 (2020-08-10 15:03:04:680284): [kcm] [kcm_input_parse] (0x0020): Illegal zero-length message (2020-08-10 15:03:04:680302): [kcm] [kcm_recv] (0x0010): Failed to parse data (74, Bad message), aborting client (2020-08-10 15:03:04:680319): [kcm] [kcm_reply_error] (0x0040): KCM operation returs failure [74]: Bad message (2020-08-10 15:03:04:680353): [kcm] [kcm_failbuf_construct] (0x1000): Sent reply with error -1765328188 This is Fedora 32, sssd-kcm-2.3.1-2.fc32.x86_64. (In reply to Chuck Lever from comment #81) > Still seeing sssd_kcm in an infinite loop, filling up /var/log with the > following: > > (2020-08-10 15:03:04:680265): [kcm] [kcm_input_parse] (0x1000): Received > message with length 0 > (2020-08-10 15:03:04:680284): [kcm] [kcm_input_parse] (0x0020): Illegal > zero-length message > (2020-08-10 15:03:04:680302): [kcm] [kcm_recv] (0x0010): Failed to parse > data (74, Bad message), aborting client > (2020-08-10 15:03:04:680319): [kcm] [kcm_reply_error] (0x0040): KCM > operation returs failure [74]: Bad message > (2020-08-10 15:03:04:680353): [kcm] [kcm_failbuf_construct] (0x1000): Sent > reply with error -1765328188 > > This is Fedora 32, sssd-kcm-2.3.1-2.fc32.x86_64. I haven't been working on SSSD is some time so I can't comment on the bug itself. At any rate, I would recommend to open a new bug, a comment on a closed bug is just going to get lost in the bugzilla mail storm. (In reply to Jakub Hrozek from comment #82) > I haven't been working on SSSD in some time so I can't comment on the bug > itself. At any rate, I would recommend to open a new bug, a comment on a > closed bug is just going to get lost in the bugzilla mail storm. Thank you for responding, Jakub. I've opened https://bugzilla.redhat.com/show_bug.cgi?id=1868696 |