Description of problem: After successful screen lock/unlock on my work pc joined to AD, screen unlocking does not work anymore reporting an error. In sssd logs after every unlock/login attempt the following line appears: [sssd[krb5_child[6273]]][6273]: Disk quota exceeded Version-Release number of selected component (if applicable): sssd-1.12.4-6.fc22.x86_64 How reproducible: Steps to Reproduce: 1. Install Fedora 22 Workstation 2. Join AD Domain with realm as kerberos member 3. Login with AD account, after some time and some successful screen unlocks, unlocking is not working anymore. Actual results: Sometimes screen unlocking does not work, machine reboot is the workaround. Expected results: Screen unlock works Additional info: I am available for debug/testing.
If the filesystem with the directory /var/lib/sss full?
We might be depleting the keyring quota for the user by creating new keys but not remoiving old keys ?
Could you also provide sanitized sssd.conf and krb5.conf?
(In reply to Lukas Slebodnik from comment #1) > If the filesystem with the directory /var/lib/sss full? Filesystem are ok: /dev/mapper/fedora-root 51475068 7368136 41469108 16% / /dev/sda1 487652 163664 294292 36% /boot /dev/mapper/fedora-home 184117324 16230452 158511128 10% /home
Created attachment 1037779 [details] krb5.conf krb5.conf
Created attachment 1037780 [details] sssd.conf sssd.conf created automatically by realm.
If you can reproduce the bug at will, can you get us an strace of the krb5_child process? See: https://fedorahosted.org/sssd/wiki/DevelTips#UsingstracetotracktheSSSDprocesses
And log files with high debug level(9) might be useful as well.
I am not able to reproduce at will, i don't know what triggers the problem, here are some lines extracted from sssd logs that show when the problem occurred: May 27 09:59:58 i2000005928 [sssd[krb5_child[32334]]][32334]: Disk quota exceeded May 27 10:56:53 i2000005928 [sssd[krb5_child[6114]]][6114]: Disk quota exceeded May 28 13:27:11 i2000005928 [sssd[krb5_child[29606]]][29606]: Disk quota exceeded May 28 14:33:47 i2000005928 [sssd[krb5_child[1660]]][1660]: Disk quota exceeded May 28 14:33:47 i2000005928 [sssd[krb5_child[1660]]][1660]: Disk quota exceeded May 28 14:33:49 i2000005928 [sssd[krb5_child[1666]]][1666]: Disk quota exceeded May 28 14:33:49 i2000005928 [sssd[krb5_child[1666]]][1666]: Disk quota exceeded May 28 14:33:55 i2000005928 [sssd[krb5_child[1674]]][1674]: Disk quota exceeded May 28 14:34:08 i2000005928 [sssd[krb5_child[1696]]][1696]: Disk quota exceeded May 28 14:36:43 i2000005928 [sssd[krb5_child[3385]]][3385]: Disk quota exceeded May 28 14:37:03 i2000005928 [sssd[krb5_child[3400]]][3400]: Disk quota exceeded May 28 14:38:17 i2000005928 [sssd[krb5_child[5191]]][5191]: Disk quota exceeded May 28 14:38:17 i2000005928 [sssd[krb5_child[5191]]][5191]: Disk quota exceeded May 28 14:47:22 i2000005928 [sssd[krb5_child[6537]]][6537]: Disk quota exceeded May 28 14:48:07 i2000005928 [sssd[krb5_child[7086]]][7086]: Disk quota exceeded May 28 14:48:07 i2000005928 [sssd[krb5_child[7086]]][7086]: Disk quota exceeded May 28 14:54:48 i2000005928 [sssd[krb5_child[9080]]][9080]: Disk quota exceeded Jun 04 16:29:43 i2000005928 [sssd[krb5_child[23969]]][23969]: Disk quota exceeded Jun 05 08:56:57 i2000005928 [sssd[krb5_child[22320]]][22320]: Disk quota exceeded Jun 05 08:57:00 i2000005928 [sssd[krb5_child[22325]]][22325]: Disk quota exceeded Jun 05 09:06:48 i2000005928 [sssd[krb5_child[23753]]][23753]: Disk quota exceeded Jun 05 09:06:50 i2000005928 [sssd[krb5_child[23758]]][23758]: Disk quota exceeded Jun 05 09:06:50 i2000005928 [sssd[krb5_child[23758]]][23758]: Disk quota exceeded Jun 05 09:21:07 i2000005928 [sssd[krb5_child[25756]]][25756]: Disk quota exceeded Jun 05 09:21:07 i2000005928 [sssd[krb5_child[25756]]][25756]: Disk quota exceeded Jun 09 16:10:44 i2000005928 [sssd[krb5_child[5814]]][5814]: Disk quota exceeded Jun 09 16:31:41 i2000005928 [sssd[krb5_child[7015]]][7015]: Disk quota exceeded Jun 09 16:31:43 i2000005928 [sssd[krb5_child[7034]]][7034]: Disk quota exceeded Jun 09 16:31:48 i2000005928 [sssd[krb5_child[7041]]][7041]: Disk quota exceeded Jun 09 16:32:10 i2000005928 [sssd[krb5_child[7364]]][7364]: Disk quota exceeded Jun 11 11:18:05 i2000005928 [sssd[krb5_child[10292]]][10292]: Disk quota exceeded Jun 11 14:23:23 i2000005928 [sssd[krb5_child[6273]]][6273]: Disk quota exceeded Jun 11 14:23:25 i2000005928 [sssd[krb5_child[6278]]][6278]: Disk quota exceeded Jun 11 14:23:28 i2000005928 [sssd[krb5_child[6285]]][6285]: Disk quota exceeded Jun 11 14:23:28 i2000005928 [sssd[krb5_child[6285]]][6285]: Disk quota exceeded Jun 11 14:23:32 i2000005928 [sssd[krb5_child[6322]]][6322]: Disk quota exceeded Jun 11 14:23:32 i2000005928 [sssd[krb5_child[6322]]][6322]: Disk quota exceeded I am setting debug level to 9 and next time I will try to strace sssd when the issue arise.
The problem just happened, yesterday also happened. Here is the link where you can download the archive with sssd and strace logs. [1] Thank you. [1] https://thor.labbase.it/owncloud/index.php/s/OTHunepRRM4HxqH
I'm sorry for a late response. I checked the log files (especially krb5_child.log-20150615) and it looks like know upstream bug https://fedorahosted.org/sssd/ticket/2592. The log file start on "Fri Jun 12 14:45:25 2015" and the disk was full 4 days later (on "Tue Jun 16 13:25:07 2015") The old ccache files was not removed from directory /var/lib/sssd/db/ It is already fixed in sssd-1.12.5. I'm going to close this bug. Please re-open it if you are able to reproduce with sssd >= 1.12.5 which is already in fedora 22 stable repository.
Thank you Lukas, unfortunately i have to re-open this: yesterday it happened again, i am attaching relevant log files, let me know if you need more info/debug.
Could you also provide output command of ls -l /var/lib/sssd/db/?
Here it is: [root@i2000005928 sss]# ls -l /var/lib/sss/db/ total 9932 -rw-------. 1 root root 7593984 Jul 16 11:20 cache_ente.regione.emr.it.ldb -rw-------. 1 root root 3926 Jul 16 11:13 ccache_ENTE.REGIONE.EMR.IT -rw-------. 1 root root 0 Jun 15 17:24 ccache_ENTE.REGIONE.EMR.IT_JfCnxH -rw-------. 1 root root 1286144 Jul 16 02:20 config.ldb -rw-------. 1 root root 1286144 May 13 10:00 sssd.ldb You can find a new archive of relevant log files in the previously used location [1] [1] https://thor.labbase.it/owncloud/index.php/s/OTHunepRRM4HxqH
Created attachment 1052662 [details] krb5_child.log
(In reply to Lukas Slebodnik from comment #15) > Created attachment 1052662 [details] > krb5_child.log The file contains (Tue Jul 14 15:21:29 2015) [[sssd[krb5_child[19827]]]] [create_ccache] (0x4000): Initializing ccache of type [KEYRING] (Tue Jul 14 15:21:29 2015) [[sssd[krb5_child[19827]]]] [get_and_save_tgt] (0x0020): 1233: [122][Disk quota exceeded] (Tue Jul 14 15:21:29 2015) [[sssd[krb5_child[19827]]]] [map_krb5_error] (0x0020): 1273: [122][Disk quota exceeded] and the old ccache was KEYRING:persistent:112866179. So it looks like there is not enough space in kernel keyring. I'm not very familiar with kernel keyring. So Sumit or Simo might help us.
I should have paste wider context. There is something suspicious abount missing client principal. (Tue Jul 14 15:21:29 2015) [[sssd[krb5_child[19827]]]] [sss_get_ccache_name_for_principal] (0x4000): Location: [KEYRING:persistent:112866179] (Tue Jul 14 15:21:29 2015) [[sssd[krb5_child[19827]]]] [sss_get_ccache_name_for_principal] (0x2000): krb5_cc_cache_match failed: [-1765328243][Can't find client principal Dalrio_L.EMR.IT in cache collection] (Tue Jul 14 15:21:29 2015) [[sssd[krb5_child[19827]]]] [create_ccache] (0x4000): Initializing ccache of type [KEYRING] (Tue Jul 14 15:21:29 2015) [[sssd[krb5_child[19827]]]] [get_and_save_tgt] (0x0020): 1233: [122][Disk quota exceeded] (Tue Jul 14 15:21:29 2015) [[sssd[krb5_child[19827]]]] [map_krb5_error] (0x0020): 1273: [122][Disk quota exceeded] (Tue Jul 14 15:21:29 2015) [[sssd[krb5_child[19827]]]] [k5c_send_data] (0x0200): Received error code 1432158209 (Tue Jul 14 15:21:29 2015) [[sssd[krb5_child[19827]]]] [pack_response_packet] (0x2000): response packet size: [20] (Tue Jul 14 15:21:29 2015) [[sssd[krb5_child[19827]]]] [k5c_send_data] (0x4000): Response sent. (Tue Jul 14 15:21:29 2015) [[sssd[krb5_child[19827]]]] [main] (0x0400): krb5_child completed successfully
Can you add the output of 'cat /proc/keys' and 'cat /pric/key.users' ?
Created attachment 1052675 [details] content of /proc/keys This is the content of /proc/keys
Created attachment 1052678 [details] content of /proc/key-users This is the content of /proc/key-users
Lorenzo, I would like to apologize for stalled ticket. I thought Sumit will continue with analysis. Are you still able to reproduce this bug?
Hi Lukas, don't worry, i disabled keyring in favour of file cache some time ago, this way i have worked around the problem. Now i have switched back to keyring cache and the issue is back. Let me know if you need any action from me. Thank you, Lorenzo
Is it a desktop machine which can be restarted or is it a server machine which was not restarted for long time?
It's a desktop machine, it is a workstation at my workplace. I log in at morning and log out from the desktop session when i leave at evening. I shutdown the pc at friday and power back on at monday.
Stephen, Could you help us with this bug? Do you have an idea why ccache was not removed from kernel keyring? @see comment 19 and comment 20
Looking at the number of empty ccaches for the user in that keyring it seem like SSSD is using libkrb5 in a way that does not reuse the same ccache for a user but generates a new one at each authentication. That is clearly submoptimal. In the AD case where tickets are valid for a longer time (even a wekk in some setups) it will make keys pile up and eventually the user hits their quota. We need to look at why new ccaches are being created that is not expected.
Simo, Which data/log files do we need from reporter?
The krb5 child log with all the debug should be sufficient to at least figure something out.
Just a note that I've just seen this as well, and it appears that I can reproduce it. I'm running a script that fires off a load of mock builds on 20 large hosts. It starts 12 builds per host, and so runs a very large number of ssh processes pretty much at exactly the same time. Something goes awry when that happens; I get [sssd[krb5_child[6033]]][6033]: Disk quota exceeded in the logs, klist says: klist: Credentials cache keyring 'persistent:7225:krb_ccache_t53ARZW' not found kinit doesn't work: kinit: Failed to store credentials: Disk quota exceeded while getting initial credentials and other things like sudo which want to do auth simply fail: sudo[8958]: pam_sss(sudo:auth): authentication failure; logname=tibbs uid=7225 euid=0 tty=/dev/pts/10 ruser=tibbs [sssd[krb5_child[8970]]][8970]: Disk quota exceeded sudo[8958]: pam_sss(sudo:auth): received for user tibbs: 4 (System error) sudo[8958]: pam_unix(sudo:auth): auth could not identify password for [tibbs]
Jason, if you can reproduce, can you attach the logs? As simo said in comment #28, the krb5_child.log would be the most important.
It would be good to attach also domain log file an not only krb5_child.log
Yes, domain logs would be good here as well. It sounds a bit like a race-condition or other issues in the queuing of the requests.
Since kind of screws things up to the point that I have to reboot if I don't want to wait until the kernel expires the keyring, can you let me know exactly what I should add to which file to get the necessary debug info? Unfortunately sssd's policy of not logging anything at all by default is really annoying.
debug_level=10 to all sssd.conf's [pam] and [domain] sections should do the trick (if you're not limited by disk space, you can just add debug_level=10 to all sections..)
I'll try, but now I'm having trouble making it happen again. Happened to me twice originally but now I can spam ssh calls across a huge number of hosts without problems. What is the actual limit on the kernel keycain size? Is there any trick I might use to try and fill it? What mechanism actually keeps it from filling in general use?
OK, I have managed to reproduce something like this, but it doesn't have exactly the same symptoms. Basically, I'm running 200+ ssh processes in parallel, some of them to the same machine I'm currently using. Eventually I get this: Jan 14 01:27:55 epithumia.math.uh.edu sshd[27164]: krb5_cc_initialize(): Disk quota exceeded klist says I no longer have a keyring, and I can't acquire a new key: epithumia:~/work/mass-build/> klist klist: Credentials cache keyring 'persistent:7225:krb_ccache_gDUNson' not found epithumia:~/work/mass-build/> kdestroy epithumia:~/work/mass-build/> kinit Password for tibbs.EDU: kinit: Failed to store credentials: Disk quota exceeded while getting initial credentials I have sssd debugging turned up as requested, but krb5_child.log is empty for the time around the issue. I made it happen again, and there was still no output for the time it happened. I have sssd_pam.log and sssd_default.log as well as krb5_child.log preserved, though I'm not sure any of this is relevant as it's sshd which is actually complaining and none of the sssd logs seem to contain any complaints about this at all. If you let me know how much of the logs you would like (i.e. how many minutes before and after the issue) I'll be happy to send them along or, if they don't actually have any private data, just attach them to this ticket.
OK, some more information. Simo and I worked on this a bit on IRC and figured out that in one issue doing ssh into localhost with gssapi credential delegation turned on could result in the duplication of a large keyring and thus a quota exhaustion. I can reproduce that one pretty easily, but I guess it's more of an ssh issue not the same thing as the one in this ticket. Luckily I managed to duplicate the issue with logging enabled and have something more useful to add. I have sssd set to renew credentials every three hours. If I check my logs now, I see it ran at 19:05:42, and then again at 22:05:42, and the latter happened to be right in the middle of a mass rebuild I was doing where I had ssh sessions running repeatedly to 23 different hosts. The ssh calls started failing, and in the journal I see: Jan 16 22:05:42 epithumia.math.uh.edu [sssd[krb5_child[4248]]][4248]: Disk quota exceeded Jan 16 22:05:42 epithumia.math.uh.edu [sssd[krb5_child[4248]]][4248]: Disk quota exceeded at this point klist just tells I no longer have a keyring, and I can't acquire a key. Unfortunately I was not saving keyctl output during this run, but I still have sssd logging turned up. sssd_default.log shows the renewal timer expiring on a couple of principals but no errors or complaints that I see. krb5_child.log does show the disk quota exceeded but not much else explaining that error. Then it exits successfully. (?) I will attach snippets of each available log over that period.
Created attachment 1115639 [details] Snippets of various sssd log files
One more point, on a hunch I asked David Howells and he confirms that deleted keys still count against quota untill the kernel gets aroound to gabage collect them. So some operations that replace key contents (but do not add to them) while technically shuld not cause quota issue, may end up throwing this error. I do not know if this is the only issue but it can certainly exacerbate a problem.
Fedora 22 changed to end-of-life (EOL) status on 2016-07-19. Fedora 22 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. If you experience problems, please add a comment to this bug. Thank you for reporting this bug and we are sorry it could not be fixed.