Bug 1230750 - SSSD krb5 disk quota exceeded
Summary: SSSD krb5 disk quota exceeded
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: sssd
Version: 22
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: Jakub Hrozek
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-06-11 13:23 UTC by Lorenzo Dalrio
Modified: 2016-07-19 19:50 UTC (History)
11 users (show)

Fixed In Version: sssd-1.12.5-2.fc22
Clone Of:
Environment:
Last Closed: 2016-07-19 19:50:15 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
krb5.conf (495 bytes, text/plain)
2015-06-11 16:51 UTC, Lorenzo Dalrio
no flags Details
sssd.conf (444 bytes, text/plain)
2015-06-11 16:52 UTC, Lorenzo Dalrio
no flags Details
krb5_child.log (16.08 KB, text/plain)
2015-07-16 12:00 UTC, Lukas Slebodnik
no flags Details
content of /proc/keys (6.90 KB, text/plain)
2015-07-16 12:26 UTC, Lorenzo Dalrio
no flags Details
content of /proc/key-users (112 bytes, text/plain)
2015-07-16 12:27 UTC, Lorenzo Dalrio
no flags Details
Snippets of various sssd log files (30.79 KB, text/plain)
2016-01-17 17:46 UTC, Jason Tibbitts
no flags Details

Description Lorenzo Dalrio 2015-06-11 13:23:26 UTC
Description of problem: After successful screen lock/unlock on my work pc joined to AD, screen unlocking does not work anymore reporting an error.
In sssd logs after every unlock/login attempt the following line appears:

[sssd[krb5_child[6273]]][6273]: Disk quota exceeded


Version-Release number of selected component (if applicable):
sssd-1.12.4-6.fc22.x86_64


How reproducible:


Steps to Reproduce:
1. Install Fedora 22 Workstation
2. Join AD Domain with realm as kerberos member
3. Login with AD account, after some time and some successful screen unlocks, unlocking is not working anymore.

Actual results: Sometimes screen unlocking does not work, machine reboot is the workaround.


Expected results: Screen unlock works


Additional info: I am available for debug/testing.

Comment 1 Lukas Slebodnik 2015-06-11 14:00:10 UTC
If the filesystem with the directory /var/lib/sss full?

Comment 2 Simo Sorce 2015-06-11 14:09:19 UTC
We might be depleting the keyring quota for the user by creating new keys but not remoiving old keys ?

Comment 3 Lukas Slebodnik 2015-06-11 16:30:21 UTC
Could you also provide sanitized sssd.conf and krb5.conf?

Comment 4 Lorenzo Dalrio 2015-06-11 16:49:06 UTC
(In reply to Lukas Slebodnik from comment #1)
> If the filesystem with the directory /var/lib/sss full?

Filesystem are ok:

/dev/mapper/fedora-root  51475068  7368136  41469108  16% /
/dev/sda1                  487652   163664    294292  36% /boot
/dev/mapper/fedora-home 184117324 16230452 158511128  10% /home

Comment 5 Lorenzo Dalrio 2015-06-11 16:51:19 UTC
Created attachment 1037779 [details]
krb5.conf

krb5.conf

Comment 6 Lorenzo Dalrio 2015-06-11 16:52:26 UTC
Created attachment 1037780 [details]
sssd.conf

sssd.conf created automatically by realm.

Comment 7 Jakub Hrozek 2015-06-11 16:58:42 UTC
If you can reproduce the bug at will, can you get us an strace of the krb5_child process?

See:
https://fedorahosted.org/sssd/wiki/DevelTips#UsingstracetotracktheSSSDprocesses

Comment 8 Lukas Slebodnik 2015-06-11 17:06:34 UTC
And log files with high debug level(9) might be useful as well.

Comment 9 Lorenzo Dalrio 2015-06-12 11:39:58 UTC
I am not able to reproduce at will, i don't know what triggers the problem, here are some lines extracted from sssd logs that show when the problem occurred:

May 27 09:59:58 i2000005928 [sssd[krb5_child[32334]]][32334]: Disk quota exceeded
May 27 10:56:53 i2000005928 [sssd[krb5_child[6114]]][6114]: Disk quota exceeded
May 28 13:27:11 i2000005928 [sssd[krb5_child[29606]]][29606]: Disk quota exceeded
May 28 14:33:47 i2000005928 [sssd[krb5_child[1660]]][1660]: Disk quota exceeded
May 28 14:33:47 i2000005928 [sssd[krb5_child[1660]]][1660]: Disk quota exceeded
May 28 14:33:49 i2000005928 [sssd[krb5_child[1666]]][1666]: Disk quota exceeded
May 28 14:33:49 i2000005928 [sssd[krb5_child[1666]]][1666]: Disk quota exceeded
May 28 14:33:55 i2000005928 [sssd[krb5_child[1674]]][1674]: Disk quota exceeded
May 28 14:34:08 i2000005928 [sssd[krb5_child[1696]]][1696]: Disk quota exceeded
May 28 14:36:43 i2000005928 [sssd[krb5_child[3385]]][3385]: Disk quota exceeded
May 28 14:37:03 i2000005928 [sssd[krb5_child[3400]]][3400]: Disk quota exceeded
May 28 14:38:17 i2000005928 [sssd[krb5_child[5191]]][5191]: Disk quota exceeded
May 28 14:38:17 i2000005928 [sssd[krb5_child[5191]]][5191]: Disk quota exceeded
May 28 14:47:22 i2000005928 [sssd[krb5_child[6537]]][6537]: Disk quota exceeded
May 28 14:48:07 i2000005928 [sssd[krb5_child[7086]]][7086]: Disk quota exceeded
May 28 14:48:07 i2000005928 [sssd[krb5_child[7086]]][7086]: Disk quota exceeded
May 28 14:54:48 i2000005928 [sssd[krb5_child[9080]]][9080]: Disk quota exceeded
Jun 04 16:29:43 i2000005928 [sssd[krb5_child[23969]]][23969]: Disk quota exceeded
Jun 05 08:56:57 i2000005928 [sssd[krb5_child[22320]]][22320]: Disk quota exceeded
Jun 05 08:57:00 i2000005928 [sssd[krb5_child[22325]]][22325]: Disk quota exceeded
Jun 05 09:06:48 i2000005928 [sssd[krb5_child[23753]]][23753]: Disk quota exceeded
Jun 05 09:06:50 i2000005928 [sssd[krb5_child[23758]]][23758]: Disk quota exceeded
Jun 05 09:06:50 i2000005928 [sssd[krb5_child[23758]]][23758]: Disk quota exceeded
Jun 05 09:21:07 i2000005928 [sssd[krb5_child[25756]]][25756]: Disk quota exceeded
Jun 05 09:21:07 i2000005928 [sssd[krb5_child[25756]]][25756]: Disk quota exceeded
Jun 09 16:10:44 i2000005928 [sssd[krb5_child[5814]]][5814]: Disk quota exceeded
Jun 09 16:31:41 i2000005928 [sssd[krb5_child[7015]]][7015]: Disk quota exceeded
Jun 09 16:31:43 i2000005928 [sssd[krb5_child[7034]]][7034]: Disk quota exceeded
Jun 09 16:31:48 i2000005928 [sssd[krb5_child[7041]]][7041]: Disk quota exceeded
Jun 09 16:32:10 i2000005928 [sssd[krb5_child[7364]]][7364]: Disk quota exceeded
Jun 11 11:18:05 i2000005928 [sssd[krb5_child[10292]]][10292]: Disk quota exceeded
Jun 11 14:23:23 i2000005928 [sssd[krb5_child[6273]]][6273]: Disk quota exceeded
Jun 11 14:23:25 i2000005928 [sssd[krb5_child[6278]]][6278]: Disk quota exceeded
Jun 11 14:23:28 i2000005928 [sssd[krb5_child[6285]]][6285]: Disk quota exceeded
Jun 11 14:23:28 i2000005928 [sssd[krb5_child[6285]]][6285]: Disk quota exceeded
Jun 11 14:23:32 i2000005928 [sssd[krb5_child[6322]]][6322]: Disk quota exceeded
Jun 11 14:23:32 i2000005928 [sssd[krb5_child[6322]]][6322]: Disk quota exceeded

I am setting debug level to 9 and next time I will try to strace sssd when the issue arise.

Comment 10 Lorenzo Dalrio 2015-06-18 08:47:42 UTC
The problem just happened, yesterday also happened.
Here is the link where you can download the archive with sssd and strace logs. [1]
Thank you.

[1] https://thor.labbase.it/owncloud/index.php/s/OTHunepRRM4HxqH

Comment 11 Lukas Slebodnik 2015-07-16 08:36:31 UTC
I'm sorry for a late response.

I checked the log files (especially krb5_child.log-20150615) and it looks like know upstream bug https://fedorahosted.org/sssd/ticket/2592.
The log file start on "Fri Jun 12 14:45:25 2015" and the disk was full 4 days later (on "Tue Jun 16 13:25:07 2015")

The old ccache files was not removed from directory /var/lib/sssd/db/
It is already fixed in sssd-1.12.5.

I'm going to close this bug. Please re-open it if you are able to reproduce with
sssd >= 1.12.5  which is already in fedora 22 stable repository.

Comment 12 Lorenzo Dalrio 2015-07-16 09:25:43 UTC
Thank you Lukas, unfortunately i have to re-open this: yesterday it happened again, i am attaching relevant log files, let me know if you need more info/debug.

Comment 13 Lukas Slebodnik 2015-07-16 09:32:14 UTC
Could you also provide output command of ls -l /var/lib/sssd/db/?

Comment 14 Lorenzo Dalrio 2015-07-16 09:36:53 UTC
Here it is:

[root@i2000005928 sss]# ls -l /var/lib/sss/db/
total 9932
-rw-------. 1 root root 7593984 Jul 16 11:20 cache_ente.regione.emr.it.ldb
-rw-------. 1 root root    3926 Jul 16 11:13 ccache_ENTE.REGIONE.EMR.IT
-rw-------. 1 root root       0 Jun 15 17:24 ccache_ENTE.REGIONE.EMR.IT_JfCnxH
-rw-------. 1 root root 1286144 Jul 16 02:20 config.ldb
-rw-------. 1 root root 1286144 May 13 10:00 sssd.ldb

You can find a new archive of relevant log files in the previously used location [1]

[1] https://thor.labbase.it/owncloud/index.php/s/OTHunepRRM4HxqH

Comment 15 Lukas Slebodnik 2015-07-16 12:00:23 UTC
Created attachment 1052662 [details]
krb5_child.log

Comment 16 Lukas Slebodnik 2015-07-16 12:04:24 UTC
(In reply to Lukas Slebodnik from comment #15)
> Created attachment 1052662 [details]
> krb5_child.log
The file contains
(Tue Jul 14 15:21:29 2015) [[sssd[krb5_child[19827]]]] [create_ccache] (0x4000): Initializing ccache of type [KEYRING]
(Tue Jul 14 15:21:29 2015) [[sssd[krb5_child[19827]]]] [get_and_save_tgt] (0x0020): 1233: [122][Disk quota exceeded]
(Tue Jul 14 15:21:29 2015) [[sssd[krb5_child[19827]]]] [map_krb5_error] (0x0020): 1273: [122][Disk quota exceeded]

and the old ccache was KEYRING:persistent:112866179. So it looks like there is not enough space in kernel keyring. I'm not very familiar with kernel keyring.
So Sumit or Simo might help us.

Comment 17 Lukas Slebodnik 2015-07-16 12:06:48 UTC
I should have paste wider context. There is something suspicious abount missing client principal. 

(Tue Jul 14 15:21:29 2015) [[sssd[krb5_child[19827]]]] [sss_get_ccache_name_for_principal] (0x4000): Location: [KEYRING:persistent:112866179]
(Tue Jul 14 15:21:29 2015) [[sssd[krb5_child[19827]]]] [sss_get_ccache_name_for_principal] (0x2000): krb5_cc_cache_match failed: [-1765328243][Can't find client principal Dalrio_L.EMR.IT in cache collection]
(Tue Jul 14 15:21:29 2015) [[sssd[krb5_child[19827]]]] [create_ccache] (0x4000): Initializing ccache of type [KEYRING]
(Tue Jul 14 15:21:29 2015) [[sssd[krb5_child[19827]]]] [get_and_save_tgt] (0x0020): 1233: [122][Disk quota exceeded]
(Tue Jul 14 15:21:29 2015) [[sssd[krb5_child[19827]]]] [map_krb5_error] (0x0020): 1273: [122][Disk quota exceeded]
(Tue Jul 14 15:21:29 2015) [[sssd[krb5_child[19827]]]] [k5c_send_data] (0x0200): Received error code 1432158209
(Tue Jul 14 15:21:29 2015) [[sssd[krb5_child[19827]]]] [pack_response_packet] (0x2000): response packet size: [20]
(Tue Jul 14 15:21:29 2015) [[sssd[krb5_child[19827]]]] [k5c_send_data] (0x4000): Response sent.
(Tue Jul 14 15:21:29 2015) [[sssd[krb5_child[19827]]]] [main] (0x0400): krb5_child completed successfully

Comment 18 Sumit Bose 2015-07-16 12:12:16 UTC
Can you add the output of 'cat /proc/keys' and 'cat /pric/key.users' ?

Comment 19 Lorenzo Dalrio 2015-07-16 12:26:19 UTC
Created attachment 1052675 [details]
content of /proc/keys

This is the content of /proc/keys

Comment 20 Lorenzo Dalrio 2015-07-16 12:27:04 UTC
Created attachment 1052678 [details]
content of /proc/key-users

This is the content of /proc/key-users

Comment 21 Lukas Slebodnik 2015-10-07 12:20:09 UTC
Lorenzo,
I would like to apologize for stalled ticket. I thought Sumit will continue with analysis.

Are you still able to reproduce this bug?

Comment 22 Lorenzo Dalrio 2015-10-23 09:38:12 UTC
Hi Lukas,
don't worry, i disabled keyring in favour of file cache some time ago, this way i have worked around the problem. Now i have switched back to keyring cache and the issue is back. Let me know if you need any action from me.

Thank you,
Lorenzo

Comment 23 Lukas Slebodnik 2015-10-23 13:15:11 UTC
Is it a desktop machine which can be restarted or is it a server machine which was not restarted for long time?

Comment 24 Lorenzo Dalrio 2015-10-23 13:50:37 UTC
It's a desktop machine, it is a workstation at my workplace. I log in at morning and log out from the desktop session when i leave at evening. I shutdown the pc at friday and power back on at monday.

Comment 25 Lukas Slebodnik 2015-10-23 14:14:40 UTC
Stephen,
Could you help us with this bug?
Do you have an idea why ccache was not removed from kernel keyring?
@see comment 19 and comment 20

Comment 26 Simo Sorce 2015-10-23 19:56:32 UTC
Looking at the number of empty ccaches for the user in that keyring it seem like SSSD is using libkrb5 in a way that does not reuse the same ccache for a user but generates a new one at each authentication. That is clearly submoptimal.
In the AD case where tickets are valid for a longer time (even a wekk in some setups) it will make keys pile up and eventually the user hits their quota.

We need to look at why new ccaches are being created that is not expected.

Comment 27 Lukas Slebodnik 2015-11-05 12:08:46 UTC
Simo,
Which data/log files do we need from reporter?

Comment 28 Simo Sorce 2015-11-05 13:56:27 UTC
The krb5 child log with all the debug should be sufficient to at least figure something out.

Comment 29 Jason Tibbitts 2016-01-06 01:54:14 UTC
Just a note that I've just seen this as well, and it appears that I can reproduce it.

I'm running a script that fires off a load of mock builds on 20 large hosts.  It starts 12 builds per host, and so runs a very large number of ssh processes pretty much at exactly the same time.  Something goes awry when that happens; I get [sssd[krb5_child[6033]]][6033]: Disk quota exceeded in the logs, klist says:

klist: Credentials cache keyring 'persistent:7225:krb_ccache_t53ARZW' not found

kinit doesn't work:

kinit: Failed to store credentials: Disk quota exceeded while getting initial credentials

and other things like sudo which want to do auth simply fail:

sudo[8958]: pam_sss(sudo:auth): authentication failure; logname=tibbs uid=7225 euid=0 tty=/dev/pts/10 ruser=tibbs
[sssd[krb5_child[8970]]][8970]: Disk quota exceeded
sudo[8958]: pam_sss(sudo:auth): received for user tibbs: 4 (System error)
sudo[8958]: pam_unix(sudo:auth): auth could not identify password for [tibbs]

Comment 30 Jakub Hrozek 2016-01-06 06:22:49 UTC
Jason, if you can reproduce, can you attach the logs? As simo said in comment #28, the krb5_child.log would be the most important.

Comment 31 Lukas Slebodnik 2016-01-06 08:52:53 UTC
It would be good to attach also domain log file an not only krb5_child.log

Comment 32 Sumit Bose 2016-01-06 09:28:18 UTC
Yes, domain logs would be good here as well. It sounds a bit like a race-condition or other issues in the queuing of the requests.

Comment 33 Jason Tibbitts 2016-01-06 20:26:11 UTC
Since kind of screws things up to the point that I have to reboot if I don't want to wait until the kernel expires the keyring, can you let me know exactly what I should add to which file to get the necessary debug info?  Unfortunately sssd's policy of not logging anything at all by default is really annoying.

Comment 34 Jakub Hrozek 2016-01-07 07:12:01 UTC
debug_level=10 to all sssd.conf's [pam] and [domain] sections should do the trick (if you're not limited by disk space, you can just add debug_level=10 to all sections..)

Comment 35 Jason Tibbitts 2016-01-07 07:22:31 UTC
I'll try, but now I'm having trouble making it happen again.  Happened to me twice originally but now I can spam ssh calls across a huge number of hosts without problems.

What is the actual limit on the kernel keycain size?  Is there any trick I might use to try and fill it?  What mechanism actually keeps it from filling in general use?

Comment 36 Jason Tibbitts 2016-01-14 07:51:54 UTC
OK, I have managed to reproduce something like this, but it doesn't have exactly the same symptoms.

Basically, I'm running 200+ ssh processes in parallel, some of them to the same machine I'm currently using.  Eventually I get this:

Jan 14 01:27:55 epithumia.math.uh.edu sshd[27164]: krb5_cc_initialize(): Disk quota exceeded

klist says I no longer have a keyring, and I can't acquire a new key:

epithumia:~/work/mass-build/> klist
klist: Credentials cache keyring 'persistent:7225:krb_ccache_gDUNson' not found

epithumia:~/work/mass-build/> kdestroy

epithumia:~/work/mass-build/> kinit
Password for tibbs.EDU:
kinit: Failed to store credentials: Disk quota exceeded while getting initial credentials

I have sssd debugging turned up as requested, but krb5_child.log is empty for the time around the issue.  I made it happen again, and there was still no output for the time it happened.

I have sssd_pam.log and sssd_default.log as well as krb5_child.log preserved, though I'm not sure any of this is relevant as it's sshd which is actually complaining and none of the sssd logs seem to contain any complaints about this at all.  If you let me know how much of the logs you would like (i.e. how many minutes before and after the issue) I'll be happy to send them along or, if they don't actually have any private data, just attach them to this ticket.

Comment 37 Jason Tibbitts 2016-01-17 17:27:44 UTC
OK, some more information.

Simo and I worked on this a bit on IRC and figured out that in one issue doing ssh into localhost with gssapi credential delegation turned on could result in the duplication of a large keyring and thus a quota exhaustion.  I can reproduce that one pretty easily, but I guess it's more of an ssh issue not the same thing as the one in this ticket.

Luckily I managed to duplicate the issue with logging enabled and have something more useful to add.

I have sssd set to renew credentials every three hours.  If I check my logs now, I see it ran at 19:05:42, and then again at 22:05:42, and the latter happened to be right in the middle of a mass rebuild I was doing where I had ssh sessions running repeatedly to 23 different hosts.  The ssh calls started failing, and in the journal I see:

Jan 16 22:05:42 epithumia.math.uh.edu [sssd[krb5_child[4248]]][4248]: Disk quota exceeded
Jan 16 22:05:42 epithumia.math.uh.edu [sssd[krb5_child[4248]]][4248]: Disk quota exceeded

at this point klist just tells I no longer have a keyring, and I can't acquire a key.  Unfortunately I was not saving keyctl output during this run, but I still have sssd logging turned up.

sssd_default.log shows the renewal timer expiring on a couple of principals but no errors or complaints that I see.

krb5_child.log does show the disk quota exceeded but not much else explaining that error.  Then it exits successfully. (?)

I will attach snippets of each available log over that period.

Comment 38 Jason Tibbitts 2016-01-17 17:46:32 UTC
Created attachment 1115639 [details]
Snippets of various sssd log files

Comment 39 Simo Sorce 2016-01-18 16:24:20 UTC
One more point,
on a hunch I asked David Howells and he confirms that deleted keys still count against quota untill the kernel gets aroound to gabage collect them.
So some operations that replace key contents (but do not add to them) while technically shuld not cause  quota issue, may end up throwing this error.

I do not know if this is the only issue but it can certainly exacerbate a problem.

Comment 40 Fedora End Of Life 2016-07-19 19:50:15 UTC
Fedora 22 changed to end-of-life (EOL) status on 2016-07-19. Fedora 22 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.