Bug 963620 - journald: we need a way to get audit, cgroup, ... information attached to log messages instead of asynchronously reading them in
journald: we need a way to get audit, cgroup, ... information attached to log...
Status: NEW
Product: Fedora
Classification: Fedora
Component: systemd (Show other bugs)
rawhide
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: systemd-maint
Fedora Extras Quality Assurance
: FutureFeature
: 1026806 (view as bug list)
Depends On:
Blocks: systemd-RFE 1026830 1168956
  Show dependency treegraph
 
Reported: 2013-05-16 05:13 EDT by Jan Kaluža
Modified: 2017-04-24 16:13 EDT (History)
23 users (show)

See Also:
Fixed In Version:
Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
mod_journald.c (5.05 KB, text/plain)
2013-05-16 05:14 EDT, Jan Kaluža
no flags Details
half-working patch (11.76 KB, patch)
2013-05-17 02:44 EDT, Jan Kaluža
no flags Details | Diff
half-working patch (11.61 KB, patch)
2013-05-17 02:50 EDT, Jan Kaluža
no flags Details | Diff
cgroup_task_path patch (3.28 KB, patch)
2013-07-11 19:33 EDT, Tejun Heo
no flags Details | Diff

  None (edit)
Description Jan Kaluža 2013-05-16 05:13:30 EDT
Hi,

this is not real bugreport, but probably RFE and sum up of my journald tests.

My initial goal was to create mod_journald module for httpd. This module just logs data to journal using sd-journald.h API. I'm attaching current testing version of this module so you can see my usage of this API.

Using "ab" benchmark tool, I have found out that httpd is able to handle 5000-5600 requests per second without logging to journald (just logging to error_log). httpd's LogLevel was set to Debug and therefore it produced 2 log messages per request.

Then I loaded mod_journald and disabled error_log logging by setting logging to /dev/null (this is not ideal, but for the results of this test it's not relevant). The requests per second value dropped from 5000-5600 to 1300. I know it's lot of logging, but still... 

Next, I have checked what's the journald doing while logging and found out it does 14 reads/syscalls of various /proc/* /sys/* files per log line. You can see that in code here: http://cgit.freedesktop.org/systemd/systemd/tree/src/journal/journald-server.c#n530.

In this file, if I commented out lines between 542-631 (all the *_get_* functions), the performance improved to +-4800 requests per second.

It's clear that we can't just remove them, but some kind of caching would definitely improve journald performance, since some of those files/values don't change during the process life.

I don't know right now, if for example selinux context can change during the process life, but for values which can't change, you could store them in some kind of hash table with data from "ucred" as key (or maybe even just PID?).

I can try to do that myself, but I'm not familiar with systemd code yet.
Comment 1 Jan Kaluža 2013-05-16 05:14:47 EDT
Created attachment 748695 [details]
mod_journald.c
Comment 2 Jan Kaluža 2013-05-16 05:24:25 EDT
Ok, selinux context was bad example, because its sent in SCM_SECURITY with the log message, but I think you got what I think.
Comment 3 Jan Kaluža 2013-05-17 02:44:34 EDT
Created attachment 749196 [details]
half-working patch

This patch illustrates my idea. It caches the results of reads/syscalls in hashmap and for next log message from the same pid, it loads them from the cache instead of getting them again.

The problem with this is obvious. If the process exits, we still keep its data in cache and once the PID is rotated and used again by another process, the messages in log will be bad.

Right now I'm not sure how to fix that, but maybe someone more familiar with systemd/journald will get some idea.

So, this patch is NOT production ready and does not work until someone finds out a way how to detect that process exited and we should therefore remove its data from the cache.
Comment 4 Jan Kaluža 2013-05-17 02:50:49 EDT
Created attachment 749198 [details]
half-working patch

I've somehow messed up order of free() and hashmap_free() in previous patch. patch description remains the same as in previous patch...
Comment 5 Jan Kaluža 2013-05-17 04:44:00 EDT
I think what is needed to acomplish this is something like session_key transfered with each log message. If the incoming message does not have session_key, journald would clear the cache and send the session key to process. For next log messages, the process would send session key, so journald would know this is still the same process running and could do some optimizations.

It's kind of analogy to http world...
Comment 6 Michal Schmidt 2013-05-17 05:21:53 EDT
A problem with the session key idea is that these fields are meant to be "trusted". It should be not be possible for the sending process to fake them or cause obsolete information to be logged.

Even the current approach of reading the data from /proc is known to be insufficient in this regard. There is a race between the sending of the log data and reading the information from /proc.

The definitive solution is to enhance the kernel to allow sending all this information with the message itself, similarly to the creds.
Comment 7 Jan Kaluža 2013-05-30 02:04:00 EDT
After discussion with Eric Paris, I will try to patch the kernel myself and hopefully improve the current situation.
Comment 8 Jan Kaluža 2013-06-25 09:07:57 EDT
I have already sent some kernel patches to Eric some time ago. I'm just adding link to them also to this bug for a tracking purpose:
http://people.apache.org/~jkaluza/patches/linux/
Comment 9 Kay Sievers 2013-06-25 10:10:12 EDT
I guess that needs to be some new SCM_AUDIT or so, instead of adding fields to
"struct ucred". I doubt we can touch/extend this public structure.
Comment 10 Jan Kaluža 2013-06-25 15:47:43 EDT
Thanks for the comment. That's actually thing I wasn't sure about (I'm also not sure if I can simply move audit fields from task_struct to task_cred, but Eric asked me to do it).

I would like to add more fields for the rest of information journald needs (just check http://cgit.freedesktop.org/systemd/systemd/tree/src/journal/journald-server.c#n548 to find out what it fetches from /proc according to pid). This patch was just something to begin with, since I don't have any kernel hacking experience and this was my first try to open kernel code.

I think it could be the best to create special structs like SCM_AUDIT for audit data, SCM_PROCESS for "_EXE" or "_CMDLINE", SCM_CGROUP for cgroups and so on. It's better than having everything in single struct.

Do you agree?
Comment 11 Jan Kaluža 2013-06-26 09:58:38 EDT
I have updated the second patch to send audit related data in msg with SCM_AUDIT type. The patches are still located at <http://people.apache.org/~jkaluza/patches/linux/>.

If you think this is the right way to go, please tell me and I will do the same for other data needed by journald.
Comment 12 Lennart Poettering 2013-06-26 11:13:22 EDT
Having SCM_AUDIT with sessionid+loginuid in a struct would be fantastic.

Having SCM_PROCINFO or so with the exec path, the cmdline and the comm field would be really nice, too! This one should maybe just carry a chain of NUL terminated strings: the comm name, the exe path followed by argv[0], argv[1] and so on.

And a final SCM_CGROUP would be absolutely great as well. For this we'd only need the cgroup path of the hierarchies. Tejun is currently working on marking one of the cgroup hierarchies as "primary", and that should be the one we would want to get the path for.

Adding these three would allow us to drop all code to check /proc entirely. That would fix a major race, and make things a lot faster!
Comment 13 Kay Sievers 2013-06-26 11:21:14 EDT
 #define SCM_CREDENTIALS 0x02		/* rw: struct ucred		*/
 #define SCM_SECURITY	0x03		/* rw: security label		*/
+#define SCM_AUDIT	0x03		/* rw: struct uaudit		*/

There are 03 two times now?
Comment 14 Jan Kaluža 2013-06-26 11:32:47 EDT
(In reply to Kay Sievers from comment #13)
>  #define SCM_CREDENTIALS 0x02		/* rw: struct ucred		*/
>  #define SCM_SECURITY	0x03		/* rw: security label		*/
> +#define SCM_AUDIT	0x03		/* rw: struct uaudit		*/
> 
> There are 03 two times now?

Thanks, you are right. I have fixed it. I have copy-pasted line with 0x02 and then bumped it to 0x03 without checking the previous line, stupid mistake...
Comment 15 Jan Kaluža 2013-06-26 11:39:16 EDT
(In reply to Lennart Poettering from comment #12)
> Having SCM_AUDIT with sessionid+loginuid in a struct would be fantastic.

That's what my last patch provides.

> Having SCM_PROCINFO or so with the exec path, the cmdline and the comm field
> would be really nice, too! This one should maybe just carry a chain of NUL
> terminated strings: the comm name, the exe path followed by argv[0], argv[1]
> and so on.
> 
> And a final SCM_CGROUP would be absolutely great as well. For this we'd only
> need the cgroup path of the hierarchies. Tejun is currently working on
> marking one of the cgroup hierarchies as "primary", and that should be the
> one we would want to get the path for.
> 
> Adding these three would allow us to drop all code to check /proc entirely.
> That would fix a major race, and make things a lot faster!

Thanks for this sum up. That's my goal. I'm only not sure with the "get the patches upstream" part. I still hope someone who also cares will take my patches and try to submit them (once they'll be ready for it of course). But in the end I could just RTFM and try to do it myself :).
Comment 16 Jan Kaluža 2013-07-03 02:38:56 EDT
I have another patch [1] implementing SCM_PROCINFO. This one is little bit more tricky thanks to kmalloc/kmemdup/kfree, but it works for me.

It sends current->comm and cmdline in single string where substrings are separated by \0 (as Lennart suggested).

[1] http://people.apache.org/~jkaluza/patches/linux/0003-Send-comm-and-cmdline-in-SCM_PROCINFO.patch
Comment 17 Jan Kaluža 2013-07-03 04:44:18 EDT
(In reply to Lennart Poettering from comment #12)
> And a final SCM_CGROUP would be absolutely great as well. For this we'd only
> need the cgroup path of the hierarchies. Tejun is currently working on
> marking one of the cgroup hierarchies as "primary", and that should be the
> one we would want to get the path for.

Don't you know if there's some patch implementing the "primary cgroup" or is that already submitted upstream? I was not able to find out anything like that in cgroup.h/cgroup.c in linux.

I could probably send the same data as in /proc/$pid/cgroup, maybe it could be more useful, but I have no deep experience with cgroups, so if you think just primary cgroup should be enough, I will do that.
Comment 18 Kay Sievers 2013-07-03 05:35:17 EDT
(In reply to Jan Kaluža from comment #17)
> (In reply to Lennart Poettering from comment #12)
> > And a final SCM_CGROUP would be absolutely great as well. For this we'd only
> > need the cgroup path of the hierarchies. Tejun is currently working on
> > marking one of the cgroup hierarchies as "primary", and that should be the
> > one we would want to get the path for.
> 
> Don't you know if there's some patch implementing the "primary cgroup" or is
> that already submitted upstream? I was not able to find out anything like
> that in cgroup.h/cgroup.c in linux.
> 
> I could probably send the same data as in /proc/$pid/cgroup, maybe it could
> be more useful, but I have no deep experience with cgroups, so if you think
> just primary cgroup should be enough, I will do that.

It's not available at the moment. We need and want only one cgroup to be
attached, which currently is hierarchy 1, the named hierarchy from systemd,
not all the individual controllers.

Here is the similar problem, which needs to be solved too:
  https://code.google.com/p/d-bus/source/browse/message.c#623

We should probably just make this kernel function to accept 0 and it will
return us the named hierarchy. I'll ask Tejun ...
Comment 19 Jan Kaluža 2013-07-03 06:38:11 EDT
Thanks. I think I should probably wait for the new cgroup API to do SCM_CGROUP. I'm not sure it's useful right now to implement another patch around <http://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git/commit/?h=review-task_cgroup_path_from_hierarchy>.
Comment 20 Kay Sievers 2013-07-03 07:04:30 EDT
This is merged and should be available in the rawhide kernel the next days:
  http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=857a2beb09ab83e9a8185821ae16db7dfbe8b837

Just pass "1" to it for now in your tests, it should work just fine.

I'll ask Tejun to make sure it will always do the right thing without knowing
a specific hierarchy number ...
Comment 21 Jan Kaluža 2013-07-03 07:13:14 EDT
Great, thanks for the help and info. I will update my VM and try to implement SCM_CGROUP using this function. I think it could be relatively easy, since it will be similar patch to SCM_PROCINFO.
Comment 22 Kay Sievers 2013-07-03 12:59:55 EDT
Talked to Tejun, just stick with the "1" for the cgroup id, that will be the right number now and in the future.
Comment 23 Jan Kaluža 2013-07-09 07:39:46 EDT
I have tried to add SCM_CGROUP support using following patch: <http://people.apache.org/~jkaluza/patches/linux/0004-Send-cgroup_path-in-SCM_CGROUP.patch>. I'm using linux from git ddcf6600b133697adbafd96e080818bdc0dfd028.

To get the cgroup path, I do:

ret = task_cgroup_path_from_hierarchy(current, 1, *cgroup_path, PAGE_SIZE);

but *cgroup_path is 0 after that.

If I replace this function by simple 'strcpy(*cgroup, "test");', I see "test" in my testing user-space app later, so I presume it's really problem with task_cgroup_path_from_hierarchy method.

I will try to debug it more.
Comment 24 Jan Kaluža 2013-07-10 03:41:15 EDT
task_cgroup_path_from_hierarchy returns -2 (-ENOENT), according to cgroup.c, it's caused by root = idr_find(&cgroup_hierarchy_idr, hierarchy_id); returning NULL.
Comment 25 Tejun Heo 2013-07-11 18:36:22 EDT
Right, I reserved hierarchy_id of 1 for the unified hierarchy, so the first named hierarchy got punted to 2 and -ENOENT. Hmmm... nasty. I think I'll just add task_cgroup_path() which returns path from the first hierarchy.

Thanks.
Comment 26 Tejun Heo 2013-07-11 19:33:31 EDT
Created attachment 772483 [details]
cgroup_task_path patch

I'm gonna post and commit this patch to cgroup/for-3.12 so that people don't have to worry about multiple hierarchies or hierarchy IDs, which are silly anyway.

Thanks.
Comment 27 Jan Kaluža 2013-07-12 00:23:53 EDT
Thanks, I will give it a try.
Comment 28 Jan Kaluža 2013-07-15 09:20:14 EDT
Tejun,

with the patch you have attached in Comment 26 it works perfectly. Thank you very much for your work on this.
Comment 29 Jan Kaluža 2013-07-15 09:24:25 EDT
(In reply to Lennart Poettering from comment #12)
> Having SCM_AUDIT with sessionid+loginuid in a struct would be fantastic.
> 
> Having SCM_PROCINFO or so with the exec path, the cmdline and the comm field
> would be really nice, too! This one should maybe just carry a chain of NUL
> terminated strings: the comm name, the exe path followed by argv[0], argv[1]
> and so on.
> 
> And a final SCM_CGROUP would be absolutely great as well. For this we'd only
> need the cgroup path of the hierarchies. Tejun is currently working on
> marking one of the cgroup hierarchies as "primary", and that should be the
> one we would want to get the path for.
> 
> Adding these three would allow us to drop all code to check /proc entirely.
> That would fix a major race, and make things a lot faster!

All this should be done in my 4 patches at <http://people.apache.org/~jkaluza/patches/linux/>. I will review them and fix possible formatting issues. Then I will have to RTM to find out what to do next :).
Comment 30 Jan Kaluža 2013-08-07 04:31:25 EDT
I have updated the patches to fix issues found by checkpatch.pl script. There are still some warnings I haven't fixed, because it would make the code inconsistent with the rest of file for which the warnings have been printed. In case you think I should really fix the remaining warnings, I can of course do that.

URL remains the same: http://people.apache.org/~jkaluza/patches/linux/
Comment 31 Richard Guy Briggs 2013-08-22 18:03:58 EDT
In patch 0001-Move-sessionid-loginuid-from-task_struct-to-cred.patch:

I suspect both instances of __task_cred() should be wrapped in rcu_read_lock()/rcu_read_unlock().
Comment 32 Jan Kaluža 2013-08-23 01:31:40 EDT
Thanks for the review. I have fixed the first patch and rebased others to latest git revision of kernel. URL with patches remains the same.
Comment 33 Jan Kaluža 2013-08-28 06:30:54 EDT
Just for the info, I've sent the patches upstream: <http://marc.info/?l=linux-netdev&m=137761450411591&w=2>.
Comment 34 Fedora End Of Life 2013-09-16 09:56:34 EDT
This bug appears to have been reported against 'rawhide' during the Fedora 20 development cycle.
Changing version to '20'.

More information and reason for this action is here:
https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora20
Comment 35 Michal Schmidt 2013-11-05 08:37:27 EST
*** Bug 1026806 has been marked as a duplicate of this bug. ***
Comment 36 Zbigniew Jędrzejewski-Szmek 2013-11-05 14:50:42 EST
(In reply to Jan Kaluža from comment #33)
> Just for the info, I've sent the patches upstream:
> <http://marc.info/?l=linux-netdev&m=137761450411591&w=2>.
So, we have a nack from Eric Biederman: "The process can just as easily pass the
information in userspace directly.", which shows that he doesn't understand what this is for. What now?
Comment 37 Jan Kaluža 2014-01-13 03:22:31 EST
Tejun asked me to re-submit my patchset, so I've rebased it to latest net-next and re-submitted my patches to net-dev mailing list.

net-dev thread: http://marc.info/?l=linux-netdev&m=138960028628028&w=2
URL remains the same: http://people.apache.org/~jkaluza/patches/linux/
Comment 38 Lennart Poettering 2014-02-23 12:00:04 EST
*** Bug 1026806 has been marked as a duplicate of this bug. ***
Comment 39 Fedora End Of Life 2015-05-29 05:03:56 EDT
This message is a reminder that Fedora 20 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 20. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '20'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 20 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.
Comment 40 Fedora End Of Life 2015-11-04 06:44:02 EST
This message is a reminder that Fedora 21 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 21. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '21'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 21 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.
Comment 41 Jan Kurik 2016-02-24 08:12:24 EST
This bug appears to have been reported against 'rawhide' during the Fedora 24 development cycle.
Changing version to '24'.

More information and reason for this action is here:
https://fedoraproject.org/wiki/Fedora_Program_Management/HouseKeeping/Fedora24#Rawhide_Rebase
Comment 42 Florian Weimer 2016-04-23 11:59:50 EDT
I think this is still present in rawhide.

Note You need to log in before you can comment on or make changes to this bug.