Bug 1379187

Summary: process accounting rarely works anymore
Product: [Fedora] Fedora Reporter: Ian Donaldson <iand>
Component: psacctAssignee: Jan Rybar <jrybar>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 26CC: jaromir.capik, jrybar, sixerjman
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-03-13 13:25:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Ian Donaldson 2016-09-26 00:02:11 UTC
Description of problem:


Process accounting, as displayed by lastcomm(1) rarely works anymore.

In recent FC releases more often than not, process accounting on 
all my systems disables (possibly during log rotate) and fails to reenable
mostly.

On FC24 it is even worse.   On some systems it fails to start even
when manually turned on.   No error messages.


Version-Release number of selected component (if applicable):

psacct-6.6.2-4.fc24.i686
kernel-PAE-4.7.3-200.fc24.i686


How reproducible:

About 95%

Steps to Reproduce:
1./usr/sbin/accton /var/account/pacct
2.ls -l /var/account/pacct
3. observe size is zero and stays zero after multiple commands are run

Actual results:

lastcomm shows no output

Expected results:

lastcomm should show recent processes exiting


Additional info:

On a system just now...

$ ll /var/account
total 4924K
drwxr-xr-x.  2 root root    4096 Sep 17 03:25 .
drwxr-xr-x. 21 root root    4096 Sep 15 01:29 ..
-rw-------.  1 root root       0 Sep 17 03:25 pacct
-rw-------.  1 root root  108547 Oct 17  2014 pacct-20141017.gz
-rw-------.  1 root root  110086 Oct 18  2014 pacct-20141018.gz
-rw-------.  1 root root  313866 Dec 22  2014 pacct-20141222.gz
-rw-------.  1 root root   64624 Dec 24  2014 pacct-20141224.gz
-rw-------.  1 root root  134019 Jul 11  2015 pacct-20150903.gz
-rw-------.  1 root root  359989 Sep 14  2015 pacct-20150914.gz
-rw-------.  1 root root  185507 Sep 15  2015 pacct-20150915.gz
-rw-------.  1 root root   55618 Sep 16  2015 pacct-20150916.gz
-rw-------.  1 root root   53617 Sep 17  2015 pacct-20150917.gz
-rw-------.  1 root root   61472 Sep 18  2015 pacct-20150918.gz
-rw-------.  1 root root   50192 Sep 19  2015 pacct-20150919.gz
-rw-------.  1 root root   50792 Sep 20  2015 pacct-20150920.gz
-rw-------.  1 root root   11841 Dec 29  2015 pacct-20151229.gz
-rw-------.  1 root root  457441 May 17 00:20 pacct-20160517.gz
-rw-------.  1 root root  277307 May 17 06:46 pacct-20160519.gz
-rw-------.  1 root root   92516 May 20 03:29 pacct-20160520.gz
-rw-------.  1 root root   18359 Jun 14 03:09 pacct-20160614.gz
-rw-------.  1 root root   21677 Jul  9 03:39 pacct-20160709.gz
-rw-------.  1 root root  114841 Jul 10 03:37 pacct-20160710.gz
-rw-------.  1 root root  112930 Jul 11 03:25 pacct-20160711.gz
-rw-------.  1 root root  111408 Jul 12 03:10 pacct-20160712.gz
-rw-------.  1 root root  376406 Jul 13 03:41 pacct-20160713.gz
-rw-------.  1 root root  113951 Jul 14 03:51 pacct-20160714.gz
-rw-------.  1 root root  110690 Jul 15 03:12 pacct-20160715.gz
-rw-------.  1 root root  114861 Jul 16 03:26 pacct-20160716.gz
-rw-------.  1 root root  116726 Jul 17 03:51 pacct-20160717.gz
-rw-------.  1 root root   19536 Aug 11 03:41 pacct-20160811.gz
-rw-------.  1 root root   27966 Sep  9 03:21 pacct-20160909.gz
-rw-------.  1 root root   39039 Sep 15 03:14 pacct-20160915.gz
-rw-------.  1 root root  132461 Sep 16 03:14 pacct-20160916.gz
-rw-------.  1 root root 1150336 Sep 17 03:25 pacct-20160917

$ ls -lc /usr/lib/os.release.d/issue-server
-rw-r--r--. 1 root root 83 Sep 15 00:52 /usr/lib/os.release.d/issue-server
(shows date FC24 upgrade was done, via dnf distro-sync)


$ ll /var/account/pacct
-rw-------. 1 root root 0 Sep 17 03:25 /var/account/pacct

# systemctl is-active psacct.service
active

# /usr/sbin/accton /var/account/pacct
Turning on process accounting, file set to '/var/account/pacct'.
# ls -l /var/account/pacct
-rw-------. 1 root root 0 Sep 17 03:25 /var/account/pacct
# ls -l /var/account/pacct
-rw-------. 1 root root 0 Sep 17 03:25 /var/account/pacct
# ls -l /var/account/pacct
-rw-------. 1 root root 0 Sep 17 03:25 /var/account/pacct
# ls -l /var/account/pacct
-rw-------. 1 root root 0 Sep 17 03:25 /var/account/pacct

# strace -e acct /usr/sbin/accton /var/account/pacct
acct("/var/account/pacct")              = 0
Turning on process accounting, file set to '/var/account/pacct'.
+++ exited with 0 +++

# strace -e acct /usr/sbin/accton /var/account/pacct
acct("/var/account/pacct")              = 0
Turning on process accounting, file set to '/var/account/pacct'.
+++ exited with 0 +++

# ls -l /var/account/pacct
-rw-------. 1 root root 0 Sep 17 03:25 /var/account/pacct


As you can see the acct(2) system call is not failing. Its just not doing
anything in the kernel.


There are no selinux or kernel messages related to these attempts.

Its dead Jim.

Comment 1 Dominique Brazziel 2017-07-24 13:40:37 UTC
This bug persists in Fedora 25.  I had implemented a hack in '/etc/logrotate.d/psacct' to repeated turn process accounting on until the size of pacct started increasing, but even that doesn't work now.

Oddly enough, if I touch '/tmp/pacct', then turn accounting on manually via 'accton /tmp/pacct', accounting records are written to the new file.  The only difference I can see are the permissions in the two directories.  I will see if account will work in '/var/account' after the permissions are set like they are in /tmp.

Comment 2 Dominique Brazziel 2017-07-24 15:45:47 UTC
Root cause here:  After remembering seeing something about high and low water marks for filesystem free space in the kernel accounting code (acct.c), I reviewed it again.  The following constants are defined towards the top of acct.c:

/*
 * These constants control the amount of freespace that suspend and
 * resume the process accounting system, and the time delay between
 * each check.
 * Turned into sysctl-controllable parameters. AV, 12/11/98
 */

int acct_parm[3] = {4, 2, 30};
#define RESUME          (acct_parm[0])  /* >foo% free space - resume */
#define SUSPEND         (acct_parm[1])  /* <foo% free space - suspend */
#define ACCT_TIMEOUT    (acct_parm[2])  /* foo second timeout between checks */

The first time I looked at the code years ago was after my root partition filled up, but since I have been running consistently with a little less than 4% free space.  The reason the intermittent stoppages were fixed with the repetitive 'accton /var/account/pacct' calls in logrotate is because just _over_ 4% free space would be reached after compression.

After I set the 'kernel.acct' sysctl parm to '3, 2, 30' accounting resumed normally.  I think the root cause for those experiencing this problem is the default high water mark for free space on the root/var partition accounting resume is rarely if ever hit.  The /tmp partition worked because on my system with 4G of RAM the free space is 99% free.

I am adding a the kernel.acct='3 2 30' fragment to '/etc/sysctl.d/' which should kill this bug's effect on me for good.  Perhaps a NEWS item in the next release of the psacct package could be added about this bug.

Comment 3 Ian Donaldson 2017-07-25 00:28:22 UTC
Updating to fc25; haven't seen it in fc26 yet though

Comment 4 Dominique Brazziel 2017-07-25 13:30:36 UTC
Ian, do you have a separate partition for /var?  What is your average free space available for the 'var' partition (root (/) or (/var))?

The Debian atop package has a README which discusses possible causes for accounting malfunction, including links to the kernel bugzilla:

KERNEL ISSUES WITH PROCESS ACCOUNTING
-------------------------------------

Newer upstream kernels (e.g. 4.8 and 4.9)have two issuess
with process accounting:

1) Sometimes process accounting does not work at all¹. Atopacctd tries to
work around this issue, by retrying to initialize process accounting
several times.

2) When using the NETLINK inface, the command TASKSTATS_CMD_GET
consequently returns -EINVAL². Atopacctd needs NETLINK to be able to
be triggered that some process in the system has finished. In this way,
atopacctd can be in a blocking state as long as no processes terminate.
When atopacctd detects that NETLINK fails, it switches into a polling
mode to periodically try if it can read process accounting records as
a workaround. This issue has to do with cpumasks and you can work-around
it by building a kernel that has CONFIG_NR_CPUS configured to exactly
the amount of CPUs (logical CPUs) in the system the kernel runs on. You
can find this kernel option under "Processor type and features" -->
"Maximum number of CPUs".


[1] Bug 190271 - process accounting sometimes does not work
https://bugzilla.kernel.org/show_bug.cgi?id=190271

[2] Bug 190711 - Process accounting: Using the NETLINK interface,
the command TASKSTATS_CMD_GET returns -EINVAL
https://bugzilla.kernel.org/show_bug.cgi?id=190711

Linux kernel mailing list thread:

[REGRESSION] Two issues that prevent process accounting (taskstats) from
working correctly
https://lkml.org/lkml/2016/12/19/182


Gerlof Langeveld
gerlof.langeveld

Comment 5 Ian Donaldson 2017-07-25 13:44:59 UTC
Nearly all my systems have /var on the root fs, with gigabytes of free space,
so its not a free space threshold thing.

I've checked a few of my fc25 and f26 systems today; some had accounting working,
some didn't.  Those that didn't, did when I ran the accton, although
all of them have psacct enabled at startup.

The original report I posted was on fc24 and on many of the systems,
running the accton didn't work at all.  Did nothing.

Below is a fc26 system that didn't have it working until I ran accton manually:

--
tomato$ $ ll /var/account
total 9200K
drwxr-xr-x.  2 root root    4096 Jul 25 03:12 .
drwxr-xr-x. 25 root root    4096 Jul 21 05:38 ..
-rw-------.  1 root root       0 Jul 25 03:12 pacct
-rw-------.  1 root root  230819 Aug 18  2016 pacct-20160818.gz
-rw-------.  1 root root  233647 Aug 19  2016 pacct-20160819.gz
-rw-------.  1 root root  236415 Aug 20  2016 pacct-20160820.gz
-rw-------.  1 root root  230192 Aug 21  2016 pacct-20160821.gz
-rw-------.  1 root root  238269 Aug 22  2016 pacct-20160822.gz
-rw-------.  1 root root  234326 Aug 23  2016 pacct-20160823.gz
-rw-------.  1 root root  230161 Aug 24  2016 pacct-20160824.gz
-rw-------.  1 root root  234752 Aug 25  2016 pacct-20160825.gz
-rw-------.  1 root root  231877 Aug 26  2016 pacct-20160826.gz
-rw-------.  1 root root  230625 Aug 27  2016 pacct-20160827.gz
-rw-------.  1 root root  379468 Sep 15  2016 pacct-20160915.gz
-rw-------.  1 root root   22545 Oct 14  2016 pacct-20161014.gz
-rw-------.  1 root root  244610 Nov  9  2016 pacct-20161109.gz
-rw-------.  1 root root   46077 Feb  1 03:39 pacct-20170201.gz
-rw-------.  1 root root  311738 Feb  2 03:40 pacct-20170202.gz
-rw-------.  1 root root  272952 Feb  3 03:36 pacct-20170203.gz
-rw-------.  1 root root  269130 Feb  4 03:48 pacct-20170204.gz
-rw-------.  1 root root  270308 Feb 25 03:06 pacct-20170225.gz
-rw-------.  1 root root  288740 Feb 26 03:32 pacct-20170226.gz
-rw-------.  1 root root  269191 Feb 27 03:32 pacct-20170227.gz
-rw-------.  1 root root  279100 Feb 28 03:11 pacct-20170228.gz
-rw-------.  1 root root  264245 Mar  1 03:25 pacct-20170301.gz
-rw-------.  1 root root  270397 Mar  2 03:48 pacct-20170302.gz
-rw-------.  1 root root  220411 Jun  9 03:10 pacct-20170609.gz
-rw-------.  1 root root  229314 Jun 10 03:21 pacct-20170610.gz
-rw-------.  1 root root  229254 Jun 11 03:31 pacct-20170611.gz
-rw-------.  1 root root  224538 Jun 12 03:08 pacct-20170612.gz
-rw-------.  1 root root  234575 Jun 13 03:50 pacct-20170613.gz
-rw-------.  1 root root  318619 Jul 21 06:21 pacct-20170722.gz
-rw-------.  1 root root  233761 Jul 23 03:14 pacct-20170723.gz
-rw-------.  1 root root 2133056 Jul 25 03:12 pacct-20170725

tomato$ df -h /var/account
Filesystem                       Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol01   60G  6.9G   50G  13% /

tomato$ systemctl |grep psacct
  psacct.service                                                                                                                 loaded active exited    Kernel process accounting

tomato$ ls -l /var/account/pacct
-rw-------. 1 root root 0 Jul 25 03:12 /var/account/pacct

tomato# accton /var/account/pacct
Turning on process accounting, file set to '/var/account/pacct'.

tomato$ ls -l /var/account/pacct
-rw-------. 1 root root 64 Jul 25 13:38 /var/account/pacct
tomato$ ls -l /var/account/pacct
-rw-------. 1 root root 128 Jul 25 13:39 /var/account/pacct
tomato$ ls -l /var/account/pacct
-rw-------. 1 root root 192 Jul 25 13:39 /var/account/pacct


tomato$ rpm -q psacct kernel
psacct-6.6.2-5.fc26.x86_64
kernel-4.11.3-200.fc25.x86_64
kernel-4.11.10-300.fc26.x86_64
kernel-4.11.11-300.fc26.x86_64

tomato$ ls -l /etc/systemd/system/multi-user.target.wants/psacct.service
lrwxrwxrwx. 1 root root 34 Apr  2  2012 /etc/systemd/system/multi-user.target.wants/psacct.service -> /lib/systemd/system/psacct.service


Most of my systems are distro-sync upgraded from earlier releases.
--

Comment 6 Dominique Brazziel 2017-07-26 15:05:20 UTC
I dropped my thresholds down to 2 (resume) and 1 (suspend) respectively after seeing no accounting records being written after logrotate.  The first couple of times after I tried 'accton /var/account/pacct' manually (no joy) I did a loop ('while true;do accton /var/account/pacct;done') and accounting started up again.

Comment 7 Dominique Brazziel 2017-07-26 15:06:06 UTC
Ian, what kernel version are you running?  I am tracking fedora-updates and my kernel is 4.11.11.

Comment 8 Ian Donaldson 2017-07-28 06:15:38 UTC
kernel-4.11.11-300.fc26.x86_64 (see above)

I note we get a lot of these in the logs also... but might just be one per reboot. (did a few reboots on 22nd)

tomato$ grep -i 'Process accounting' /var/log/kernel_messages
Jul 22 08:57:11 tomato kernel: Process accounting resumed
Jul 22 09:00:42 tomato kernel: Process accounting resumed
Jul 22 09:03:53 tomato kernel: Process accounting resumed
Jul 22 09:24:06 tomato kernel: Process accounting resumed
Jul 22 09:28:40 tomato kernel: Process accounting resumed
Jul 25 00:25:46 tomato kernel: Process accounting resumed
Jul 25 00:52:28 tomato kernel: Process accounting resumed
Jul 25 13:38:58 tomato kernel: Process accounting resumed
Jul 26 03:41:14 tomato kernel: Process accounting resumed
Jul 27 03:08:14 tomato kernel: Process accounting resumed
Jul 28 03:26:13 tomato kernel: Process accounting resumed

tomato$ df -h /var
Filesystem                       Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol01   60G  7.0G   50G  13% /


Another fc26 (distro-sync'd) system I have has nothing going into pacct,
but its all enabled.

onion$  df -h /var
Filesystem                       Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol01   60G  6.5G   50G  12% /


onion$ ll /var/account/pacct
-rw-------. 1 root root 0 Jul 28 03:19 /var/account/pacct

onion$ grep -i 'Process accounting' /var/log/kernel_messages
Jul 22 08:57:35 onion kernel: Process accounting resumed
Jul 22 09:00:31 onion kernel: Process accounting resumed
Jul 22 09:04:06 onion kernel: Process accounting resumed
Jul 22 09:15:16 onion kernel: Process accounting resumed
Jul 22 09:25:49 onion kernel: Process accounting resumed
Jul 23 03:12:29 onion kernel: Process accounting resumed
Jul 25 00:52:30 onion kernel: Process accounting resumed
Jul 25 03:41:14 onion kernel: Process accounting resumed
Jul 26 03:36:13 onion kernel: Process accounting resumed
Jul 27 03:06:12 onion kernel: Process accounting resumed

onion$ grep psacct /var/log/messages
Jul 25 00:50:10 onion audit: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=psacct comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Jul 25 00:50:10 onion audit: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=psacct comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Jul 25 00:52:30 onion audit: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=psacct comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'

onion$ systemctl status psacct
* psacct.service - Kernel process accounting
   Loaded: loaded (/usr/lib/systemd/system/psacct.service; enabled; vendor preset: disabled)
   Active: active (exited) since Tue 2017-07-25 00:52:30 UTC; 3 days ago
  Process: 869 ExecStart=/usr/sbin/accton /var/account/pacct (code=exited, status=0/SUCCESS)
  Process: 849 ExecStartPre=/usr/libexec/psacct/accton-create (code=exited, status=0/SUCCESS)
 Main PID: 869 (code=exited, status=0/SUCCESS)
    Tasks: 0 (limit: 4915)
   CGroup: /system.slice/psacct.service

Bumping ver to include fc26

Comment 9 Dominique Brazziel 2017-07-28 17:17:49 UTC
It seems like this effect happens right after reset.  I had a theory early on that because the previous day's account file was copied to pacct.1 before the new pacct had been created and logging resumed that the kernel continued writing to the old log file (pacct.1) for a period.  That prompted me to write this hack to logrotate.d/psacct, which had been working pretty reliably for a few years until my recent free space issues:

/var/account/pacct {
    compress
    delaycompress
    notifempty
    daily
    rotate 31
    #create 0600 root root
    create 0640 root adm
    postrotate
	#
	# See BZ#1379187
	#
	ACCTSIZE=$(stat --printf="%s" /var/account/pacct) 
        while [ $ACCTSIZE -eq 0  ];do
            sleep 1
            /usr/sbin/accton /var/account/pacct
            ACCTSIZE=$(stat --printf="%s" /var/account/pacct) 
        done
    endscript
}

Comment 10 Dominique Brazziel 2017-08-06 13:35:17 UTC
I am now on Release 26 after doing system upgrade.  This bug is pesky, I am going to grok the kernel code and the two links from the atop package which referenced possible causes of accounting showstoppers.  

Ian, do you run atop (https://apps.fedoraproject.org/packages/atop)?  If so, do you use the atopacct daemon?  I disabled it as I wanted only one package managing the accounting file.

Comment 11 Ian Donaldson 2017-08-14 13:05:36 UTC
Sorry, no, don't have atop installed.

Comment 12 Jan Rybar 2018-02-13 15:57:46 UTC
We tracked the issue to kernel.
The bug seems to be in kernel/acct.c:check_free_space(), which has been fixed in upstream since January 2018.
More info at upstream
http://post-office.corp.redhat.com/archives/rhkernel-list/2018-January/msg02912.html

'accton' itself works nominally. Calling of system call acct() returns 0. The bug is within acct() itself.

You should do fine with newest kernel (4.15 covers it for sure - seen in sources).

Comment 13 Dominique Brazziel 2018-02-13 16:21:43 UTC
Thanks for the heads up, Jan! The upstream link you posted seems broken, I can't reach it, get the following from chrome:

This site can’t be reached
post-office.corp.redhat.com’s server IP address could not be found.

I am really interested to see the discussion, could you re-post with an updated link? Thanks again! :D

Comment 14 Jan Rybar 2018-02-14 09:27:00 UTC
Sorry, my bad.

Here: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4d9570158b6260f449e317a5f9ed030c2504a615

Present in kernel's Pagure git (somewhere... Looking for the commit is just pain.)

Comment 15 Jan Rybar 2018-03-02 17:06:12 UTC
Hi Dominique,

I couldn't reproduce the issue after kernel update.
Can you confirm this so I can close this bugzilla?

Jan

Comment 16 Dominique Brazziel 2018-03-13 13:17:50 UTC
It seems fine on latest kernel (4.15.6). Thanks for the work everybody!