1047719 – journalctl extremely sluggish when journal is big

Bug 1047719 - journalctl extremely sluggish when journal is big

Summary: journalctl extremely sluggish when journal is big

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	systemd
Sub Component:
Version:	20
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	systemd-maint
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-01-02 00:25 UTC by Lars E. Pettersson
Modified:	2015-05-29 10:38 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-05-29 10:38:24 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Lars E. Pettersson 2014-01-02 00:25:32 UTC

Description of problem:

Running journalctl takes very long time, over 25 minutes.

Version-Release number of selected component (if applicable):

systemd-208-9.fc20.x86_64

How reproducible:

Always

Steps to Reproduce:
1. time journalctl | grep xyz 
2.
3.

Actual results:

real    25m31.478s
user    11m2.966s
sys    2m36.218s 

Expected results:

Way faster execution.

Additional info:

# journalctl --disk-usage 
Journals take up 3.6G on disk.

# journalctl --until=-1m | head -1 
-- Logs begin at Sun 2013-07-28 18:25:57 CEST, end at Thu 2014-01-02 01:20:02 CE

The same question on the text files in /var/log takes:

# time grep -r --exclude-dir=journal xyz /var/log
...
real    1m6.362s
user    0m2.253s
sys    0m1.201s 
# du -sh --exclude=journal /var/log
620M    /var/log

System info:

# cat /proc/cpuinfo | grep "model name"
model name	: AMD Turion(tm) II Neo N54L Dual-Core Processor
model name	: AMD Turion(tm) II Neo N54L Dual-Core Processor
# free -h
             total       used       free     shared    buffers     cached
Mem:          1.8G       1.7G       127M       1.1M       398M       826M
-/+ buffers/cache:       523M       1.3G
Swap:         4.3G       133M       4.1G

The system is a home web and mail server running at runlevel 3.

Comment 1 Lars E. Pettersson 2014-01-26 08:37:37 UTC

It should be added that this also effects systemctl:

# time systemctl status postfix.service 
postfix.service - Postfix Mail Transport Agent
   Loaded: loaded (/usr/lib/systemd/system/postfix.service; enabled)
   Active: active (running) since Sun 2014-01-05 18:16:48 CET; 2 weeks 6 days ago
 Main PID: 6548 (master)
   CGroup: /system.slice/postfix.service
           ├─ 6548 /usr/libexec/postfix/master -w
           ├─12919 qmgr -l -t fifo -u
           ├─12922 tlsmgr -l -t unix -u
           ├─12937 pickup -l -t fifo -u
           └─14566 anvil -l -t unix -u

[ten log lines showed]
Hint: Some lines were ellipsized, use -l to show in full.

real	0m42.251s
user	0m0.026s
sys	0m0.361s

Comment 2 Lars E. Pettersson 2014-02-27 19:58:50 UTC

58 days later, and the problem still persist:

# time journalctl | grep xyz 
...
real	20m6.684s
user	12m18.541s
sys	2m44.568s
# 
# time grep -r --exclude-dir=journal xyz /var/log
...
real	1m47.507s
user	0m2.836s
sys	0m1.631s
#

Things was a tad faster at my two workstation computers at home and at work (both Dell T1650 with E3-1270 Xeon and 16GB of RAM)

Home:
# time journalctl | grep xyz 
...
real	0m46.057s
user	0m18.918s
sys	0m3.308s
# time grep -r --exclude-dir=journal xyz /var/log
...
real	0m4.424s
user	0m0.101s
sys	0m0.038s
# 

Work:
# time journalctl | grep xyz 
...
real	0m40.357s
user	0m11.891s
sys	0m1.627s
# time grep -r --exclude-dir=journal xyz /var/log

real	0m6.646s
user	0m0.125s
sys	0m0.040s
#

But they have smaller journals, which might be the answer, 772.6M at home, and 553.4M at work, compared to 4.0G for my server that is the really sluggish one.

Note that 'systemctl status <whatever>' is equally sluggish, as it reads the journal, and prints some log-lines at the end of the status message.

Comment 3 Lars E. Pettersson 2014-06-07 21:15:03 UTC

158 days later, and still no action on this issue...

This time it was a tad faster, but still waaaay too long time:

# time journalctl | grep xyz
...
real	12m8.549s
user	8m55.324s
sys	2m16.010s

# journalctl --disk-usage 
Journals take up 3.9G on disk.

# journalctl --until=-1m | head -1 
-- Logs begin at Fri 2014-03-14 23:07:16 CET, end at Sat 2014-06-07 23:05:48 CEST. --

# time grep -r --exclude-dir=journal xyz /var/log
...
real	1m52.778s
user	0m4.206s
sys	0m2.491s

# du -sh --exclude=journal /var/log
1.1G	/var/log

# rpm -qa systemd* | sort
systemd-208-16.fc20.x86_64
systemd-libs-208-16.fc20.x86_64
systemd-python-208-16.fc20.x86_64
systemd-ui-3-1.fc20.x86_64

Comment 4 Lars E. Pettersson 2014-07-19 11:38:44 UTC

200 days later, and still no action on this issue...

The problem persist.

# time journalctl | grep xyz
...
real	31m5.530s
user	14m11.046s
sys	18m21.174s

# journalctl --disk-usage 
Journals take up 4.0G on disk.

# journalctl --until=-1m | head -1 
-- Logs begin at Thu 2014-05-01 16:29:17 CEST, end at Sat 2014-07-19 13:31:01 CEST. --

# time grep -r --exclude-dir=journal xyz /var/log
...
real	2m8.209s
user	0m4.898s
sys	0m2.757s

# du -sh --exclude=journal /var/log
1.3G	/var/log

# rpm -qa systemd* | sort
systemd-208-19.fc20.x86_64
systemd-libs-208-19.fc20.x86_64
systemd-python-208-19.fc20.x86_64
systemd-ui-3-1.fc20.x86_64

Comment 5 Lars E. Pettersson 2014-09-09 19:44:25 UTC

252 days later, and still no response whatsoever from the maintainers...

The journal is still sluggish. Se below:

# time journalctl | grep xyz
...
real	29m36.162s
user	14m14.193s
sys	17m38.224s

# journalctl --disk-usage 
Journals take up 4.0G on disk.

# journalctl --until=-1m | head -1 
-- Logs begin at Thu 2014-06-05 20:34:43 CEST, end at Tue 2014-09-09 21:33:17 CEST. --

Compare this with the time it takes to do the same with syslog text files covering the same period.

# time grep -r --exclude-dir=journal xyz /var/log
...
real	2m9.329s
user	0m5.054s
sys	0m2.879s

# du -sh --exclude=journal /var/log
1.4G	/var/log

# rpm -qa systemd* | sort
systemd-208-21.fc20.x86_64
systemd-libs-208-21.fc20.x86_64
systemd-python-208-21.fc20.x86_64
systemd-ui-3-1.fc20.x86_64

A simple 'systemctl status' also takes quite a long time to complete:

# time systemctl status postfix.service 
postfix.service - Postfix Mail Transport Agent
   Loaded: loaded (/usr/lib/systemd/system/postfix.service; enabled)
   Active: active (running) since Sun 2014-05-04 20:29:11 CEST; 4 months 6 days ago
 Main PID: 1350 (master)
   CGroup: /system.slice/postfix.service
           ├─ 1350 /usr/libexec/postfix/master -w
           ├─ 3306 qmgr -l -t fifo -u
           ├─ 3346 tlsmgr -l -t unix -u
           ├─27379 pickup -l -t fifo -u
           └─28486 anvil -l -t unix -u

... some log lines ...

real	0m14.598s
user	0m0.039s
sys	0m0.244s

(it is the log lines that take time, making the 'systemctl status' command really annoying to use)

Comment 6 Zbigniew Jędrzejewski-Szmek 2014-09-25 02:32:06 UTC

What filesystem is used for /var/log/journal?

Comment 7 Lars E. Pettersson 2014-09-25 08:56:46 UTC

/var/log/journal resides on a / raid1 partition (consisting of three drives, initially two) using ext4. The same partition as the text based logs refereed to earlier.

Comment 8 Lars E. Pettersson 2014-09-25 19:56:15 UTC

(In reply to Lars E. Pettersson from comment #7)
> /var/log/journal resides on a / raid1 partition (consisting of three drives,
> initially two) using ext4. 

I have to correct myself. That should be ext3, not ext4.

Comment 9 Lars E. Pettersson 2014-11-30 16:20:03 UTC

334 days later, and still no action on this bug...

The same tests as before:

# time journalctl | grep xyz
...
real	51m57.719s
user	18m13.949s
sys	36m12.876s

# journalctl --disk-usage 
Journals take up 3.9G on disk.

# journalctl --until=-1m | head -1 
-- Logs begin at Sat 2014-06-07 15:52:06 CEST, end at Sun 2014-11-30 17:16:02 CET. --

# time grep -r --exclude-dir=journal xyz /var/log
...
real	1m50.120s
user	0m4.717s
sys	0m2.714s

# du -sh --exclude=journal /var/log
1.3G	/var/log

# rpm -qa systemd* | sort
systemd-208-28.fc20.x86_64
systemd-libs-208-28.fc20.x86_64
systemd-python-208-28.fc20.x86_64
systemd-ui-3-1.fc20.x86_64

# time systemctl status postfix.service 
postfix.service - Postfix Mail Transport Agent
   Loaded: loaded (/usr/lib/systemd/system/postfix.service; disabled)
   Active: inactive (dead)


real	0m8.445s
user	0m0.016s
sys	0m0.139s

Comment 10 Kevin Fenzi 2014-11-30 16:58:10 UTC

(In reply to Lars E. Pettersson from comment #9)
> 334 days later, and still no action on this bug...

Well, not entirely true, as one of the systemd maintainers asked what filesystem you were using. Apparently there was some big issues with btrfs and journal logs, so he was trying to rule that out (which we have since you are using ext3). 

Whats the output of 'journalctl --verify' ? Perhaps you have a lot of corrupt logs and it's taking a long time to construct the index. 

Does doing a 'journalctl --update-catalog' help any?

Also, if you have space, can you try copying all the journal files into another directory and operating on them with journalctl --root? This would perhaps show if there was some kind of fragmentation or fs issue that was fixed or helped by copying them.

Comment 11 Lars E. Pettersson 2014-11-30 20:11:46 UTC

(In reply to Kevin Fenzi from comment #10)
> (In reply to Lars E. Pettersson from comment #9)
> > 334 days later, and still no action on this bug...
> 
> Well, not entirely true, as one of the systemd maintainers asked what
> filesystem you were using.

Well, asking what file system I use and then silently ignoring my answer is not much of action, at least not in my book. I did not even know why the question was asked, seemed a bit off-topic to me...

So, thanks for providing information on why the question of file system was asked.

OK, ran 'journalctl --verify' and it found 103 OK files, and 3 that failed with the faults "Invalid tail monotonic timestamp", "Invalid tail monotonic timestamp", and "Data number mismatch". Compared to other systems, nothing particular, i.e. the journal seem to be OK and healthy.

Tested making a copy, as you asked me to do, and ran the time test on that.

# time journalctl --root /root/journal/journal | grep xyz
...
real	52m22.032s
user	18m2.488s
sys	36m55.334s

I.e. no improvement at all...

Tried the switch '--update-catalog' with the following result.

# journalctl --update-catalog
# time journalctl | grep xyz
...
real	52m36.395s
user	17m55.600s
sys	37m13.684s

Well, that did not improve things either.

Any other thing I could test?

Comment 12 Kevin Fenzi 2014-12-03 15:07:08 UTC

Nothing else is leaping to my mind. 

Currently journal compresses with xz which saves a lot of space, but is very slow... 

From irc: 

Dec 01 13:25:00 <zbyszek>       nirik: not specifically on this bug, but yes, we're working on improving journal speed.
Dec 01 13:25:18 <zbyszek>       I'm currently working on making lz4 compression the default for journal.
Dec 01 13:25:56 <zbyszek>       It has been optional for a few months, but the on-disk format was incompatible with the lz4 binary, so I didn't want to make it the default until that was fixed.
Dec 01 13:26:32 <zbyszek>       Lz4 upstream has added a "frame" api which provides compatibility, and we'll be switching over to it and making lz4 the default instead of xz.
Dec 01 13:27:06 <zbyszek>       This should provide a noticable speedup, since compression/decompression of large blocks was very costly.
Dec 01 13:27:29 <zbyszek>       XZ speed was also improved a lot through better setings in the meanwhile.
Dec 01 13:28:00 <zbyszek>       There have been other fixes too, but nothing which would allow us to say that "this bug is fixed".

Comment 13 Michal Schmidt 2014-12-09 21:19:08 UTC

I pushed a couple of commits upstream to optimize journal iteration:

http://cgit.freedesktop.org/systemd/systemd/commit/?id=b7c88ab8cc7d55a43450bf3dea750f95f2e910d6

http://cgit.freedesktop.org/systemd/systemd/commit/?id=f8b5a3b75fb55f0acb85c21424b3893c822742e9

I have a couple of more ideas for improvement, but the implementation's not working yet.

Comment 14 Michal Schmidt 2014-12-12 23:51:39 UTC

Here's another one:

http://cgit.freedesktop.org/systemd/systemd/commit/?id=69adae5168da231c6cf319f708860954701b25ed

Comment 15 Michal Schmidt 2014-12-18 14:22:10 UTC

(In reply to Michal Schmidt from comment #13)
> I pushed a couple of commits upstream to optimize journal iteration:
> 
> http://cgit.freedesktop.org/systemd/systemd/commit/
> ?id=b7c88ab8cc7d55a43450bf3dea750f95f2e910d6
> 
> http://cgit.freedesktop.org/systemd/systemd/commit/
> ?id=f8b5a3b75fb55f0acb85c21424b3893c822742e9

These are now superseded by the patchset around:
http://cgit.freedesktop.org/systemd/systemd/commit/?id=7943f42275025e1b6642b580b19b24dfab8dee61

Comment 16 Fedora End Of Life 2015-05-29 10:17:13 UTC

This message is a reminder that Fedora 20 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 20. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '20'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 20 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 17 Michal Schmidt 2015-05-29 10:38:24 UTC

journalctl performs better in F22.

Note You need to log in before you can comment on or make changes to this bug.