+++ This bug was initially created as a clone of Bug #1598502 +++
Description of problem:
Server encountered an OOM related service outage because of many csh (tcsh) processes running which were consuming huge amounts of memory.
Version-Release number of selected component (if applicable):
RHEL7.5 and tcsh-6.18.01-15.el7.x86_64
How reproducible:
Easily
Steps to Reproduce:
1. Put linked corrupt history file into user home directory as .history
2. Start tcsh shell.
3. Watch tcsh memory usage rapidly grow.
Actual results:
Rapid memory usage by tcsh process until system eventually runs out of memory.
Expected results:
tcsh should not exhaust memory if history file is corrupt.
Additional info:
Further analysis revealed this to be due to csh tasks consuming CPU and memory processing a huge corrupted history file:
# ll .history
-rw-------. 1 testuser testuser 4237962198 Jun 4 12:37 .history
Using this history file I was easily able to replicate the problem. As soon as the csh tries to start it reads the history file and never appears to complete, at least I never let it run to completion or server memory outage. In top(1) we can see it growing:
$top
.......
%Cpu15 : 64.0 us, 36.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
......
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
49287 testuser 20 0 10.5g 10.0g 1304 R 99.7 31.8 2:20.88 csh
1547 polkitd 20 0 547368 19864 5324 S 7.6 0.1 520:21.97 polkitd
1623 dbus 20 0 70148 4328 1944 R 2.0 0.0 154:26.93 dbus-daemon
.......
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
49287 testuser 20 0 24.3g 22.6g 996 R 99.3 72.2 8:58.82 csh
188 root 25 5 0 0 0 S 22.2 0.0 0:32.54 ksmd
.........
An strace shows the csh reading the history and growing the heap:
# more /tmp/csh.str
49287 read(6, "\321\313\177$\235\232\321\313\177 ( \235\232\321\313", 16) = 16
49287 read(6, "\177", 1) = 1
49287 read(6, "/", 1) = 1
49287 read(6, "E", 1) = 1
49287 read(6, "P", 1) = 1
49287 read(6, "_", 1) = 1
49287 read(6, "O", 1) = 1
49287 read(6, "U", 1) = 1
49287 read(6, "\232", 1) = 1
49287 read(6, "\321", 1) = 1
49287 read(6, "\313", 1) = 1
49287 read(6, "\177", 1) = 1
49287 read(6, "!", 1) = 1
49287 read(6, "\236", 1) = 1
49287 read(6, "\232", 1) = 1
49287 read(6, "\321", 1) = 1
49287 lseek(6, -15, SEEK_CUR) = 238628506
49287 brk(NULL) = 0x76b8f000
49287 brk(0x76c00000) = 0x76c00000
49287 read(6, "\177/EP_OU\232\321\313\177!\236\232\321\313\177!\236\232\321\313\177!\236\232\321\313\177!\236\232"..., 4096) = 409
6
49287 read(6, "\232\321\313INVAL_NUMVALS}!\340\22\375\1", 22) = 22
49287 read(6, "@", 1) = 1
49287 read(6, "\22", 1) = 1
49287 read(6, "\375", 1) = 1
49287 read(6, "\1", 1) = 1
.........
Building the latest version from upstream also demonstrated the problem (based upon 6.20.00):
https://github.com/tcsh-org/tcsh
Interrupting via gdb shows:
(gdb) bt
#0 0x00007f073e82f44c in __brk (addr=0x38ac8a000) at ../sysdeps/unix/sysv/linux/x86_64/brk.c:31
#1 0x00007f073e82f4e6 in __GI___sbrk (increment=-7122944) at sbrk.c:53
#2 0x00007f073e7c2ae9 in __GI___default_morecore (increment=<optimized out>) at morecore.c:48
#3 0x00007f073e7b962a in systrim (pad=<optimized out>, av=0x7f073eb00760 <main_arena>, av=0x7f073eb00760 <main_arena>)
at malloc.c:2786
#4 0x00007f073e7bb2ea in _int_free (av=0x7f073eb00760 <main_arena>, p=0x38ac69230, have_lock=0) at malloc.c:4072
#5 0x0000000000418e0b in balloc (buf=buf@entry=445373) at sh.lex.c:1537
#6 0x000000000041993c in bgetc () at sh.lex.c:1678
#7 readc (wanteof=wanteof@entry=1) at sh.lex.c:1423
#8 0x0000000000419f4a in readc (wanteof=1) at sh.lex.c:498
#9 getC1 (flag=flag@entry=2) at sh.lex.c:466
#10 0x000000000041bd47 in word (parsehtime=0) at sh.lex.c:416
#11 lex (hp=hp@entry=0x672dd0 <paraml>) at sh.lex.c:180
#12 0x0000000000405887 in process (catch=catch@entry=0) at sh.c:2059
#13 0x000000000040633c in srcunit (unit=6, onlyown=onlyown@entry=0, hflg=hflg@entry=1, av=av@entry=0x69a460) at sh.c:1761
#14 0x000000000040661f in srcfile (f=f@entry=0x68da80 "/home/darrencl/.history", onlyown=onlyown@entry=0, flag=flag@entry=1,
av=av@entry=0x69a460) at sh.c:1553
#15 0x0000000000406763 in dosource (t=0x69a460, t@entry=0x65bbc0 <loadhist_cmd.9185>, c=c@entry=0x0) at sh.c:2201
#16 0x0000000000417942 in loadhist (fname=<optimized out>, fname@entry=0x0, mflg=mflg@entry=0) at sh.hist.c:1341
#17 0x0000000000404bf2 in main (argc=<optimized out>, argv=0x7ffc81286a48) at sh.c:1362
(gdb)
Looks like we spend a lot of time looping inside lex(). I don't see any obvious sanity checks in this code.
The customer has worked around the problem by removing the corrupt history file and setting .cshrc variables (previously at their default):
set history = 0
set histsave = 0
The problem looks very similar to:
https://access.redhat.com/solutions/41146
Comment 1Fedora Update System
2019-10-25 16:47:26 UTC