+++ This bug was initially created as a clone of Bug #1598502 +++ Description of problem: Server encountered an OOM related service outage because of many csh (tcsh) processes running which were consuming huge amounts of memory. Version-Release number of selected component (if applicable): RHEL7.5 and tcsh-6.18.01-15.el7.x86_64 How reproducible: Easily Steps to Reproduce: 1. Put linked corrupt history file into user home directory as .history 2. Start tcsh shell. 3. Watch tcsh memory usage rapidly grow. Actual results: Rapid memory usage by tcsh process until system eventually runs out of memory. Expected results: tcsh should not exhaust memory if history file is corrupt. Additional info: Further analysis revealed this to be due to csh tasks consuming CPU and memory processing a huge corrupted history file: # ll .history -rw-------. 1 testuser testuser 4237962198 Jun 4 12:37 .history Using this history file I was easily able to replicate the problem. As soon as the csh tries to start it reads the history file and never appears to complete, at least I never let it run to completion or server memory outage. In top(1) we can see it growing: $top ....... %Cpu15 : 64.0 us, 36.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st ...... PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 49287 testuser 20 0 10.5g 10.0g 1304 R 99.7 31.8 2:20.88 csh 1547 polkitd 20 0 547368 19864 5324 S 7.6 0.1 520:21.97 polkitd 1623 dbus 20 0 70148 4328 1944 R 2.0 0.0 154:26.93 dbus-daemon ....... PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 49287 testuser 20 0 24.3g 22.6g 996 R 99.3 72.2 8:58.82 csh 188 root 25 5 0 0 0 S 22.2 0.0 0:32.54 ksmd ......... An strace shows the csh reading the history and growing the heap: # more /tmp/csh.str 49287 read(6, "\321\313\177$\235\232\321\313\177 ( \235\232\321\313", 16) = 16 49287 read(6, "\177", 1) = 1 49287 read(6, "/", 1) = 1 49287 read(6, "E", 1) = 1 49287 read(6, "P", 1) = 1 49287 read(6, "_", 1) = 1 49287 read(6, "O", 1) = 1 49287 read(6, "U", 1) = 1 49287 read(6, "\232", 1) = 1 49287 read(6, "\321", 1) = 1 49287 read(6, "\313", 1) = 1 49287 read(6, "\177", 1) = 1 49287 read(6, "!", 1) = 1 49287 read(6, "\236", 1) = 1 49287 read(6, "\232", 1) = 1 49287 read(6, "\321", 1) = 1 49287 lseek(6, -15, SEEK_CUR) = 238628506 49287 brk(NULL) = 0x76b8f000 49287 brk(0x76c00000) = 0x76c00000 49287 read(6, "\177/EP_OU\232\321\313\177!\236\232\321\313\177!\236\232\321\313\177!\236\232\321\313\177!\236\232"..., 4096) = 409 6 49287 read(6, "\232\321\313INVAL_NUMVALS}!\340\22\375\1", 22) = 22 49287 read(6, "@", 1) = 1 49287 read(6, "\22", 1) = 1 49287 read(6, "\375", 1) = 1 49287 read(6, "\1", 1) = 1 ......... Building the latest version from upstream also demonstrated the problem (based upon 6.20.00): https://github.com/tcsh-org/tcsh Interrupting via gdb shows: (gdb) bt #0 0x00007f073e82f44c in __brk (addr=0x38ac8a000) at ../sysdeps/unix/sysv/linux/x86_64/brk.c:31 #1 0x00007f073e82f4e6 in __GI___sbrk (increment=-7122944) at sbrk.c:53 #2 0x00007f073e7c2ae9 in __GI___default_morecore (increment=<optimized out>) at morecore.c:48 #3 0x00007f073e7b962a in systrim (pad=<optimized out>, av=0x7f073eb00760 <main_arena>, av=0x7f073eb00760 <main_arena>) at malloc.c:2786 #4 0x00007f073e7bb2ea in _int_free (av=0x7f073eb00760 <main_arena>, p=0x38ac69230, have_lock=0) at malloc.c:4072 #5 0x0000000000418e0b in balloc (buf=buf@entry=445373) at sh.lex.c:1537 #6 0x000000000041993c in bgetc () at sh.lex.c:1678 #7 readc (wanteof=wanteof@entry=1) at sh.lex.c:1423 #8 0x0000000000419f4a in readc (wanteof=1) at sh.lex.c:498 #9 getC1 (flag=flag@entry=2) at sh.lex.c:466 #10 0x000000000041bd47 in word (parsehtime=0) at sh.lex.c:416 #11 lex (hp=hp@entry=0x672dd0 <paraml>) at sh.lex.c:180 #12 0x0000000000405887 in process (catch=catch@entry=0) at sh.c:2059 #13 0x000000000040633c in srcunit (unit=6, onlyown=onlyown@entry=0, hflg=hflg@entry=1, av=av@entry=0x69a460) at sh.c:1761 #14 0x000000000040661f in srcfile (f=f@entry=0x68da80 "/home/darrencl/.history", onlyown=onlyown@entry=0, flag=flag@entry=1, av=av@entry=0x69a460) at sh.c:1553 #15 0x0000000000406763 in dosource (t=0x69a460, t@entry=0x65bbc0 <loadhist_cmd.9185>, c=c@entry=0x0) at sh.c:2201 #16 0x0000000000417942 in loadhist (fname=<optimized out>, fname@entry=0x0, mflg=mflg@entry=0) at sh.hist.c:1341 #17 0x0000000000404bf2 in main (argc=<optimized out>, argv=0x7ffc81286a48) at sh.c:1362 (gdb) Looks like we spend a lot of time looping inside lex(). I don't see any obvious sanity checks in this code. The customer has worked around the problem by removing the corrupt history file and setting .cshrc variables (previously at their default): set history = 0 set histsave = 0 The problem looks very similar to: https://access.redhat.com/solutions/41146
FEDORA-2019-c9f4cc6025 has been submitted as an update to Fedora 31. https://bodhi.fedoraproject.org/updates/FEDORA-2019-c9f4cc6025
tcsh-6.21.00-3.fc31 has been pushed to the Fedora 31 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2019-c9f4cc6025
tcsh-6.21.00-3.fc31 has been pushed to the Fedora 31 stable repository. If problems still persist, please make note of it in this bug report.