1765659 – tcsh corrupt history file results in rapid memory consumption until memory exhausted

Bug 1765659 - tcsh corrupt history file results in rapid memory consumption until memory exhausted

Summary: tcsh corrupt history file results in rapid memory consumption until memory ex...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	tcsh
Sub Component:
Version:	31
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Jan Macku
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-10-25 16:43 UTC by Jan Macku
Modified:	2019-11-04 00:54 UTC (History)
CC List:	18 users (show)
Fixed In Version:	tcsh-6.21.00-3.fc31
Clone Of:	1598502
Environment:
Last Closed:	2019-11-04 00:54:18 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Jan Macku 2019-10-25 16:43:52 UTC

+++ This bug was initially created as a clone of Bug #1598502 +++

Description of problem:
Server encountered an OOM related service outage because of many csh (tcsh) processes running which were consuming huge amounts of memory.

Version-Release number of selected component (if applicable):
RHEL7.5 and tcsh-6.18.01-15.el7.x86_64

How reproducible:
Easily

Steps to Reproduce:
1. Put linked corrupt history file into user home directory as .history
2. Start tcsh shell. 
3. Watch tcsh memory usage rapidly grow.

Actual results:
Rapid memory usage by tcsh process until system eventually runs out of memory. 

Expected results:
tcsh should not exhaust memory if history file is corrupt. 

Additional info:
Further analysis revealed this to be due to csh tasks consuming CPU and memory processing a huge corrupted history file:

# ll .history
-rw-------. 1 testuser testuser 4237962198 Jun  4 12:37 .history


Using this history file I was easily able to replicate the problem. As soon as the csh tries to start it reads the history file and never appears to complete, at least I never let it run to completion or server memory outage.  In top(1) we can see it growing:


$top
.......
 %Cpu15 : 64.0 us, 36.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
 
   
......
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                       
49287 testuser  20   0   10.5g  10.0g   1304 R  99.7 31.8   2:20.88 csh                                                           
 1547 polkitd   20   0  547368  19864   5324 S   7.6  0.1 520:21.97 polkitd                                                       
 1623 dbus      20   0   70148   4328   1944 R   2.0  0.0 154:26.93 dbus-daemon      
....... 
    
 
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                       
49287 testuser  20   0   24.3g  22.6g    996 R  99.3 72.2   8:58.82 csh                                                           
  188 root      25   5       0      0      0 S  22.2  0.0   0:32.54 ksmd         
......... 
 

An strace shows the csh reading the history and growing the heap:

# more /tmp/csh.str 
49287 read(6, "\321\313\177$\235\232\321\313\177 ( \235\232\321\313", 16) = 16
49287 read(6, "\177", 1)                = 1
49287 read(6, "/", 1)                   = 1
49287 read(6, "E", 1)                   = 1
49287 read(6, "P", 1)                   = 1
49287 read(6, "_", 1)                   = 1
49287 read(6, "O", 1)                   = 1
49287 read(6, "U", 1)                   = 1
49287 read(6, "\232", 1)                = 1
49287 read(6, "\321", 1)                = 1
49287 read(6, "\313", 1)                = 1
49287 read(6, "\177", 1)                = 1
49287 read(6, "!", 1)                   = 1
49287 read(6, "\236", 1)                = 1
49287 read(6, "\232", 1)                = 1
49287 read(6, "\321", 1)                = 1
49287 lseek(6, -15, SEEK_CUR)           = 238628506
49287 brk(NULL)                         = 0x76b8f000
49287 brk(0x76c00000)                   = 0x76c00000
49287 read(6, "\177/EP_OU\232\321\313\177!\236\232\321\313\177!\236\232\321\313\177!\236\232\321\313\177!\236\232"..., 4096) = 409
6
49287 read(6, "\232\321\313INVAL_NUMVALS}!\340\22\375\1", 22) = 22
49287 read(6, "@", 1)                   = 1
49287 read(6, "\22", 1)                 = 1
49287 read(6, "\375", 1)                = 1
49287 read(6, "\1", 1)                  = 1
.........


Building the latest version from upstream also demonstrated the problem (based upon 6.20.00):

https://github.com/tcsh-org/tcsh

Interrupting via gdb shows:


(gdb) bt
#0  0x00007f073e82f44c in __brk (addr=0x38ac8a000) at ../sysdeps/unix/sysv/linux/x86_64/brk.c:31
#1  0x00007f073e82f4e6 in __GI___sbrk (increment=-7122944) at sbrk.c:53
#2  0x00007f073e7c2ae9 in __GI___default_morecore (increment=<optimized out>) at morecore.c:48
#3  0x00007f073e7b962a in systrim (pad=<optimized out>, av=0x7f073eb00760 <main_arena>, av=0x7f073eb00760 <main_arena>)
    at malloc.c:2786
#4  0x00007f073e7bb2ea in _int_free (av=0x7f073eb00760 <main_arena>, p=0x38ac69230, have_lock=0) at malloc.c:4072
#5  0x0000000000418e0b in balloc (buf=buf@entry=445373) at sh.lex.c:1537
#6  0x000000000041993c in bgetc () at sh.lex.c:1678
#7  readc (wanteof=wanteof@entry=1) at sh.lex.c:1423
#8  0x0000000000419f4a in readc (wanteof=1) at sh.lex.c:498
#9  getC1 (flag=flag@entry=2) at sh.lex.c:466
#10 0x000000000041bd47 in word (parsehtime=0) at sh.lex.c:416
#11 lex (hp=hp@entry=0x672dd0 <paraml>) at sh.lex.c:180
#12 0x0000000000405887 in process (catch=catch@entry=0) at sh.c:2059
#13 0x000000000040633c in srcunit (unit=6, onlyown=onlyown@entry=0, hflg=hflg@entry=1, av=av@entry=0x69a460) at sh.c:1761
#14 0x000000000040661f in srcfile (f=f@entry=0x68da80 "/home/darrencl/.history", onlyown=onlyown@entry=0, flag=flag@entry=1, 
    av=av@entry=0x69a460) at sh.c:1553
#15 0x0000000000406763 in dosource (t=0x69a460, t@entry=0x65bbc0 <loadhist_cmd.9185>, c=c@entry=0x0) at sh.c:2201
#16 0x0000000000417942 in loadhist (fname=<optimized out>, fname@entry=0x0, mflg=mflg@entry=0) at sh.hist.c:1341
#17 0x0000000000404bf2 in main (argc=<optimized out>, argv=0x7ffc81286a48) at sh.c:1362
(gdb) 

Looks like we spend a lot of time looping inside lex().  I don't see any obvious sanity checks in this code.

The customer has worked around the problem by removing the corrupt history file and setting .cshrc variables (previously at their default):

set history = 0
set histsave = 0

The problem looks very similar to:
https://access.redhat.com/solutions/41146

Comment 1 Fedora Update System 2019-10-25 16:47:26 UTC

FEDORA-2019-c9f4cc6025 has been submitted as an update to Fedora 31. https://bodhi.fedoraproject.org/updates/FEDORA-2019-c9f4cc6025

Comment 2 Fedora Update System 2019-10-27 04:02:21 UTC

tcsh-6.21.00-3.fc31 has been pushed to the Fedora 31 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2019-c9f4cc6025

Comment 3 Fedora Update System 2019-11-04 00:54:18 UTC

tcsh-6.21.00-3.fc31 has been pushed to the Fedora 31 stable repository. If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.