Bug 1765659

Summary:	tcsh corrupt history file results in rapid memory consumption until memory exhausted
Product:	[Fedora] Fedora	Reporter:	Jan Macku <jamacku>
Component:	tcsh	Assignee:	Jan Macku <jamacku>
Status:	CLOSED ERRATA	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	31	CC:	afox, cra, darren.lavender, deekej, devel, dkaylor, fkrska, fpokorny, fsumsal, jamacku, jchaloup, kdudka, lnykryn, msekleta, ovasik, qe-baseos-apps, svashisht, vcrhonek
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	tcsh-6.21.00-3.fc31	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1598502	Environment:
Last Closed:	2019-11-04 00:54:18 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jan Macku 2019-10-25 16:43:52 UTC

+++ This bug was initially created as a clone of Bug #1598502 +++

Description of problem:
Server encountered an OOM related service outage because of many csh (tcsh) processes running which were consuming huge amounts of memory.

Version-Release number of selected component (if applicable):
RHEL7.5 and tcsh-6.18.01-15.el7.x86_64

How reproducible:
Easily

Steps to Reproduce:
1. Put linked corrupt history file into user home directory as .history
2. Start tcsh shell. 
3. Watch tcsh memory usage rapidly grow.

Actual results:
Rapid memory usage by tcsh process until system eventually runs out of memory. 

Expected results:
tcsh should not exhaust memory if history file is corrupt. 

Additional info:
Further analysis revealed this to be due to csh tasks consuming CPU and memory processing a huge corrupted history file:

# ll .history
-rw-------. 1 testuser testuser 4237962198 Jun  4 12:37 .history


Using this history file I was easily able to replicate the problem. As soon as the csh tries to start it reads the history file and never appears to complete, at least I never let it run to completion or server memory outage.  In top(1) we can see it growing:


$top
.......
 %Cpu15 : 64.0 us, 36.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
 
   
......
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                       
49287 testuser  20   0   10.5g  10.0g   1304 R  99.7 31.8   2:20.88 csh                                                           
 1547 polkitd   20   0  547368  19864   5324 S   7.6  0.1 520:21.97 polkitd                                                       
 1623 dbus      20   0   70148   4328   1944 R   2.0  0.0 154:26.93 dbus-daemon      
....... 
    
 
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                       
49287 testuser  20   0   24.3g  22.6g    996 R  99.3 72.2   8:58.82 csh                                                           
  188 root      25   5       0      0      0 S  22.2  0.0   0:32.54 ksmd         
......... 
 

An strace shows the csh reading the history and growing the heap:

# more /tmp/csh.str 
49287 read(6, "\321\313\177$\235\232\321\313\177 ( \235\232\321\313", 16) = 16
49287 read(6, "\177", 1)                = 1
49287 read(6, "/", 1)                   = 1
49287 read(6, "E", 1)                   = 1
49287 read(6, "P", 1)                   = 1
49287 read(6, "_", 1)                   = 1
49287 read(6, "O", 1)                   = 1
49287 read(6, "U", 1)                   = 1
49287 read(6, "\232", 1)                = 1
49287 read(6, "\321", 1)                = 1
49287 read(6, "\313", 1)                = 1
49287 read(6, "\177", 1)                = 1
49287 read(6, "!", 1)                   = 1
49287 read(6, "\236", 1)                = 1
49287 read(6, "\232", 1)                = 1
49287 read(6, "\321", 1)                = 1
49287 lseek(6, -15, SEEK_CUR)           = 238628506
49287 brk(NULL)                         = 0x76b8f000
49287 brk(0x76c00000)                   = 0x76c00000
49287 read(6, "\177/EP_OU\232\321\313\177!\236\232\321\313\177!\236\232\321\313\177!\236\232\321\313\177!\236\232"..., 4096) = 409
6
49287 read(6, "\232\321\313INVAL_NUMVALS}!\340\22\375\1", 22) = 22
49287 read(6, "@", 1)                   = 1
49287 read(6, "\22", 1)                 = 1
49287 read(6, "\375", 1)                = 1
49287 read(6, "\1", 1)                  = 1
.........


Building the latest version from upstream also demonstrated the problem (based upon 6.20.00):

https://github.com/tcsh-org/tcsh

Interrupting via gdb shows:


(gdb) bt
#0  0x00007f073e82f44c in __brk (addr=0x38ac8a000) at ../sysdeps/unix/sysv/linux/x86_64/brk.c:31
#1  0x00007f073e82f4e6 in __GI___sbrk (increment=-7122944) at sbrk.c:53
#2  0x00007f073e7c2ae9 in __GI___default_morecore (increment=<optimized out>) at morecore.c:48
#3  0x00007f073e7b962a in systrim (pad=<optimized out>, av=0x7f073eb00760 <main_arena>, av=0x7f073eb00760 <main_arena>)
    at malloc.c:2786
#4  0x00007f073e7bb2ea in _int_free (av=0x7f073eb00760 <main_arena>, p=0x38ac69230, have_lock=0) at malloc.c:4072
#5  0x0000000000418e0b in balloc (buf=buf@entry=445373) at sh.lex.c:1537
#6  0x000000000041993c in bgetc () at sh.lex.c:1678
#7  readc (wanteof=wanteof@entry=1) at sh.lex.c:1423
#8  0x0000000000419f4a in readc (wanteof=1) at sh.lex.c:498
#9  getC1 (flag=flag@entry=2) at sh.lex.c:466
#10 0x000000000041bd47 in word (parsehtime=0) at sh.lex.c:416
#11 lex (hp=hp@entry=0x672dd0 <paraml>) at sh.lex.c:180
#12 0x0000000000405887 in process (catch=catch@entry=0) at sh.c:2059
#13 0x000000000040633c in srcunit (unit=6, onlyown=onlyown@entry=0, hflg=hflg@entry=1, av=av@entry=0x69a460) at sh.c:1761
#14 0x000000000040661f in srcfile (f=f@entry=0x68da80 "/home/darrencl/.history", onlyown=onlyown@entry=0, flag=flag@entry=1, 
    av=av@entry=0x69a460) at sh.c:1553
#15 0x0000000000406763 in dosource (t=0x69a460, t@entry=0x65bbc0 <loadhist_cmd.9185>, c=c@entry=0x0) at sh.c:2201
#16 0x0000000000417942 in loadhist (fname=<optimized out>, fname@entry=0x0, mflg=mflg@entry=0) at sh.hist.c:1341
#17 0x0000000000404bf2 in main (argc=<optimized out>, argv=0x7ffc81286a48) at sh.c:1362
(gdb) 

Looks like we spend a lot of time looping inside lex().  I don't see any obvious sanity checks in this code.

The customer has worked around the problem by removing the corrupt history file and setting .cshrc variables (previously at their default):

set history = 0
set histsave = 0

The problem looks very similar to:
https://access.redhat.com/solutions/41146

Comment 1 Fedora Update System 2019-10-25 16:47:26 UTC

FEDORA-2019-c9f4cc6025 has been submitted as an update to Fedora 31. https://bodhi.fedoraproject.org/updates/FEDORA-2019-c9f4cc6025

Comment 2 Fedora Update System 2019-10-27 04:02:21 UTC

tcsh-6.21.00-3.fc31 has been pushed to the Fedora 31 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2019-c9f4cc6025

Comment 3 Fedora Update System 2019-11-04 00:54:18 UTC

tcsh-6.21.00-3.fc31 has been pushed to the Fedora 31 stable repository. If problems still persist, please make note of it in this bug report.