1598502 – tcsh corrupt history file results in rapid memory consumption until memory exhausted

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1598502 - tcsh corrupt history file results in rapid memory consumption until memory exhausted

Summary: tcsh corrupt history file results in rapid memory consumption until memory ex...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	tcsh
Sub Component:
Version:	7.5
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Jan Macku
QA Contact:	Karel Volný
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1643104 1765649 1818766 1846271
TreeView+	depends on / blocked

Reported:	2018-07-05 15:58 UTC by afox@redhat.com
Modified:	2023-10-06 17:50 UTC (History)
CC List:	14 users (show)
Fixed In Version:	tcsh-6.18.01-16.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1765649 1765659 1846271 (view as bug list)
Environment:
Last Closed:	2020-03-31 20:09:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:1177	0	None	None	None	2020-03-31 20:09:53 UTC

Description afox@redhat.com 2018-07-05 15:58:18 UTC

Description of problem:
Server encountered an OOM related service outage because of many csh (tcsh) processes running which were consuming huge amounts of memory.

Version-Release number of selected component (if applicable):
RHEL7.5 and tcsh-6.18.01-15.el7.x86_64

How reproducible:
Easily

Steps to Reproduce:
1. Put linked corrupt history file into user home directory as .history
2. Start tcsh shell. 
3. Watch tcsh memory usage rapidly grow.

Actual results:
Rapid memory usage by tcsh process until system eventually runs out of memory. 

Expected results:
tcsh should not exhaust memory if history file is corrupt. 

Additional info:
Further analysis revealed this to be due to csh tasks consuming CPU and memory processing a huge corrupted history file:

# ll .history
-rw-------. 1 testuser testuser 4237962198 Jun  4 12:37 .history


Using this history file I was easily able to replicate the problem. As soon as the csh tries to start it reads the history file and never appears to complete, at least I never let it run to completion or server memory outage.  In top(1) we can see it growing:


$top
.......
 %Cpu15 : 64.0 us, 36.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
 
   
......
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                       
49287 testuser  20   0   10.5g  10.0g   1304 R  99.7 31.8   2:20.88 csh                                                           
 1547 polkitd   20   0  547368  19864   5324 S   7.6  0.1 520:21.97 polkitd                                                       
 1623 dbus      20   0   70148   4328   1944 R   2.0  0.0 154:26.93 dbus-daemon      
....... 
    
 
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                       
49287 testuser  20   0   24.3g  22.6g    996 R  99.3 72.2   8:58.82 csh                                                           
  188 root      25   5       0      0      0 S  22.2  0.0   0:32.54 ksmd         
......... 
 

An strace shows the csh reading the history and growing the heap:

# more /tmp/csh.str 
49287 read(6, "\321\313\177$\235\232\321\313\177 ( \235\232\321\313", 16) = 16
49287 read(6, "\177", 1)                = 1
49287 read(6, "/", 1)                   = 1
49287 read(6, "E", 1)                   = 1
49287 read(6, "P", 1)                   = 1
49287 read(6, "_", 1)                   = 1
49287 read(6, "O", 1)                   = 1
49287 read(6, "U", 1)                   = 1
49287 read(6, "\232", 1)                = 1
49287 read(6, "\321", 1)                = 1
49287 read(6, "\313", 1)                = 1
49287 read(6, "\177", 1)                = 1
49287 read(6, "!", 1)                   = 1
49287 read(6, "\236", 1)                = 1
49287 read(6, "\232", 1)                = 1
49287 read(6, "\321", 1)                = 1
49287 lseek(6, -15, SEEK_CUR)           = 238628506
49287 brk(NULL)                         = 0x76b8f000
49287 brk(0x76c00000)                   = 0x76c00000
49287 read(6, "\177/EP_OU\232\321\313\177!\236\232\321\313\177!\236\232\321\313\177!\236\232\321\313\177!\236\232"..., 4096) = 409
6
49287 read(6, "\232\321\313INVAL_NUMVALS}!\340\22\375\1", 22) = 22
49287 read(6, "@", 1)                   = 1
49287 read(6, "\22", 1)                 = 1
49287 read(6, "\375", 1)                = 1
49287 read(6, "\1", 1)                  = 1
.........


Building the latest version from upstream also demonstrated the problem (based upon 6.20.00):

https://github.com/tcsh-org/tcsh

Interrupting via gdb shows:


(gdb) bt
#0  0x00007f073e82f44c in __brk (addr=0x38ac8a000) at ../sysdeps/unix/sysv/linux/x86_64/brk.c:31
#1  0x00007f073e82f4e6 in __GI___sbrk (increment=-7122944) at sbrk.c:53
#2  0x00007f073e7c2ae9 in __GI___default_morecore (increment=<optimized out>) at morecore.c:48
#3  0x00007f073e7b962a in systrim (pad=<optimized out>, av=0x7f073eb00760 <main_arena>, av=0x7f073eb00760 <main_arena>)
    at malloc.c:2786
#4  0x00007f073e7bb2ea in _int_free (av=0x7f073eb00760 <main_arena>, p=0x38ac69230, have_lock=0) at malloc.c:4072
#5  0x0000000000418e0b in balloc (buf=buf@entry=445373) at sh.lex.c:1537
#6  0x000000000041993c in bgetc () at sh.lex.c:1678
#7  readc (wanteof=wanteof@entry=1) at sh.lex.c:1423
#8  0x0000000000419f4a in readc (wanteof=1) at sh.lex.c:498
#9  getC1 (flag=flag@entry=2) at sh.lex.c:466
#10 0x000000000041bd47 in word (parsehtime=0) at sh.lex.c:416
#11 lex (hp=hp@entry=0x672dd0 <paraml>) at sh.lex.c:180
#12 0x0000000000405887 in process (catch=catch@entry=0) at sh.c:2059
#13 0x000000000040633c in srcunit (unit=6, onlyown=onlyown@entry=0, hflg=hflg@entry=1, av=av@entry=0x69a460) at sh.c:1761
#14 0x000000000040661f in srcfile (f=f@entry=0x68da80 "/home/darrencl/.history", onlyown=onlyown@entry=0, flag=flag@entry=1, 
    av=av@entry=0x69a460) at sh.c:1553
#15 0x0000000000406763 in dosource (t=0x69a460, t@entry=0x65bbc0 <loadhist_cmd.9185>, c=c@entry=0x0) at sh.c:2201
#16 0x0000000000417942 in loadhist (fname=<optimized out>, fname@entry=0x0, mflg=mflg@entry=0) at sh.hist.c:1341
#17 0x0000000000404bf2 in main (argc=<optimized out>, argv=0x7ffc81286a48) at sh.c:1362
(gdb) 

Looks like we spend a lot of time looping inside lex().  I don't see any obvious sanity checks in this code.

The customer has worked around the problem by removing the corrupt history file and setting .cshrc variables (previously at their default):

set history = 0
set histsave = 0

The problem looks very similar to:
https://access.redhat.com/solutions/41146

Comment 4 Siteshwar Vashisht 2018-07-11 13:10:27 UTC

David,

dkaspar is currently on PTO, so I had a look at this bug. I will try to summarize it here:

This is the backtrace when tcsh exits when it can not allocate memory:

#0  out_of_memory () at tc.alloc.c:69
#1  0x000000000044c2ad in smalloc (n=578604348) at tc.alloc.c:530
#2  0x00000000004553bf in s_strsave (
    s=0x7fff46e4c010 L"VA\xf000009d\xf000009a\xf000009d\xf000009a\xf00000d1\xf00000cb\021\xf00000fd\001`\023\xf00000fd\001 ; 1\004\001\001\001\001\001\001\001\001\xf00000c0\021\xf00000fd\001!\xf00000ff\a < \xf000009e\xf000009a\xf00000d1\xf00000cb\177!\xf000009e\xf000009a\xf00000d1\xf00000cb\177!\xf000009e\xf000009a\xf00000d1\xf00000cb\177!\xf000009e\xf000009a\xf00000d1\xf00000cb\177\022\xf00000fd\001\xf00000e0\021\xf00000fdist$HOME/stop_start/lis\xf00000c1\xf000009e\xf000009a\xf00000d1\xf00000cb\177կ\xf000009a\xf00000d1\xf00000cb\020U\xf000009e\xf000009a\xf00000d1\xf00000cbop_start/listener \177\xf00000ff\xf00000ff\xf00000ff\xf00000ff\002 \xf000009e\xf000009a\xf00000d1\003!\"${VALUE_GREP_OUT}\"!\020\021\xf00000fd\xf000009e\xf000009a\xf00000d1\xf00000cb\177կ\xf000009a\xf00000d1\xf00000cb\177\xf00000ff\xf00000ff\xf00000ff\xf00000ff\006\xf0000083\xf00000b3\xf000009a\xf00000d1\xf00000cb\177!\xf000009e\xf000009a\xf00000d1\xf00000cb\177!\xf000009e\xf000009a\xf00000d1E/stop_sta"...) at tc.str.c:458
#3  0x000000000041c231 in enthist (event=7, lp=0x682ed0 <paraml>, docopy=1, mflg=0, hlen=100) at sh.hist.c:985
#4  0x000000000041b1ce in savehist (sp=0x682ed0 <paraml>, mflg=0) at sh.hist.c:150
#5  0x0000000000406f00 in process (catch=0) at sh.c:2091
#6  0x0000000000406544 in srcunit (unit=6, onlyown=0, hflg=1, av=0x6b6eb0) at sh.c:1761
#7  0x0000000000405c98 in srcfile (f=0x6bb7f0 "/home/situ/.history", onlyown=0, flag=1, av=0x6b6eb0) at sh.c:1553
#8  0x0000000000407330 in dosource (t=0x6b6eb0, c=0x0) at sh.c:2201
#9  0x000000000041cdbf in loadhist (fname=0x670b50 <STRNULL> L"", mflg=0) at sh.hist.c:1341
#10 0x0000000000405683 in main (argc=0, argv=0x7fffffffdee8) at sh.c:1362

This is the block of code in tc.str.c which is trying to allocate memory to copy a string from history:

(gdb) l
453         size_t size;
454
455         if (s == NULL)
456             s = STRNULL;
457         size = (Strlen(s) + 1) * sizeof(*n);
458         n = xmalloc(size);
459         memcpy(n, s, size);
460         return (n);
461     }
462

String that is being saved is arbitrarily long:

(gdb) p wcslen(s)
$1 = 144651086

So this line:
457         size = (Strlen(s) + 1) * sizeof(*n);

evaluates to:

(gdb) p size
$2 = 578604348

On my system, malloc fails while trying to allocate a chunk of this size. I think tcsh should put some limiations on length of string that is being saved. I would let dkaspar follow up on it when he is back from holidays.

Comment 25 Lukáš Nykrýn 2019-05-23 11:34:02 UTC

Right now we don't a proper fix there, postponing to 7.8.

Comment 36 errata-xmlrpc 2020-03-31 20:09:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1177

Comment 37 Klaas Demter 2020-07-01 12:11:15 UTC

Hi,
this update lead to problems within one of our important production systems. [For the Red Hat people: Customer Case #02690226 with the corrupt .history file]

We had a corrupt .history file (it included a line with thousands of characters). It is unclear what caused that corrupt file, but in any case -- tcsh should not write that corrupt file.

Now with the corrupt file starting a shell will show a very unspecific error message: "$< line too long." and under certain conditions (which are not clear to us yet) will even lead to the shell not exiting properly. This in turn lead to the cron scripts that use tcsh to spawn process and not exit them anymore which lead to a monitoring script causing a process exhaustion for the application user.


So there are three issues that need addressing:
1) do not write a corrupt history file
2) deal with a corrupt history file properly and not create hung processes
3) make the error message more clear, my first instinct was to search all files like .cshrc for a command that is too long :)


Greetings
Klaas

Comment 38 Jan Macku 2020-07-03 10:21:32 UTC

Hi Klaas,
I'm sorry for inconvenience. I'm working on fix. Please follow clone of this bug at rhbz #1846271 if possible.

Thank you
Greetings
Jan

Note You need to log in before you can comment on or make changes to this bug.