Bug 1598502
| Summary: | tcsh corrupt history file results in rapid memory consumption until memory exhausted | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | afox <afox> | |
| Component: | tcsh | Assignee: | Jan Macku <jamacku> | |
| Status: | CLOSED ERRATA | QA Contact: | Karel Volný <kvolny> | |
| Severity: | medium | Docs Contact: | ||
| Priority: | medium | |||
| Version: | 7.5 | CC: | afox, darren.lavender, dkaylor, fkrska, fsumsal, jamacku, jhuo, kdudka, klaas, kvolny, lnykryn, msekleta, pdwyer, svashisht | |
| Target Milestone: | rc | |||
| Target Release: | --- | |||
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | tcsh-6.18.01-16.el7 | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1765649 1765659 1846271 (view as bug list) | Environment: | ||
| Last Closed: | 2020-03-31 20:09:46 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1643104, 1765649, 1818766, 1846271 | |||
David,
dkaspar is currently on PTO, so I had a look at this bug. I will try to summarize it here:
This is the backtrace when tcsh exits when it can not allocate memory:
#0 out_of_memory () at tc.alloc.c:69
#1 0x000000000044c2ad in smalloc (n=578604348) at tc.alloc.c:530
#2 0x00000000004553bf in s_strsave (
s=0x7fff46e4c010 L"VA\xf000009d\xf000009a\xf000009d\xf000009a\xf00000d1\xf00000cb\021\xf00000fd\001`\023\xf00000fd\001 ; 1\004\001\001\001\001\001\001\001\001\xf00000c0\021\xf00000fd\001!\xf00000ff\a < \xf000009e\xf000009a\xf00000d1\xf00000cb\177!\xf000009e\xf000009a\xf00000d1\xf00000cb\177!\xf000009e\xf000009a\xf00000d1\xf00000cb\177!\xf000009e\xf000009a\xf00000d1\xf00000cb\177\022\xf00000fd\001\xf00000e0\021\xf00000fdist$HOME/stop_start/lis\xf00000c1\xf000009e\xf000009a\xf00000d1\xf00000cb\177կ\xf000009a\xf00000d1\xf00000cb\020U\xf000009e\xf000009a\xf00000d1\xf00000cbop_start/listener \177\xf00000ff\xf00000ff\xf00000ff\xf00000ff\002 \xf000009e\xf000009a\xf00000d1\003!\"${VALUE_GREP_OUT}\"!\020\021\xf00000fd\xf000009e\xf000009a\xf00000d1\xf00000cb\177կ\xf000009a\xf00000d1\xf00000cb\177\xf00000ff\xf00000ff\xf00000ff\xf00000ff\006\xf0000083\xf00000b3\xf000009a\xf00000d1\xf00000cb\177!\xf000009e\xf000009a\xf00000d1\xf00000cb\177!\xf000009e\xf000009a\xf00000d1E/stop_sta"...) at tc.str.c:458
#3 0x000000000041c231 in enthist (event=7, lp=0x682ed0 <paraml>, docopy=1, mflg=0, hlen=100) at sh.hist.c:985
#4 0x000000000041b1ce in savehist (sp=0x682ed0 <paraml>, mflg=0) at sh.hist.c:150
#5 0x0000000000406f00 in process (catch=0) at sh.c:2091
#6 0x0000000000406544 in srcunit (unit=6, onlyown=0, hflg=1, av=0x6b6eb0) at sh.c:1761
#7 0x0000000000405c98 in srcfile (f=0x6bb7f0 "/home/situ/.history", onlyown=0, flag=1, av=0x6b6eb0) at sh.c:1553
#8 0x0000000000407330 in dosource (t=0x6b6eb0, c=0x0) at sh.c:2201
#9 0x000000000041cdbf in loadhist (fname=0x670b50 <STRNULL> L"", mflg=0) at sh.hist.c:1341
#10 0x0000000000405683 in main (argc=0, argv=0x7fffffffdee8) at sh.c:1362
This is the block of code in tc.str.c which is trying to allocate memory to copy a string from history:
(gdb) l
453 size_t size;
454
455 if (s == NULL)
456 s = STRNULL;
457 size = (Strlen(s) + 1) * sizeof(*n);
458 n = xmalloc(size);
459 memcpy(n, s, size);
460 return (n);
461 }
462
String that is being saved is arbitrarily long:
(gdb) p wcslen(s)
$1 = 144651086
So this line:
457 size = (Strlen(s) + 1) * sizeof(*n);
evaluates to:
(gdb) p size
$2 = 578604348
On my system, malloc fails while trying to allocate a chunk of this size. I think tcsh should put some limiations on length of string that is being saved. I would let dkaspar follow up on it when he is back from holidays.
Right now we don't a proper fix there, postponing to 7.8. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:1177 Hi, this update lead to problems within one of our important production systems. [For the Red Hat people: Customer Case #02690226 with the corrupt .history file] We had a corrupt .history file (it included a line with thousands of characters). It is unclear what caused that corrupt file, but in any case -- tcsh should not write that corrupt file. Now with the corrupt file starting a shell will show a very unspecific error message: "$< line too long." and under certain conditions (which are not clear to us yet) will even lead to the shell not exiting properly. This in turn lead to the cron scripts that use tcsh to spawn process and not exit them anymore which lead to a monitoring script causing a process exhaustion for the application user. So there are three issues that need addressing: 1) do not write a corrupt history file 2) deal with a corrupt history file properly and not create hung processes 3) make the error message more clear, my first instinct was to search all files like .cshrc for a command that is too long :) Greetings Klaas Hi Klaas, I'm sorry for inconvenience. I'm working on fix. Please follow clone of this bug at rhbz #1846271 if possible. Thank you Greetings Jan |
Description of problem: Server encountered an OOM related service outage because of many csh (tcsh) processes running which were consuming huge amounts of memory. Version-Release number of selected component (if applicable): RHEL7.5 and tcsh-6.18.01-15.el7.x86_64 How reproducible: Easily Steps to Reproduce: 1. Put linked corrupt history file into user home directory as .history 2. Start tcsh shell. 3. Watch tcsh memory usage rapidly grow. Actual results: Rapid memory usage by tcsh process until system eventually runs out of memory. Expected results: tcsh should not exhaust memory if history file is corrupt. Additional info: Further analysis revealed this to be due to csh tasks consuming CPU and memory processing a huge corrupted history file: # ll .history -rw-------. 1 testuser testuser 4237962198 Jun 4 12:37 .history Using this history file I was easily able to replicate the problem. As soon as the csh tries to start it reads the history file and never appears to complete, at least I never let it run to completion or server memory outage. In top(1) we can see it growing: $top ....... %Cpu15 : 64.0 us, 36.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st ...... PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 49287 testuser 20 0 10.5g 10.0g 1304 R 99.7 31.8 2:20.88 csh 1547 polkitd 20 0 547368 19864 5324 S 7.6 0.1 520:21.97 polkitd 1623 dbus 20 0 70148 4328 1944 R 2.0 0.0 154:26.93 dbus-daemon ....... PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 49287 testuser 20 0 24.3g 22.6g 996 R 99.3 72.2 8:58.82 csh 188 root 25 5 0 0 0 S 22.2 0.0 0:32.54 ksmd ......... An strace shows the csh reading the history and growing the heap: # more /tmp/csh.str 49287 read(6, "\321\313\177$\235\232\321\313\177 ( \235\232\321\313", 16) = 16 49287 read(6, "\177", 1) = 1 49287 read(6, "/", 1) = 1 49287 read(6, "E", 1) = 1 49287 read(6, "P", 1) = 1 49287 read(6, "_", 1) = 1 49287 read(6, "O", 1) = 1 49287 read(6, "U", 1) = 1 49287 read(6, "\232", 1) = 1 49287 read(6, "\321", 1) = 1 49287 read(6, "\313", 1) = 1 49287 read(6, "\177", 1) = 1 49287 read(6, "!", 1) = 1 49287 read(6, "\236", 1) = 1 49287 read(6, "\232", 1) = 1 49287 read(6, "\321", 1) = 1 49287 lseek(6, -15, SEEK_CUR) = 238628506 49287 brk(NULL) = 0x76b8f000 49287 brk(0x76c00000) = 0x76c00000 49287 read(6, "\177/EP_OU\232\321\313\177!\236\232\321\313\177!\236\232\321\313\177!\236\232\321\313\177!\236\232"..., 4096) = 409 6 49287 read(6, "\232\321\313INVAL_NUMVALS}!\340\22\375\1", 22) = 22 49287 read(6, "@", 1) = 1 49287 read(6, "\22", 1) = 1 49287 read(6, "\375", 1) = 1 49287 read(6, "\1", 1) = 1 ......... Building the latest version from upstream also demonstrated the problem (based upon 6.20.00): https://github.com/tcsh-org/tcsh Interrupting via gdb shows: (gdb) bt #0 0x00007f073e82f44c in __brk (addr=0x38ac8a000) at ../sysdeps/unix/sysv/linux/x86_64/brk.c:31 #1 0x00007f073e82f4e6 in __GI___sbrk (increment=-7122944) at sbrk.c:53 #2 0x00007f073e7c2ae9 in __GI___default_morecore (increment=<optimized out>) at morecore.c:48 #3 0x00007f073e7b962a in systrim (pad=<optimized out>, av=0x7f073eb00760 <main_arena>, av=0x7f073eb00760 <main_arena>) at malloc.c:2786 #4 0x00007f073e7bb2ea in _int_free (av=0x7f073eb00760 <main_arena>, p=0x38ac69230, have_lock=0) at malloc.c:4072 #5 0x0000000000418e0b in balloc (buf=buf@entry=445373) at sh.lex.c:1537 #6 0x000000000041993c in bgetc () at sh.lex.c:1678 #7 readc (wanteof=wanteof@entry=1) at sh.lex.c:1423 #8 0x0000000000419f4a in readc (wanteof=1) at sh.lex.c:498 #9 getC1 (flag=flag@entry=2) at sh.lex.c:466 #10 0x000000000041bd47 in word (parsehtime=0) at sh.lex.c:416 #11 lex (hp=hp@entry=0x672dd0 <paraml>) at sh.lex.c:180 #12 0x0000000000405887 in process (catch=catch@entry=0) at sh.c:2059 #13 0x000000000040633c in srcunit (unit=6, onlyown=onlyown@entry=0, hflg=hflg@entry=1, av=av@entry=0x69a460) at sh.c:1761 #14 0x000000000040661f in srcfile (f=f@entry=0x68da80 "/home/darrencl/.history", onlyown=onlyown@entry=0, flag=flag@entry=1, av=av@entry=0x69a460) at sh.c:1553 #15 0x0000000000406763 in dosource (t=0x69a460, t@entry=0x65bbc0 <loadhist_cmd.9185>, c=c@entry=0x0) at sh.c:2201 #16 0x0000000000417942 in loadhist (fname=<optimized out>, fname@entry=0x0, mflg=mflg@entry=0) at sh.hist.c:1341 #17 0x0000000000404bf2 in main (argc=<optimized out>, argv=0x7ffc81286a48) at sh.c:1362 (gdb) Looks like we spend a lot of time looping inside lex(). I don't see any obvious sanity checks in this code. The customer has worked around the problem by removing the corrupt history file and setting .cshrc variables (previously at their default): set history = 0 set histsave = 0 The problem looks very similar to: https://access.redhat.com/solutions/41146