Bug 1846271
Summary: | [RHEL-7.9] tcsh corrupt history file results in rapid memory consumption until memory exhausted [rhel-7.9.z] | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Jan Macku <jamacku> | ||||
Component: | tcsh | Assignee: | Jan Macku <jamacku> | ||||
Status: | CLOSED ERRATA | QA Contact: | Karel Volný <kvolny> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 7.5 | CC: | darren.lavender, dkaylor, fkrska, fsumsal, jamacku, jhuo, jreznik, kdudka, klaas, kvolny, lnykryn, msekleta, pandrade, pdwyer, pvlasin, svashisht | ||||
Target Milestone: | rc | Keywords: | Regression, ZStream | ||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | tcsh-6.18.01-17.el7_9.1 | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | 1598502 | Environment: | |||||
Last Closed: | 2020-11-10 12:55:15 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1598502 | ||||||
Bug Blocks: | 1643104, 1765649, 1818766 | ||||||
Attachments: |
|
Comment 10
Lukáš Nykrýn
2020-07-02 13:24:29 UTC
Thank you to Klass (#1598502) who pointed out three issues that we have to address: > So there are three issues that need addressing: > 1) do not write a corrupt history file > 2) deal with a corrupt history file properly and not create hung processes > 3) make the error message more clear, my first instinct was to search all files like .cshrc for a command that is too long :) We need to backport this patch from upstream to make warning more clear. https://github.com/tcsh-org/tcsh/commit/4723e1754cd28657183358334fe77fe328f01bd0 I'm currently debugging the issue, once I'll have some news I'll update this bz. (In reply to Jan Macku from comment #12) > Thank you to Klass (#1598502) who pointed out three issues that we have to > address: > > > So there are three issues that need addressing: > > 1) do not write a corrupt history file > > 2) deal with a corrupt history file properly and not create hung processes > > 3) make the error message more clear, my first instinct was to search all files like .cshrc for a command that is too long :) > > We need to backport this patch from upstream to make warning more clear. > > https://github.com/tcsh-org/tcsh/commit/ > 4723e1754cd28657183358334fe77fe328f01bd0 > > I'm currently debugging the issue, once I'll have some news I'll update this > bz. Support pointed out that file locking seems to be disabled by default on RHEL via /etc/csh.cshrc from setup-2.8.71-11.el7.noarch Default is: set savehist = (1024 merge) support said it should be: set savehist = (1024 merge lock) This could be one cause for 1) Greetings Klaas That is also not the default in current fedora https://pagure.io/setup/blob/master/f/csh.cshrc#_34 or even upstream (https://github.com/tcsh-org/tcsh/blob/1b652a48872d1d29b506445336599222569fb80a/dot.tcshrc#L35) are there any downsides to setting this as a default? For now our own workaround is to use: "set histfile=" Hi Klaas, Please accept my apology for misspelling your name. > are there any downsides to setting this as a default? > Default is: > set savehist = (1024 merge) > support said it should be: > set savehist = (1024 merge lock) What I've been able to find out with help of my colleague is that "locking" of history should be default on systems without home directory via nfs. https://bugzilla.redhat.com/show_bug.cgi?id=885901#c95 So it probably should not make any difference in behaviour. I'll check if lock works but probably on tuesday since monday is public holiday in CZ. Greetings Jan (In reply to Jan Macku from comment #15) > Hi Klaas, > Please accept my apology for misspelling your name. No worries > > > are there any downsides to setting this as a default? > > > Default is: > > set savehist = (1024 merge) > > support said it should be: > > set savehist = (1024 merge lock) > > What I've been able to find out with help of my colleague is that "locking" > of history should be default on systems without home directory via nfs. > > https://bugzilla.redhat.com/show_bug.cgi?id=885901#c95 I can't see that comment, it's marked private -- but I do have the home on nfs (via autofs), maybe that's the problem. > > So it probably should not make any difference in behaviour. I'll check if > lock works but probably on tuesday since monday is public holiday in CZ. > > Greetings > Jan Bug #85901 is related to another lock patch, written by Red Hat, that used the flock syscall. From my understanding, it would work on Linux to Linux nfs mounts. Newer tcsh has an upstream patch that does not use flock, but a trick with hard links. Regardless of using the lock, at least for recent tcsh (rhel 7 or newer), it might block on a syscall if reading/writing the history file, if the mount point becomes unavailable. Hi Klaas, I used strace to check whether default lock works. %< ----------------- open("/root/.history", O_RDWR|O_CREAT, 0600) = 3 dup(3) = 4 dup(4) = 5 dup(5) = 6 close(5) = 0 close(4) = 0 close(3) = 0 fcntl(6, F_SETFD, FD_CLOEXEC) = 0 fcntl(6, F_SETLKW, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = 0 %< ----------------- Function "fcntl(..., F_SETLKW, ...)" suggest that it should work on classic fs. I'll try to setup nfs via autofs and check how it works there. I'll come back to you. Greetings Jan Hi Klaas, I set up nfs and by default lock mechanism isn't enabled. After I set "savehist = (1024 merge lock)" then strace showed that history has been lock. So I would suggest you to set it, because this would prevent future corruption of history file. I checked how tcsh behave in fedora where is latest upstream version and it looks like the issue is fixed in upstream. Now I'll try to backport corresponding patches from upstream and further test if it fixes this issue. I'll let you know about updates. Greetings Jan Hi Jan, 1) Corrupt History file: so the corrupt history part of the bug is due to file locking not being enabled by default on nfs. Current upstream has default locking on nfs enabled, the nfs differentiation is just in the RHEL flock patch? 2) Ignore broken history (or at least not fail catastrophically): Are you able to replicate the shells hanging on exit when working with the the corrupt history? I could replicate this outside of the production system. In any case I think it would be safe to say the history should not break a shell that badly in any situation. 3) Better error message: I guess that's the easiest part, from #12 I see you already located a backportable change from upstream to make this better. Greetings Klaas (In reply to Klaas Demter from comment #22) > 1) Corrupt History file: > so the corrupt history part of the bug is due to file locking not being > enabled by default on nfs. Current upstream has default locking on nfs > enabled, the nfs differentiation is just in the RHEL flock patch? I just wanted to check if lock works. And about patch I think so, but I'm not 100% sure, I'm quite new to tcsh maintenance. > 2) Ignore broken history (or at least not fail catastrophically): > Are you able to replicate the shells hanging on exit when working with the > the corrupt history? I could replicate this outside of the production > system. In any case I think it would be safe to say the history should not > break a shell that badly in any situation. I strongly agree. I already located possible patches to backport. I have two corrupted history files and with both of them I'm able to reproduce this issue. But if you are able to provide yours I'll be glad use it for testing as well. Possible patches: https://github.com/tcsh-org/tcsh/commit/1ef6e9d0eef9278829f863fff55755b7a5415b81 https://github.com/tcsh-org/tcsh/commit/83c5be028419b3f27c3cc707b88fb21bfa4e1b11 https://github.com/tcsh-org/tcsh/commit/67db04d8d8ada0aa2fb7dfdf83f3a408ad5a01dc https://github.com/tcsh-org/tcsh/commit/6bc746caad28d0f1ffaf07a4ea1468e8efcc6d52 > 3) Better error message: > I guess that's the easiest part, from #12 I see you already located a > backportable change from upstream to make this better. Yes, There is upstream patch which make warning more clear: https://github.com/tcsh-org/tcsh/commit/4723e1754cd28657183358334fe77fe328f01bd0 Greetings Jan Hi Klaas, Just a quick update. Currently I was able to backport 4 out of 5 patches (last one has same problem as patch below) from upstream and from initial testing it looks like rapid memory consumption nor high cpu consumption is no longer present. In following days I'll test it properly and let you know. Also I found one more patch witch might by a part of the fix: https://github.com/tcsh-org/tcsh/commit/c9757dd83f6c641dacddd58095cefc5ed21a3b42 The issue with this one is that there huge amount of code missing so there is no chance to backport this patch. Only chance is to rewrite it somehow. Greetings Jan Hi Klaas, I tested backported patches from upstream with three corrupted history files which are available to me. There is no more cpu nor ram exhaustion with history from your customer ticket and with history from original report Bug #1598502 only small esthetic issue (at the end): %<---------------- [root@localhost ~]# tcsh Can't load history: $< line too long. [root@localhost ~]# ls anaconda-ks.cfg history_1 history_2 history_3 [root@localhost ~]# ps PID TTY TIME CMD 1772 pts/0 00:00:00 bash 1789 pts/0 00:00:00 tcsh 1806 pts/0 00:00:00 ps [root@localhost ~]# exit exit Can't load history: $< line too long. exit [root@localhost ~]# %<---------------- The tcsh not even start with last history file from private comment on https://bugzilla.redhat.com/show_bug.cgi?id=1598502#34 . The issue is the same with upstream version of tcsh in Fedora. I'll contact upstream about it and also I'll try to debug it at the end of this week. Greetings Jan Hi Klaas, It's me with another update. With help of my colleague Kamil I was able find cause of infinite loop. Upstream was able to come up with slightly different patch then we suggested, which break everything (every corrupted history cause infinite loop). Patch: https://github.com/tcsh-org/tcsh/commit/064853aa25f498aa7bc4554393a53240696813fa So I create patch by myself (see attachments). Basically upstream by mistake used two different functions in original patch stderror(...) that sets error msg and abort loading and seterror(...) that sets msg but continue to load corrupted history. By changing seterror(...) to stderror(...) tcsh is able to skip corrupted history and continue. I tested this patch together with all that I mentioned in past and It seems that everything works. Only thing that doesn't work is that command history is available only in interactive shell mode. Once you exit it isn't saved. So I would suggest to start with clear .history file. Also I want to let you know that tomorrow I'm Out of office and next week I'm on PTO. I'll be back on 10th of August if there will be any problem during my absence please feel free to contact my manager lnykryn. I'll to create errata today, but it might get delayed till 10th. Thank you for your understanding and patience. Regards Jan Created attachment 1702907 [details]
Call stderror consistently
This patch replaces seterror(...) with stderror(...)
Suggested by Kamil Dudka
Thanks for the investigative work and the fix Jan/Kamil. Will this receive a 7.8 backport or is 7.9 right around the corner? Hi Klaas, RHEL 7.9.z will be released soon. We can't give you any exact dates via bugzilla, please contact the support. Regards Jan Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (tcsh bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:5001 If anyone else implemented the workaround I suggested "set histfile="; we migrated a couple of systems to rhel8 and that is no longer working on el8 :) It creates files like .10386548 in the current working directory. We no longer have any systems older than RHEL 7.9 so I just removed the workaround completely, and the histfile is now ~/.history again. |