Bug 1846271

Summary:

[RHEL-7.9] tcsh corrupt history file results in rapid memory consumption until memory exhausted [rhel-7.9.z]

Product:

Red Hat Enterprise Linux 7

Reporter:

Jan Macku <jamacku>

Component:

tcsh

Assignee:

Jan Macku <jamacku>

Status:

CLOSED ERRATA

QA Contact:

Karel Volný <kvolny>

Severity:

high

Docs Contact:

Priority:

urgent

Version:

7.5

CC:

darren.lavender, dkaylor, fkrska, fsumsal, jamacku, jhuo, jreznik, kdudka, klaas, kvolny, lnykryn, msekleta, pandrade, pdwyer, pvlasin, svashisht

Target Milestone:

Keywords:

Regression, ZStream

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

tcsh-6.18.01-17.el7_9.1

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

1598502

Environment:

Last Closed:

2020-11-10 12:55:15 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1598502

Bug Blocks:

1643104, 1765649, 1818766

Attachments:

Description	Flags
Call stderror consistently	none

Comment 10 Lukáš Nykrýn 2020-07-02 13:24:29 UTC

After talking to CEE, this is not needed in 7.9 GA, but it should be fixed in the first z-stream batch.

Comment 12 Jan Macku 2020-07-03 10:30:29 UTC

Thank you to Klass (#1598502) who pointed out three issues that we have to address:

> So there are three issues that need addressing:
> 1) do not write a corrupt history file
> 2) deal with a corrupt history file properly and not create hung processes
> 3) make the error message more clear, my first instinct was to search all files like .cshrc for a command that is too long :)

We need to backport this patch from upstream to make warning more clear.

https://github.com/tcsh-org/tcsh/commit/4723e1754cd28657183358334fe77fe328f01bd0

I'm currently debugging the issue, once I'll have some news I'll update this bz.

Comment 13 Klaas Demter 2020-07-03 10:42:06 UTC

(In reply to Jan Macku from comment #12)
> Thank you to Klass (#1598502) who pointed out three issues that we have to
> address:
> 
> > So there are three issues that need addressing:
> > 1) do not write a corrupt history file
> > 2) deal with a corrupt history file properly and not create hung processes
> > 3) make the error message more clear, my first instinct was to search all files like .cshrc for a command that is too long :)
> 
> We need to backport this patch from upstream to make warning more clear.
> 
> https://github.com/tcsh-org/tcsh/commit/
> 4723e1754cd28657183358334fe77fe328f01bd0
> 
> I'm currently debugging the issue, once I'll have some news I'll update this
> bz.

Support pointed out that file locking seems to be disabled by default on RHEL via /etc/csh.cshrc from setup-2.8.71-11.el7.noarch 
Default is:
set savehist = (1024 merge)
support said it should be:
set savehist = (1024 merge lock)

This could be one cause for 1)

Greetings
Klaas

Comment 14 Klaas Demter 2020-07-03 11:02:07 UTC

That is also not the default in current fedora https://pagure.io/setup/blob/master/f/csh.cshrc#_34 or even upstream (https://github.com/tcsh-org/tcsh/blob/1b652a48872d1d29b506445336599222569fb80a/dot.tcshrc#L35)

are there any downsides to setting this as a default?


For now our own workaround is to use:
"set histfile="

Comment 15 Jan Macku 2020-07-03 13:48:51 UTC

Hi Klaas,
Please accept my apology for misspelling your name.

> are there any downsides to setting this as a default?

> Default is:
> set savehist = (1024 merge)
> support said it should be:
> set savehist = (1024 merge lock)

What I've been able to find out with help of my colleague is that "locking" of history should be default on systems without home directory via nfs.

https://bugzilla.redhat.com/show_bug.cgi?id=885901#c95

So it probably should not make any difference in behaviour. I'll check if lock works but probably on tuesday since monday is public holiday in CZ. 

Greetings
Jan

Comment 16 Klaas Demter 2020-07-03 14:23:24 UTC

(In reply to Jan Macku from comment #15)
> Hi Klaas,
> Please accept my apology for misspelling your name.
No worries
> 
> > are there any downsides to setting this as a default?
> 
> > Default is:
> > set savehist = (1024 merge)
> > support said it should be:
> > set savehist = (1024 merge lock)
> 
> What I've been able to find out with help of my colleague is that "locking"
> of history should be default on systems without home directory via nfs.
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=885901#c95

I can't see that comment, it's marked private -- but I do have the home on nfs (via autofs), maybe that's the problem.

> 
> So it probably should not make any difference in behaviour. I'll check if
> lock works but probably on tuesday since monday is public holiday in CZ. 
> 
> Greetings
> Jan

Comment 17 Paulo Andrade 2020-07-03 17:50:58 UTC

Bug #85901 is related to another lock patch, written by Red Hat,
that used the flock syscall. From my understanding, it would work
on Linux to Linux nfs mounts.

Newer tcsh has an upstream patch that does not use flock, but a
trick with hard links.

Regardless of using the lock, at least for recent tcsh (rhel 7 or
newer), it might block on a syscall if reading/writing the history
file, if the mount point becomes unavailable.

Comment 18 Jan Macku 2020-07-07 13:05:29 UTC

Hi Klaas,
I used strace to check whether default lock works. 

%< -----------------
open("/root/.history", O_RDWR|O_CREAT, 0600) = 3
dup(3)                                  = 4
dup(4)                                  = 5
dup(5)                                  = 6
close(5)                                = 0
close(4)                                = 0
close(3)                                = 0
fcntl(6, F_SETFD, FD_CLOEXEC)           = 0
fcntl(6, F_SETLKW, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = 0
%< -----------------

Function "fcntl(..., F_SETLKW, ...)" suggest that it should work on classic fs.
I'll try to setup nfs via autofs and check how it works there. I'll come back to you.

Greetings
Jan

Comment 21 Jan Macku 2020-07-09 12:43:37 UTC

Hi Klaas,
I set up nfs and by default lock mechanism isn't enabled. After I set "savehist = (1024 merge lock)" then strace showed that history has been lock.
So I would suggest you to set it, because this would prevent future corruption of history file.

I checked how tcsh behave in fedora where is latest upstream version and it looks like the issue is fixed in upstream. Now I'll try to backport corresponding patches from upstream and further test if it fixes this issue. 

I'll let you know about updates.
Greetings
Jan

Comment 22 Klaas Demter 2020-07-09 13:10:48 UTC

Hi Jan,
1) Corrupt History file:
so the corrupt history part of the bug is due to file locking not being enabled by default on nfs. Current upstream has default locking on nfs enabled, the nfs differentiation is just in the RHEL flock patch?


2) Ignore broken history (or at least not fail catastrophically):
Are you able to replicate the shells hanging on exit when working with the the corrupt history? I could replicate this outside of the production system. In any case I think it would be safe to say the history should not break a shell that badly in any situation.

3) Better error message:
I guess that's the easiest part, from #12 I see you already located a backportable change from upstream to make this better.

Greetings
Klaas

Comment 23 Jan Macku 2020-07-09 14:47:23 UTC

(In reply to Klaas Demter from comment #22)
> 1) Corrupt History file:
> so the corrupt history part of the bug is due to file locking not being
> enabled by default on nfs. Current upstream has default locking on nfs
> enabled, the nfs differentiation is just in the RHEL flock patch?

I just wanted to check if lock works. And about patch I think so, but I'm not 100% sure, I'm quite new to tcsh maintenance.

> 2) Ignore broken history (or at least not fail catastrophically):
> Are you able to replicate the shells hanging on exit when working with the
> the corrupt history? I could replicate this outside of the production
> system. In any case I think it would be safe to say the history should not
> break a shell that badly in any situation.

I strongly agree. I already located possible patches to backport. I have two corrupted history files and with both of them I'm able to reproduce this issue.
But if you are able to provide yours I'll be glad use it for testing as well.

Possible patches:
https://github.com/tcsh-org/tcsh/commit/1ef6e9d0eef9278829f863fff55755b7a5415b81
https://github.com/tcsh-org/tcsh/commit/83c5be028419b3f27c3cc707b88fb21bfa4e1b11
https://github.com/tcsh-org/tcsh/commit/67db04d8d8ada0aa2fb7dfdf83f3a408ad5a01dc
https://github.com/tcsh-org/tcsh/commit/6bc746caad28d0f1ffaf07a4ea1468e8efcc6d52

> 3) Better error message:
> I guess that's the easiest part, from #12 I see you already located a
> backportable change from upstream to make this better.

Yes, There is upstream patch which make warning more clear:
https://github.com/tcsh-org/tcsh/commit/4723e1754cd28657183358334fe77fe328f01bd0

Greetings
Jan

Comment 24 Jan Macku 2020-07-14 12:53:38 UTC

Hi Klaas,
Just a quick update. Currently I was able to backport 4 out of 5 patches (last one has same problem as patch below) from upstream and from initial testing it looks like rapid memory consumption nor high cpu consumption is no longer present. In following days I'll test it properly and let you know.

Also I found one more patch witch might by a part of the fix:
https://github.com/tcsh-org/tcsh/commit/c9757dd83f6c641dacddd58095cefc5ed21a3b42

The issue with this one is that there huge amount of code missing so there is no chance to backport this patch. Only chance is to rewrite it somehow.

Greetings
Jan

Comment 25 Jan Macku 2020-07-20 15:10:20 UTC

Hi Klaas,
I tested backported patches from upstream with three corrupted history files which are available to me. 

There is no more cpu nor ram exhaustion with history from your customer ticket and with history from original report Bug #1598502 only small esthetic issue (at the end):

%<----------------

[root@localhost ~]# tcsh 
Can't load history: $< line too long.
[root@localhost ~]# ls
anaconda-ks.cfg  history_1  history_2  history_3
[root@localhost ~]# ps
  PID TTY          TIME CMD
 1772 pts/0    00:00:00 bash
 1789 pts/0    00:00:00 tcsh
 1806 pts/0    00:00:00 ps
[root@localhost ~]# exit
exit
Can't load history: $< line too long.
exit
[root@localhost ~]# 

%<----------------

The tcsh not even start with last history file from private comment on https://bugzilla.redhat.com/show_bug.cgi?id=1598502#34 . The issue is the same with upstream version of tcsh in Fedora. I'll contact upstream about it and also I'll try to debug it at the end of this week.

Greetings
Jan

Comment 26 Jan Macku 2020-07-30 08:50:08 UTC

Hi Klaas,
It's me with another update. With help of my colleague Kamil I was able find cause of infinite loop.
Upstream was able to come up with slightly different patch then we suggested, which break everything (every corrupted history cause infinite loop).
Patch: https://github.com/tcsh-org/tcsh/commit/064853aa25f498aa7bc4554393a53240696813fa

So I create patch by myself (see attachments). Basically upstream by mistake used two different functions in original patch stderror(...) that sets error msg and abort loading and seterror(...) that sets msg but continue to load corrupted history. By changing seterror(...) to stderror(...) tcsh is able to skip corrupted history and continue.

I tested this patch together with all that I mentioned in past and It seems that everything works. Only thing that doesn't work is that command history is available only in interactive shell mode. Once you exit it isn't saved. So I would suggest to start with clear .history file.

Also I want to let you know that tomorrow I'm Out of office and next week I'm on PTO. I'll be back on 10th of August if there will be any problem during my absence please feel free to contact my manager lnykryn.
 
I'll to create errata today, but it might get delayed till 10th.

Thank you for your understanding and patience.
Regards
Jan

Comment 27 Jan Macku 2020-07-30 08:54:55 UTC

Created attachment 1702907 [details]
Call stderror consistently

This patch replaces seterror(...) with stderror(...)

Suggested by Kamil Dudka

Comment 30 Klaas Demter 2020-07-30 13:44:46 UTC

Thanks for the investigative work and the fix Jan/Kamil. Will this receive a 7.8 backport or is 7.9 right around the corner?

Comment 32 Jan Macku 2020-08-10 09:15:29 UTC

Hi Klaas,
RHEL 7.9.z will be released soon. 

We can't give you any exact dates via bugzilla, please contact the support.

Regards
Jan

Comment 35 errata-xmlrpc 2020-11-10 12:55:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (tcsh bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5001

Comment 36 Klaas Demter 2022-06-27 13:55:06 UTC

If anyone else implemented the workaround I suggested "set histfile="; we migrated a couple of systems to rhel8 and that is no longer working on el8 :) It creates files like .10386548 in the current working directory. We no longer have any systems older than RHEL 7.9 so I just removed the workaround completely, and the histfile is now ~/.history again.