Bug 1792908

Summary: nstat core dumps continuously if its state file /tmp/.nstat.u<userid> is corrupted
Product: Red Hat Enterprise Linux 7 Reporter: Renaud Métrich <rmetrich>
Component: iprouteAssignee: Andrea Claudi <aclaudi>
Status: CLOSED ERRATA QA Contact: BaseOS QE Security Team <qe-baseos-security>
Severity: high Docs Contact:
Priority: high    
Version: 7.7CC: atragler, jmaxwell, ptalbert
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: iproute-4.11.0-27.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1824896 (view as bug list) Environment:
Last Closed: 2020-09-29 20:28:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1824896    

Description Renaud Métrich 2020-01-20 11:23:10 UTC
Description of problem:

If for some reason the state file /tmp/.nstat.u<userid> is corrupted and doesn't contain the expected data nstat understands, "nstat" will continuously die until the state file is manually deleted.


Version-Release number of selected component (if applicable):

iproute-4.11.0-25.el7_7.2.x86_64


How reproducible:

Always


Steps to Reproduce:
1. Execute nstat once

  # nstat

2. Corrupt the state file

  # echo FOO > /tmp/.nstat.u0

3. Execute nstat again

  # nstat

Actual results:

Aborted (core dumped)


Expected results:

"fresh" nstat data (not the differences since state file is corrupted)


Additional info:

The following backtrace is seen:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
(gdb) bt
#0  0x00007fb3a26c0377 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
#1  0x00007fb3a26c1a68 in __GI_abort () at abort.c:90
#2  0x0000000000402602 in load_good_table (fp=fp@entry=0x169c0a0) at nstat.c:147
#3  0x00000000004021f7 in main (argc=<optimized out>, argv=<optimized out>) at nstat.c:696

(gdb) up 2
#2  0x0000000000402602 in load_good_table (fp=fp@entry=0x169c0a0) at nstat.c:147
147				abort();
(gdb) list
142				continue;
143			}
144			/* idbuf is as big as buf, so this is safe */
145			nr = sscanf(buf, "%s%llu%lg", idbuf, &val, &rate);
146			if (nr < 2)
147				abort();
148			if (nr < 3)
149				rate = 0;
150			if (useless_number(idbuf))
151				continue;

(gdb) p buf
$1 = "FOO\n"
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

Due to "nr < 2" ("nr == 0" here, there is no "long" then "double" value), a coredump is created.

Comment 2 Andrea Claudi 2020-02-04 13:17:22 UTC
Hi Renaud,
Data stored in nstat temporary file are provided by kernel via /proc files; if something breaks in their syntax (e.g. because of a bug), we have a nstat temp file that is costantly broken, so there is no chance to provide "fresh" nstat data as you expect. In this situation, the only thing we can do is to avoid the crash and make nstat fail gracefully (i.e. printing a meaningful message and exiting with error). Do you agree with this?

Comment 3 Renaud Métrich 2020-02-04 13:32:33 UTC
Hi Andrea,

I understand, it's sufficient to bail in error. Still I'm not sure if nstat should delete the corrupted state file or keep it and tell the admin to delete it (I would prefer nstat to delete the file since it's corrupted anyway).

Renaud.

Comment 4 Andrea Claudi 2020-02-04 16:08:51 UTC
I don't see any prominent reason to hide the causes of the error deleting the file. Having the file readily available can expedite a fix in case of issues, instead.
Please take into account that if the error comes from the kernel, every new file will contain it.

Comment 5 Renaud Métrich 2020-02-04 20:03:09 UTC
Well the issue happened on a customer system, of course he didn't push "FOO" into the state file.
This means that it's possible that from time to time the kernel reports bad data, hence when such thing happens, the file should just be discarded, an error printed stating that next execution will not show the differences, but the whole stats.
We may indeed not delete the file, but then nstat return code should be very specific so that script tools can detect this and do the clean up by themselves (which seems complicated to me: the caller would then need to know how the state file is named).

Comment 6 Andrea Claudi 2020-02-06 17:56:14 UTC
Yes, that was clear to me.

The problem is that if the kernel is printing garbage in the temp file, there is no chance to have "clean" data: hence it is useless to delete the file, we will end up with another corrupted file anyway.

Comment 7 Andrea Claudi 2020-02-06 18:12:22 UTC
Patch sent upstream: https://patchwork.ozlabs.org/patch/1234524/

Comment 9 Andrea Claudi 2020-02-24 17:01:57 UTC
Patch merged upstream.

Comment 11 Andrea Claudi 2020-04-16 15:55:38 UTC
Solved upstream with:

commit 2c7056ac26412fe99443a283f0c1261cb81ccea2
Author: Andrea Claudi <aclaudi>
Date:   Mon Feb 17 14:46:18 2020 +0100

    nstat: print useful error messages in abort() cases

Comment 18 errata-xmlrpc 2020-09-29 20:28:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (iproute bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3999