Bug 1792908

Summary:	nstat core dumps continuously if its state file /tmp/.nstat.u<userid> is corrupted
Product:	Red Hat Enterprise Linux 7	Reporter:	Renaud Métrich <rmetrich>
Component:	iproute	Assignee:	Andrea Claudi <aclaudi>
Status:	CLOSED ERRATA	QA Contact:	BaseOS QE Security Team <qe-baseos-security>
Severity:	high	Docs Contact:
Priority:	high
Version:	7.7	CC:	atragler, jmaxwell, ptalbert
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	iproute-4.11.0-27.el7	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1824896 (view as bug list)		Environment:
Last Closed:	2020-09-29 20:28:24 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1824896

Description Renaud Métrich 2020-01-20 11:23:10 UTC

Description of problem:

If for some reason the state file /tmp/.nstat.u<userid> is corrupted and doesn't contain the expected data nstat understands, "nstat" will continuously die until the state file is manually deleted.


Version-Release number of selected component (if applicable):

iproute-4.11.0-25.el7_7.2.x86_64


How reproducible:

Always


Steps to Reproduce:
1. Execute nstat once

  # nstat

2. Corrupt the state file

  # echo FOO > /tmp/.nstat.u0

3. Execute nstat again

  # nstat

Actual results:

Aborted (core dumped)


Expected results:

"fresh" nstat data (not the differences since state file is corrupted)


Additional info:

The following backtrace is seen:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
(gdb) bt
#0  0x00007fb3a26c0377 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
#1  0x00007fb3a26c1a68 in __GI_abort () at abort.c:90
#2  0x0000000000402602 in load_good_table (fp=fp@entry=0x169c0a0) at nstat.c:147
#3  0x00000000004021f7 in main (argc=<optimized out>, argv=<optimized out>) at nstat.c:696

(gdb) up 2
#2  0x0000000000402602 in load_good_table (fp=fp@entry=0x169c0a0) at nstat.c:147
147				abort();
(gdb) list
142				continue;
143			}
144			/* idbuf is as big as buf, so this is safe */
145			nr = sscanf(buf, "%s%llu%lg", idbuf, &val, &rate);
146			if (nr < 2)
147				abort();
148			if (nr < 3)
149				rate = 0;
150			if (useless_number(idbuf))
151				continue;

(gdb) p buf
$1 = "FOO\n"
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

Due to "nr < 2" ("nr == 0" here, there is no "long" then "double" value), a coredump is created.

Comment 2 Andrea Claudi 2020-02-04 13:17:22 UTC

Hi Renaud,
Data stored in nstat temporary file are provided by kernel via /proc files; if something breaks in their syntax (e.g. because of a bug), we have a nstat temp file that is costantly broken, so there is no chance to provide "fresh" nstat data as you expect. In this situation, the only thing we can do is to avoid the crash and make nstat fail gracefully (i.e. printing a meaningful message and exiting with error). Do you agree with this?

Comment 3 Renaud Métrich 2020-02-04 13:32:33 UTC

Hi Andrea,

I understand, it's sufficient to bail in error. Still I'm not sure if nstat should delete the corrupted state file or keep it and tell the admin to delete it (I would prefer nstat to delete the file since it's corrupted anyway).

Renaud.

Comment 4 Andrea Claudi 2020-02-04 16:08:51 UTC

I don't see any prominent reason to hide the causes of the error deleting the file. Having the file readily available can expedite a fix in case of issues, instead.
Please take into account that if the error comes from the kernel, every new file will contain it.

Comment 5 Renaud Métrich 2020-02-04 20:03:09 UTC

Well the issue happened on a customer system, of course he didn't push "FOO" into the state file.
This means that it's possible that from time to time the kernel reports bad data, hence when such thing happens, the file should just be discarded, an error printed stating that next execution will not show the differences, but the whole stats.
We may indeed not delete the file, but then nstat return code should be very specific so that script tools can detect this and do the clean up by themselves (which seems complicated to me: the caller would then need to know how the state file is named).

Comment 6 Andrea Claudi 2020-02-06 17:56:14 UTC

Yes, that was clear to me.

The problem is that if the kernel is printing garbage in the temp file, there is no chance to have "clean" data: hence it is useless to delete the file, we will end up with another corrupted file anyway.

Comment 7 Andrea Claudi 2020-02-06 18:12:22 UTC

Patch sent upstream: https://patchwork.ozlabs.org/patch/1234524/

Comment 9 Andrea Claudi 2020-02-24 17:01:57 UTC

Patch merged upstream.

Comment 11 Andrea Claudi 2020-04-16 15:55:38 UTC

Solved upstream with:

commit 2c7056ac26412fe99443a283f0c1261cb81ccea2
Author: Andrea Claudi <aclaudi>
Date:   Mon Feb 17 14:46:18 2020 +0100

    nstat: print useful error messages in abort() cases

Comment 18 errata-xmlrpc 2020-09-29 20:28:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (iproute bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3999