Bug 1546179

Summary: nettop.stp fails with read fault on rhel-alt-7.5/s390x
Product: Red Hat Enterprise Linux 7 Reporter: Martin Cermak <mcermak>
Component: systemtapAssignee: Frank Ch. Eigler <fche>
Status: CLOSED ERRATA QA Contact: Martin Cermak <mcermak>
Severity: medium Docs Contact: Vladimír Slávik <vslavik>
Priority: medium    
Version: 7.5-AltCC: bgollahe, chorn, cww, dsmith, fche, fj-lsoft-kernel-it, fj-lsoft-rh-dump, jistone, lberk, lherbolt, mbenitez, mcermak, mjw, pasik, vslavik
Target Milestone: rc   
Target Release: ---   
Hardware: s390x   
OS: Linux   
Whiteboard:
Fixed In Version: systemtap-3.3-1.el7 Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of: 1506230 Environment:
Last Closed: 2018-10-30 10:46:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1477664, 1505884, 1609081    

Description Martin Cermak 2018-02-16 14:26:04 UTC
+++ This bug was initially created as a clone of Bug #1506230 +++

=======
# stap nettop.stp  -c 'wget -q https://ftp.spline.de/pub/OpenBSD/ftplist'
ERROR: read fault [man error::fault] at 0x76af7f20 near identifier '$skb' at /usr/share/systemtap/tapset/linux/networking.stp:84:27
  PID   UID DEV     XMIT_PK RECV_PK XMIT_KB RECV_KB COMMAND        
19715     0 enccw0.0.8000       2       0       0       0 wget           

WARNING: Number of errors: 1, skipped probes: 0
WARNING: /usr/bin/staprun exited with status: 1
Pass 5: run failed.  [man error::pass5]
#
=======

Note that -P doesn't help.  SSH, or ICMP isn't sufficient to reproduce, but https or ftp traffic seems to kill.

Comment 2 David Smith 2018-02-27 21:20:01 UTC
OK, I think I know what is going on here. The following kernel commit changed the sk_buff structure:

====
commit bffa72cf7f9df842f0016ba03586039296b4caaf
Author: Eric Dumazet <edumazet>
Date:   Tue Sep 19 05:14:24 2017 -0700

    net: sk_buff rbnode reorg
    
    skb->rbnode shares space with skb->next, skb->prev and skb->tstamp
====

The sk_buff structure now looks like this:

====
struct sk_buff {
	union {
		struct {
			/* These two members must be first. */
			struct sk_buff		*next;
			struct sk_buff		*prev;

			union {
				struct net_device	*dev;
				/* Some protocols might use this space to store information,
				 * while device pointer would be NULL.
				 * UDP receive path is one user.
				 */
				unsigned long		dev_scratch;
			};
		};
		struct rb_node	rbnode; /* used in netem & tcp stack */
	};
        ... stuff deleted ...
};
====

The systemtap read fault error is coming from the following tapset line:

	dev_name = kernel_string($skb->dev->name)

(Of course, I'll remind readers that a "read fault" error is really a good thing - that's systemtap realizing that this address isn't valid and giving an error instead of trying to read a bad address and potentially crashing the kernel.)

In this case, the problem I see in finding a solution is that I don't see a way of knowing which of the two different unions have valid values in them:

1) In this particular skb, is skb->rbnode valid or is the unnamed structure containing next/prev valid?

2) Assuming this particular skb's unnamed structure containing next/prev is valid, is dev or dev_scratch valid?

The s390x is probably seeing the read fault error more than other platforms because it has always been more "sensitive" to bad addresses.

The quickest solution would just be to surround that tapset code with 'try { ... } catch { ... }' and returning the dev_name field as something like "UNKNOWN", but that doesn't really follow the spirit of the following bug where we tried to eradicate strings like that:

<https://sourceware.org/bugzilla/show_bug.cgi?id=15044>

Comment 3 David Smith 2018-03-02 20:06:41 UTC
Fixed upstream in commit 2f6fcfc68. Accesses to sk_buff structures are now surrounded by try/catch in probes.

Comment 11 errata-xmlrpc 2018-10-30 10:46:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:3168