Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1852859

Summary: epoll_ctl triggers valgrind warnings on non-x86_64 64bit arches
Product: Red Hat Developer Toolset Reporter: Mark Wielaard <mjw>
Component: valgrindAssignee: Mark Wielaard <mjw>
Status: CLOSED ERRATA QA Contact: Alexandra Petlanová Hájková <ahajkova>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: DTS 10.0 RHEL 7CC: ahajkova, aoliva, arjun, codonell, dj, dodji, dsmith, extras-qa, fweimer, jakub, law, mcermak, mfabian, mjw, mnewsome, ohudlick, pfrankli, rth, siddhesh, tgl
Target Milestone: beta1   
Target Release: 10.0   
Hardware: aarch64   
OS: Unspecified   
Whiteboard:
Fixed In Version: devtoolset-10-valgrind-3.16.1-4 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: 1844778
: 1875329 (view as bug list) Environment:
Last Closed: 2020-12-01 12:10:32 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1844778    
Bug Blocks: 1875329, 1879760    

Description Mark Wielaard 2020-07-01 13:11:47 UTC
+++ This bug was initially created as a clone of Bug #1844778 +++

Description of problem:
An application (postgresql, to be exact) using epoll_ctl fails when run under valgrind, with a stack trace like
{
   epoll-weirdness
   Memcheck:Param
   epoll_ctl(event)
   fun:epoll_ctl
   fun:WaitEventAdjustEpoll
}

I see this on aarch64 but not x86_64 (have not tried other architectures).

The calling function is definitely initializing both the events and data.ptr fields of the struct epoll_event passed to epoll_ctl().  I speculate that epoll_ctl() is coded in such a way that the uninitialized inter-field padding in that struct triggers valgrind's wrath.

Version-Release number of selected component (if applicable):
glibc-2.31-2.fc32.aarch64
valgrind-3.16.0-1.fc32.aarch64

How reproducible:
100%

Steps to Reproduce:
1.Run postgresql regression tests under valgrind on Fedora 32/aarch64.

If you want to be picky about this, I can probably prepare a more self-contained test case.

Actual results:
valgrind exceptions in epoll_ctl.

Expected results:
No such exceptions.
Additional info:

--- Additional comment from Tom Lane on 2020-06-07 05:27:24 UTC ---

Forgot to mention that the error vanishes if I insert

       memset(&epoll_event, 0, sizeof(epoll_event));

into the calling code before it fills the events and data.ptr fields, which is why my suspicion is focused on that struct's padding.

--- Additional comment from Florian Weimer on 2020-06-07 17:26:34 UTC ---

I'm sorry, glibc's epoll_ctl implementation is just a tiny wrapper which does not process the pointed-to struct  all. It looks like the system call model in valgrind needs some tweaking. It's possible that the default GCC code generation on x86-64 writes to the padding by accident, but does not do so on aarch64.

Given that epoll_ctl is a multiplexing system call, it would be good to know which operation triggered this, and which bytes are reported as undefined. Thanks.

--- Additional comment from Mark Wielaard on 2020-06-07 17:37:15 UTC ---

Could you provide the actual warning valgrind produces?
The suppression provided doesn't show which param is being warned about and why.
Showing the actual source code that triggers the warning would be helpful too.

--- Additional comment from Mark Wielaard on 2020-06-07 18:00:35 UTC ---

(In reply to Mark Wielaard from comment #3)
> Could you provide the actual warning valgrind produces?
> The suppression provided doesn't show which param is being warned about and
> why.
> Showing the actual source code that triggers the warning would be helpful
> too.

Actually I was wrong, the supression does say "epoll(event)", so it is the event argument.
According to valgrind's own definition of vki_epoll_event the struct is packaged on x86_64, but not on other arches.
Will have to check the kernel definition to see if this is the issue on aarch64.

--- Additional comment from Tom Lane on 2020-06-07 19:42:47 UTC ---



--- Additional comment from Tom Lane on 2020-06-07 19:46:52 UTC ---

OK, this seems like it might be more complex than I expected, so I stripped down the Postgres code to a self-contained test case.

Run with
gcc -Wall -o epolltest epolltest.c
valgrind ./epolltest

On x86_64 valgrind reports no problems (although that test is with RHEL8's 3.15.0; I don't have a working Intel F32 installation just now).  On aarch64 I get

==67615== Syscall param epoll_ctl(event) points to uninitialised byte(s)
==67615==    at 0x495A7B8: epoll_ctl (in /usr/lib64/libc-2.31.so)
==67615==    by 0x400A67: WaitEventAdjustEpoll (in /home/tgl/epolltest)
==67615==    by 0x4009CB: AddWaitEventToSet (in /home/tgl/epolltest)
==67615==    by 0x400ACF: main (in /home/tgl/epolltest)
==67615==  Address 0x1fff0006bc is on thread 1's stack
==67615==  in frame #1, created by WaitEventAdjustEpoll (???:)

--- Additional comment from Mark Wielaard on 2020-06-07 22:09:34 UTC ---

Thanks. Replicated on an ARM64 setup.

And found the following in the kernel headers:

/* 
 * On x86-64 make the 64bit structure have the same alignment as the
 * 32bit structure. This makes 32bit emulation easier.
 *
 * UML/x86_64 needs the same packing as x86_64
 */
#ifdef __x86_64__
#define EPOLL_PACKED __attribute__((packed))
#else
#define EPOLL_PACKED
#endif

struct epoll_event {
        __poll_t events;
        __u64 data;
} EPOLL_PACKED;

Sigh. So that explains why there is no padding on x86_64, but there might be on any other arch.
valgrind indeed checks that the whole struct is defined.

Testing the following valgrind patch:

diff --git a/coregrind/m_syswrap/syswrap-linux.c b/coregrind/m_syswrap/syswrap-l
index f1ecbbf..e94c448 100644
--- a/coregrind/m_syswrap/syswrap-linux.c
+++ b/coregrind/m_syswrap/syswrap-linux.c
@@ -1923,8 +1923,13 @@ PRE(sys_epoll_ctl)
          SARG1, ( ARG2<3 ? epoll_ctl_s[ARG2] : "?" ), SARG3, ARG4);
    PRE_REG_READ4(long, "epoll_ctl",
                  int, epfd, int, op, int, fd, struct vki_epoll_event *, event);
-   if (ARG2 != VKI_EPOLL_CTL_DEL)
-      PRE_MEM_READ( "epoll_ctl(event)", ARG4, sizeof(struct vki_epoll_event) );
+   if (ARG2 != VKI_EPOLL_CTL_DEL) {
+      struct vki_epoll_event *event = (struct vki_epoll_event *) ARG4;
+      PRE_MEM_READ( "epoll_ctl(event.events)", (Addr) &event->events,
+                    sizeof(__vki_u32) );
+      PRE_MEM_READ( "epoll_ctl(event.data)", (Addr) &event->data,
+                    sizeof(__vki_u64) );
+   }
 }
 
 PRE(sys_epoll_wait)

--- Additional comment from Tom Lane on 2020-06-07 22:59:23 UTC ---

The other thing that might bite us here (independently of architecture) is that epoll_data_t is a union of both 32- and 64-bit fields.  I haven't really studied the relevant man pages, but I suppose there are cases where the data field is legitimately only partially initialized.

--- Additional comment from Mark Wielaard on 2020-06-08 10:49:35 UTC ---

(In reply to Tom Lane from comment #8)
> The other thing that might bite us here (independently of architecture) is
> that epoll_data_t is a union of both 32- and 64-bit fields.  I haven't
> really studied the relevant man pages, but I suppose there are cases where
> the data field is legitimately only partially initialized.

Urgh, you are right. I filed an upstream bug https://bugs.kde.org/show_bug.cgi?id=422623 since this is not really Fedora specific.

Comment 5 Mark Wielaard 2020-09-14 14:18:35 UTC
Update summary to make clear this is not aarch64 specific, the same issue shows on ppc64le and s390x.

Comment 8 errata-xmlrpc 2020-12-01 12:10:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (new packages: devtoolset-10-valgrind), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:5287