Bug 1844778
Summary: | epoll_ctl triggers valgrind warnings on aarch64 | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Tom Lane <tgl> | ||||||
Component: | valgrind | Assignee: | Mark Wielaard <mjw> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||
Severity: | unspecified | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 32 | CC: | aoliva, arjun.is, codonell, dj, dodji, fweimer, jakub, law, mfabian, mjw, pfrankli, rth, siddhesh, thaller | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | aarch64 | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | valgrind-3.16.1-5.fc33 valgrind-3.16.1-5.fc32 | Doc Type: | If docs needed, set a value | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | |||||||||
: | 1852859 (view as bug list) | Environment: | |||||||
Last Closed: | 2020-08-18 22:54:50 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1852859, 1875329, 1879760 | ||||||||
Attachments: |
|
Description
Tom Lane
2020-06-07 05:09:11 UTC
Forgot to mention that the error vanishes if I insert memset(&epoll_event, 0, sizeof(epoll_event)); into the calling code before it fills the events and data.ptr fields, which is why my suspicion is focused on that struct's padding. I'm sorry, glibc's epoll_ctl implementation is just a tiny wrapper which does not process the pointed-to struct all. It looks like the system call model in valgrind needs some tweaking. It's possible that the default GCC code generation on x86-64 writes to the padding by accident, but does not do so on aarch64. Given that epoll_ctl is a multiplexing system call, it would be good to know which operation triggered this, and which bytes are reported as undefined. Thanks. Could you provide the actual warning valgrind produces? The suppression provided doesn't show which param is being warned about and why. Showing the actual source code that triggers the warning would be helpful too. (In reply to Mark Wielaard from comment #3) > Could you provide the actual warning valgrind produces? > The suppression provided doesn't show which param is being warned about and > why. > Showing the actual source code that triggers the warning would be helpful > too. Actually I was wrong, the supression does say "epoll(event)", so it is the event argument. According to valgrind's own definition of vki_epoll_event the struct is packaged on x86_64, but not on other arches. Will have to check the kernel definition to see if this is the issue on aarch64. Created attachment 1695904 [details]
self-contained test case
OK, this seems like it might be more complex than I expected, so I stripped down the Postgres code to a self-contained test case. Run with gcc -Wall -o epolltest epolltest.c valgrind ./epolltest On x86_64 valgrind reports no problems (although that test is with RHEL8's 3.15.0; I don't have a working Intel F32 installation just now). On aarch64 I get ==67615== Syscall param epoll_ctl(event) points to uninitialised byte(s) ==67615== at 0x495A7B8: epoll_ctl (in /usr/lib64/libc-2.31.so) ==67615== by 0x400A67: WaitEventAdjustEpoll (in /home/tgl/epolltest) ==67615== by 0x4009CB: AddWaitEventToSet (in /home/tgl/epolltest) ==67615== by 0x400ACF: main (in /home/tgl/epolltest) ==67615== Address 0x1fff0006bc is on thread 1's stack ==67615== in frame #1, created by WaitEventAdjustEpoll (???:) Thanks. Replicated on an ARM64 setup. And found the following in the kernel headers: /* * On x86-64 make the 64bit structure have the same alignment as the * 32bit structure. This makes 32bit emulation easier. * * UML/x86_64 needs the same packing as x86_64 */ #ifdef __x86_64__ #define EPOLL_PACKED __attribute__((packed)) #else #define EPOLL_PACKED #endif struct epoll_event { __poll_t events; __u64 data; } EPOLL_PACKED; Sigh. So that explains why there is no padding on x86_64, but there might be on any other arch. valgrind indeed checks that the whole struct is defined. Testing the following valgrind patch: diff --git a/coregrind/m_syswrap/syswrap-linux.c b/coregrind/m_syswrap/syswrap-l index f1ecbbf..e94c448 100644 --- a/coregrind/m_syswrap/syswrap-linux.c +++ b/coregrind/m_syswrap/syswrap-linux.c @@ -1923,8 +1923,13 @@ PRE(sys_epoll_ctl) SARG1, ( ARG2<3 ? epoll_ctl_s[ARG2] : "?" ), SARG3, ARG4); PRE_REG_READ4(long, "epoll_ctl", int, epfd, int, op, int, fd, struct vki_epoll_event *, event); - if (ARG2 != VKI_EPOLL_CTL_DEL) - PRE_MEM_READ( "epoll_ctl(event)", ARG4, sizeof(struct vki_epoll_event) ); + if (ARG2 != VKI_EPOLL_CTL_DEL) { + struct vki_epoll_event *event = (struct vki_epoll_event *) ARG4; + PRE_MEM_READ( "epoll_ctl(event.events)", (Addr) &event->events, + sizeof(__vki_u32) ); + PRE_MEM_READ( "epoll_ctl(event.data)", (Addr) &event->data, + sizeof(__vki_u64) ); + } } PRE(sys_epoll_wait) The other thing that might bite us here (independently of architecture) is that epoll_data_t is a union of both 32- and 64-bit fields. I haven't really studied the relevant man pages, but I suppose there are cases where the data field is legitimately only partially initialized. (In reply to Tom Lane from comment #8) > The other thing that might bite us here (independently of architecture) is > that epoll_data_t is a union of both 32- and 64-bit fields. I haven't > really studied the relevant man pages, but I suppose there are cases where > the data field is legitimately only partially initialized. Urgh, you are right. I filed an upstream bug https://bugs.kde.org/show_bug.cgi?id=422623 since this is not really Fedora specific. Created attachment 1711315 [details] a simple reproducer for epoll_wait() warning There seems to be a similar (same?) issue that breaks gitlab-ci tests for NetworkManager. But this happens on x86_64, which (according to this bug and https://bugs.kde.org/show_bug.cgi?id=422623 should not be affected). I attach a simple reproducer. I run it on current Fedora rawhide, with valgrind 1:3.16.1-4.fc33. $ podman run -ti fedora:rawhide # dnf upgrade -y # dnf install gcc valgrind -y # curl "$SOMEWHERE/test.c" > test.c # gcc test.c -o test # valgrind --quiet --error-exitcode=77 --leak-check=full --num-callers=100 ./test Output: ==54== Conditional jump or move depends on uninitialised value(s) ==54== at 0x4012F6: main (in /test) ==54== (sorry, I didn't try whether the patch from kde:422623 fixes this) Hi, Thanks for the reproducer. I'll take a look. But don't have access to a machine to replicate at the moment. (In reply to Thomas Haller from comment #10) > $ podman run -ti fedora:rawhide > # dnf upgrade -y > # dnf install gcc valgrind -y > # curl "$SOMEWHERE/test.c" > test.c > # gcc test.c -o test > # valgrind --quiet --error-exitcode=77 --leak-check=full --num-callers=100 > ./test > > Output: > > ==54== Conditional jump or move depends on uninitialised value(s) > ==54== at 0x4012F6: main (in /test) > ==54== Note that if you compile with -g then valgrind will be able to show you on which source line that value is used (and where it was created). > (sorry, I didn't try whether the patch from kde:422623 fixes this) valgrind 3.16.1-4.fc33 includes a backport of that fix, so it might be possible the fix actually introduced the issue if it doesn't show with an earlier valgrind version on x86_64. > Thanks for the reproducer. I'll take a look. But don't have access to a > machine to replicate at the moment. Replicated now. > (In reply to Thomas Haller from comment #10) > > $ podman run -ti fedora:rawhide > > # dnf upgrade -y > > # dnf install gcc valgrind -y > > # curl "$SOMEWHERE/test.c" > test.c > > # gcc test.c -o test > > # valgrind --quiet --error-exitcode=77 --leak-check=full --num-callers=100 > > ./test On a slightly older system I had to replace CLOCK_BOOTTIME with CLOCK_REALTIME for some reason. > > Output: > > > > ==54== Conditional jump or move depends on uninitialised value(s) > > ==54== at 0x4012F6: main (in /test) > > ==54== > > Note that if you compile with -g then valgrind will be able to show you on > which source line that value is used (and where it was created). With gcc -g you get: ==11495== Conditional jump or move depends on uninitialised value(s) ==11495== at 0x4008CD: main (test.c:46) Which is: assert (eevent2[0].data.u32 == EEVENT_DATA_U32_1); > > (sorry, I didn't try whether the patch from kde:422623 fixes this) > > valgrind 3.16.1-4.fc33 includes a backport of that fix, so it might be > possible the fix actually introduced the issue if it doesn't show with an > earlier valgrind version on x86_64. This does indeed seem to have been introduced by the patch that should have fixed the issue on aarch64. Digging into why that is now. (In reply to Mark Wielaard from comment #12) > This does indeed seem to have been introduced by the patch that should have > fixed the issue on aarch64. > > Digging into why that is now. It is because I cannot program in C... A pointer to an array is not a pointer to a pointer to an array... diff --git a/coregrind/m_syswrap/syswrap-linux.c b/coregrind/m_syswrap/syswrap-linux.c index 0850487e9..3f488795a 100644 --- a/coregrind/m_syswrap/syswrap-linux.c +++ b/coregrind/m_syswrap/syswrap-linux.c @@ -2115,11 +2115,11 @@ static void epoll_post_helper ( ThreadId tid, SyscallArgs* arrghs, vg_assert(SUCCESS); if (RES > 0) { Int i; - struct vki_epoll_event **events = (struct vki_epoll_event**)(Addr)ARG2; + struct vki_epoll_event *events = (struct vki_epoll_event*)(Addr)ARG2; for (i = 0; i < RES; i++) { /* Assume both events and data are set (data is user space only). */ - POST_FIELD_WRITE(events[i]->events); - POST_FIELD_WRITE(events[i]->data); + POST_FIELD_WRITE(events[i].events); + POST_FIELD_WRITE(events[i].data); } } } I'll get this upstream and update the backport in the fedora package. Should be fully fixed in valgrind-3.16.1-5.fc33 FEDORA-2020-b7e414bdd4 has been submitted as an update to Fedora 32. https://bodhi.fedoraproject.org/updates/FEDORA-2020-b7e414bdd4 FEDORA-2020-b7e414bdd4 has been pushed to the Fedora 32 testing repository. In short time you'll be able to install the update with the following command: `sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2020-b7e414bdd4` You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2020-b7e414bdd4 See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates. Confirmed that this update fixes the problem for me on aarch64. FEDORA-2020-b7e414bdd4 has been pushed to the Fedora 32 stable repository. If problem still persists, please make note of it in this bug report. |