Bug 2122957 - Upgrade to kernel-4.18.0-372.x breaks user-mode linux (UML) kernel operations
Summary: Upgrade to kernel-4.18.0-372.x breaks user-mode linux (UML) kernel operations
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: kernel
Version: 8.6
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: rc
: ---
Assignee: core-kernel-bot
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-08-31 12:34 UTC by antal.nemes
Modified: 2023-08-08 03:37 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-09-21 18:51:56 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
reproduction script (608 bytes, application/x-shellscript)
2022-08-31 12:34 UTC, antal.nemes
no flags Details
log showing the failure (22.84 KB, text/plain)
2022-08-31 12:35 UTC, antal.nemes
no flags Details
log showing successful run (38.57 KB, text/plain)
2022-08-31 12:35 UTC, antal.nemes
no flags Details
repro env - package list (NEVRA) (11.27 KB, text/plain)
2022-08-31 12:36 UTC, antal.nemes
no flags Details
repro env - package list (name only) (3.98 KB, text/plain)
2022-08-31 12:37 UTC, antal.nemes
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-132813 0 None None None 2022-08-31 12:37:12 UTC

Description antal.nemes 2022-08-31 12:34:41 UTC
Created attachment 1908674 [details]
reproduction script

Description of problem:

We are running a custom build of libguestfs with user-mode linux (UML) backend, executed on Rocky Linux 8. UML kernel is built from 4.14 from kernel.org.

Since upgrade to kernel 4.18.0-372.x, execution of some processes within this guestfs UML kernel are failing.
Log of the libguests operation with debug/tracing enabled is attached (log-failure.txt, contrast with log-success.txt).

There are two types of failures that occur:
1. Inconsistency detected by ld.so: dl-version.c: 205: _dl_check_map_versions: Assertion `needed != NULL' failed!
2. Crash of ls /dev

Apropos 1, it seems ld.so silently fails to load certain libraries.
By instrumenting glibc's loader with additional debug statements, I determined this to occur in _dl_map_object when checking if library with same soname is already loaded. When the issue occurs, strcmp(name, soname) yields 0 (strings match) when strings do not actually match. So loader ends up thinking a dependency is already is loaded, when in reality it is not. Adding a debug printf statement before strcmp causes subsequent strcmp on (presumably) same values to show the correct result. This does not occur for every library loaded (but does occur deterministically for each failing binary), but I am unable to identify what exactly triggers this mismatch.  I am completely at a loss as to how a kernel update can result in such behavior.

I did not do any analysis for failure 2 (the crash of ls binary).

Issue is 100% reproducible. Issue does not occur after reverting to kernel 4.18.0-348.20 or .23.

Given the same custom UML kernel and libguestfs binaries, upgrading only the kernel from 4.18.0-348.x to 4.18.0-372.x results in the failure.
Issue does not occur with 5.18 kernel (kernel-ml from elrepo). Issue apparently occurs also in Rocky Linux 9.
Issue is reproducible if OS is running on KVM or vSphere, but is not reproducible when OS is running in VirtualBox (tested with Vagrant bento/rockylinux86).

This issue is critical for us because it is blocking kernel upgrades.
UML kernel is critical for us to achieve performant isolation in environments where nested virtualizaton is not available.

Version-Release number of selected component (if applicable):
kernel-4.18.0-372.9.1.el8 and newer

How reproducible:
Issue is 100% reproducible.

Steps to Reproduce:

Relevant files are in https://files.hycu.com/d/56180e9f77e242f0975d/ (expires in 60 days)
- guestfs-uml.tgz our custom build of libguests and UML backend
- packages*.txt - list of installed packages from our local reproducer
- repro.sh - test script (also attached)

Procedure:
1. dnf install augeas-libs libconfig
2. rm -rf /usr/local/lib64
3. cd /
4. tar xvfz guestfs-uml.tgz
5. . repro.sh && prepare-test && run-test

Actual results:

Guestfs operation with UML backend fails (see log-failure.txt).

Expected results:

Guestfs operation with UML backend succeeds (see log-success.txt).

Additional info:

This issue is critical for us because it is blocking kernel upgrades.
UML kernel is critical for us to achieve performant isolation in environments where nested virtualizaton is not available.

Comment 1 antal.nemes 2022-08-31 12:35:24 UTC
Created attachment 1908675 [details]
log showing the failure

Comment 2 antal.nemes 2022-08-31 12:35:49 UTC
Created attachment 1908676 [details]
log showing successful run

Comment 3 antal.nemes 2022-08-31 12:36:42 UTC
Created attachment 1908677 [details]
repro env - package list (NEVRA)

Comment 4 antal.nemes 2022-08-31 12:37:09 UTC
Created attachment 1908678 [details]
repro env - package list (name only)

Comment 5 Rafael Aquini 2022-09-21 18:51:56 UTC
User Mode Linux is a technology that is not supported by RHEL, 
so there's no guarantees that backports will not break it, 
at any point in the product lifecycle.

Besides that, you state that you are running a custom built environment
on top of a downstream clone of RHEL, which makes you case unsupportable.

Unfortunately, I have to tell you are on your own with the deployment
choices you made. However, if you manage to debug the problem by yourself and 
find out the set of changes (from upstream) that makes UML work for you, and
at that point you are still willing to have RHEL carrying over the chages,
please update them here reopening this ticket.

For now, I'm going to close this ticket as NOTABUG, given UML is not part
of RHEL offerings.

Comment 6 antal.nemes 2022-09-21 21:37:38 UTC
> User Mode Linux is a technology that is not supported by RHEL, 
> so there's no guarantees that backports will not break it, 
> at any point in the product lifecycle.

While the effect is currently visible in UML, the implication is that a change 
in the kernel clearly resulted in incorrect behavior of a user-space application. 
Without further analysis, there is no telling what else would affected (including
things that are supported by RHEL). Since this is reproducible, this is an opportunity 
to identify the root cause (and by extension, its blast radius).

>  However, if you manage to debug the problem by yourself and 
> find out the set of changes (from upstream) that makes UML work for you ..

The issue does not occur upstream, otherwise I would not be opening a bug with RedHat.
I would be happy to bisect the issue myself if I had granular backport changesets, but
to my knowledge, this is not available to me.


Note You need to log in before you can comment on or make changes to this bug.