Bug 2122957

Summary: Upgrade to kernel-4.18.0-372.x breaks user-mode linux (UML) kernel operations
Product: Red Hat Enterprise Linux 8 Reporter: antal.nemes
Component: kernelAssignee: core-kernel-bot <core-kernel-mgr>
kernel sub component: Kernel-Core QA Contact: Red Hat Kernel QE team <kernel-qe>
Status: CLOSED NOTABUG Docs Contact:
Severity: urgent    
Priority: unspecified CC: antal.nemes, aquini
Version: 8.6   
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-09-21 18:51:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
reproduction script
none
log showing the failure
none
log showing successful run
none
repro env - package list (NEVRA)
none
repro env - package list (name only) none

Description antal.nemes 2022-08-31 12:34:41 UTC
Created attachment 1908674 [details]
reproduction script

Description of problem:

We are running a custom build of libguestfs with user-mode linux (UML) backend, executed on Rocky Linux 8. UML kernel is built from 4.14 from kernel.org.

Since upgrade to kernel 4.18.0-372.x, execution of some processes within this guestfs UML kernel are failing.
Log of the libguests operation with debug/tracing enabled is attached (log-failure.txt, contrast with log-success.txt).

There are two types of failures that occur:
1. Inconsistency detected by ld.so: dl-version.c: 205: _dl_check_map_versions: Assertion `needed != NULL' failed!
2. Crash of ls /dev

Apropos 1, it seems ld.so silently fails to load certain libraries.
By instrumenting glibc's loader with additional debug statements, I determined this to occur in _dl_map_object when checking if library with same soname is already loaded. When the issue occurs, strcmp(name, soname) yields 0 (strings match) when strings do not actually match. So loader ends up thinking a dependency is already is loaded, when in reality it is not. Adding a debug printf statement before strcmp causes subsequent strcmp on (presumably) same values to show the correct result. This does not occur for every library loaded (but does occur deterministically for each failing binary), but I am unable to identify what exactly triggers this mismatch.  I am completely at a loss as to how a kernel update can result in such behavior.

I did not do any analysis for failure 2 (the crash of ls binary).

Issue is 100% reproducible. Issue does not occur after reverting to kernel 4.18.0-348.20 or .23.

Given the same custom UML kernel and libguestfs binaries, upgrading only the kernel from 4.18.0-348.x to 4.18.0-372.x results in the failure.
Issue does not occur with 5.18 kernel (kernel-ml from elrepo). Issue apparently occurs also in Rocky Linux 9.
Issue is reproducible if OS is running on KVM or vSphere, but is not reproducible when OS is running in VirtualBox (tested with Vagrant bento/rockylinux86).

This issue is critical for us because it is blocking kernel upgrades.
UML kernel is critical for us to achieve performant isolation in environments where nested virtualizaton is not available.

Version-Release number of selected component (if applicable):
kernel-4.18.0-372.9.1.el8 and newer

How reproducible:
Issue is 100% reproducible.

Steps to Reproduce:

Relevant files are in https://files.hycu.com/d/56180e9f77e242f0975d/ (expires in 60 days)
- guestfs-uml.tgz our custom build of libguests and UML backend
- packages*.txt - list of installed packages from our local reproducer
- repro.sh - test script (also attached)

Procedure:
1. dnf install augeas-libs libconfig
2. rm -rf /usr/local/lib64
3. cd /
4. tar xvfz guestfs-uml.tgz
5. . repro.sh && prepare-test && run-test

Actual results:

Guestfs operation with UML backend fails (see log-failure.txt).

Expected results:

Guestfs operation with UML backend succeeds (see log-success.txt).

Additional info:

This issue is critical for us because it is blocking kernel upgrades.
UML kernel is critical for us to achieve performant isolation in environments where nested virtualizaton is not available.

Comment 1 antal.nemes 2022-08-31 12:35:24 UTC
Created attachment 1908675 [details]
log showing the failure

Comment 2 antal.nemes 2022-08-31 12:35:49 UTC
Created attachment 1908676 [details]
log showing successful run

Comment 3 antal.nemes 2022-08-31 12:36:42 UTC
Created attachment 1908677 [details]
repro env - package list (NEVRA)

Comment 4 antal.nemes 2022-08-31 12:37:09 UTC
Created attachment 1908678 [details]
repro env - package list (name only)

Comment 5 Rafael Aquini 2022-09-21 18:51:56 UTC
User Mode Linux is a technology that is not supported by RHEL, 
so there's no guarantees that backports will not break it, 
at any point in the product lifecycle.

Besides that, you state that you are running a custom built environment
on top of a downstream clone of RHEL, which makes you case unsupportable.

Unfortunately, I have to tell you are on your own with the deployment
choices you made. However, if you manage to debug the problem by yourself and 
find out the set of changes (from upstream) that makes UML work for you, and
at that point you are still willing to have RHEL carrying over the chages,
please update them here reopening this ticket.

For now, I'm going to close this ticket as NOTABUG, given UML is not part
of RHEL offerings.

Comment 6 antal.nemes 2022-09-21 21:37:38 UTC
> User Mode Linux is a technology that is not supported by RHEL, 
> so there's no guarantees that backports will not break it, 
> at any point in the product lifecycle.

While the effect is currently visible in UML, the implication is that a change 
in the kernel clearly resulted in incorrect behavior of a user-space application. 
Without further analysis, there is no telling what else would affected (including
things that are supported by RHEL). Since this is reproducible, this is an opportunity 
to identify the root cause (and by extension, its blast radius).

>  However, if you manage to debug the problem by yourself and 
> find out the set of changes (from upstream) that makes UML work for you ..

The issue does not occur upstream, otherwise I would not be opening a bug with RedHat.
I would be happy to bisect the issue myself if I had granular backport changesets, but
to my knowledge, this is not available to me.