Bug 1846343

Summary:

vGPU: VM failed to run with mdev_type instance.

Product:

Red Hat Enterprise Linux 8

Reporter:

Nisim Simsolo <nsimsolo>

Component:

dracut

Assignee:

Lukáš Nykrýn <lnykryn>

Status:

CLOSED ERRATA

QA Contact:

Frantisek Sumsal <fsumsal>

Severity:

urgent

Docs Contact:

Priority:

unspecified

Version:

8.2

CC:

abpatil, ahadas, alex.williamson, bugs, coli, dmarchan, dracut-maint-list, fsumsal, hbarcomb, juzhang, knoel, lnykryn, michal.skrivanek, mkalinin, mzamazal, nsimsolo, ovasik, zhguo

Target Milestone:

Flags:

pm-rhel: mirror+

Target Release:

8.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

dracut-049-92.git20200702.el8

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-11-04 01:42:48 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Virt

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1852718

Attachments:

Description	Flags
vdsm.log	none
engine.log	none
VM QEMU log	none

Description Nisim Simsolo 2020-06-11 12:24:00 UTC

Description of problem:
After adding Nvidia vGPU instance using WebAdmin -> VM -> host devices -> manage vGPU button 
or using edit VM -> custom properties -> mdev_type, 
the VM failed to run with the next vdsm.log errors:
 
2020-06-11 15:04:14,007+0300 ERROR (vm/6099c96f) [virt.vm] (vmId='6099c96f-d79d-47ae-b39f-9489bc552cf0') The vm start process failed (vm:871)
Traceback (most recent call last):
.
.
libvirt.libvirtError: internal error: Process exited prior to exec: libvirt:  error : failed to access '/sys/bus/mdev/devices/e1f27070-b062-4ea3-a689-89e37a56f677/iommu_group': No such file or directory

2020-06-11 15:04:18,533+0300 ERROR (jsonrpc/1) [root] Couldn't parse NVDIMM device data (hostdev:755)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/common/hostdev.py", line 753, in list_nvdimms
    data = json.loads(output)
  File "/usr/lib64/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/usr/lib64/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib64/python3.6/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
--------------------------

vGPU Nvidia drivers are installed and Nvidia service is running.
also, it is possible to see vGPU instances in the host, for example:
# /home/nsimsolo/vgpu_instances1.sh 
mdev_type: nvidia-11 --- description: num_heads=2, frl_config=45, framebuffer=512M, max_resolution=2560x1600, max_instance=16 --- name: GRID M60-0B
mdev_type: nvidia-12 --- description: num_heads=2, frl_config=60, framebuffer=512M, max_resolution=2560x1600, max_instance=16 --- name: GRID M60-0Q
mdev_type: nvidia-13 --- description: num_heads=1, frl_config=60, framebuffer=1024M, max_resolution=1280x1024, max_instance=8 --- name: GRID M60-1A
mdev_type: nvidia-14 --- description: num_heads=4, frl_config=45, framebuffer=1024M, max_resolution=5120x2880, max_instance=8 --- name: GRID M60-1B
----------------

This issue is not related to emulated machine type (issue occured on pc-i440fx and Q35)

Version-Release number of selected component (if applicable):
ovirt-engine-4.4.1.2-0.10.el8ev
vdsm-4.40.19-1.el8ev.x86_64
libvirt-daemon-6.0.0-22.module+el8.2.1+6815+1c792dc8.x86_64
qemu-kvm-4.2.0-22.module+el8.2.1+6758+cb8d64c2.x86_64
Nvidia host drivers (Tesla M60): NVIDIA-vGPU-rhel-8.2-450.36.01.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Browse Webadmin -> click on VM name -> host devices tab -> manage vGPU, select Nvidia instane and click "save" button.
2. Run VM
3.

Actual results:
VM failed to run 

Expected results:
VM should run with attached vGPU device.

Additional info:
vdsm.log and engine.log attached

Comment 1 Nisim Simsolo 2020-06-11 12:26:59 UTC

Created attachment 1696752 [details]
vdsm.log

Comment 2 Nisim Simsolo 2020-06-11 12:30:51 UTC

Created attachment 1696754 [details]
engine.log

Comment 3 Nisim Simsolo 2020-06-11 12:38:59 UTC

Created attachment 1696755 [details]
VM QEMU log

Comment 4 Arik 2020-06-11 12:45:49 UTC

> 2020-06-11 15:04:18,533+0300 ERROR (jsonrpc/1) [root] Couldn't parse NVDIMM
> device data (hostdev:755)
> Traceback (most recent call last):
>   File "/usr/lib/python3.6/site-packages/vdsm/common/hostdev.py", line 753,
> in list_nvdimms
>     data = json.loads(output)
>   File "/usr/lib64/python3.6/json/__init__.py", line 354, in loads
>     return _default_decoder.decode(s)
>   File "/usr/lib64/python3.6/json/decoder.py", line 339, in decode
>     obj, end = self.raw_decode(s, idx=_w(s, 0).end())
>   File "/usr/lib64/python3.6/json/decoder.py", line 357, in raw_decode
>     raise JSONDecodeError("Expecting value", s, err.value) from None
> json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
> --------------------------

Milan, please keep me honest here - I think the error above wouldn't prevent the VM from starting but this one is:

2020-06-11 15:16:28,067+0300 ERROR (vm/0cc01cbb) [virt.vm] (vmId='0cc01cbb-8cf0-499d-b7b9-afb822cde4f7') The vm start process failed (vm:871)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 801, in _startUnderlyingVm
    self._run()
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 2608, in _run
    dom.createWithFlags(flags)
  File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/common/function.py", line 94, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python3.6/site-packages/libvirt.py", line 1265, in createWithFlags
    if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self)
libvirt.libvirtError: internal error: Process exited prior to exec: libvirt:  error : failed to access '/sys/bus/mdev/devices/71cba851-7aad-44e3-be0d-6046e4aa0c34/iommu_group': No such file or directory
2020-06-11 15:16:28,067+0300 INFO  (vm/0cc01cbb) [virt.vm] (vmId='0cc01cbb-8cf0-499d-b7b9-afb822cde4f7') Changed state to Down: internal error: Process exited prior to exec: libvirt:  error : failed to access '/sys/bus/mdev/devices/71cba851-7aad-44e3-be0d-6046e4aa0c34/iommu_group': No such file or directory (code=1) (vm:1629)

Comment 5 RHEL Program Management 2020-06-11 12:48:39 UTC

This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 6 Milan Zamazal 2020-06-11 13:47:33 UTC

Arik, you're right, the JSON traceback is just a harmless annoyance in the log (already fixed in Vdsm master), the real error is the libvirt error. I think I've seen such an error in the past when the host wasn't properly configured for mdev. But it can also be a platform error or a change in el8.

Nisim, could you please check the host was booted with proper kernel command line options? I think `intel_iommu=on iommu=pt' should be present.

Comment 7 Nisim Simsolo 2020-06-11 13:51:56 UTC

(In reply to Milan Zamazal from comment #6)

> Nisim, could you please check the host was booted with proper kernel command
> line options? I think `intel_iommu=on iommu=pt' should be present.
it's running with proper kernel cmdline:

# cat /proc/cmdline 
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-193.7.1.el8_2.x86_64 root=/dev/mapper/rhel_lion01-root ro crashkernel=auto resume=/dev/mapper/rhel_lion01-swap rd.lvm.lv=rhel_lion01/root rd.lvm.lv=rhel_lion01/swap rhgb quiet rdblacklist=nouveau intel_iommu=on

And both Nvidia vGPUs in this host are installed and running:
# nvidia-smi 
Thu Jun 11 16:50:29 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.36.01    Driver Version: 450.36.01    CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla M60           On   | 00000000:84:00.0 Off |                  Off |
| N/A   38C    P8    24W / 150W |     14MiB /  8191MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M60           On   | 00000000:85:00.0 Off |                  Off |
| N/A   33C    P8    24W / 150W |     14MiB /  8191MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla M60           On   | 00000000:8B:00.0 Off |                  Off |
| N/A   35C    P8    24W / 150W |     14MiB /  8191MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla M60           On   | 00000000:8C:00.0 Off |                  Off |
| N/A   46C    P8    24W / 150W |     14MiB /  8191MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |

Comment 8 Milan Zamazal 2020-06-11 14:04:11 UTC

OK, Nisim, let's check a bit more:

- Which vfio kernel modules are loaded? Specifically, is vfio_mdev loaded?
- Is there anything in /sys/kernel/iommu_groups/?
- Is there anything in /sys/class/iommu/?

Comment 9 Nisim Simsolo 2020-06-11 14:16:53 UTC

(In reply to Milan Zamazal from comment #8)
> OK, Nisim, let's check a bit more:
> 
> - Which vfio kernel modules are loaded? Specifically, is vfio_mdev loaded?
> - Is there anything in /sys/kernel/iommu_groups/?
> - Is there anything in /sys/class/iommu/?

#  lsmod | grep nvidia_vgpu_vfio
nvidia_vgpu_vfio       53248  0
nvidia              19501056  10 nvidia_vgpu_vfio
mdev                   20480  1 nvidia_vgpu_vfio
vfio                   36864  1 nvidia_vgpu_vfio

# ls /sys/kernel/iommu_groups/
0  10  12  14  16  18  2   21  23  25  27  29  30  32  34  36  38  4   41  43  45  47  49  50  52  54  56  58  6   61  63  7  9
1  11  13  15  17  19  20  22  24  26  28  3   31  33  35  37  39  40  42  44  46  48  5   51  53  55  57  59  60  62  64  8

# cat /sys/kernel/iommu_groups/55/devices/0000\:84\:00.0/mdev_supported_types/nvidia-22/description 
num_heads=4, frl_config=60, framebuffer=8192M, max_resolution=5120x2880, max_instance=1

# ls /sys/class/iommu/
dmar0  dmar1

# ls /sys/class/iommu/dmar0/devices/
0000:80:01.0  0000:80:03.0  0000:80:04.1  0000:80:04.3  0000:80:04.5  0000:80:04.7  0000:81:00.1  0000:83:08.0  0000:84:00.0  0000:86:00.0  0000:87:10.0  0000:8a:08.0  0000:8b:00.0
0000:80:02.0  0000:80:04.0  0000:80:04.2  0000:80:04.4  0000:80:04.6  0000:81:00.0  0000:82:00.0  0000:83:10.0  0000:85:00.0  0000:87:08.0  0000:89:00.0  0000:8a:10.0  0000:8c:00.0

Comment 10 Milan Zamazal 2020-06-11 14:27:39 UTC

How about the vfio_mdev module?

Comment 11 Nisim Simsolo 2020-06-11 14:40:20 UTC

(In reply to Milan Zamazal from comment #10)
> How about the vfio_mdev module?

I can see it in lsmod, but:
# modinfo vfio_mdev
filename:       /lib/modules/4.18.0-193.7.1.el8_2.x86_64/kernel/drivers/vfio/mdev/vfio_mdev.ko.xz
description:    VFIO based driver for Mediated device
author:         NVIDIA Corporation
license:        GPL v2
version:        0.1
rhelversion:    8.2
srcversion:     20FFF915712EA2E529A6752
depends:        mdev,vfio
intree:         Y
name:           vfio_mdev
vermagic:       4.18.0-193.7.1.el8_2.x86_64 SMP mod_unload modversions 
sig_id:         PKCS#7
signer:         Red Hat Enterprise Linux kernel signing key
sig_key:        70:5C:5F:89:3D:91:85:84:58:94:B6:EC:AE:44:FF:B7:8A:27:82:5C
sig_hashalgo:   sha256
signature:      0D:98:63:4E:B0:22:B7:FD:D1:D2:1F:2B:17:57:B0:CB:7B:E4:C2:65:

Comment 13 Nisim Simsolo 2020-06-11 15:36:10 UTC

After loading vfio_mdev kernel module and rebooting the host, 
it is now possible to run VM with vGPU:
# lsmod | grep nvidia_vgpu_vfio
nvidia_vgpu_vfio       53248  19
nvidia              19501056  145 nvidia_vgpu_vfio
mdev                   20480  2 vfio_mdev,nvidia_vgpu_vfio
vfio                   36864  6 vfio_mdev,nvidia_vgpu_vfio,vfio_iommu_type1
# ls -l /sys/bus/mdev/devices/
total 0
lrwxrwxrwx. 1 root root 0 Jun 11 18:14 06b068d1-92ae-469e-bdce-9243050092ef -> ../../../devices/pci0000:80/0000:80:02.0/0000:82:00.0/0000:83:08.0/0000:84:00.0/06b068d1-92ae-469e-bdce-9243050092ef
# nvidia-smi 
Thu Jun 11 18:31:59 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.36.01    Driver Version: 450.36.01    CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla M60           On   | 00000000:84:00.0 Off |                  Off |
| N/A   39C    P8    24W / 150W |   2050MiB /  8191MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M60           On   | 00000000:85:00.0 Off |                  Off |
| N/A   33C    P8    24W / 150W |     14MiB /  8191MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla M60           On   | 00000000:8B:00.0 Off |                  Off |
| N/A   36C    P8    24W / 150W |     14MiB /  8191MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla M60           On   | 00000000:8C:00.0 Off |                  Off |
| N/A   47C    P8    24W / 150W |     14MiB /  8191MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      9640    C+G   vgpu                             2031MiB |
+-----------------------------------------------------------------------------+
#

Comment 14 Milan Zamazal 2020-06-15 14:46:45 UTC

So the problem is vfio_mdev kernel module is not loaded by default. It apparently used to be loaded in 4.3 automatically, maybe there used to be a module dependency that caused it to be loaded, which is not present anymore.

What options do we have to make the module available? We can load it unconditionally. While it can probably be always loaded and may be harmless (is it?), it's of no use on hosts unless related hardware is used. Loading it on a vGPU VM start looks weird. But maybe we could check its presence on VM start failure and provide hint in the log if the module is missing?

The question is whether oVirt should be responsible for loading the module at all. If the user is responsible for installing the vGPU drivers, perhaps the user should be responsible for making the module loaded too and we should just mention the possible problem in the documentation? And which component or entity is responsible for adding intel_iommu=on kernel command line option -- perhaps the same entity should add the module?

Opinions?

Comment 15 Michal Skrivanek 2020-06-18 14:51:14 UTC

starts with documenting it and clearer error message

we have kernel cmdline configuration in UI so maybe we can do that there? though...it rather looks like a bug. maybe libvirt, kernel.

Comment 16 Arik 2020-06-18 16:14:51 UTC

(In reply to Michal Skrivanek from comment #15)
> starts with documenting it and clearer error message

+1

> 
> we have kernel cmdline configuration in UI so maybe we can do that there?
> though...it rather looks like a bug. maybe libvirt, kernel.

So I think the question is whether it is really a bug or an intentional change - in case of the latter we should probably report this on the host capabilities and schedule VMs accordingly, no?

Comment 17 Michal Skrivanek 2020-06-18 17:27:22 UTC

no, modules are supposed to be loaded automatically by any decent OS. If there's a good reason why this oone cannot be then please find out and document that

Comment 18 Milan Zamazal 2020-06-18 18:50:09 UTC

`mdev' module has a soft post module dependency on `vfio_mdev'. Indeed, if I run `modprobe mdev' on my el8 machine then `vfio_mdev' gets (and apparently remains) loaded as well. I'll need to look into Nisim's environment why `vfio_mdev' is not loaded when `mdev' is.

Comment 19 Milan Zamazal 2020-06-22 15:40:11 UTC

[Alex, we have trouble with vfio_mdev module not being loaded automatically on a vGPU host.]

Looking at Nisim's environment, which should be basically a freshly installed RHEL 8 machine with Nvidia drivers installed from rpm:

- After reboot, vfio_mdev module is not loaded although nvidia_vgpu_vfio is:

  # lsmod | egrep '(vfio|mdev)'
  nvidia_vgpu_vfio       53248  0
  nvidia              19501056  10 nvidia_vgpu_vfio
  mdev                   20480  1 nvidia_vgpu_vfio
  vfio                   36864  1 nvidia_vgpu_vfio

- Let's remove nvidia_vgpu_vfio, looks all right:

  # modprobe -r nvidia_vgpu_vfio
  # lsmod | egrep '(vfio|mdev)'

- Let's insert nvidia_vgpu_vfio again:

  # modprobe nvidia_vgpu_vfio
  # lsmod | egrep '(vfio|mdev)'
  nvidia_vgpu_vfio       53248  0
  vfio_mdev              16384  0
  mdev                   20480  2 vfio_mdev,nvidia_vgpu_vfio
  vfio_iommu_type1       32768  0
  vfio                   36864  3 vfio_mdev,nvidia_vgpu_vfio,vfio_iommu_type1
  nvidia              19501056  10 nvidia_vgpu_vfio

  Both vfio_mdev and vfio_iommu_type1 modules are loaded now!

Alex, I was told you may be able to help. I can see in the kernel module dependencies that mdev soft depends on vfio_mdev. How is it possible that vfio_mdev gets loaded when nvidia_vgpu_vfio is inserted manually, while not after boot although nvidia_vgpu_info is present? Any idea, can there be something wrong with initramfs or anything else? BTW, it used to work in RHEL 7.

Comment 20 Alex Williamson 2020-06-22 17:09:57 UTC

(In reply to Milan Zamazal from comment #19)
> [Alex, we have trouble with vfio_mdev module not being loaded automatically
> on a vGPU host.]
> 
> Looking at Nisim's environment, which should be basically a freshly
> installed RHEL 8 machine with Nvidia drivers installed from rpm:
> 
> - After reboot, vfio_mdev module is not loaded although nvidia_vgpu_vfio is:
> 
>   # lsmod | egrep '(vfio|mdev)'
>   nvidia_vgpu_vfio       53248  0
>   nvidia              19501056  10 nvidia_vgpu_vfio
>   mdev                   20480  1 nvidia_vgpu_vfio
>   vfio                   36864  1 nvidia_vgpu_vfio
> 
> - Let's remove nvidia_vgpu_vfio, looks all right:
> 
>   # modprobe -r nvidia_vgpu_vfio
>   # lsmod | egrep '(vfio|mdev)'
> 
> - Let's insert nvidia_vgpu_vfio again:
> 
>   # modprobe nvidia_vgpu_vfio
>   # lsmod | egrep '(vfio|mdev)'
>   nvidia_vgpu_vfio       53248  0
>   vfio_mdev              16384  0
>   mdev                   20480  2 vfio_mdev,nvidia_vgpu_vfio
>   vfio_iommu_type1       32768  0
>   vfio                   36864  3 vfio_mdev,nvidia_vgpu_vfio,vfio_iommu_type1
>   nvidia              19501056  10 nvidia_vgpu_vfio
> 
>   Both vfio_mdev and vfio_iommu_type1 modules are loaded now!
> 
> Alex, I was told you may be able to help. I can see in the kernel module
> dependencies that mdev soft depends on vfio_mdev. How is it possible that
> vfio_mdev gets loaded when nvidia_vgpu_vfio is inserted manually, while not
> after boot although nvidia_vgpu_info is present? Any idea, can there be
> something wrong with initramfs or anything else? BTW, it used to work in
> RHEL 7.

In the second case, you're showing that the kernel module dependencies are all correct.  My guess would be that some sort of local configuration on this system is causing the nvidia modules to get loaded from the initramfs but the soft dependencies aren't present.  Is there anything under /etc/dracut.conf.d that might explain this?  Or perhaps inspect the initrfram fs with lsinitrd?  Cc Zhiyi who likely also has experience here.

Comment 21 Guo, Zhiyi 2020-06-23 02:36:17 UTC

(In reply to Alex Williamson from comment #20)
> (In reply to Milan Zamazal from comment #19)
> > [Alex, we have trouble with vfio_mdev module not being loaded automatically
> > on a vGPU host.]
> > 
> > Looking at Nisim's environment, which should be basically a freshly
> > installed RHEL 8 machine with Nvidia drivers installed from rpm:
> > 
> > - After reboot, vfio_mdev module is not loaded although nvidia_vgpu_vfio is:
> > 
> >   # lsmod | egrep '(vfio|mdev)'
> >   nvidia_vgpu_vfio       53248  0
> >   nvidia              19501056  10 nvidia_vgpu_vfio
> >   mdev                   20480  1 nvidia_vgpu_vfio
> >   vfio                   36864  1 nvidia_vgpu_vfio
> > 
> > - Let's remove nvidia_vgpu_vfio, looks all right:
> > 
> >   # modprobe -r nvidia_vgpu_vfio
> >   # lsmod | egrep '(vfio|mdev)'
> > 
> > - Let's insert nvidia_vgpu_vfio again:
> > 
> >   # modprobe nvidia_vgpu_vfio
> >   # lsmod | egrep '(vfio|mdev)'
> >   nvidia_vgpu_vfio       53248  0
> >   vfio_mdev              16384  0
> >   mdev                   20480  2 vfio_mdev,nvidia_vgpu_vfio
> >   vfio_iommu_type1       32768  0
> >   vfio                   36864  3 vfio_mdev,nvidia_vgpu_vfio,vfio_iommu_type1
> >   nvidia              19501056  10 nvidia_vgpu_vfio
> > 
> >   Both vfio_mdev and vfio_iommu_type1 modules are loaded now!
> > 
> > Alex, I was told you may be able to help. I can see in the kernel module
> > dependencies that mdev soft depends on vfio_mdev. How is it possible that
> > vfio_mdev gets loaded when nvidia_vgpu_vfio is inserted manually, while not
> > after boot although nvidia_vgpu_info is present? Any idea, can there be
> > something wrong with initramfs or anything else? BTW, it used to work in
> > RHEL 7.
> 
> In the second case, you're showing that the kernel module dependencies are
> all correct.  My guess would be that some sort of local configuration on
> this system is causing the nvidia modules to get loaded from the initramfs
> but the soft dependencies aren't present.  Is there anything under
> /etc/dracut.conf.d that might explain this?  Or perhaps inspect the
> initrfram fs with lsinitrd?  Cc Zhiyi who likely also has experience here.

Yes, I also hit this issue

Comment 22 Guo, Zhiyi 2020-06-23 02:46:08 UTC

(In reply to Guo, Zhiyi from comment #21)
> (In reply to Alex Williamson from comment #20)
> > (In reply to Milan Zamazal from comment #19)
> > > [Alex, we have trouble with vfio_mdev module not being loaded automatically
> > > on a vGPU host.]
> > > 
> > > Looking at Nisim's environment, which should be basically a freshly
> > > installed RHEL 8 machine with Nvidia drivers installed from rpm:
> > > 
> > > - After reboot, vfio_mdev module is not loaded although nvidia_vgpu_vfio is:
> > > 
> > >   # lsmod | egrep '(vfio|mdev)'
> > >   nvidia_vgpu_vfio       53248  0
> > >   nvidia              19501056  10 nvidia_vgpu_vfio
> > >   mdev                   20480  1 nvidia_vgpu_vfio
> > >   vfio                   36864  1 nvidia_vgpu_vfio
> > > 
> > > - Let's remove nvidia_vgpu_vfio, looks all right:
> > > 
> > >   # modprobe -r nvidia_vgpu_vfio
> > >   # lsmod | egrep '(vfio|mdev)'
> > > 
> > > - Let's insert nvidia_vgpu_vfio again:
> > > 
> > >   # modprobe nvidia_vgpu_vfio
> > >   # lsmod | egrep '(vfio|mdev)'
> > >   nvidia_vgpu_vfio       53248  0
> > >   vfio_mdev              16384  0
> > >   mdev                   20480  2 vfio_mdev,nvidia_vgpu_vfio
> > >   vfio_iommu_type1       32768  0
> > >   vfio                   36864  3 vfio_mdev,nvidia_vgpu_vfio,vfio_iommu_type1
> > >   nvidia              19501056  10 nvidia_vgpu_vfio
> > > 
> > >   Both vfio_mdev and vfio_iommu_type1 modules are loaded now!
> > > 
> > > Alex, I was told you may be able to help. I can see in the kernel module
> > > dependencies that mdev soft depends on vfio_mdev. How is it possible that
> > > vfio_mdev gets loaded when nvidia_vgpu_vfio is inserted manually, while not
> > > after boot although nvidia_vgpu_info is present? Any idea, can there be
> > > something wrong with initramfs or anything else? BTW, it used to work in
> > > RHEL 7.
> > 
> > In the second case, you're showing that the kernel module dependencies are
> > all correct.  My guess would be that some sort of local configuration on
> > this system is causing the nvidia modules to get loaded from the initramfs
> > but the soft dependencies aren't present.  Is there anything under
> > /etc/dracut.conf.d that might explain this?  Or perhaps inspect the
> > initrfram fs with lsinitrd?  Cc Zhiyi who likely also has experience here.
> 
> Yes, I also hit this issue

opps, submit the unfinished comment by mistake..

Yes, I also hit this issue in my environment.
I think this is something cannot be reproduced with 8.2.1 tree RHEL-8.2.1-20200508.n.0(tested with tesla V100) and happen to recent eng release(RHEL-8.2.1-20200608.n.0, tested with tesla T4 + sr-iov mode).
But try to downgrade the kernel to the one included with RHEL-8.2.1-20200508.n.0(4.18.0-193.2.1.el8_2), issue still happen

Comment 23 Milan Zamazal 2020-06-23 08:36:57 UTC

/etc/dracut.conf.d is empty (as well as /etc/dracut.conf). But after unpacking the initramfs, I can see a difference in module dependencies.

On the main system:

  4.18.0-193.7.1.el8_2.x86_64/modules.order:kernel/drivers/vfio/mdev/vfio_mdev.ko
  4.18.0-193.7.1.el8_2.x86_64/modules.dep:kernel/drivers/vfio/mdev/vfio_mdev.ko.xz: kernel/drivers/vfio/mdev/mdev.ko.xz kernel/drivers/vfio/vfio.ko.xz
  4.18.0-193.7.1.el8_2.x86_64/modules.softdep:softdep mdev post: vfio_mdev

While in initramfs:

  4.18.0-193.7.1.el8_2.x86_64/modules.order:kernel/drivers/vfio/mdev/vfio_mdev.ko
  4.18.0-193.7.1.el8_2.x86_64/modules.softdep:softdep mdev post: vfio_mdev

So while the soft dependency is defined there, it's missing in modules.dep. When I generate an initrd image manually, the dependency is present in modules.dep.

Alex, what else can I check?

Comment 24 Alex Williamson 2020-06-23 18:14:09 UTC

dracut appears to be failing us here and I don't see that this is a regression.  The regression might simply be that something was installed that caused the initramfs for the kernel to be regenerated that wasn't previously.  For a workaround you can create the following:

# cat /etc/dracut.conf.d/nvidia.conf 
omit_drivers+="nvidia"

Then rebuild initramfs with 'dracut -f --regenerate-all'.  On to the problem...

dracut seems to be identifying nvidia_vgpu_vfio as a module that it needs to add to the initramfs, presumably because of the alias:

# modinfo nvidia_vgpu_vfio
filename:       /lib/modules/4.18.0-193.el8.x86_64/weak-updates/nvidia/nvidia-vgpu-vfio.ko
version:        440.87
supported:      external
license:        MIT
rhelversion:    8.2
srcversion:     5D4064D3E109D020922911D
alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
depends:        nvidia,mdev,vfio
name:           nvidia_vgpu_vfio
vermagic:       4.18.0-167.el8.x86_64 SMP mod_unload modversions 

The nvidia module is similar (with a bunch of parm entries omitted here):

# modinfo nvidia
filename:       /lib/modules/4.18.0-193.el8.x86_64/weak-updates/nvidia/nvidia.ko
alias:          char-major-195-*
version:        440.87
supported:      external
license:        NVIDIA
rhelversion:    8.2
srcversion:     EDD83534DD78C3B1B5A0F6E
alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
depends:        ipmi_msghandler
name:           nvidia
vermagic:       4.18.0-167.el8.x86_64 SMP mod_unload modversions 

Our starting point is here:

depends:        nvidia,mdev,vfio

The mdev and vfio modules are direct dependencies, those get added to the initramfs.  The trouble starts here:

# modinfo mdev
filename:       /lib/modules/4.18.0-193.el8.x86_64/kernel/drivers/vfio/mdev/mdev.ko.xz
softdep:        post: vfio_mdev

# modinfo vfio
filename:       /lib/modules/4.18.0-193.el8.x86_64/kernel/drivers/vfio/vfio.ko.xz
softdep:        post: vfio_iommu_type1 vfio_iommu_spapr_tce

So our dependent modules have soft dependencies.  This seems to be partial fixed by:

commit c38f9e980c1ee03151dd1c6602907c6228b78d30
Author: Harald Hoyer <harald>
Date:   Tue Dec 4 10:02:45 2018 +0100

    install/dracut-install.c: install module dependencies of dependencies

The trouble is that this carries forward a poor (imo) assumption made here when softdep support was first introduced:

commit 4cdee66c8ed5f82bbd0638e30d867318343b0e6c
Author: Jeremy Linton <lintonrjeremy>
Date:   Mon Jul 2 23:25:05 2018 -0500

    dracut-install: Support modules.softdep
    
    Dracut uses the module deps to determine module dependencies
    but that only works for modules with hard symbolic dependencies.
    Some modules have dependencies created via callback API's or other
    methods which aren't reflected in the modules.dep but rather in
    modules.softdep through the use of "pre:" and "post:" commands
    created in the kernel with MODULE_SOFTDEP().
    
    Since in dracut we are only concerned about early boot, this patch
    only looks at the pre: section of modules which are already being
    inserted in the initrd under the assumption that the pre: section
    lists dependencies required for the functionality of the module being
    installed in the initrd.

That latter paragraph tries to make the argument that only pre: softdeps are required for functionality of the module, but we can see here that's not the case.  In the case of these post: softdeps, the kernel module is making a request module call to provide the remainder of the functionality.  In the case of mdev, the vfio-mdev driver is what bridges mdev devices into the vfio ecosystem.  In the case of vfio, the softdep IOMMU backend drivers provides the functionality that actually makes vfio devices useful.  If the module is not available when the kernel module makes a request for it, who would be responsible for manually loading that module later?

So in addition to pulling the functionality of c38f9e980c1e into RHEL, I think we also need something like:

diff --git a/install/dracut-install.c b/install/dracut-install.c
index 3d64ed7a..57f4c557 100644
--- a/install/dracut-install.c
+++ b/install/dracut-install.c
@@ -1484,6 +1484,8 @@ static int install_dependent_modules(struct kmod_list *modlist)
                                ret = kmod_module_get_softdeps(mod, &modpre, &modpost);
                                if (ret == 0)
                                        ret = install_dependent_modules(modpre);
+                               if (ret == 0)
+                                       ret = install_dependent_modules(modpost);
                        }
                 } else {
                         log_error("dracut_install '%s' '%s' ERROR", path, &path[kerneldirlen]);
@@ -1547,6 +1549,8 @@ static int install_module(struct kmod_module *mod)
                 ret = kmod_module_get_softdeps(mod, &modpre, &modpost);
                 if (ret == 0)
                         ret = install_dependent_modules(modpre);
+                if (ret == 0)
+                        ret = install_dependent_modules(modpost);
         }
 
         return ret;

This completes what 4cdee66c8ed5 should have done originally so that we have pre: and post: softdep modules installed, both for the directly included module, but also for the dependent modules thanks to c38f9e980c1e.

Moving to RHEL8/dracut to accept or refute this solution.

Comment 26 Alex Williamson 2020-06-23 22:10:19 UTC

Another piece of the mystery here is why we don't see this issue more regularly. In my testing with the GRID 10.2 GA driver I'm ONLY able to reproduce when I use the rpm install. When I use the run file install dracut never includes the nvidia modules in the initramfs therefore the modules are only loaded from the filesystem where all dependencies are available. It appears this is due to where the modules are installed. When using the rpm file, the modules reside here:

/lib/modules/`uname -r`/weak-updates/nvidia/nvidia-vgpu-vfio.ko
/lib/modules/`uname -r`/weak-updates/nvidia/nvidia.ko

When using the run file, the modules are instead installed here:

/lib/modules/`uname -r`/kernel/drivers/video/nvidia-vgpu-vfio.ko
/lib/modules/`uname -r`/kernel/drivers/video/nvidia.ko

If I use an rpm install and move the modules from the former location to the latter location (and update dependencies with 'depmod -a'), then dracut generates an initramfs WITHOUT the nvidia modules. If I use a run file install and move the modules to the "extra" directory for the kernel and update depmod, dracut generates an initramfs WITH the nvidia modules. I'm tempted to suspect this is due to:

/etc/depmod.d/dist.conf:
#
# depmod.conf
#

# override default search ordering for kmod packaging
search updates extra built-in weak-updates

The comment and man page suggest this should only control ordering, for example as I read it, to allow modules in these directories to override modules that might be provided elsewhere. However, if I comment out this directive, the nvidia modules in the extra directory disappear from my initramfs. To test the opposite, my system includes an audio device making use of the snd-hda-intel driver. This driver does not exist in my initramfs by default, but if I copy the module to the extra directory, update depmod, and regenerate, it does now appear in the initramfs. If I do the same with a sound driver for which I don't have hardware, snd-emu10k1, the module does not appear in the initramfs.

My hypothesis is therefore that any modules in the search path with an alias matching installed hardware will make it to the initramfs. Perhaps dracut folks can confirm this.

Comment 27 Milan Zamazal 2020-06-30 10:30:46 UTC

Lukáši, what do you think about Alex's analysis? Do you have any idea how to proceed? It would be helpful to know about your plans and/or estimates so that we can handle the issue on the RHV side accordingly.

Comment 28 Lukáš Nykrýn 2020-07-02 07:19:22 UTC

I talked to Harald, dracut upstream, and he is fine with the patch Alex mentioned. I've made some slight modification, since the original version did not install the post modules, when there was an error with the pre ones.

https://github.com/dracutdevs/dracut/pull/848

Comment 37 errata-xmlrpc 2020-11-04 01:42:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (dracut bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4473

Comment 38 David Tardon 2020-11-19 14:31:14 UTC

*** Bug 1898664 has been marked as a duplicate of this bug. ***