Bug 1668380

Summary: Under Microsoft WIndows Subsystem for Linux (WSL) - Error: rpmdb open failed [rhel8]
Product: Red Hat Enterprise Linux 8 Reporter: James Hartsock <hartsjc>
Component: rpmAssignee: Packaging Maintenance Team <packaging-team-maint>
Status: CLOSED NOTABUG QA Contact: BaseOS QE Security Team <qe-baseos-security>
Severity: medium Docs Contact:
Priority: medium    
Version: 8.0CC: fweimer, ngompa13, pmatilai
Target Milestone: rc   
Target Release: 8.0   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1668379 Environment:
Last Closed: 2019-02-01 08:10:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1668379    
Bug Blocks: 1623566    
Attachments:
Description Flags
rpm output pre and post rebuilddb
none
tar of rpm.works & rpm.fails
none
opensuse strace of rpm and rebuildb
none
RHEL 7 strace of rpm -qa before & after rpm --rebuilddb in WSL build 18890 none

Description James Hartsock 2019-01-22 15:41:14 UTC
+++ This bug was initially created as a clone of Bug #1668379 +++

Description of problem:
Fedora (and RHEL 8 beta) systems have following errors with dnf immediately.

Version-Release number of selected component (if applicable):
Red Hat Enterprise Linux release 8.0 Beta (Ootpa)

How reproducible:
Very

Steps to Reproduce:
1. Take Fedora (or RHEL 8 Beta) container image, export it as tar and gzip
   # podman pull registry.access.redhat.com/rhel8-beta
   # podman run -it rhel8-beta sleep 999999
   # podman ps
   CONTAINER ID
   <ContainID>
   # podman export <ContainID> -o rhel8.tar
   # podman kill   <ContainID>
   # gzip rhel8.tar

2. Use https://github.com/DDoSolitary/LxRunOffline on Windows

   C:\Temp>LxRunOffline-v3.3.2\LxRunOffline.exe install -n RHEL8 -d RHEL8 -f rhel8.tar.gz
   C:\Temp>LxRunOffline-v3.3.2\LxRunOffline.exe set-default -n RHEL8
   C:\Temp>LxRunOffline-v3.3.2\LxRunOffline.exe get-default
   RHEL*

3. Run rpm or dnf command


Actual results:
[root@win10 Temp]# dnf list
Failed to set locale, defaulting to C
error: db5 error(12) from dbenv->open: Cannot allocate memory
error: db5 error(22) from dbenv->close: Invalid argument
error: cannot open Packages index using db5 - Cannot allocate memory (12)
error: cannot open Packages database in /var/lib/rpm
Error: Error: rpmdb open failed

Expected results:
Should not be RPM issues


Additional info:

WSL info @ https://docs.microsoft.com/en-us/windows/wsl/install-win10

Comment 1 James Hartsock 2019-01-22 15:49:36 UTC
Some public discussion going on at:

https://github.com/Microsoft/WSL/issues/90
https://github.com/Microsoft/WSL/issues/3742

Comment 2 Panu Matilainen 2019-01-31 09:09:04 UTC
"Works in native Linux, doesn't in WSL" sounds simply like a bug in WSL to me.

Berkeley DB's shared environment has a bit of a history for being a sucker for weird VM etc kernel bugs on native Linux too, this seems no different. Using a private environment would probably work around it (I presume 'rpm -qa' as a normal user does work?)

Comment 3 Panu Matilainen 2019-01-31 09:34:08 UTC
I have no means of testing anything on Windows, but if you can provide a strace of the failure (eg 'strace rpm -qa' output) I can at least take a look at it.

Comment 4 James Hartsock 2019-02-01 02:19:20 UTC
Created attachment 1525672 [details]
rpm output pre and post rebuilddb

Yes, rpm seems to work when run as normal user.

Here is me capturing the strace data.
  [root@win10 temp]# strace -o rpm_-q_rpm.works -s 2048 -tvf rpm -q rpm
  rpm-4.11.3-35.el7.x86_64

  [root@win10 temp]# strace -o rpm_--rebuilddb.strace -s 2048 -tvf rpm --rebuilddb

  [root@win10 temp]# strace -o rpm_-q_rpm.fail1 -s 2048 -tvf rpm -q rpm
  Segmentation fault (core dumped)

  [root@win10 temp]# strace -o rpm_-q_rpm.fail2 -s 2048 -tvf rpm -q rpm
  <hangs ... kill -9 rpm in another window>
  Killed

Comment 5 James Hartsock 2019-02-01 02:21:59 UTC
Created attachment 1525673 [details]
tar of rpm.works & rpm.fails

Here is tar of the /var/lib/rpm directory both before (rpm.works) and after the rebuild (rpm.fails),

Seems that you can use this to mimic the behavior on a normal RHEL 7 image. Perhaps enough to get some addition information on your own if needed.

Here is me replicating on my RHEL 7 (csb) laptop
  # uname -r
  3.10.0-891.el7.x86_64 <---- normal RHEL, not WSL

  # tar zxf ~jhartsoc/var-lib-rpm.tar.gz 
  # cd /var/lib
  # cp -arp rpm rpm.BACKUP
  # rm -rf rpm
  # cp -arp rpm.fails rpm
  # rpm -q rpm
  <hangs>

Comment 6 Panu Matilainen 2019-02-01 08:10:39 UTC
If mmap() would fail with EINVAL or such we could deal with it, but as long as WSL is pretending all is well we can't help.

There are several tickets on WSL reporting how Berkeley DB and LMDB are broken because of mmap() issues, eg
https://github.com/Microsoft/WSL/issues/3451 and https://github.com/Microsoft/WSL/issues/658

A bug in WSL can only be fxed in WSL.

Comment 7 Panu Matilainen 2019-02-01 08:11:36 UTC
*** Bug 1668378 has been marked as a duplicate of this bug. ***

Comment 8 James Hartsock 2019-02-01 14:53:15 UTC
Created attachment 1525858 [details]
opensuse strace of rpm and rebuildb

OpenSuse does appear to work...


win10:/var/lib # cat /etc/SuSE-release
openSUSE 42.3 (x86_64)
VERSION = 42.3
CODENAME = Malachite
# /etc/SuSE-release is deprecated and will be removed in the future, use /etc/os-release insteada

win10:/var/lib # strace -o suse-rpm_-q_rpm.before -s 2048 -tvf rpm -q rpm
rpm-4.11.2-13.7.x86_64

win10:/var/lib # strace -o suse-rpm_--rebuilddb.strace -s 2048 -tvf rpm --rebuilddb

win10:/var/lib # strace -o suse-rpm_-q_rpm.after -s 2048 -tvf rpm -q rpm
rpm-4.11.2-13.7.x86_64

Comment 9 Panu Matilainen 2019-02-04 07:28:48 UTC
Yeah, it "works" because they carry a patch to the shared environment of Berkeley DB (essentially disabling BDB level locking on concurrent access, the same as we do for unprivileged users) and then a bunch of other patches to try and deal with the consequences.

Comment 14 James Hartsock 2019-05-03 22:13:38 UTC
Created attachment 1562799 [details]
RHEL 7 strace of rpm -qa before & after rpm --rebuilddb in WSL build 18890

Microsoft Claims fixed in 18890 build:
  https://github.com/Microsoft/WSL/issues/3939#issuecomment-488429593
    Fixed in Windows Insider Build 18890 - https://github.com/MicrosoftDocs/WSL/blob/live/WSL/release-notes.md#build-18890


So with 18890 build does work...
  [root@win10_build18890 ~]# rpm --rebuilddb
  [root@win10_build18890 ~]# echo $?
  0

  [root@win10_build18890 ~]# rpm -q rpm
  rpm-4.11.3-32.el7.x86_64

  [root@win10_build18890 ~]# rpm -qa 2>/dev/null | grep rpm-4
  rpm-4.11.3-32.el7.x86_64


But get mutex errors on STDERROR
  [root@win10_build18890 ~]# rpm -qa 2>error.out | wc -l
  248

  [root@win10_build18890 ~]# sort error.out | uniq -c
        1 error: cannot open Name index using db5 - Cannot allocate memory (12)
    39313 error: rpmdb: BDB2034 unable to allocate memory for mutex; resize mutex region