Bug 1394862 - libdb: Assumes that internal condition variable layout never changes
libdb: Assumes that internal condition variable layout never changes
Status: ASSIGNED
Product: Fedora
Classification: Fedora
Component: libdb (Show other bugs)
26
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: Petr Kubat
Fedora Extras Quality Assurance
:
Depends On:
Blocks: 1346768 1397087
  Show dependency treegraph
 
Reported: 2016-11-14 10:35 EST by Jan Pazdziora
Modified: 2017-05-26 16:06 EDT (History)
24 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
upstream patch (30.94 KB, patch)
2017-05-16 04:07 EDT, Petr Kubat
no flags Details | Diff


External Trackers
Tracker ID Priority Status Summary Last Updated
Sourceware 21119 None None None 2017-02-13 07:22 EST

  None (edit)
Description Jan Pazdziora 2016-11-14 10:35:21 EST
Description of problem:

Building Fedora rawhide container image from Dockerfile

FROM fedora:rawhide
RUN dnf install -y httpd

works fine.

However, adding RUN dnf upgrade -y glibc before the dnf install line causes the build process to hang.

Version-Release number of selected component (if applicable):

glibc-2.24.90-13.fc26.x86_64

How reproducible:

Deterministic for this particular Dockerfile on my machines. But it's not always apr-util or the second package on which it gets stuck.

Steps to Reproduce:
1. Have Dockerfile

FROM fedora:rawhide
RUN dnf upgrade -y glibc
RUN dnf install -y httpd

2. Attempt to build the image: docker build -t rawhide .

Actual results:

Install  7 Packages

Total download size: 1.7 M
Installed size: 4.7 M
Downloading Packages:
--------------------------------------------------------------------------------
Total                                           1.1 MB/s | 1.7 MB     00:01     
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
  Installing  : apr-1.5.2-4.fc25.x86_64                                     1/7 
  Installing  : apr-util-1.5.4-3.fc24.x86_64                                2/7

and here the process hangs.

Expected results:

All packages installed, image created.

Additional info:

The same hang happens with plain docker run:

docker run --rm -ti fedora:rawhide bash -c 'dnf upgrade -y glibc && dnf install -y httpd'

With glibc-2.24.90-2.fc26.x86_64 (which is on the 3 month old fedora:rawhide image), things work, that's why filing against glibc.
Comment 1 Jan Pazdziora 2016-11-14 10:38:05 EST
The OS and environment on the host do not matter -- I see the same hang on Fedora 24 with docker-1.10.3-54.gite03ddb8.fc24.x86_64 and on RHEL 7.3 with docker-1.12.3-4.el7.x86_64.
Comment 2 Florian Weimer 2016-11-14 10:38:50 EST
Please provide a process tree (ps axuf) when the hang occurs.  Attaching GDB to the hanging subprocess would be helpful, too, but it could prove difficult to get GDB to pick up the proper separate debuginfo.  Alternatively, please create a coredump of the hanging process with gcore and upload it somewhere.  Then we can analyze it offline.
Comment 3 Jan Pazdziora 2016-11-15 05:48:29 EST
The bt against the hung process is

#0  0x00007fd8520f82c1 in futex_wait (private=<optimized out>, expected=4294967295, futex_word=0x7fd842759c04) at ../sysdeps/unix/sysv/linux/futex-internal.h:61
#1  futex_wait_simple (private=<optimized out>, expected=4294967295, futex_word=0x7fd842759c04) at ../sysdeps/nptl/futex-internal.h:135
#2  __pthread_cond_destroy (cond=cond@entry=0x7fd842759be0) at pthread_cond_destroy.c:54
#3  0x00007fd846d969cf in __db_pthread_mutex_destroy (env=env@entry=0x55ac4daae720, mutex=mutex@entry=340) at ../../src/mutex/mut_pthread.c:757
#4  0x00007fd846d9610f in __db_tas_mutex_destroy (env=env@entry=0x55ac4daae720, mutex=mutex@entry=340) at ../../src/mutex/mut_tas.c:602
#5  0x00007fd846e4dda8 in __mutex_free_int (env=0x55ac4daae720, locksys=locksys@entry=1, indxp=indxp@entry=0x7fd84036f550) at ../../src/mutex/mut_alloc.c:248
#6  0x00007fd846e4e3c5 in __mutex_free (env=env@entry=0x55ac4daae720, indxp=indxp@entry=0x7fd84036f550) at ../../src/mutex/mut_alloc.c:217
#7  0x00007fd846eb329b in __memp_bhfree (dbmp=dbmp@entry=0x55ac4cc383f0, infop=0x55ac4daf5b30, mfp=mfp@entry=0x7fd8403052d0, hp=<optimized out>, 
    bhp=bhp@entry=0x7fd84036f550, flags=flags@entry=1) at ../../src/mp/mp_bh.c:663
#8  0x00007fd846eb5aca in __memp_fget (dbmfp=dbmfp@entry=0x55ac4d9e50f0, pgnoaddr=pgnoaddr@entry=0x7ffc5622bb5c, ip=ip@entry=0x7fd84045f770, txn=txn@entry=0x0, 
    flags=flags@entry=8, addrp=addrp@entry=0x7ffc5622bb60) at ../../src/mp/mp_fget.c:479
#9  0x00007fd846ebb4fc in __memp_ftruncate (dbmfp=dbmfp@entry=0x55ac4d9e50f0, txn=0x0, ip=0x7fd84045f770, pgno=pgno@entry=19, flags=flags@entry=0)
    at ../../src/mp/mp_method.c:856
#10 0x00007fd846e7199e in __db_free (dbc=dbc@entry=0x55ac4d9e8280, h=<optimized out>, flags=flags@entry=0) at ../../src/db/db_meta.c:525
#11 0x00007fd846e75ddb in __db_doff (dbc=dbc@entry=0x55ac4d9e8280, pgno=<optimized out>) at ../../src/db/db_overflow.c:479
#12 0x00007fd846da3a8f in __bam_ditem (dbc=dbc@entry=0x55ac4d9e8280, h=0x7fd84036b388, indx=indx@entry=7) at ../../src/btree/bt_delete.c:138
#13 0x00007fd846da8160 in __bam_iitem (dbc=dbc@entry=0x55ac4d9e8280, key=key@entry=0x7ffc5622c1d0, data=data@entry=0x7ffc5622c1a0, op=6, flags=flags@entry=0)
    at ../../src/btree/bt_put.c:421
#14 0x00007fd846da2e5d in __bamc_put (dbc=0x55ac4d9e8280, key=0x7ffc5622c1d0, data=0x7ffc5622c1a0, flags=14, pgnop=0x7ffc5622c0b4)
    at ../../src/btree/bt_cursor.c:2240
#15 0x00007fd846e5d1dc in __dbc_iput (dbc=0x55ac4daad330, key=0x7ffc5622c1d0, data=0x7ffc5622c1a0, flags=14) at ../../src/db/db_cam.c:2136
#16 0x00007fd846e5f70d in __dbc_put (dbc=dbc@entry=0x55ac4daad330, key=key@entry=0x7ffc5622c1d0, data=data@entry=0x7ffc5622c1a0, flags=<optimized out>, 
    flags@entry=14) at ../../src/db/db_cam.c:2049
#17 0x00007fd846e6c9c1 in __dbc_put_pp (dbc=0x55ac4daad330, key=0x7ffc5622c1d0, data=0x7ffc5622c1a0, flags=14) at ../../src/db/db_iface.c:2751
#18 0x00007fd84754b235 in dbiCursorPut.isra.5 () from /lib64/librpm.so.7
#19 0x00007fd84754d211 in updateIndex.part.8 () from /lib64/librpm.so.7
#20 0x00007fd84754d5e1 in db3_idxdbPut () from /lib64/librpm.so.7
#21 0x00007fd847553168 in tag2index () from /lib64/librpm.so.7
#22 0x00007fd847556c6b in rpmdbAdd () from /lib64/librpm.so.7
#23 0x00007fd84756a86d in rpmpsmRun () from /lib64/librpm.so.7
#24 0x00007fd84757e125 in rpmteProcess () from /lib64/librpm.so.7
#25 0x00007fd847584a3e in rpmtsRun () from /lib64/librpm.so.7
#26 0x00007fd84559d204 in rpmts_Run () from /usr/lib64/python3.5/site-packages/rpm/_rpm.cpython-35m-x86_64-linux-gnu.so
#27 0x00007fd8523b67e9 in PyCFunction_Call (func=<built-in method run of TransactionSet object at remote 0x7fd843940b28>, 
    args=(<method at remote 0x7fd84458a1c8>, '', 64), kwds=<optimized out>) at /usr/src/debug/Python-3.5.2/Objects/methodobject.c:98
#28 0x00007fd85236eeb7 in PyObject_Call (func=<built-in method run of TransactionSet object at remote 0x7fd843940b28>, arg=<optimized out>, kw=<optimized out>)
    at /usr/src/debug/Python-3.5.2/Objects/abstract.c:2165
#29 0x00007fd852425d17 in PyEval_CallObjectWithKeywords (func=func@entry=<built-in method run of TransactionSet object at remote 0x7fd843940b28>, 
    arg=arg@entry=(<method at remote 0x7fd84458a1c8>, '', 64), kw=kw@entry=0x0) at /usr/src/debug/Python-3.5.2/Python/ceval.c:4609
#30 0x00007fd852389b38 in methoddescr_call (descr=<optimized out>, args=(<method at remote 0x7fd84458a1c8>, '', 64), kwds=0x0)
    at /usr/src/debug/Python-3.5.2/Objects/descrobject.c:250
#31 0x00007fd85236eeb7 in PyObject_Call (func=<method_descriptor at remote 0x7fd845bfd708>, arg=<optimized out>, kw=<optimized out>)
    at /usr/src/debug/Python-3.5.2/Objects/abstract.c:2165
#32 0x00007fd85242a491 in do_call (nk=<optimized out>, na=4, pp_stack=0x7ffc5622ca60, func=<optimized out>) at /usr/src/debug/Python-3.5.2/Python/ceval.c:4965
#33 call_function (oparg=<optimized out>, pp_stack=0x7ffc5622ca60) at /usr/src/debug/Python-3.5.2/Python/ceval.c:4761
#34 PyEval_EvalFrameEx (
    f=f@entry=Frame 0x7fd842bae048, for file /usr/lib64/python3.5/site-packages/rpm/transaction.py, line 103, in run (self=<TransactionSet(_probFilter=64) at remote 0x7fd843940b28>, callback=<method at remote 0x7fd84458a1c8>, data=''), throwflag=throwflag@entry=0) at /usr/src/debug/Python-3.5.2/Python/ceval.c:3260
#35 0x00007fd85242dcfb in fast_function (nk=<optimized out>, na=<optimized out>, n=3, pp_stack=0x7ffc5622cba0, func=<optimized out>)
    at /usr/src/debug/Python-3.5.2/Python/ceval.c:4832
#36 call_function (oparg=<optimized out>, pp_stack=0x7ffc5622cba0) at /usr/src/debug/Python-3.5.2/Python/ceval.c:4759
#37 PyEval_EvalFrameEx (
    f=f@entry=Frame 0x55ac4da0bba8, for file /usr/lib/python3.5/site-packages/dnf/base.py, line 735, in _run_transaction (self=<BaseCli(_tempfile_persistor=None, _ds_callback=<DepSolveProgressCallBack(loops=1) at remote 0x7fd843950710>, _comps=None, _group_persistor=None, _plugins=<Plugins(plugin_cls=[<type at remote 0x55ac4bdef758>, <type at remote 0x55ac4bde9c38>, <type at remote 0x55ac4bdef3a8>, <type at remote 0x55ac4bdf24b8>, <type at remote 0x55ac4bdfcdd8>, <type at remote 0x55ac4bdf1f58>, <type at remote 0x55ac4bde8a38>, <type at remote 0x55ac4bdedaa8>, <type at remote 0x55ac4bdf3f08>, <type at remote 0x55ac4bdf3b58>], plugins=[<DebuginfoInstall(cli=<Cli(cmdstring='dnf install -y httpd ', command=<InstallCommand(cli=<...>, opts=<Namespace(ip_resolve=None, debuglevel=None, command=['install'], help=False, color=None, setopts=[], allowerasing=None, rpmverbosity=None, quiet=None, repos_ed=[], showdupesfromrepos=None, releasever=None, verbose=None, plugins=None, assumeno=None, version=None, excludepkgs=[], debugsolver=N...(truncated), throwflag=throwflag@entry=0) at /usr/src/debug/Python-3.5.2/Python/ceval.c:3260
#38 0x00007fd85242f5c3 in _PyEval_EvalCodeWithName (_co=<optimized out>, globals=<optimized out>, locals=locals@entry=0x0, args=args@entry=0x55ac4c8bac90, 
    argcount=1, kws=0x55ac4c8bac98, kwcount=1, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name='_run_transaction', qualname='Base._run_transaction')
    at /usr/src/debug/Python-3.5.2/Python/ceval.c:4047
#39 0x00007fd85242be39 in fast_function (nk=<optimized out>, na=<optimized out>, n=<optimized out>, pp_stack=0x7ffc5622cdb0, func=<optimized out>)
    at /usr/src/debug/Python-3.5.2/Python/ceval.c:4842
#40 call_function (oparg=<optimized out>, pp_stack=0x7ffc5622cdb0) at /usr/src/debug/Python-3.5.2/Python/ceval.c:4759
#41 PyEval_EvalFrameEx (
    f=f@entry=Frame 0x55ac4c8baaa8, for file /usr/lib/python3.5/site-packages/dnf/base.py, line 661, in do_transaction (self=<BaseCli(_tempfile_persistor=None, _ds_callback=<DepSolveProgressCallBack(loops=1) at remote 0x7fd843950710>, _comps=None, _group_persistor=None, _plugins=<Plugins(plugin_cls=[<type at remote 0x55ac4bdef758>, <type at remote 0x55ac4bde9c38>, <type at remote 0x55ac4bdef3a8>, <type at remote 0x55ac4bdf24b8>, <type at remote 0x55ac4bdfcdd8>, <type at remote 0x55ac4bdf1f58>, <type at remote 0x55ac4bde8a38>, <type at remote 0x55ac4bdedaa8>, <type at remote 0x55ac4bdf3f08>, <type at remote 0x55ac4bdf3b58>], plugins=[<Debugi---Type <return> to continue, or q <return> to quit---

The ps output in the container is

# ps axuwwf
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root        11  0.0  0.0  12588  3716 ?        Ss   10:46   0:00 bash
root        34  0.0  0.0  39828  3256 ?        R+   10:48   0:00  \_ ps axuwwf
root         1  1.3  2.2 541256 91700 ?        Ss+  10:46   0:01 /usr/libexec/system-python /usr/bin/dnf install -y httpd
Comment 4 Florian Weimer 2016-11-15 05:56:46 EST
(In reply to Jan Pazdziora from comment #3)
> The bt against the hung process is
> 
> #0  0x00007fd8520f82c1 in futex_wait (private=<optimized out>,
> expected=4294967295, futex_word=0x7fd842759c04) at
> ../sysdeps/unix/sysv/linux/futex-internal.h:61
> #1  futex_wait_simple (private=<optimized out>, expected=4294967295,
> futex_word=0x7fd842759c04) at ../sysdeps/nptl/futex-internal.h:135
> #2  __pthread_cond_destroy (cond=cond@entry=0x7fd842759be0) at
> pthread_cond_destroy.c:54

This could be due to glibc-swbz13165.patch (new condition variable implementation).
Comment 5 Florian Weimer 2016-11-15 08:41:29 EST
I was able to reproduce it, but only in a Docker image.  Here's the condvar state:

(gdb) print *cond
$2 = {__data = {{__wseq = 0, __wseq32 = {__low = 0, __high = 0}}, {
      __g1_start = 0, __g1_start32 = {__low = 0, __high = 0}}, 
    __g_refs = {0, 0}, __g_size = {0, 0}, 
    __g1_orig_size = 4294967295, __wrefs = 4294967295, 
    __g_signals = {0, 0}}, 
  __size = '\000' <repeats 32 times>, "\377\377\377\377\377\377\377\377\000\000\000\000\000\000\000", __align = 0}
Comment 6 Torvald Riegel 2016-11-15 10:19:53 EST
It seems that the root cause is that the program tries to use a condvar instance that has been initialized with a prior version of glibc.  The condvar appears to be in a database file (/var/lib/rpm/__db.002 in our reproducer).  With the (glibc-internal) condvar definition in the old glibc version, the bits represent a valid, unused condvar:
$4 = {__data = {__lock = 0, __futex = 0, __total_seq = 0, 
                __wakeup_seq = 0,  __woken_seq = 0, __mutex = 
                0xffffffffffffffff, __nwaiters = 0,  __broadcast_seq = 0},

The same bits do not represent a valid condvar if given the condvar definition in the new version of glibc.

Storing the condvar in the database file means that effectively, the program is expecting different versions of glibc to use the same bit representations of internal data structures.  This is not guaranteed to work.

It seems that either BDB should be changed to not store the condvar in database files, or dnf needs to reinitialize the databases it relies on when glibc is updated.
Comment 7 Florian Weimer 2016-11-15 10:23:57 EST
The hang goes away if I add a “rpm --rebuilddb” step to the container build file:

FROM fedora:rawhide
RUN dnf upgrade -y glibc
RUN rpm --rebuilddb
RUN dnf install -y httpd
Comment 8 Jan Pazdziora 2016-11-15 11:15:42 EST
(In reply to Torvald Riegel from comment #6)
> It seems that the root cause is that the program tries to use a condvar
> instance that has been initialized with a prior version of glibc.

Thank you for the analysis.

> The
> condvar appears to be in a database file (/var/lib/rpm/__db.002 in our
> reproducer).  With the (glibc-internal) condvar definition in the old glibc
> version, the bits represent a valid, unused condvar:
> $4 = {__data = {__lock = 0, __futex = 0, __total_seq = 0, 
>                 __wakeup_seq = 0,  __woken_seq = 0, __mutex = 
>                 0xffffffffffffffff, __nwaiters = 0,  __broadcast_seq = 0},
> 
> The same bits do not represent a valid condvar if given the condvar
> definition in the new version of glibc.

Can the same issue happen during upgrade from (say) Fedora 24 to rawhide?

How hard would it be to make this old unused bit value also understood as unused value in new glibc?

> Storing the condvar in the database file means that effectively, the program
> is expecting different versions of glibc to use the same bit representations
> of internal data structures.  This is not guaranteed to work.
> 
> It seems that either BDB should be changed to not store the condvar in
> database files, or dnf needs to reinitialize the databases it relies on when
> glibc is updated.

It would likely need to be rpm, not dnf.
Comment 9 Carlos O'Donell 2016-11-15 11:46:50 EST
(In reply to Torvald Riegel from comment #6)
> It seems that the root cause is that the program tries to use a condvar
> instance that has been initialized with a prior version of glibc.  The
> condvar appears to be in a database file (/var/lib/rpm/__db.002 in our
> reproducer).  With the (glibc-internal) condvar definition in the old glibc
> version, the bits represent a valid, unused condvar:
> $4 = {__data = {__lock = 0, __futex = 0, __total_seq = 0, 
>                 __wakeup_seq = 0,  __woken_seq = 0, __mutex = 
>                 0xffffffffffffffff, __nwaiters = 0,  __broadcast_seq = 0},
> 
> The same bits do not represent a valid condvar if given the condvar
> definition in the new version of glibc.
> 
> Storing the condvar in the database file means that effectively, the program
> is expecting different versions of glibc to use the same bit representations
> of internal data structures.  This is not guaranteed to work.
> 
> It seems that either BDB should be changed to not store the condvar in
> database files, or dnf needs to reinitialize the databases it relies on when
> glibc is updated.

This is unsupported by POSIX, you must _always_ reinitialize the POISX thread objects per:
~~~
2.9.9 Synchronization Object Copies and Alternative Mappings

For barriers, condition variables, mutexes, and read-write locks, [TSH] [Option Start]  if the process-shared attribute is set to PTHREAD_PROCESS_PRIVATE, [Option End]  only the synchronization object at the address used to initialize it can be used for performing synchronization. The effect of referring to another mapping of the same object when locking, unlocking, or destroying the object is undefined. [TSH] [Option Start]  If the process-shared attribute is set to PTHREAD_PROCESS_SHARED, only the synchronization object itself can be used for performing synchronization; however, it need not be referenced at the address used to initalize it (that is, another mapping of the same object can be used). [Option End]  The effect of referring to a copy of the object when locking, unlocking, or destroying it is undefined.

For spin locks, the above requirements shall apply as if spin locks have a process-shared attribute that is set from the pshared argument to pthread_spin_init(). For semaphores, the above requirements shall apply as if semaphores have a process-shared attribute that is set to PTHREAD_PROCESS_PRIVATE if the pshared argument to sem_init() is zero and set to PTHREAD_PROCESS_SHARED if pshared is non-zero.
~~~

The serialization of the POSIX thread object into the database is undefined behaviour as noted under "The effect of referring to a copy of the object when locking, unlocking, or destroying it is undefined."

You may only use initialized objects.

I'm moving this to libdb to get fixed.
Comment 10 Florian Weimer 2016-11-15 11:57:49 EST
(In reply to Carlos O'Donell from comment #9)

Torvald and I looked at the POSIX wording as well, and it is not sufficiently clear if the libdb usage is actually undefined.  POSIX likely intends this to be undefined: some implementations do not have a unified page cache, and the effect of the mutex/condition variable initialization may happen completely outside the context of the file mapping.

The key question is whether you may continue to use the resource after the process which has initialized it unmapped the file and exited.  The historic Berkeley DB use case requires the initialization to persist, although I do think it is problematic, as explained above.  Even the existing database environment liveness checks are likely insufficient because they cannot detect the case where all applications detached orderly, and reattach after a glibc update.
Comment 11 Florian Weimer 2016-11-15 15:30:04 EST
I found where POSIX says that the libdb usage is undefined.  It's in the specfication for mmap and unmmap:

“The state of synchronization objects such as mutexes, semaphores, barriers, and conditional variables placed in shared memory mapped with MAP_SHARED becomes undefined when the last region in any process containing the synchronization object is unmapped.”

If we posit that glibc updates need system reboots, this means that the format of these data structures on disk does not have to be retained across glibc updates.
Comment 12 Carlos O'Donell 2016-11-15 15:58:10 EST
(In reply to Florian Weimer from comment #11)
> I found where POSIX says that the libdb usage is undefined.  It's in the
> specfication for mmap and unmmap:
> 
> “The state of synchronization objects such as mutexes, semaphores, barriers,
> and conditional variables placed in shared memory mapped with MAP_SHARED
> becomes undefined when the last region in any process containing the
> synchronization object is unmapped.”
> 
> If we posit that glibc updates need system reboots, this means that the
> format of these data structures on disk does not have to be retained across
> glibc updates.

Correct, with the alignment and size being constrained by the ABI, so that needs to remain constant, but yes, the internal details of the type and the bit patterns and their meanings can and may change from release to release. You must reinitialize the objects before using them.

(In reply to Florian Weimer from comment #10)
> (In reply to Carlos O'Donell from comment #9)
> 
> Torvald and I looked at the POSIX wording as well, and it is not
> sufficiently clear if the libdb usage is actually undefined.  POSIX likely
> intends this to be undefined: some implementations do not have a unified
> page cache, and the effect of the mutex/condition variable initialization
> may happen completely outside the context of the file mapping.

The "why" is covered in the non-normative text under "Alternate Implementations Possible" here:
http://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_mutex_init.html

> The key question is whether you may continue to use the resource after the
> process which has initialized it unmapped the file and exited.  The historic
> Berkeley DB use case requires the initialization to persist, although I do
> think it is problematic, as explained above.  Even the existing database
> environment liveness checks are likely insufficient because they cannot
> detect the case where all applications detached orderly, and reattach after
> a glibc update.

You cannot continue to use the synchronization resource after the last mapping is removed, as you noted in comment 11.

(In reply to Jan Pazdziora from comment #8)
> (In reply to Torvald Riegel from comment #6)
> > The
> > condvar appears to be in a database file (/var/lib/rpm/__db.002 in our
> > reproducer).  With the (glibc-internal) condvar definition in the old glibc
> > version, the bits represent a valid, unused condvar:
> > $4 = {__data = {__lock = 0, __futex = 0, __total_seq = 0, 
> >                 __wakeup_seq = 0,  __woken_seq = 0, __mutex = 
> >                 0xffffffffffffffff, __nwaiters = 0,  __broadcast_seq = 0},
> > 
> > The same bits do not represent a valid condvar if given the condvar
> > definition in the new version of glibc.
> 
> Can the same issue happen during upgrade from (say) Fedora 24 to rawhide?

Yes, the same issue may happen when upgrading from Fedora 24 to Rawhide. It is conceivable that this issue becomes a blocker for F26 if it reproduces consistently during installs or upgrades.

This needs to be fixed in Berkeley DB, this is a error in the use of the POSIX threads primitives which constrains the implementation from moving forward with beneficial changes e.g. correctness and performance for condition variables (the reason the internal type representation changed).

> How hard would it be to make this old unused bit value also understood as
> unused value in new glibc?

TLDR; we would never support detecting of the old unused bit value. The cost is too high for other applications.

You would have to detect all possible marshalled bit patterns and reinitialize the condition variable in all API interfaces that might be passed the unmarshalled state. The cost is non-zero, the cost is on the hot path for all other applications that use the type (even with versioning apps pay the price until they are recompiled), and it would be only to support an undefined use.

> > Storing the condvar in the database file means that effectively, the program
> > is expecting different versions of glibc to use the same bit representations
> > of internal data structures.  This is not guaranteed to work.
> > 
> > It seems that either BDB should be changed to not store the condvar in
> > database files, or dnf needs to reinitialize the databases it relies on when
> > glibc is updated.
> 
> It would likely need to be rpm, not dnf.

It would actually be in Berkeley DB, which is where we have assigned this bug to e.g. libdb.
Comment 13 Carlos O'Donell 2016-11-15 16:08:36 EST
(In reply to Carlos O'Donell from comment #12)
> > How hard would it be to make this old unused bit value also understood as
> > unused value in new glibc?
> 
> TLDR; we would never support detecting of the old unused bit value. The cost
> is too high for other applications.
> 
> You would have to detect all possible marshalled bit patterns and
> reinitialize the condition variable in all API interfaces that might be
> passed the unmarshalled state. The cost is non-zero, the cost is on the hot
> path for all other applications that use the type (even with versioning apps
> pay the price until they are recompiled), and it would be only to support an
> undefined use.

FYI.

For the sake of openness and transparency there are two cases which glibc doesn't support which it could, and it is the following:

(1) Mixed implementations with PTHREAD_PROCESS_SHARED.

(a) Process A running glibc X initializes a condition variable in shared memory with PTHREAD_PROCESS_SHARED.
(b) Process B running glibc X+1 (with a different condition variable definition) attaches to that shared memory and tries to use it.

It is clear that process A and process B have two different understandings of the condition variable. The use of PTHREAD_PROCESS_SHARED has to trigger some kind of versioned data interface e.g. a condvar with a version field which detects the version in use and switches to the appropriate algorithm. This would incur significant costs when using PTHREAD_PROCESS_SHARED objects, and the use case has never become so important to support that we have supported it. Either way the glibc upgrade for Process B required a reboot, so as Florian argues you need to reboot, in which case you get a consistent view of the condvar and everything works optimally. The text as written in POSIX argues the above should be supported, but glibc doesn't support it.

(2) Mixed 32-bit and 64-bit implementations with PTHREAD_PROCESS_SHARED (similar to (1)).

While this is similar to (1) and can be considered a case of "another implementation of narrower type" I want to call it out separately.

The other case is a mixed 32-bit and 64-bit process case where the processes share types in shared memory, and this doesn't work correctly for the obvious reasons that the types are different ABIs. Making it work could again be part of making PTHREAD_PROCESS_SHARED truly "proces shared" across an ABI boundary for 32-bit and 64-bit code (reserve a bit in the version field for an ABI flag). Again this use case has never been important enough from a user perspective that we would attempt to fix it (though we face a similar problem in nscd with shared maps of the database caches).
Comment 14 Torvald Riegel 2016-11-16 05:43:58 EST
I had a quick look at src/mutex/mut_pthread.c, and I'd like to stress that this applies to all pthreads synchronization primitives, not just condvars.  libdb should not try to persist instances of these types; if it does, the database becomes tied to a particular glibc version.  A new rwlock implementation is in the queue for testing in Rawhide, and we may alter plain mutexes too in the future.

libdb will have to either implement custom synchronization, or it has to require clients to all use the same glibc version and architecture (including 32b vs. 64b on x86_64, for example).
Comment 15 Fedora Admin XMLRPC Client 2016-11-16 08:31:44 EST
This package has changed ownership in the Fedora Package Database.  Reassigning to the new owner of this component.
Comment 16 Torvald Riegel 2017-01-03 08:29:49 EST
glibc upstream now contains the new condition variable.  Has any work been started on the libdb (users) side to fix this libdb bug?

libdb has to be fixed very soon or we will get into issues when updating glibc.  I can help with how to fix this in libdb, but I do need help from the libdb maintainers, in particular in terms of (1) testing, (2) ensuring that all users of libdb enable whatever solution we come up with, and (3) notifying other distributions about this.

Who are the people that should be involved in this?  I suppose upstream-first isn't possible in this case?
Comment 17 Petr Kubat 2017-01-03 08:54:28 EST
Thanks for bringing this up.
I have not yet managed to take a deeper look at this but I can see how this might prove problematic in the near future. I will definitely appreciate any help you can offer with fixing libdb.
As for testing this change, libdb has some kind of test suite that might be useful but has not been used yet in any way. I will take a look at it and see if I can get it working.

I am not sure whether we are going to get any support from upstream with this issue given that we are running on an older (possibly already unsupported?) release, but I can try asking anyway.
Comment 18 Torvald Riegel 2017-01-03 12:31:42 EST
Note that I haven't ever worked with or on libdb before, so it would be good to get people involved that have some experience.  I don't have the spare time to see this through in detail.  I can help regarding how to use pthreads properly and with general concurrency considerations.

To reiterate, the problem is that libdb uses the glibc-internal binary representation in databases, and either (1) libdb or (2) how dnf uses libdb cause new glibc code to try to work on binary representation of old glibc code.  

I'm splitting this up in (1) and (2) because it is not yet clear to me who exactly is at fault here.  libdb's documentation on upgrading libdb (see docs/upgrading/upgrade_process.html in libdb's source) is rather vague and fails to include glibc (or whatever provides pthreads) in it's consideration but does mention the compiler; this is odd at least from our perspective because things like datastructure layout are part of the ABI guarantees but glibc-internal representations are not part of the ABI.  libdb could state that any pthreads-implementation update is to be considered to be a libdb major/minor update -- but it does not do that AFAIK, and it's unclear whether that would help.

libdb could require that only the same build of libdb including the same build of the pthreads implementation are active on the same database concurrently.  If it would shut down all mutexes when there are no references left and reinitialize them when a new reference comes in, this should be sufficient to not trigger the glibc update problem under that requirement.  In this case, it may be simply dnf's fault if dnf does not enforce that no process must hold a reference on dnf's databases while glibc is upgraded.
Second, I have looked a bit at the code for libdb mutex, mutex regions, and environments, but I haven't figured out yet whether libdb actually destroys all mutexes when the last reference to a database is removed.
Nonetheless, making this scenario work could be a solution.

Another solution could be to say that glibc upgrades are like libdb major/minor upgrades, and then follow (some of) the steps outlined in docs/upgrading/upgrade_process.html for every user of libdb.  This requires touching more packages, but may be easier to enforce than the first solution, which requires checking whether libdb clients cannot hold a reference across updates of glibc.

A third solution would be to version databases and include glibc's version in there.  But that doesn't provide automatic capability, and would probably require quite a few changes to libdb.

I guess it would be best to look at the first solution first, starting with trying to find out what libdb actually guarantees or intends to guarantee.  Any suggestions for where to get this information?
Comment 19 Panu Matilainen 2017-01-04 01:53:47 EST
Just FWIW: dnf does not use libdb directly, rpm does. Dnf is only involved as a librpm API consumer. The rpmdb is never closed during a transaction so there's always a "reference" to the database environment when an upgrade is in process.

From rpm POV the simplest option would be to just flick on DB_PRIVATE mode. Which sort of equals disabling db locking but then non-root queries (which I'd assume to be the vast majority) run in DB_PRIVATE mode for the simple reason of not having permission to grab locks, and the world hasn't ended. Single writer is enforced via a separate transaction lock anyway.
Comment 20 Petr Kubat 2017-01-04 06:39:06 EST
(In reply to Torvald Riegel from comment #18)
> Second, I have looked a bit at the code for libdb mutex, mutex regions, and
> environments, but I haven't figured out yet whether libdb actually destroys
> all mutexes when the last reference to a database is removed.
> Nonetheless, making this scenario work could be a solution.

To my knowledge, mutexes located in libdb's regions are not destroyed or re-initialized when the last process that references them exists.
 
> Another solution could be to say that glibc upgrades are like libdb
> major/minor upgrades, and then follow (some of) the steps outlined in
> docs/upgrading/upgrade_process.html for every user of libdb.  This requires
> touching more packages, but may be easier to enforce than the first
> solution, which requires checking whether libdb clients cannot hold a
> reference across updates of glibc.

This would be possible for the packages we know of but not helpful for user applications built upon libdb that would find themselves unable to use it and would have to make some manual changes on their own.
This could however be used as a workaround for rpm which would otherwise block the upgrade to F26.

> A third solution would be to version databases and include glibc's version
> in there.  But that doesn't provide automatic capability, and would probably
> require quite a few changes to libdb.
> 
> I guess it would be best to look at the first solution first, starting with
> trying to find out what libdb actually guarantees or intends to guarantee. 
> Any suggestions for where to get this information?

Unfortunately my knowledge about libdb is also very limited given that I have not been taking care of it for too long.
I have contacted upstream regarding this issue. The thread is publicly available at:
https://community.oracle.com/message/14170069
My hope is that Oracle might help us answer the questions we have about their implementation at least.
Comment 21 Petr Kubat 2017-01-06 02:49:59 EST
Discussion with Oracle moved to email. They promised to take a look at this issue but it will take a while before they can work on it actively. Do we know when the new implementation is going to get into an upstream release? So Oracle know what timeframe to work with.

I have also verified with Oracle that libdb does not do any reinitialization checks of synchronization primitives on its own, so the only way to currently work around this is by opening the environment with the DB_RECOVER (or DB_PRIVATE as Panu pointed out) flag set.
Comment 22 Petr Kubat 2017-02-13 04:53:40 EST
Forwarding new information from upstream:

----

We have been looking at this.  this is a bit tricky.   The problem as we see it, is the glibc library changes the pthread_cond_t structure in a way that breaks backward compatibility.   So we cannot look at structure size and make any conclusions.  The idea that we came up with is this ....

   -  add a check in the configure code to see if gnu_get_libc_version() exists on the system -- This should be true for nearly all Linuxes
 - if true, we compile in some extra code that is executed during the environment open
 - This code will clean the mutex region if the glibc version is different.

  consequences of this patch ....

  - we will store the glibc version in the environment and on open we will grab the current version and if we detect a mismatch we will recreate the region files.  This should prevent the hang that you are seeing.

----

This could work as a workaround to the issue, however I have voiced some concerns regarding already existing environments that do not have the glibc version variable yet (as we had problems with modifying environment structures before).
Also upstream asks whether it is possible to get into a situation in which two rpm processes are accessing libdb environment each using a different glibc version (after performing the glibc upgrade).
Comment 23 Panu Matilainen 2017-02-13 05:15:26 EST
(In reply to Petr Kubat from comment #22)
> Also upstream asks whether it is possible to get into a situation in which
> two rpm processes are accessing libdb environment each using a different
> glibc version (after performing the glibc upgrade).

This is one of the reasons why calling rpm from rpm scriptlets is such a bad idea: while it appears to work most of the time, every once in a blue moon it breaks in that single transaction where glibc innards change and there's no way to predict when this might happen and reproducing can be a bit tricky too.

It's also a fine example of the wacko complications involved with the rpm database...
Comment 24 Petr Kubat 2017-02-13 06:09:02 EST
(In reply to Panu Matilainen from comment #23)
> This is one of the reasons why calling rpm from rpm scriptlets is such a bad
> idea: while it appears to work most of the time, every once in a blue moon
> it breaks in that single transaction where glibc innards change and there's
> no way to predict when this might happen and reproducing can be a bit tricky
> too.
> 
> It's also a fine example of the wacko complications involved with the rpm
> database...

Ah, right. I forgot running rpm inside a rpm scriptlet is a thing. Thanks for reminding me.
Comment 25 Panu Matilainen 2017-02-13 06:15:03 EST
(In reply to Petr Kubat from comment #24)
> 
> Ah, right. I forgot running rpm inside a rpm scriptlet is a thing. Thanks
> for reminding me.

It's very much a discouraged thing. Not that discouraging ever *stopped* people from doing it though...
Comment 26 Florian Weimer 2017-02-13 06:19:34 EST
(In reply to Petr Kubat from comment #22)
> Forwarding new information from upstream:
> 
> ----
> 
> We have been looking at this.  this is a bit tricky.   The problem as we see
> it, is the glibc library changes the pthread_cond_t structure in a way that
> breaks backward compatibility.   So we cannot look at structure size and
> make any conclusions.  The idea that we came up with is this ....
> 
>    -  add a check in the configure code to see if gnu_get_libc_version()
> exists on the system -- This should be true for nearly all Linuxes
>  - if true, we compile in some extra code that is executed during the
> environment open
>  - This code will clean the mutex region if the glibc version is different.
> 
>   consequences of this patch ....
> 
>   - we will store the glibc version in the environment and on open we will
> grab the current version and if we detect a mismatch we will recreate the
> region files.  This should prevent the hang that you are seeing.
> 
> ----

I don't think this is quite sufficient because the data structure layout could change without the glibc version changing.

I suppose we could provide a pthread_layout_version_np function in glibc which returns an int which refers to version number of the current pthread data structure layout.  it is a bit tricky to come up with a single version number which is consistent across all distributions, but we can probably provide something.
Comment 27 Torvald Riegel 2017-02-13 06:57:58 EST
(In reply to Florian Weimer from comment #26)
> (In reply to Petr Kubat from comment #22)
> > Forwarding new information from upstream:
> > 
> > ----
> > 
> > We have been looking at this.  this is a bit tricky.   The problem as we see
> > it, is the glibc library changes the pthread_cond_t structure in a way that
> > breaks backward compatibility.   So we cannot look at structure size and
> > make any conclusions.  The idea that we came up with is this ....
> > 
> >    -  add a check in the configure code to see if gnu_get_libc_version()
> > exists on the system -- This should be true for nearly all Linuxes
> >  - if true, we compile in some extra code that is executed during the
> > environment open
> >  - This code will clean the mutex region if the glibc version is different.
> > 
> >   consequences of this patch ....
> > 
> >   - we will store the glibc version in the environment and on open we will
> > grab the current version and if we detect a mismatch we will recreate the
> > region files.  This should prevent the hang that you are seeing.
> > 
> > ----
> 
> I don't think this is quite sufficient because the data structure layout
> could change without the glibc version changing.
> 
> I suppose we could provide a pthread_layout_version_np function in glibc
> which returns an int which refers to version number of the current pthread
> data structure layout.  it is a bit tricky to come up with a single version
> number which is consistent across all distributions, but we can probably
> provide something.

I agree that we could be able to build such a version number, but I don't think we should.  First, libdb should show that it can't just do reference counting, for example, and re-initialize any condvars it uses whenever the a database transitions from not being used to used.  I'll talk with libdb upstream about this.
Comment 28 Howard Chu 2017-02-13 07:46:51 EST
If you can guarantee that a reboot will occur between version changes, a simple fix is to always configure BDB to use a shared memory region instead of a mmap'd file for its environment. Then the shared memory region is automatically destroyed/recreated by rebooting the system.
Comment 29 Petr Kubat 2017-02-21 08:46:51 EST
Some more information from the discussion with libdb upstream.

We have 2 ideas we are looking at right now.

1. have BDB scroll away the glibc version and on an open do some checking to see if glibc version changed and force a recovery under the covers
2. DBENV recovery on glibc version change via a %posttrans cleanup action

The first idea I have already mentioned in comment 22.
The second idea is something I have thrown into the discussion as a workaround  specifically for rpm. I have tried looking at this idea a bit more to see if it would actually work and found out, that there are additional issues connected with it.
AFAIK there are 3 ways to go about forcing the environment to recover:
First "rpm --rebuilddb", but this does not work as it needs a transaction lock that is held by the rpm process doing the install/update action.
Second is "db_recover -h /var/lib/rpm" which is the libdb way of doing the recovery. This does work in recovering the environment but results in the rpm process throwing DB_RUNRECOVERY errors.
Third way is just removing the problematic region files by force via "rm -f /var/lib/__db.00*" but it seems rpm is recreating them with the old glibc structures in place when the db/dbenv is closed.

Generally libdb upstream is against modifying their code to rebuild the environment every time it is accessed as that (together with the implementation of an "in-use" counter) would be a non-negligible hit to performance.
Comment 30 Petr Kubat 2017-02-22 03:48:49 EST
A correction to the third way of environment recovery ("rm -f /var/lib/__db.00*") - the region files are not recreated on closing the db/dbenv, but during dnf's verification process, which is still a problem since the hang is not removed.
Comment 32 Fedora End Of Life 2017-02-28 05:36:52 EST
This bug appears to have been reported against 'rawhide' during the Fedora 26 development cycle.
Changing version to '26'.
Comment 33 Petr Kubat 2017-05-03 02:37:14 EDT
Some more information from upstream:

They are going with the basic solution that they had originally proposed (cleaning up the environment files on glibc version change). The main issue is doing this in a dynamic way.

Right now upstream is waiting for a pass on all of their regressions tests so hopefully we will get a patch we can work with soon.
Comment 34 Petr Kubat 2017-05-16 04:07 EDT
Created attachment 1279226 [details]
upstream patch

Earlier today, upstream sent me the promised patch for our version of libdb (attached).
Only had time to do a quick test but it seems to work out of the box. Will do a thorough review later.
Comment 35 Fedora Update System 2017-05-23 18:43:34 EDT
libdb-5.3.28-18.fc24 has been submitted as an update to Fedora 24. https://bodhi.fedoraproject.org/updates/FEDORA-2017-9d674be444
Comment 36 Fedora Update System 2017-05-23 18:44:08 EDT
libdb-5.3.28-18.fc25 has been submitted as an update to Fedora 25. https://bodhi.fedoraproject.org/updates/FEDORA-2017-6e056b68bf
Comment 37 Fedora Update System 2017-05-23 18:44:32 EDT
libdb-5.3.28-18.fc26 has been submitted as an update to Fedora 26. https://bodhi.fedoraproject.org/updates/FEDORA-2017-a4c41ecc27
Comment 38 Adam Williamson 2017-05-23 18:50:49 EDT
As this is blocking Fedora 26 beta, and the patch tested out with a scratch build (see https://bugzilla.redhat.com/show_bug.cgi?id=1443415#c19 ), I went ahead and submitted updates for all releases. Petr, of course if you spot anything wrong on review, I will withdraw the updates.

For F24 and F25 I synced up with master, pulling in a couple of other recent fixes; I reviewed them and they seemed to be quite safe to include, but again, please let me know if not.
Comment 39 Petr Kubat 2017-05-24 02:36:10 EDT
As for the patch I found one thing that bothered me and also might pose some issues:

~~~~~~~~~~~~~~snip~~~~~~~~~~~~~~

diff -r db78da0996b1 src/env/env_region.c
--- a/src/env/env_region.c  Mon Sep 09 11:09:35 2013 -0400
+++ b/src/env/env_region.c  Sat Apr 29 04:10:18 2017 -0700
@@ -14,17 +14,54 @@
 #include "dbinc/log.h"
 #include "dbinc/txn.h"

+#define static
 static int  __env_des_get __P((ENV *, REGINFO *, REGINFO *, REGION **));
 static int  __env_faultmem __P((ENV *, void *, size_t, int));
 static int  __env_sys_attach __P((ENV *, REGINFO *, REGION *));

~~~~~~~~~~~~~~snip~~~~~~~~~~~~~~

I already have a patch for this (and other minor things) prepared but wanted to first get in touch with upstream to ask if there is not a good reason for the define to stay in (there is not).
I will make new builds for libdb across all the branches that already have the original fix.
Comment 40 Adam Williamson 2017-05-24 02:56:26 EDT
Thanks for the heads-up - please do go ahead and edit my updates with your new builds if you have the power, or submit new updates to supersede them.
Comment 41 Petr Kubat 2017-05-24 03:50:39 EDT
New builds have been added to the existing updates.

As for how libdb now handles environment rebuild checks - when the environment is opened libdb checks the modify timestamp of libpthread.so, compares it with the value stored in the environment and on mismatch exclusively locks the environment and rebuilds it with new variables in place.
Libdb now also rebuilds the environment the same way when either the internal layout of its regions' variables changes (as is the case with the addition of pthreads_timestamp) or when libdb is updated to a newer version.
Comment 42 Fedora Update System 2017-05-25 15:20:08 EDT
libdb-5.3.28-19.fc26 has been pushed to the Fedora 26 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2017-a4c41ecc27
Comment 43 Lukas Slebodnik 2017-05-26 03:32:49 EDT
(In reply to Adam Williamson from comment #38)
> As this is blocking Fedora 26 beta, and the patch tested out with a scratch
> build (see https://bugzilla.redhat.com/show_bug.cgi?id=1443415#c19 ), I went
> ahead and submitted updates for all releases. Petr, of course if you spot
> anything wrong on review, I will withdraw the updates.
> 
> For F24 and F25 I synced up with master, pulling in a couple of other recent
> fixes; I reviewed them and they seemed to be quite safe to include, but
> again, please let me know if not.

Adam,
Thank you very much. I really appreciate that you "tested" this update properly.
[root@vm-174 ~]# dnf check-update
error: rpmdb: BDB0113 Thread/process 17153/139837515298560 failed: BDB1507 Thread died in Berkeley DB library
error: db5 error(-30973) from dbenv->failchk: BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery
error: cannot open Packages index using db5 -  (-30973)
error: cannot open Packages database in /var/lib/rpm
Error: Error: rpmdb open failed
[root@vm-174 ~]# 
[root@vm-174 ~]# rpm -q libdb
error: rpmdb: BDB0113 Thread/process 17153/139837515298560 failed: BDB1507 Thread died in Berkeley DB library
error: db5 error(-30973) from dbenv->failchk: BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery
error: cannot open Packages index using db5 -  (-30973)
error: cannot open Packages database in /var/lib/rpm
error: rpmdb: BDB0113 Thread/process 17153/139837515298560 failed: BDB1507 Thread died in Berkeley DB library
error: db5 error(-30973) from dbenv->failchk: BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery
error: cannot open Packages index using db5 -  (-30973)
error: cannot open Packages database in /var/lib/rpm
package libdb is not installed
Comment 44 Adam Williamson 2017-05-26 03:36:59 EDT
Lukas: I didn't test it; people hitting the bug did, and it fixed the bug. Updates go to testing in order to find problems like that. I already unpushed the updates several hours ago. You may want to follow https://bugzilla.redhat.com/show_bug.cgi?id=1443415 , the discussion is more active there.
Comment 45 Lukas Slebodnik 2017-05-26 03:47:45 EDT
Then it would be good to close this BZ as a duplicate or vice versa.
Comment 46 Lukas Slebodnik 2017-05-26 07:58:13 EDT
(In reply to Adam Williamson from comment #44)
> Lukas: I didn't test it; people hitting the bug did, and it fixed the bug.
And that is the biggest problem. It is a critical package and you blindly pushed 5.3.28-18 to updates-testing.

You would have not pushed package to updates-tesing if had tried dnf update + links from koji.
  e.g. dnf update https://kojipkgs.fedoraproject.org//packages/libdb/5.3.28/18.fc26/x86_64/libdb-5.3.28-18.fc26.x86_64.rpm https://kojipkgs.fedoraproject.org//packages/libdb/5.3.28/18.fc26/x86_64/libdb-utils-5.3.28-18.fc26.x86_64.rpm

Because it does not work. I know is not just your problem because package maintainer didn't tested it's build either (5.3.28-19). (dnf update + links from koji)

> Updates go to testing in order to find problems like that.
Fedora users should just help with testing packages in updates-testing. They should not supplement this role. updates-testing should be used to find corner-case. But this is a basic use case. Update from update -> updates testing.

> I already unpushed the updates several hours ago.
Then there is a bug in fedora release engineering. It is obvious that unpushed packages were released. It would be good to prevent such situation in future. If you know where to file ticket for fedora releng then I would appreciate if you could do that.
Comment 47 Petr Kubat 2017-05-26 08:08:33 EDT
Lukas,
libdb has been tested. It is unfortunate that another issue broke rpm in some configurations (I guess yours included) but that is something that most of the times cannot be forseen, which is exactly the reason why we are pushing packages to updates-testing where a lot of users with different configurations can install and test the package.

This is not a libdb issue, nor is it a dnf/yum issue. If anything this is either an issue of rpm, that allows calling rpm commands inside scriplets, or an issue of packages, that actually do it, even though it is strongly discouraged. But let us not throw blame around and concentrate on fixing things instead.

Follows a log from my box for anyone interested with openldap-servers package installed and libdb being updated (via rpm -Uvv "libdb-5.3.28-19.fc26.x86_64.rpm")

~~~~~~~~~~~~~~snip~~~~~~~~~~~~~~
D: %triggerin(openldap-servers-2.4.44-10.fc25.x86_64): scriptlet start
fdio:       2 writes,      368 total bytes in 0.000009 secs
D: %triggerin(openldap-servers-2.4.44-10.fc25.x86_64): execv(/bin/sh) pid 8412
D: Plugin: calling hook scriptlet_fork_post in selinux plugin
D: setexecfilecon: (/bin/sh) 
+ '[' 2 -eq 2 ']'
++ wc -l
++ sort -u
++ sed 's/\.[0-9]*$//'
++ rpm -q '--qf=%{version}\n' libdb
+ '[' 1 '!=' 1 ']'
+ rm -f /var/lib/ldap/rpm_upgrade_libdb
+ exit 0
D: %triggerin(openldap-servers-2.4.44-10.fc25.x86_64): waitpid(8412) rc 8412 status 0
error: rpmdb: DB_LOCK->lock_put: Lock is no longer valid
error: db5 error(22) from dbcursor->c_close: Invalid argument
Segmentation fault (core dumped)
~~~~~~~~~~~~~~snip~~~~~~~~~~~~~~

Not sure yet why libdb dies here with segfault. Should be complaining about environment version mismatch instead.
Comment 48 Panu Matilainen 2017-05-26 09:14:48 EDT
It's not even just scriptlets calling rpm, it's *anything* opening the rpmdb ... during a transaction that happens to change the futex implementation or such. Which occurs once every few years - totally unpredictable except for the fact that the next time WILL come, most likely when everybody has mostly forgotten the issue exists at all :-/
Comment 49 Lukas Slebodnik 2017-05-26 10:54:13 EDT
(In reply to Petr Kubat from comment #47)
> Lukas,
> libdb has been tested. It is unfortunate that another issue broke rpm in
> some configurations (I guess yours included)

It is not broken just in some configuration. It is broken for everyone on 26.

I am sorry but you didn't tested upgrade of your own package and you try to persuade us that there is nothing wrong in libdb. If there is a change in libdb which required change in rpm then both packages should be updated at once. Or both maintainers should be in sync. But breaking upgrade for everyone is not a solution. libdb is ***critical*** package on fedora (due to rpm)

> but that is something that most
> of the times cannot be forseen, which is exactly the reason why we are
> pushing packages to updates-testing where a lot of users with different
> configurations can install and test the package.
>

BTW the updates-testing is a poor excuse. I do not want to say that you had to test all test cases with rpm. But upgrading of own package is a minimal use-case which should not be tested by users.



Are you sure it is not bug in libdb?
Then please explain why it happens when libdb-5.3.28-17.fc26.x86_64.rpm is upgraded to newer version. And libdb-5.3.28-17.fc26.x86_64.rpm is a default version in f26 https://apps.fedoraproject.org/packages/libdb

sh# rpm -qa libdb*
libdb-utils-5.3.28-17.fc26.x86_64
libdb-5.3.28-17.fc26.x86_64

sh# rpm -q --scripts libdb libdb-utils
postinstall program: /sbin/ldconfig
postuninstall program: /sbin/ldconfig

sh# rpm -Uvh https://kojipkgs.fedoraproject.org//packages/libdb/5.3.28/19.fc26/x86_64/libdb-5.3.28-19.fc26.x86_64.rpm https://kojipkgs.fedoraproject.org//packages/libdb/5.3.28/19.fc26/x86_64/libdb-utils-5.3.28-19.fc26.x86_64.rpm
Retrieving https://kojipkgs.fedoraproject.org//packages/libdb/5.3.28/19.fc26/x86_64/libdb-5.3.28-19.fc26.x86_64.rpm
Retrieving https://kojipkgs.fedoraproject.org//packages/libdb/5.3.28/19.fc26/x86_64/libdb-utils-5.3.28-19.fc26.x86_64.rpm

Preparing...                          ################################# [100%]
Updating / installing...
   1:libdb-5.3.28-19.fc26             ################################# [ 25%]
error: rpmdb: DB_LOCK->lock_put: Lock is no longer valid
error: db5 error(22) from dbcursor->c_close: Invalid argument
Comment 50 Petr Kubat 2017-05-26 12:49:56 EDT
>Then please explain why it happens when libdb-5.3.28-17.fc26.x86_64.rpm is upgraded to newer version.

If you were to provide a proper debugging output (like with the example I posted in my last comment) I would be able to answer that question. Alas you failed and so I cannot.

>It's not even just scriptlets calling rpm, it's *anything* opening the rpmdb ... 

True but it feels like the scriplets use case is one of the more visible ones given I seem to hit some issues with them pretty often.
In any way I am out of ideas on how to make this work for scriplets. We cannot remove the environment during scriplets as that breaks rpm since it does not expect that a scriplet might do that. On the other hand even if we were able to make scriplets use the old libdb shared library (is that even possible after it is already installed?) that would just cause issues in the packages running those scriplets (since those would just fail on environment version mismatch).
This leaves us once again with trying to keep the environment signature the same between those two versions, which, however would cause us to not be able to remember the pthreads version libdb was build against as it is currently saved in the environment.

Note You need to log in before you can comment on or make changes to this bug.