Bug 1394862

Summary: libdb: Assumes that internal condition variable layout never changes
Product: [Fedora] Fedora Reporter: Jan Pazdziora <jpazdziora>
Component: libdbAssignee: Petr Kubat <pkubat>
Status: ASSIGNED --- QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 26CC: arjun.is, codonell, dj, fche, fweimer, hhorak, hyc, jakub, jpazdziora, jstanek, law, lslebodn, mfabian, mhonek, pfrankli, pkubat, pmatilai, rjones, sid, triegel, zing
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
See Also: https://bugzilla.redhat.com/show_bug.cgi?id=1397087
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Bug Depends On:    
Bug Blocks: 1346768    

Description Jan Pazdziora 2016-11-14 10:35:21 EST
Description of problem:

Building Fedora rawhide container image from Dockerfile

FROM fedora:rawhide
RUN dnf install -y httpd

works fine.

However, adding RUN dnf upgrade -y glibc before the dnf install line causes the build process to hang.

Version-Release number of selected component (if applicable):

glibc-2.24.90-13.fc26.x86_64

How reproducible:

Deterministic for this particular Dockerfile on my machines. But it's not always apr-util or the second package on which it gets stuck.

Steps to Reproduce:
1. Have Dockerfile

FROM fedora:rawhide
RUN dnf upgrade -y glibc
RUN dnf install -y httpd

2. Attempt to build the image: docker build -t rawhide .

Actual results:

Install  7 Packages

Total download size: 1.7 M
Installed size: 4.7 M
Downloading Packages:
--------------------------------------------------------------------------------
Total                                           1.1 MB/s | 1.7 MB     00:01     
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
  Installing  : apr-1.5.2-4.fc25.x86_64                                     1/7 
  Installing  : apr-util-1.5.4-3.fc24.x86_64                                2/7

and here the process hangs.

Expected results:

All packages installed, image created.

Additional info:

The same hang happens with plain docker run:

docker run --rm -ti fedora:rawhide bash -c 'dnf upgrade -y glibc && dnf install -y httpd'

With glibc-2.24.90-2.fc26.x86_64 (which is on the 3 month old fedora:rawhide image), things work, that's why filing against glibc.
Comment 1 Jan Pazdziora 2016-11-14 10:38:05 EST
The OS and environment on the host do not matter -- I see the same hang on Fedora 24 with docker-1.10.3-54.gite03ddb8.fc24.x86_64 and on RHEL 7.3 with docker-1.12.3-4.el7.x86_64.
Comment 2 Florian Weimer 2016-11-14 10:38:50 EST
Please provide a process tree (ps axuf) when the hang occurs.  Attaching GDB to the hanging subprocess would be helpful, too, but it could prove difficult to get GDB to pick up the proper separate debuginfo.  Alternatively, please create a coredump of the hanging process with gcore and upload it somewhere.  Then we can analyze it offline.
Comment 3 Jan Pazdziora 2016-11-15 05:48:29 EST
The bt against the hung process is

#0  0x00007fd8520f82c1 in futex_wait (private=<optimized out>, expected=4294967295, futex_word=0x7fd842759c04) at ../sysdeps/unix/sysv/linux/futex-internal.h:61
#1  futex_wait_simple (private=<optimized out>, expected=4294967295, futex_word=0x7fd842759c04) at ../sysdeps/nptl/futex-internal.h:135
#2  __pthread_cond_destroy (cond=cond@entry=0x7fd842759be0) at pthread_cond_destroy.c:54
#3  0x00007fd846d969cf in __db_pthread_mutex_destroy (env=env@entry=0x55ac4daae720, mutex=mutex@entry=340) at ../../src/mutex/mut_pthread.c:757
#4  0x00007fd846d9610f in __db_tas_mutex_destroy (env=env@entry=0x55ac4daae720, mutex=mutex@entry=340) at ../../src/mutex/mut_tas.c:602
#5  0x00007fd846e4dda8 in __mutex_free_int (env=0x55ac4daae720, locksys=locksys@entry=1, indxp=indxp@entry=0x7fd84036f550) at ../../src/mutex/mut_alloc.c:248
#6  0x00007fd846e4e3c5 in __mutex_free (env=env@entry=0x55ac4daae720, indxp=indxp@entry=0x7fd84036f550) at ../../src/mutex/mut_alloc.c:217
#7  0x00007fd846eb329b in __memp_bhfree (dbmp=dbmp@entry=0x55ac4cc383f0, infop=0x55ac4daf5b30, mfp=mfp@entry=0x7fd8403052d0, hp=<optimized out>, 
    bhp=bhp@entry=0x7fd84036f550, flags=flags@entry=1) at ../../src/mp/mp_bh.c:663
#8  0x00007fd846eb5aca in __memp_fget (dbmfp=dbmfp@entry=0x55ac4d9e50f0, pgnoaddr=pgnoaddr@entry=0x7ffc5622bb5c, ip=ip@entry=0x7fd84045f770, txn=txn@entry=0x0, 
    flags=flags@entry=8, addrp=addrp@entry=0x7ffc5622bb60) at ../../src/mp/mp_fget.c:479
#9  0x00007fd846ebb4fc in __memp_ftruncate (dbmfp=dbmfp@entry=0x55ac4d9e50f0, txn=0x0, ip=0x7fd84045f770, pgno=pgno@entry=19, flags=flags@entry=0)
    at ../../src/mp/mp_method.c:856
#10 0x00007fd846e7199e in __db_free (dbc=dbc@entry=0x55ac4d9e8280, h=<optimized out>, flags=flags@entry=0) at ../../src/db/db_meta.c:525
#11 0x00007fd846e75ddb in __db_doff (dbc=dbc@entry=0x55ac4d9e8280, pgno=<optimized out>) at ../../src/db/db_overflow.c:479
#12 0x00007fd846da3a8f in __bam_ditem (dbc=dbc@entry=0x55ac4d9e8280, h=0x7fd84036b388, indx=indx@entry=7) at ../../src/btree/bt_delete.c:138
#13 0x00007fd846da8160 in __bam_iitem (dbc=dbc@entry=0x55ac4d9e8280, key=key@entry=0x7ffc5622c1d0, data=data@entry=0x7ffc5622c1a0, op=6, flags=flags@entry=0)
    at ../../src/btree/bt_put.c:421
#14 0x00007fd846da2e5d in __bamc_put (dbc=0x55ac4d9e8280, key=0x7ffc5622c1d0, data=0x7ffc5622c1a0, flags=14, pgnop=0x7ffc5622c0b4)
    at ../../src/btree/bt_cursor.c:2240
#15 0x00007fd846e5d1dc in __dbc_iput (dbc=0x55ac4daad330, key=0x7ffc5622c1d0, data=0x7ffc5622c1a0, flags=14) at ../../src/db/db_cam.c:2136
#16 0x00007fd846e5f70d in __dbc_put (dbc=dbc@entry=0x55ac4daad330, key=key@entry=0x7ffc5622c1d0, data=data@entry=0x7ffc5622c1a0, flags=<optimized out>, 
    flags@entry=14) at ../../src/db/db_cam.c:2049
#17 0x00007fd846e6c9c1 in __dbc_put_pp (dbc=0x55ac4daad330, key=0x7ffc5622c1d0, data=0x7ffc5622c1a0, flags=14) at ../../src/db/db_iface.c:2751
#18 0x00007fd84754b235 in dbiCursorPut.isra.5 () from /lib64/librpm.so.7
#19 0x00007fd84754d211 in updateIndex.part.8 () from /lib64/librpm.so.7
#20 0x00007fd84754d5e1 in db3_idxdbPut () from /lib64/librpm.so.7
#21 0x00007fd847553168 in tag2index () from /lib64/librpm.so.7
#22 0x00007fd847556c6b in rpmdbAdd () from /lib64/librpm.so.7
#23 0x00007fd84756a86d in rpmpsmRun () from /lib64/librpm.so.7
#24 0x00007fd84757e125 in rpmteProcess () from /lib64/librpm.so.7
#25 0x00007fd847584a3e in rpmtsRun () from /lib64/librpm.so.7
#26 0x00007fd84559d204 in rpmts_Run () from /usr/lib64/python3.5/site-packages/rpm/_rpm.cpython-35m-x86_64-linux-gnu.so
#27 0x00007fd8523b67e9 in PyCFunction_Call (func=<built-in method run of TransactionSet object at remote 0x7fd843940b28>, 
    args=(<method at remote 0x7fd84458a1c8>, '', 64), kwds=<optimized out>) at /usr/src/debug/Python-3.5.2/Objects/methodobject.c:98
#28 0x00007fd85236eeb7 in PyObject_Call (func=<built-in method run of TransactionSet object at remote 0x7fd843940b28>, arg=<optimized out>, kw=<optimized out>)
    at /usr/src/debug/Python-3.5.2/Objects/abstract.c:2165
#29 0x00007fd852425d17 in PyEval_CallObjectWithKeywords (func=func@entry=<built-in method run of TransactionSet object at remote 0x7fd843940b28>, 
    arg=arg@entry=(<method at remote 0x7fd84458a1c8>, '', 64), kw=kw@entry=0x0) at /usr/src/debug/Python-3.5.2/Python/ceval.c:4609
#30 0x00007fd852389b38 in methoddescr_call (descr=<optimized out>, args=(<method at remote 0x7fd84458a1c8>, '', 64), kwds=0x0)
    at /usr/src/debug/Python-3.5.2/Objects/descrobject.c:250
#31 0x00007fd85236eeb7 in PyObject_Call (func=<method_descriptor at remote 0x7fd845bfd708>, arg=<optimized out>, kw=<optimized out>)
    at /usr/src/debug/Python-3.5.2/Objects/abstract.c:2165
#32 0x00007fd85242a491 in do_call (nk=<optimized out>, na=4, pp_stack=0x7ffc5622ca60, func=<optimized out>) at /usr/src/debug/Python-3.5.2/Python/ceval.c:4965
#33 call_function (oparg=<optimized out>, pp_stack=0x7ffc5622ca60) at /usr/src/debug/Python-3.5.2/Python/ceval.c:4761
#34 PyEval_EvalFrameEx (
    f=f@entry=Frame 0x7fd842bae048, for file /usr/lib64/python3.5/site-packages/rpm/transaction.py, line 103, in run (self=<TransactionSet(_probFilter=64) at remote 0x7fd843940b28>, callback=<method at remote 0x7fd84458a1c8>, data=''), throwflag=throwflag@entry=0) at /usr/src/debug/Python-3.5.2/Python/ceval.c:3260
#35 0x00007fd85242dcfb in fast_function (nk=<optimized out>, na=<optimized out>, n=3, pp_stack=0x7ffc5622cba0, func=<optimized out>)
    at /usr/src/debug/Python-3.5.2/Python/ceval.c:4832
#36 call_function (oparg=<optimized out>, pp_stack=0x7ffc5622cba0) at /usr/src/debug/Python-3.5.2/Python/ceval.c:4759
#37 PyEval_EvalFrameEx (
    f=f@entry=Frame 0x55ac4da0bba8, for file /usr/lib/python3.5/site-packages/dnf/base.py, line 735, in _run_transaction (self=<BaseCli(_tempfile_persistor=None, _ds_callback=<DepSolveProgressCallBack(loops=1) at remote 0x7fd843950710>, _comps=None, _group_persistor=None, _plugins=<Plugins(plugin_cls=[<type at remote 0x55ac4bdef758>, <type at remote 0x55ac4bde9c38>, <type at remote 0x55ac4bdef3a8>, <type at remote 0x55ac4bdf24b8>, <type at remote 0x55ac4bdfcdd8>, <type at remote 0x55ac4bdf1f58>, <type at remote 0x55ac4bde8a38>, <type at remote 0x55ac4bdedaa8>, <type at remote 0x55ac4bdf3f08>, <type at remote 0x55ac4bdf3b58>], plugins=[<DebuginfoInstall(cli=<Cli(cmdstring='dnf install -y httpd ', command=<InstallCommand(cli=<...>, opts=<Namespace(ip_resolve=None, debuglevel=None, command=['install'], help=False, color=None, setopts=[], allowerasing=None, rpmverbosity=None, quiet=None, repos_ed=[], showdupesfromrepos=None, releasever=None, verbose=None, plugins=None, assumeno=None, version=None, excludepkgs=[], debugsolver=N...(truncated), throwflag=throwflag@entry=0) at /usr/src/debug/Python-3.5.2/Python/ceval.c:3260
#38 0x00007fd85242f5c3 in _PyEval_EvalCodeWithName (_co=<optimized out>, globals=<optimized out>, locals=locals@entry=0x0, args=args@entry=0x55ac4c8bac90, 
    argcount=1, kws=0x55ac4c8bac98, kwcount=1, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name='_run_transaction', qualname='Base._run_transaction')
    at /usr/src/debug/Python-3.5.2/Python/ceval.c:4047
#39 0x00007fd85242be39 in fast_function (nk=<optimized out>, na=<optimized out>, n=<optimized out>, pp_stack=0x7ffc5622cdb0, func=<optimized out>)
    at /usr/src/debug/Python-3.5.2/Python/ceval.c:4842
#40 call_function (oparg=<optimized out>, pp_stack=0x7ffc5622cdb0) at /usr/src/debug/Python-3.5.2/Python/ceval.c:4759
#41 PyEval_EvalFrameEx (
    f=f@entry=Frame 0x55ac4c8baaa8, for file /usr/lib/python3.5/site-packages/dnf/base.py, line 661, in do_transaction (self=<BaseCli(_tempfile_persistor=None, _ds_callback=<DepSolveProgressCallBack(loops=1) at remote 0x7fd843950710>, _comps=None, _group_persistor=None, _plugins=<Plugins(plugin_cls=[<type at remote 0x55ac4bdef758>, <type at remote 0x55ac4bde9c38>, <type at remote 0x55ac4bdef3a8>, <type at remote 0x55ac4bdf24b8>, <type at remote 0x55ac4bdfcdd8>, <type at remote 0x55ac4bdf1f58>, <type at remote 0x55ac4bde8a38>, <type at remote 0x55ac4bdedaa8>, <type at remote 0x55ac4bdf3f08>, <type at remote 0x55ac4bdf3b58>], plugins=[<Debugi---Type <return> to continue, or q <return> to quit---

The ps output in the container is

# ps axuwwf
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root        11  0.0  0.0  12588  3716 ?        Ss   10:46   0:00 bash
root        34  0.0  0.0  39828  3256 ?        R+   10:48   0:00  \_ ps axuwwf
root         1  1.3  2.2 541256 91700 ?        Ss+  10:46   0:01 /usr/libexec/system-python /usr/bin/dnf install -y httpd
Comment 4 Florian Weimer 2016-11-15 05:56:46 EST
(In reply to Jan Pazdziora from comment #3)
> The bt against the hung process is
> 
> #0  0x00007fd8520f82c1 in futex_wait (private=<optimized out>,
> expected=4294967295, futex_word=0x7fd842759c04) at
> ../sysdeps/unix/sysv/linux/futex-internal.h:61
> #1  futex_wait_simple (private=<optimized out>, expected=4294967295,
> futex_word=0x7fd842759c04) at ../sysdeps/nptl/futex-internal.h:135
> #2  __pthread_cond_destroy (cond=cond@entry=0x7fd842759be0) at
> pthread_cond_destroy.c:54

This could be due to glibc-swbz13165.patch (new condition variable implementation).
Comment 5 Florian Weimer 2016-11-15 08:41:29 EST
I was able to reproduce it, but only in a Docker image.  Here's the condvar state:

(gdb) print *cond
$2 = {__data = {{__wseq = 0, __wseq32 = {__low = 0, __high = 0}}, {
      __g1_start = 0, __g1_start32 = {__low = 0, __high = 0}}, 
    __g_refs = {0, 0}, __g_size = {0, 0}, 
    __g1_orig_size = 4294967295, __wrefs = 4294967295, 
    __g_signals = {0, 0}}, 
  __size = '\000' <repeats 32 times>, "\377\377\377\377\377\377\377\377\000\000\000\000\000\000\000", __align = 0}
Comment 6 Torvald Riegel 2016-11-15 10:19:53 EST
It seems that the root cause is that the program tries to use a condvar instance that has been initialized with a prior version of glibc.  The condvar appears to be in a database file (/var/lib/rpm/__db.002 in our reproducer).  With the (glibc-internal) condvar definition in the old glibc version, the bits represent a valid, unused condvar:
$4 = {__data = {__lock = 0, __futex = 0, __total_seq = 0, 
                __wakeup_seq = 0,  __woken_seq = 0, __mutex = 
                0xffffffffffffffff, __nwaiters = 0,  __broadcast_seq = 0},

The same bits do not represent a valid condvar if given the condvar definition in the new version of glibc.

Storing the condvar in the database file means that effectively, the program is expecting different versions of glibc to use the same bit representations of internal data structures.  This is not guaranteed to work.

It seems that either BDB should be changed to not store the condvar in database files, or dnf needs to reinitialize the databases it relies on when glibc is updated.
Comment 7 Florian Weimer 2016-11-15 10:23:57 EST
The hang goes away if I add a “rpm --rebuilddb” step to the container build file:

FROM fedora:rawhide
RUN dnf upgrade -y glibc
RUN rpm --rebuilddb
RUN dnf install -y httpd
Comment 8 Jan Pazdziora 2016-11-15 11:15:42 EST
(In reply to Torvald Riegel from comment #6)
> It seems that the root cause is that the program tries to use a condvar
> instance that has been initialized with a prior version of glibc.

Thank you for the analysis.

> The
> condvar appears to be in a database file (/var/lib/rpm/__db.002 in our
> reproducer).  With the (glibc-internal) condvar definition in the old glibc
> version, the bits represent a valid, unused condvar:
> $4 = {__data = {__lock = 0, __futex = 0, __total_seq = 0, 
>                 __wakeup_seq = 0,  __woken_seq = 0, __mutex = 
>                 0xffffffffffffffff, __nwaiters = 0,  __broadcast_seq = 0},
> 
> The same bits do not represent a valid condvar if given the condvar
> definition in the new version of glibc.

Can the same issue happen during upgrade from (say) Fedora 24 to rawhide?

How hard would it be to make this old unused bit value also understood as unused value in new glibc?

> Storing the condvar in the database file means that effectively, the program
> is expecting different versions of glibc to use the same bit representations
> of internal data structures.  This is not guaranteed to work.
> 
> It seems that either BDB should be changed to not store the condvar in
> database files, or dnf needs to reinitialize the databases it relies on when
> glibc is updated.

It would likely need to be rpm, not dnf.
Comment 9 Carlos O'Donell 2016-11-15 11:46:50 EST
(In reply to Torvald Riegel from comment #6)
> It seems that the root cause is that the program tries to use a condvar
> instance that has been initialized with a prior version of glibc.  The
> condvar appears to be in a database file (/var/lib/rpm/__db.002 in our
> reproducer).  With the (glibc-internal) condvar definition in the old glibc
> version, the bits represent a valid, unused condvar:
> $4 = {__data = {__lock = 0, __futex = 0, __total_seq = 0, 
>                 __wakeup_seq = 0,  __woken_seq = 0, __mutex = 
>                 0xffffffffffffffff, __nwaiters = 0,  __broadcast_seq = 0},
> 
> The same bits do not represent a valid condvar if given the condvar
> definition in the new version of glibc.
> 
> Storing the condvar in the database file means that effectively, the program
> is expecting different versions of glibc to use the same bit representations
> of internal data structures.  This is not guaranteed to work.
> 
> It seems that either BDB should be changed to not store the condvar in
> database files, or dnf needs to reinitialize the databases it relies on when
> glibc is updated.

This is unsupported by POSIX, you must _always_ reinitialize the POISX thread objects per:
~~~
2.9.9 Synchronization Object Copies and Alternative Mappings

For barriers, condition variables, mutexes, and read-write locks, [TSH] [Option Start]  if the process-shared attribute is set to PTHREAD_PROCESS_PRIVATE, [Option End]  only the synchronization object at the address used to initialize it can be used for performing synchronization. The effect of referring to another mapping of the same object when locking, unlocking, or destroying the object is undefined. [TSH] [Option Start]  If the process-shared attribute is set to PTHREAD_PROCESS_SHARED, only the synchronization object itself can be used for performing synchronization; however, it need not be referenced at the address used to initalize it (that is, another mapping of the same object can be used). [Option End]  The effect of referring to a copy of the object when locking, unlocking, or destroying it is undefined.

For spin locks, the above requirements shall apply as if spin locks have a process-shared attribute that is set from the pshared argument to pthread_spin_init(). For semaphores, the above requirements shall apply as if semaphores have a process-shared attribute that is set to PTHREAD_PROCESS_PRIVATE if the pshared argument to sem_init() is zero and set to PTHREAD_PROCESS_SHARED if pshared is non-zero.
~~~

The serialization of the POSIX thread object into the database is undefined behaviour as noted under "The effect of referring to a copy of the object when locking, unlocking, or destroying it is undefined."

You may only use initialized objects.

I'm moving this to libdb to get fixed.
Comment 10 Florian Weimer 2016-11-15 11:57:49 EST
(In reply to Carlos O'Donell from comment #9)

Torvald and I looked at the POSIX wording as well, and it is not sufficiently clear if the libdb usage is actually undefined.  POSIX likely intends this to be undefined: some implementations do not have a unified page cache, and the effect of the mutex/condition variable initialization may happen completely outside the context of the file mapping.

The key question is whether you may continue to use the resource after the process which has initialized it unmapped the file and exited.  The historic Berkeley DB use case requires the initialization to persist, although I do think it is problematic, as explained above.  Even the existing database environment liveness checks are likely insufficient because they cannot detect the case where all applications detached orderly, and reattach after a glibc update.
Comment 11 Florian Weimer 2016-11-15 15:30:04 EST
I found where POSIX says that the libdb usage is undefined.  It's in the specfication for mmap and unmmap:

“The state of synchronization objects such as mutexes, semaphores, barriers, and conditional variables placed in shared memory mapped with MAP_SHARED becomes undefined when the last region in any process containing the synchronization object is unmapped.”

If we posit that glibc updates need system reboots, this means that the format of these data structures on disk does not have to be retained across glibc updates.
Comment 12 Carlos O'Donell 2016-11-15 15:58:10 EST
(In reply to Florian Weimer from comment #11)
> I found where POSIX says that the libdb usage is undefined.  It's in the
> specfication for mmap and unmmap:
> 
> “The state of synchronization objects such as mutexes, semaphores, barriers,
> and conditional variables placed in shared memory mapped with MAP_SHARED
> becomes undefined when the last region in any process containing the
> synchronization object is unmapped.”
> 
> If we posit that glibc updates need system reboots, this means that the
> format of these data structures on disk does not have to be retained across
> glibc updates.

Correct, with the alignment and size being constrained by the ABI, so that needs to remain constant, but yes, the internal details of the type and the bit patterns and their meanings can and may change from release to release. You must reinitialize the objects before using them.

(In reply to Florian Weimer from comment #10)
> (In reply to Carlos O'Donell from comment #9)
> 
> Torvald and I looked at the POSIX wording as well, and it is not
> sufficiently clear if the libdb usage is actually undefined.  POSIX likely
> intends this to be undefined: some implementations do not have a unified
> page cache, and the effect of the mutex/condition variable initialization
> may happen completely outside the context of the file mapping.

The "why" is covered in the non-normative text under "Alternate Implementations Possible" here:
http://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_mutex_init.html

> The key question is whether you may continue to use the resource after the
> process which has initialized it unmapped the file and exited.  The historic
> Berkeley DB use case requires the initialization to persist, although I do
> think it is problematic, as explained above.  Even the existing database
> environment liveness checks are likely insufficient because they cannot
> detect the case where all applications detached orderly, and reattach after
> a glibc update.

You cannot continue to use the synchronization resource after the last mapping is removed, as you noted in comment 11.

(In reply to Jan Pazdziora from comment #8)
> (In reply to Torvald Riegel from comment #6)
> > The
> > condvar appears to be in a database file (/var/lib/rpm/__db.002 in our
> > reproducer).  With the (glibc-internal) condvar definition in the old glibc
> > version, the bits represent a valid, unused condvar:
> > $4 = {__data = {__lock = 0, __futex = 0, __total_seq = 0, 
> >                 __wakeup_seq = 0,  __woken_seq = 0, __mutex = 
> >                 0xffffffffffffffff, __nwaiters = 0,  __broadcast_seq = 0},
> > 
> > The same bits do not represent a valid condvar if given the condvar
> > definition in the new version of glibc.
> 
> Can the same issue happen during upgrade from (say) Fedora 24 to rawhide?

Yes, the same issue may happen when upgrading from Fedora 24 to Rawhide. It is conceivable that this issue becomes a blocker for F26 if it reproduces consistently during installs or upgrades.

This needs to be fixed in Berkeley DB, this is a error in the use of the POSIX threads primitives which constrains the implementation from moving forward with beneficial changes e.g. correctness and performance for condition variables (the reason the internal type representation changed).

> How hard would it be to make this old unused bit value also understood as
> unused value in new glibc?

TLDR; we would never support detecting of the old unused bit value. The cost is too high for other applications.

You would have to detect all possible marshalled bit patterns and reinitialize the condition variable in all API interfaces that might be passed the unmarshalled state. The cost is non-zero, the cost is on the hot path for all other applications that use the type (even with versioning apps pay the price until they are recompiled), and it would be only to support an undefined use.

> > Storing the condvar in the database file means that effectively, the program
> > is expecting different versions of glibc to use the same bit representations
> > of internal data structures.  This is not guaranteed to work.
> > 
> > It seems that either BDB should be changed to not store the condvar in
> > database files, or dnf needs to reinitialize the databases it relies on when
> > glibc is updated.
> 
> It would likely need to be rpm, not dnf.

It would actually be in Berkeley DB, which is where we have assigned this bug to e.g. libdb.
Comment 13 Carlos O'Donell 2016-11-15 16:08:36 EST
(In reply to Carlos O'Donell from comment #12)
> > How hard would it be to make this old unused bit value also understood as
> > unused value in new glibc?
> 
> TLDR; we would never support detecting of the old unused bit value. The cost
> is too high for other applications.
> 
> You would have to detect all possible marshalled bit patterns and
> reinitialize the condition variable in all API interfaces that might be
> passed the unmarshalled state. The cost is non-zero, the cost is on the hot
> path for all other applications that use the type (even with versioning apps
> pay the price until they are recompiled), and it would be only to support an
> undefined use.

FYI.

For the sake of openness and transparency there are two cases which glibc doesn't support which it could, and it is the following:

(1) Mixed implementations with PTHREAD_PROCESS_SHARED.

(a) Process A running glibc X initializes a condition variable in shared memory with PTHREAD_PROCESS_SHARED.
(b) Process B running glibc X+1 (with a different condition variable definition) attaches to that shared memory and tries to use it.

It is clear that process A and process B have two different understandings of the condition variable. The use of PTHREAD_PROCESS_SHARED has to trigger some kind of versioned data interface e.g. a condvar with a version field which detects the version in use and switches to the appropriate algorithm. This would incur significant costs when using PTHREAD_PROCESS_SHARED objects, and the use case has never become so important to support that we have supported it. Either way the glibc upgrade for Process B required a reboot, so as Florian argues you need to reboot, in which case you get a consistent view of the condvar and everything works optimally. The text as written in POSIX argues the above should be supported, but glibc doesn't support it.

(2) Mixed 32-bit and 64-bit implementations with PTHREAD_PROCESS_SHARED (similar to (1)).

While this is similar to (1) and can be considered a case of "another implementation of narrower type" I want to call it out separately.

The other case is a mixed 32-bit and 64-bit process case where the processes share types in shared memory, and this doesn't work correctly for the obvious reasons that the types are different ABIs. Making it work could again be part of making PTHREAD_PROCESS_SHARED truly "proces shared" across an ABI boundary for 32-bit and 64-bit code (reserve a bit in the version field for an ABI flag). Again this use case has never been important enough from a user perspective that we would attempt to fix it (though we face a similar problem in nscd with shared maps of the database caches).
Comment 14 Torvald Riegel 2016-11-16 05:43:58 EST
I had a quick look at src/mutex/mut_pthread.c, and I'd like to stress that this applies to all pthreads synchronization primitives, not just condvars.  libdb should not try to persist instances of these types; if it does, the database becomes tied to a particular glibc version.  A new rwlock implementation is in the queue for testing in Rawhide, and we may alter plain mutexes too in the future.

libdb will have to either implement custom synchronization, or it has to require clients to all use the same glibc version and architecture (including 32b vs. 64b on x86_64, for example).
Comment 15 Fedora Admin XMLRPC Client 2016-11-16 08:31:44 EST
This package has changed ownership in the Fedora Package Database.  Reassigning to the new owner of this component.
Comment 16 Torvald Riegel 2017-01-03 08:29:49 EST
glibc upstream now contains the new condition variable.  Has any work been started on the libdb (users) side to fix this libdb bug?

libdb has to be fixed very soon or we will get into issues when updating glibc.  I can help with how to fix this in libdb, but I do need help from the libdb maintainers, in particular in terms of (1) testing, (2) ensuring that all users of libdb enable whatever solution we come up with, and (3) notifying other distributions about this.

Who are the people that should be involved in this?  I suppose upstream-first isn't possible in this case?
Comment 17 Petr Kubat 2017-01-03 08:54:28 EST
Thanks for bringing this up.
I have not yet managed to take a deeper look at this but I can see how this might prove problematic in the near future. I will definitely appreciate any help you can offer with fixing libdb.
As for testing this change, libdb has some kind of test suite that might be useful but has not been used yet in any way. I will take a look at it and see if I can get it working.

I am not sure whether we are going to get any support from upstream with this issue given that we are running on an older (possibly already unsupported?) release, but I can try asking anyway.
Comment 18 Torvald Riegel 2017-01-03 12:31:42 EST
Note that I haven't ever worked with or on libdb before, so it would be good to get people involved that have some experience.  I don't have the spare time to see this through in detail.  I can help regarding how to use pthreads properly and with general concurrency considerations.

To reiterate, the problem is that libdb uses the glibc-internal binary representation in databases, and either (1) libdb or (2) how dnf uses libdb cause new glibc code to try to work on binary representation of old glibc code.  

I'm splitting this up in (1) and (2) because it is not yet clear to me who exactly is at fault here.  libdb's documentation on upgrading libdb (see docs/upgrading/upgrade_process.html in libdb's source) is rather vague and fails to include glibc (or whatever provides pthreads) in it's consideration but does mention the compiler; this is odd at least from our perspective because things like datastructure layout are part of the ABI guarantees but glibc-internal representations are not part of the ABI.  libdb could state that any pthreads-implementation update is to be considered to be a libdb major/minor update -- but it does not do that AFAIK, and it's unclear whether that would help.

libdb could require that only the same build of libdb including the same build of the pthreads implementation are active on the same database concurrently.  If it would shut down all mutexes when there are no references left and reinitialize them when a new reference comes in, this should be sufficient to not trigger the glibc update problem under that requirement.  In this case, it may be simply dnf's fault if dnf does not enforce that no process must hold a reference on dnf's databases while glibc is upgraded.
Second, I have looked a bit at the code for libdb mutex, mutex regions, and environments, but I haven't figured out yet whether libdb actually destroys all mutexes when the last reference to a database is removed.
Nonetheless, making this scenario work could be a solution.

Another solution could be to say that glibc upgrades are like libdb major/minor upgrades, and then follow (some of) the steps outlined in docs/upgrading/upgrade_process.html for every user of libdb.  This requires touching more packages, but may be easier to enforce than the first solution, which requires checking whether libdb clients cannot hold a reference across updates of glibc.

A third solution would be to version databases and include glibc's version in there.  But that doesn't provide automatic capability, and would probably require quite a few changes to libdb.

I guess it would be best to look at the first solution first, starting with trying to find out what libdb actually guarantees or intends to guarantee.  Any suggestions for where to get this information?
Comment 19 Panu Matilainen 2017-01-04 01:53:47 EST
Just FWIW: dnf does not use libdb directly, rpm does. Dnf is only involved as a librpm API consumer. The rpmdb is never closed during a transaction so there's always a "reference" to the database environment when an upgrade is in process.

From rpm POV the simplest option would be to just flick on DB_PRIVATE mode. Which sort of equals disabling db locking but then non-root queries (which I'd assume to be the vast majority) run in DB_PRIVATE mode for the simple reason of not having permission to grab locks, and the world hasn't ended. Single writer is enforced via a separate transaction lock anyway.
Comment 20 Petr Kubat 2017-01-04 06:39:06 EST
(In reply to Torvald Riegel from comment #18)
> Second, I have looked a bit at the code for libdb mutex, mutex regions, and
> environments, but I haven't figured out yet whether libdb actually destroys
> all mutexes when the last reference to a database is removed.
> Nonetheless, making this scenario work could be a solution.

To my knowledge, mutexes located in libdb's regions are not destroyed or re-initialized when the last process that references them exists.
 
> Another solution could be to say that glibc upgrades are like libdb
> major/minor upgrades, and then follow (some of) the steps outlined in
> docs/upgrading/upgrade_process.html for every user of libdb.  This requires
> touching more packages, but may be easier to enforce than the first
> solution, which requires checking whether libdb clients cannot hold a
> reference across updates of glibc.

This would be possible for the packages we know of but not helpful for user applications built upon libdb that would find themselves unable to use it and would have to make some manual changes on their own.
This could however be used as a workaround for rpm which would otherwise block the upgrade to F26.

> A third solution would be to version databases and include glibc's version
> in there.  But that doesn't provide automatic capability, and would probably
> require quite a few changes to libdb.
> 
> I guess it would be best to look at the first solution first, starting with
> trying to find out what libdb actually guarantees or intends to guarantee. 
> Any suggestions for where to get this information?

Unfortunately my knowledge about libdb is also very limited given that I have not been taking care of it for too long.
I have contacted upstream regarding this issue. The thread is publicly available at:
https://community.oracle.com/message/14170069
My hope is that Oracle might help us answer the questions we have about their implementation at least.
Comment 21 Petr Kubat 2017-01-06 02:49:59 EST
Discussion with Oracle moved to email. They promised to take a look at this issue but it will take a while before they can work on it actively. Do we know when the new implementation is going to get into an upstream release? So Oracle know what timeframe to work with.

I have also verified with Oracle that libdb does not do any reinitialization checks of synchronization primitives on its own, so the only way to currently work around this is by opening the environment with the DB_RECOVER (or DB_PRIVATE as Panu pointed out) flag set.
Comment 22 Petr Kubat 2017-02-13 04:53:40 EST
Forwarding new information from upstream:

----

We have been looking at this.  this is a bit tricky.   The problem as we see it, is the glibc library changes the pthread_cond_t structure in a way that breaks backward compatibility.   So we cannot look at structure size and make any conclusions.  The idea that we came up with is this ....

   -  add a check in the configure code to see if gnu_get_libc_version() exists on the system -- This should be true for nearly all Linuxes
 - if true, we compile in some extra code that is executed during the environment open
 - This code will clean the mutex region if the glibc version is different.

  consequences of this patch ....

  - we will store the glibc version in the environment and on open we will grab the current version and if we detect a mismatch we will recreate the region files.  This should prevent the hang that you are seeing.

----

This could work as a workaround to the issue, however I have voiced some concerns regarding already existing environments that do not have the glibc version variable yet (as we had problems with modifying environment structures before).
Also upstream asks whether it is possible to get into a situation in which two rpm processes are accessing libdb environment each using a different glibc version (after performing the glibc upgrade).
Comment 23 Panu Matilainen 2017-02-13 05:15:26 EST
(In reply to Petr Kubat from comment #22)
> Also upstream asks whether it is possible to get into a situation in which
> two rpm processes are accessing libdb environment each using a different
> glibc version (after performing the glibc upgrade).

This is one of the reasons why calling rpm from rpm scriptlets is such a bad idea: while it appears to work most of the time, every once in a blue moon it breaks in that single transaction where glibc innards change and there's no way to predict when this might happen and reproducing can be a bit tricky too.

It's also a fine example of the wacko complications involved with the rpm database...
Comment 24 Petr Kubat 2017-02-13 06:09:02 EST
(In reply to Panu Matilainen from comment #23)
> This is one of the reasons why calling rpm from rpm scriptlets is such a bad
> idea: while it appears to work most of the time, every once in a blue moon
> it breaks in that single transaction where glibc innards change and there's
> no way to predict when this might happen and reproducing can be a bit tricky
> too.
> 
> It's also a fine example of the wacko complications involved with the rpm
> database...

Ah, right. I forgot running rpm inside a rpm scriptlet is a thing. Thanks for reminding me.
Comment 25 Panu Matilainen 2017-02-13 06:15:03 EST
(In reply to Petr Kubat from comment #24)
> 
> Ah, right. I forgot running rpm inside a rpm scriptlet is a thing. Thanks
> for reminding me.

It's very much a discouraged thing. Not that discouraging ever *stopped* people from doing it though...
Comment 26 Florian Weimer 2017-02-13 06:19:34 EST
(In reply to Petr Kubat from comment #22)
> Forwarding new information from upstream:
> 
> ----
> 
> We have been looking at this.  this is a bit tricky.   The problem as we see
> it, is the glibc library changes the pthread_cond_t structure in a way that
> breaks backward compatibility.   So we cannot look at structure size and
> make any conclusions.  The idea that we came up with is this ....
> 
>    -  add a check in the configure code to see if gnu_get_libc_version()
> exists on the system -- This should be true for nearly all Linuxes
>  - if true, we compile in some extra code that is executed during the
> environment open
>  - This code will clean the mutex region if the glibc version is different.
> 
>   consequences of this patch ....
> 
>   - we will store the glibc version in the environment and on open we will
> grab the current version and if we detect a mismatch we will recreate the
> region files.  This should prevent the hang that you are seeing.
> 
> ----

I don't think this is quite sufficient because the data structure layout could change without the glibc version changing.

I suppose we could provide a pthread_layout_version_np function in glibc which returns an int which refers to version number of the current pthread data structure layout.  it is a bit tricky to come up with a single version number which is consistent across all distributions, but we can probably provide something.
Comment 27 Torvald Riegel 2017-02-13 06:57:58 EST
(In reply to Florian Weimer from comment #26)
> (In reply to Petr Kubat from comment #22)
> > Forwarding new information from upstream:
> > 
> > ----
> > 
> > We have been looking at this.  this is a bit tricky.   The problem as we see
> > it, is the glibc library changes the pthread_cond_t structure in a way that
> > breaks backward compatibility.   So we cannot look at structure size and
> > make any conclusions.  The idea that we came up with is this ....
> > 
> >    -  add a check in the configure code to see if gnu_get_libc_version()
> > exists on the system -- This should be true for nearly all Linuxes
> >  - if true, we compile in some extra code that is executed during the
> > environment open
> >  - This code will clean the mutex region if the glibc version is different.
> > 
> >   consequences of this patch ....
> > 
> >   - we will store the glibc version in the environment and on open we will
> > grab the current version and if we detect a mismatch we will recreate the
> > region files.  This should prevent the hang that you are seeing.
> > 
> > ----
> 
> I don't think this is quite sufficient because the data structure layout
> could change without the glibc version changing.
> 
> I suppose we could provide a pthread_layout_version_np function in glibc
> which returns an int which refers to version number of the current pthread
> data structure layout.  it is a bit tricky to come up with a single version
> number which is consistent across all distributions, but we can probably
> provide something.

I agree that we could be able to build such a version number, but I don't think we should.  First, libdb should show that it can't just do reference counting, for example, and re-initialize any condvars it uses whenever the a database transitions from not being used to used.  I'll talk with libdb upstream about this.
Comment 28 Howard Chu 2017-02-13 07:46:51 EST
If you can guarantee that a reboot will occur between version changes, a simple fix is to always configure BDB to use a shared memory region instead of a mmap'd file for its environment. Then the shared memory region is automatically destroyed/recreated by rebooting the system.
Comment 29 Petr Kubat 2017-02-21 08:46:51 EST
Some more information from the discussion with libdb upstream.

We have 2 ideas we are looking at right now.

1. have BDB scroll away the glibc version and on an open do some checking to see if glibc version changed and force a recovery under the covers
2. DBENV recovery on glibc version change via a %posttrans cleanup action

The first idea I have already mentioned in comment 22.
The second idea is something I have thrown into the discussion as a workaround  specifically for rpm. I have tried looking at this idea a bit more to see if it would actually work and found out, that there are additional issues connected with it.
AFAIK there are 3 ways to go about forcing the environment to recover:
First "rpm --rebuilddb", but this does not work as it needs a transaction lock that is held by the rpm process doing the install/update action.
Second is "db_recover -h /var/lib/rpm" which is the libdb way of doing the recovery. This does work in recovering the environment but results in the rpm process throwing DB_RUNRECOVERY errors.
Third way is just removing the problematic region files by force via "rm -f /var/lib/__db.00*" but it seems rpm is recreating them with the old glibc structures in place when the db/dbenv is closed.

Generally libdb upstream is against modifying their code to rebuild the environment every time it is accessed as that (together with the implementation of an "in-use" counter) would be a non-negligible hit to performance.
Comment 30 Petr Kubat 2017-02-22 03:48:49 EST
A correction to the third way of environment recovery ("rm -f /var/lib/__db.00*") - the region files are not recreated on closing the db/dbenv, but during dnf's verification process, which is still a problem since the hang is not removed.
Comment 32 Fedora End Of Life 2017-02-28 05:36:52 EST
This bug appears to have been reported against 'rawhide' during the Fedora 26 development cycle.
Changing version to '26'.