Bug 1975706 - Python 2.7 and 3.8 test suite hangs on s390x when building the python/python3 package
Summary: Python 2.7 and 3.8 test suite hangs on s390x when building the python/python3...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: glibc
Version: 7.9
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: glibc team
QA Contact: qe-baseos-tools-bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-24 09:07 UTC by Victor Stinner
Modified: 2023-07-18 14:17 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-05-13 16:14:45 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
sem_timedwait.py (2.92 KB, text/plain)
2021-06-29 11:42 UTC, Victor Stinner
no flags Details
strace-bug (146.07 KB, text/plain)
2021-06-29 12:52 UTC, Victor Stinner
no flags Details
strace-ok: strace when sem_timedwait.py works (179.03 KB, text/plain)
2021-06-29 13:08 UTC, Victor Stinner
no flags Details

Description Victor Stinner 2021-06-24 09:07:26 UTC
When building Python 2.7.18 or Python 3.8.10 on RHEL7 using Brew, the Python test suite hangs on s390x.

The following Python 3.8 test timed out:

* test_concurrent_futures
* test_faulthandler
* test_io
* test_logging
* test_multiprocessing_fork
* test_multiprocessing_forkserver
* test_multiprocessing_spawn
* test_poll
* test_queue
* test_subprocess
* test_thread
* test_threading

Python 3.8 test_daemon_threads_shutdown_stdout_deadlock() of test_io hangs.

Python 2.7 test_multiprocessing hangs: maybe test_notify_all() (it hangs after "test_notify ... ok").

Python 3.8 test_dump_traceback_later() of test_faulthandler fails because the script didn't write any output.


(A) test_daemon_threads_shutdown_stdout_deadlock() case

test_io test_daemon_threads_shutdown_stdout_deadlock() tests the code which detects a deadlock when a Python "daemon" thread calls sys.stdout.write('.') whereas Python is exiting. Involved Modules/_io/bufferedio.c code:
-------------------------------
    relax_locking = _Py_IsFinalizing();
    Py_BEGIN_ALLOW_THREADS
    if (!relax_locking)
        st = PyThread_acquire_lock(self->lock, 1);
    else {
        /* When finalizing, we don't want a deadlock to happen with daemon
         * threads abruptly shut down while they owned the lock.
         * Therefore, only wait for a grace period (1 s.).
         * Note that non-daemon threads have already exited here, so this
         * shouldn't affect carefully written threaded I/O code.
         */
        st = PyThread_acquire_lock_timed(self->lock, (PY_TIMEOUT_T)1e6, 0);
    }
    Py_END_ALLOW_THREADS
    if (relax_locking && st != PY_LOCK_ACQUIRED) {
        PyObject *msgobj = PyUnicode_FromFormat(
            "could not acquire lock for %A at interpreter "
            "shutdown, possibly due to daemon threads",
            (PyObject *) self);
        const char *msg = PyUnicode_AsUTF8(msgobj);
        Py_FatalError(msg);
    }
-------------------------------

Maybe PyThread_acquire_lock_timed() doesn't return with a timeout after 1.0 second, whereas it should.

In Python 3.8, PyThread_acquire_lock_timed() is implemented with the libc sem_timedwait().

sem_timedwait() has 2 code paths:

* Fast code path using atomic operations
* Slow code path using Linux futex() syscall


(B) test_dump_traceback_later() case

test_faulthandler test_dump_traceback_later() spawns a thread whichs waits for 0.5 seconds and then writes the Python stacks of all threads into stderr. C code of the thread:

static void
faulthandler_thread(void *unused)
{
    PyLockStatus st;
    const char* errmsg;
    int ok;
#if defined(HAVE_PTHREAD_SIGMASK) && !defined(HAVE_BROKEN_PTHREAD_SIGMASK)
    sigset_t set;

    /* we don't want to receive any signal */
    sigfillset(&set);
    pthread_sigmask(SIG_SETMASK, &set, NULL);
#endif

    do {
        st = PyThread_acquire_lock_timed(thread.cancel_event,
                                         thread.timeout_us, 0);
        if (st == PY_LOCK_ACQUIRED) {
            PyThread_release_lock(thread.cancel_event);
            break;
        }
        /* Timeout => dump traceback */
        assert(st == PY_LOCK_FAILURE);

        _Py_write_noraise(thread.fd, thread.header, (int)thread.header_len);

        errmsg = _Py_DumpTracebackThreads(thread.fd, thread.interp, NULL);
        ok = (errmsg == NULL);

        if (thread.exit)
            _exit(1);
    } while (ok && thread.repeat);

    /* The only way out */
    PyThread_release_lock(thread.running);
}

It waits on PyThread_acquire_lock_timed() with a timeout of 0.5 seconds. It seems like PyThread_acquire_lock_timed() does not return a timeout after 0.5 seconds, whereas it should.

Instead, the main thread "cancels" the waiting thread after after 2.5 seconds: it releases cancel_event lock which unblocks PyThread_acquire_lock_timed() and so the thread exits (without writing anything into stderr).

Comment 11 Victor Stinner 2021-06-29 11:42:27 UTC
Created attachment 1795761 [details]
sem_timedwait.py

I rewrote test_faulthandler.test_dump_traceback_later() in pure Python and wrote code compatible with RHEL6 Python 2.6: attached sem_timedwait.py script.

On Python 2.6 and 2.7, it uses multiprocessing.synchronize.Lock.acquire(timeout=TIMEOUT) which is implemented with sem_timedwait() in C.

On Python 3, it uses threading.Lock.acquire(timeout=TIMEOUT) which is also implemented with sem_timedwait() in C.

Expected result (no bug):
---
# python2.6 sem_timedwait.py 
0.500 sec: *** DUMP TRACEBACK ***
2.503 sec: cancel_dump_traceback_later
2.503 sec: exit
---

Output when the bug occurs:
---
$ python2 z.py
2.503 sec: cancel_dump_traceback_later
2.504 sec: stop thread
2.504 sec: exit
---

When the bug occurs, sem_timeout(<timeout of 0.5 seconds>) does *NOT* return with ETIMEDOUT in 0.5 seconds. It only returns with success (lock acquired) after 2.5 seconds, when the main thread releases the cancel_event lock to stop the thread.

Comment 12 Victor Stinner 2021-06-29 11:46:03 UTC
Interesting data to collect about this issue to identify in which case it occurs:

hostname
cat /proc/cpuinfo
cat /etc/redhat-release
uname -r
rpm -q glibc
uptime

Comment 14 Victor Stinner 2021-06-29 12:50:50 UTC
Commented extract of strace logs when the bug occurs:
------
# Main thread
# Python: time.sleep(2.5)
48515 08:35:36.516873 select(0, NULL, NULL, NULL, {2, 500000} <unfinished ...>

    # FaulthandlerThread
    48516 08:35:36.516889 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x3fff617e000
    48516 08:35:36.516924 write(1, "1624970136.516758919 (0.001 sec)"..., 74) = 74
    # Python: cancel_event.acquire(timeout=0.5)
    48516 08:35:36.516956 futex(0x3fff6c00004, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, {1624970137, 16951000}, ffffffff <unfinished ...>

# ... nothings happens for 2 seconds ...

# Python: ... time.sleep(2.5) completes
48515 08:35:39.019500 <... select resumed> ) = 0 (Timeout)
48515 08:35:39.019670 write(1, "1624970139.019623995 (2.504 sec)"..., 62) = 62

# Python: cancel_event.release()
48515 08:35:39.019729 futex(0x3fff6c00004, FUTEX_WAKE, 1) = 1
48515 08:35:39.019770 futex(0x3fff6b80004, FUTEX_WAIT, 0, NULL <unfinished ...>

    # Python: ... cancel_event.acquire(timeout=0.5) completes in 2.5 seconds
    48516 08:35:39.019780 <... futex resumed> ) = 0
------

sem_timedwait() is called with a timeout of 0.5 seconds, but it does *NOT* return ETIMEDOUT in 0.5 seconds. It only returns after 2.5 seconds when the main thread release cancel_event lock.

Around Unix time 1624970136.516758919, sem_timedwait() calls futex() syscall with the deadline of 1624970137.16951000 (FUTEX_CLOCK_REALTIME): timeout of 0.5 seconds (0.652 seconds if we only use logged times).

The glibc seems to use futex() syscall as expected, but the futex() syscall doesn't seem to want to complete in 0.5 seconds.

Script output:
---
1624970136.516758919 (0.001 sec): cancel_event.acquire(timeout=<0.5 sec>)
1624970139.019623995 (2.504 sec): cancel_dump_traceback_later
1624970139.019831896 (2.504 sec): cancel_event.acquire(timeout=<0.5 sec>) -> True
1624970139.019917965 (2.504 sec): stop thread
1624970139.020457029 (2.505 sec): exit
---

Again, you can see that the Python Lock.acquire(timeout=0.5) doesn't return False (not acquired) in 0.5, but only returns after 2.5 seconds when the lock is acquired (because the main thread released the lock).

Comment 15 Victor Stinner 2021-06-29 12:52:32 UTC
Created attachment 1795784 [details]
strace-bug

strace-bug: strace output (RHEL7 mock container, Linux kernel 2.6.32-754.el6.s390x, glibc-2.17-292.el7.s390x) when the bug occurs.

Comment 16 Victor Stinner 2021-06-29 13:07:05 UTC
Commented extract of strace logs when the bug does NOT occur, on RHEL6 (kernel 2.6.32-754.el6.s390x, glibc-2.12-1.212.el6.s390x):
------------------------
# Main thread
# Create FaulthandlerThread
48546 08:54:33.727005 clone(child_stack=0x3fff79ff220, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x3fff79ff9e0, tls=0x3fff79ff910, child_tidptr=0x3fff79ff9e0) = 48547

    # FaulthandlerThread
    48547 08:54:33.727054 set_robust_list(0x3fff79ff9f0, 24 <unfinished ...>

48546 08:54:33.727066 futex(0x801738b0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>

    48547 08:54:33.727074 <... set_robust_list resumed> ) = 0
    48547 08:54:33.727105 mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x3ffef000000
    48547 08:54:33.727156 munmap(0x3ffef000000, 16777216) = 0
    48547 08:54:33.727178 munmap(0x3fff4000000, 50331648) = 0
    48547 08:54:33.727197 mprotect(0x3fff0000000, 135168, PROT_READ|PROT_WRITE) = 0
    48547 08:54:33.727253 futex(0x801738b0, FUTEX_WAKE_PRIVATE, 1) = 1

48546 08:54:33.727281 <... futex resumed> ) = 0
48546 08:54:33.727300 futex(0x801722c0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>

    48547 08:54:33.727332 futex(0x801722c0, FUTEX_WAKE_PRIVATE, 1) = 1

48546 08:54:33.727354 <... futex resumed> ) = 0

    48547 08:54:33.727369 fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0

# Python: time.sleep(2.5)
48546 08:54:33.727401 select(0, NULL, NULL, NULL, {2, 500000} <unfinished ...>

    48547 08:54:33.727421 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x3fff6fff000
    48547 08:54:33.727456 write(1, "1624971273.727308989 (0.001 sec): cancel_event.acquire(timeout=<0.5 sec>)\n", 74) = 74

    # Python: cancel_event.acquire(timeout=0.5)
    48547 08:54:33.727488 futex(0x3fff7a80000, FUTEX_WAIT, 0, {0, 499999000}) = -1 ETIMEDOUT (Connection timed out)

    48547 08:54:34.227652 write(1, "1624971274.227641106 (0.502 sec): cancel_event.acquire(timeout=<0.5 sec>) -> False\n", 83) = 83
    48547 08:54:34.227692 write(1, "1624971274.227683067 (0.502 sec): *** DUMP TRACEBACK ***\n", 57) = 57
    48547 08:54:34.227737 madvise(0x3fff7000000, 10465280, MADV_DONTNEED) = 0
    48547 08:54:34.227759 _exit(0)          = ?
    48547 08:54:34.227777 +++ exited with 0 +++

# Python: ... time.sleep(2.5) completes
48546 08:54:36.230018 <... select resumed> ) = 0 (Timeout)
48546 08:54:36.230192 write(1, "1624971276.230144978 (2.504 sec): cancel_dump_traceback_later\n", 62) = 62
48546 08:54:36.230295 write(1, "1624971276.230284929 (2.504 sec): exit\n", 39) = 39
------------------------


The most important syscall is "futex(0x3fff7a80000, FUTEX_WAIT, 0, {0, 499999000}) = -1 ETIMEDOUT".

Here the futex() syscall completes in 0.5 seconds, as expected.

On RHEL6 (glibc-2.12-1.212.el6.s390x), sem_timedwait() calls futex() wait FUTEX_WAIT flag.

On RHEL7 (glibc-2.17-292.el7.s390x), sem_timedwait() calls futex() wait FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME flag.

Script output of this strace extract:
---
1624971273.727308989 (0.001 sec): cancel_event.acquire(timeout=<0.5 sec>)
1624971274.227641106 (0.502 sec): cancel_event.acquire(timeout=<0.5 sec>) -> False
1624971274.227683067 (0.502 sec): *** DUMP TRACEBACK ***
1624971276.230144978 (2.504 sec): cancel_dump_traceback_later
1624971276.230284929 (2.504 sec): exit
---

Comment 17 Victor Stinner 2021-06-29 13:08:02 UTC
Created attachment 1795796 [details]
strace-ok: strace when sem_timedwait.py works

Comment 18 Victor Stinner 2021-06-29 13:49:36 UTC
sem_timedwait() in RHEL6 glibc-2.12-1.212.el6.s390x:

----
#define lll_futex_timed_wait(futex, val, timespec, private) \
  ({                                                                          \
    register unsigned long int __r2 asm ("2") = (unsigned long int) (futex);  \
    register unsigned long int __r3 asm ("3")                                 \
      = __lll_private_flag (FUTEX_WAIT, private);                             \
    register unsigned long int __r4 asm ("4") = (unsigned long int) (val);    \
    register unsigned long int __r5 asm ("5") = (unsigned long int)(timespec);\
    register unsigned long int __result asm ("2");                            \
                                                                              \
    __asm __volatile ("svc %b1"                                               \
                      : "=d" (__result)                                       \
                      : "i" (SYS_futex), "0" (__r2), "d" (__r3),              \
                        "d" (__r4), "d" (__r5)                                \
                      : "cc", "memory" );                                     \
    __result;                                                                 \
  })

int
sem_timedwait (sem_t *sem, const struct timespec *abstime)
{
  struct new_sem *isem = (struct new_sem *) sem;
  int err;

  if (atomic_decrement_if_positive (&isem->value) > 0)
    return 0;

  if (abstime->tv_nsec < 0 || abstime->tv_nsec >= 1000000000)
    {
      __set_errno (EINVAL);
      return -1;
    }

  atomic_increment (&isem->nwaiters);

  pthread_cleanup_push (__sem_wait_cleanup, isem);

  while (1)
    {
      struct timeval tv;
      struct timespec rt;
      int sec, nsec;

      /* Get the current time.  */
      __gettimeofday (&tv, NULL);

      /* Compute relative timeout.  */
      sec = abstime->tv_sec - tv.tv_sec;
      nsec = abstime->tv_nsec - tv.tv_usec * 1000;
      if (nsec < 0)
	{
	  nsec += 1000000000;
	  --sec;
	}

      /* Already timed out?  */
      err = -ETIMEDOUT;
      if (sec < 0)
	{
	  __set_errno (ETIMEDOUT);
	  err = -1;
	  break;
	}

      /* Do wait.  */
      rt.tv_sec = sec;
      rt.tv_nsec = nsec;

      /* Enable asynchronous cancellation.  Required by the standard.  */
      int oldtype = __pthread_enable_asynccancel ();

      err = lll_futex_timed_wait (&isem->value, 0, &rt,
				  isem->private ^ FUTEX_PRIVATE_FLAG);

      /* Disable asynchronous cancellation.  */
      __pthread_disable_asynccancel (oldtype);

      if (err != 0 && err != -EWOULDBLOCK)
	{
	  __set_errno (-err);
	  err = -1;
	  break;
	}

      if (atomic_decrement_if_positive (&isem->value) > 0)
	{
	  err = 0;
	  break;
	}
    }

  pthread_cleanup_pop (0);

  atomic_decrement (&isem->nwaiters);

  return err;
}
----

In the reproducer, lll_futex_timed_wait() is called with isem->private=0. The "svc 238" instruction is called with:

(gdb) p /x $r2   # &isem->value
$7 = 0x3ff9a080000
(gdb) p /x $r3   # private
$8 = 0x0
(gdb) p /x $r4   # value = isem->value
$9 = 0x0
(gdb) p	/x $r5   # &rt (timespec)
$10 = 0x3ff99ffe270

where rt is a *relative* timestamp, not an absolute timestmap: see the __gettimeofday() call used to compute it.

Comment 19 Victor Stinner 2021-06-29 14:29:38 UTC
sem_timedait() code of glibc-2.17-324.el7_9.s390x on RHEL 7.9. This glibc version is more recent than the version used on Brew, but the code should be mostly the code.

// nptl/sysdeps/unix/sysv/linux/sem_timedwait.c
int
sem_timedwait (sem_t *sem, const struct timespec *abstime)
{

  if (abstime->tv_nsec < 0 || abstime->tv_nsec >= 1000000000)
    {
      __set_errno (EINVAL);
      return -1;
    }

  if (__new_sem_wait_fast ((struct new_sem *) sem, 0) == 0)
    return 0;
  else
    return __new_sem_wait_slow((struct new_sem *) sem, abstime);
}

// nptl/sysdeps/unix/sysv/linux/sem_waitcommon.c
static int
__attribute__ ((noinline))
__new_sem_wait_slow (struct new_sem *sem, const struct timespec *abstime)
{
  int err = 0;

#if __HAVE_64B_ATOMICS
  /* Add a waiter.  Relaxed MO is sufficient because we can rely on the
     ordering provided by the RMW operations we use.  */
  unsigned long d = atomic_fetch_add_relaxed (&sem->data,
      1UL << SEM_NWAITERS_SHIFT);

  pthread_cleanup_push (__sem_wait_cleanup, sem);

  /* Wait for a token to be available.  Retry until we can grab one.  */
  for (;;)
    {
      /* If there is no token available, sleep until there is.  */
      if ((d & SEM_VALUE_MASK) == 0)
	{
	  err = do_futex_wait (sem, abstime);
	  /* A futex return value of 0 or EAGAIN is due to a real or spurious
	     wake-up, or due to a change in the number of tokens.  We retry in
	     these cases.
	     If we timed out, forward this to the caller.
	     EINTR could be either due to being interrupted by a signal, or
	     due to a spurious wake-up.  Thus, we cannot distinguish between
	     both, and are not allowed to return EINTR to the caller but have
	     to retry; this is because we may not have been interrupted by a
	     signal.  However, if we assume that only signals cause a futex
	     return of EINTR, we forward EINTR to the caller.

	     Retrying on EINTR is technically always allowed because to
	     reliably interrupt sem_wait with a signal, the signal handler
	     must call sem_post (which is AS-Safe).  In executions where the
	     signal handler does not do that, the implementation can correctly
	     claim that sem_wait hadn't actually started to execute yet, and
	     thus the signal never actually interrupted sem_wait.  We make no
	     timing guarantees, so the program can never observe that sem_wait
	     actually did start to execute.  Thus, in a correct program, we
	     can expect a signal that wanted to interrupt the sem_wait to have
	     provided a token, and can just try to grab this token if
	     futex_wait returns EINTR.  */
	  if (err == ETIMEDOUT ||
	      (err == EINTR && sem_assume_only_signals_cause_futex_EINTR))
	    {
	      __set_errno (err);
	      err = -1;
	      /* Stop being registered as a waiter.  */
	      atomic_fetch_add_relaxed (&sem->data,
		  -(1UL << SEM_NWAITERS_SHIFT));
	      break;
	    }
	  /* Relaxed MO is sufficient; see below.  */
	  d = atomic_load_relaxed (&sem->data);
	}
      else
	{
	  /* Try to grab both a token and stop being a waiter.  We need
	     acquire MO so this synchronizes with all token providers (i.e.,
	     the RMW operation we read from or all those before it in
	     modification order; also see sem_post).  On the failure path,
	     relaxed MO is sufficient because we only eventually need the
	     up-to-date value; the futex_wait or the CAS perform the real
	     work.  */
	  if (atomic_compare_exchange_weak_acquire (&sem->data,
	      &d, d - 1 - (1UL << SEM_NWAITERS_SHIFT)))
	    {
	      err = 0;
	      break;
	    }
	}
    }

  pthread_cleanup_pop (0);
#else
  // (...): code path no taken on s390x
#endif

  return err;
}

// nptl/sysdeps/unix/sysv/linux/sem_waitcommon.c
static int
__attribute__ ((noinline))
do_futex_wait (struct new_sem *sem, const struct timespec *abstime)
{
  int err;

#if __HAVE_64B_ATOMICS
  err = futex_abstimed_wait ((unsigned int *) &sem->data + SEM_VALUE_OFFSET, 0,
                             abstime, sem->private, true);
#else
  // (...): code path no taken on s390x
#endif

  return err;
}

// nptl/sysdeps/unix/sysv/linux/sem_waitcommon.c
static __always_inline int
futex_abstimed_wait (unsigned int* futex, unsigned int expected,
		     const struct timespec* abstime, int private, bool cancel)
{
  int err, oldtype;
  if (abstime == NULL)
    {
      if (cancel)
	oldtype = __pthread_enable_asynccancel ();
      err = lll_futex_wait (futex, expected, private);
      if (cancel)
	__pthread_disable_asynccancel (oldtype);
    }
  else
    {
#if (defined __ASSUME_FUTEX_CLOCK_REALTIME	\
     && defined lll_futex_timed_wait_bitset)
      /* The Linux kernel returns EINVAL for this, but in userspace
	 such a value is valid.  */
      if (abstime->tv_sec < 0)
	return ETIMEDOUT;
#else
      // (...): code path no taken on s390x
#endif
      if (cancel)
	oldtype = __pthread_enable_asynccancel ();
#if (defined __ASSUME_FUTEX_CLOCK_REALTIME	\
     && defined lll_futex_timed_wait_bitset)
      err = lll_futex_timed_wait_bitset (futex, expected, abstime,
					 FUTEX_CLOCK_REALTIME, private);
#else
      // (...): code path no taken on s390x
#endif
      if (cancel)
	__pthread_disable_asynccancel (oldtype);
    }
  switch (err)
    {
    case 0:
    case -EAGAIN:
    case -EINTR:
    case -ETIMEDOUT:
      return -err;

    case -EFAULT: /* Must have been caused by a glibc or application bug.  */
    case -EINVAL: /* Either due to wrong alignment or due to the timeout not
		     being normalized.  Must have been caused by a glibc or
		     application bug.  */
    case -ENOSYS: /* Must have been caused by a glibc bug.  */
    /* No other errors are documented at this time.  */
    default:
      abort ();
    }
}

// nptl/sysdeps/unix/sysv/linux/s390/lowlevellock.h
#define FUTEX_BITSET_MATCH_ANY	0xffffffff
#define lll_futex_timed_wait_bitset(futexp, val, timespec, clockbit, private) \
  ({                                                                          \
    register unsigned long int __r2 asm ("2") = (unsigned long int) (futexp); \
    register unsigned long int __r3 asm ("3")                                 \
      = __lll_private_flag ((FUTEX_WAIT_BITSET | clockbit), private);         \
    register unsigned long int __r4 asm ("4") = (long int) (val);             \
    register unsigned long int __r5 asm ("5") = (long int) (timespec);        \
    register unsigned long int __r6 asm ("6") = (unsigned long int) (NULL);   \
    register unsigned long int __r7 asm ("7")                                 \
      = (unsigned int) (FUTEX_BITSET_MATCH_ANY);                              \
    register unsigned long __result asm ("2");                                \
                                                                              \
    __asm __volatile ("svc %b1"                                               \
                      : "=d" (__result)                                       \
                      : "i" (SYS_futex), "0" (__r2), "d" (__r3),              \
                        "d" (__r4), "d" (__r5), "d" (__r6), "d" (__r7)        \
                      : "cc", "memory" );                                     \
    __result;                                                                 \
  })

lll_futex_timed_wait_bitset() calls "svc    238" instruction with:

(gdb) p /x $r2  # futex
$5 = 0x3fff6c00004
(gdb) p /x $r3  # FUTEX_WAIT_BITSET | FUTEX_CLOCK_REALTIME
$6 = 0x109
(gdb) p /x $r4  # expected (val)
$7 = 0x0
(gdb) p /x $r5  # abstime (timespec)
$8 = 0x3fff6b7e2f8
(gdb) p	/x $r6  # NULL
$14 = 0x0
(gdb) p	/x $r7  # FUTEX_BITSET_MATCH_ANY
$15 = 0xffffffff

It seems like lll_futex_timed_wait_bitset() is something specific to the s390 architecture.

Comment 20 Victor Stinner 2021-06-29 15:46:46 UTC
futex() syscall in the kernel-2.6.32-754.35.1.

./include/linux/futex.h:#define FUTEX_PRIVATE_FLAG	128

// kernel/futex.c
long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
		u32 __user *uaddr2, u32 val2, u32 val3)
{
	int clockrt, ret = -ENOSYS;
	int cmd = op & FUTEX_CMD_MASK;
	int fshared = 0;

	if (!(op & FUTEX_PRIVATE_FLAG))
		fshared = 1;

	clockrt = op & FUTEX_CLOCK_REALTIME;
	if (clockrt && cmd != FUTEX_WAIT_BITSET && cmd != FUTEX_WAIT_REQUEUE_PI)
		return -ENOSYS;

	switch (cmd) {
	case FUTEX_WAIT:
		val3 = FUTEX_BITSET_MATCH_ANY;
	case FUTEX_WAIT_BITSET:
		ret = futex_wait(uaddr, fshared, val, timeout, val3, clockrt);
		break;
        ...
	}
	return ret;
}

// kernel/futex.c
//
// With the glibc-2.17-324.el7_9.s390x on RHEL 7.9, I computed the arguments:
// (please double check, I'm not sure on my computation)
//
// uaddr = ((unsigned int *)&sem->data + SEM_VALUE_OFFSET) from sem_timedwait() sem argument
// fshared = 1
// val = 0 = glibc futex_abstimed_wait() expected argument
// abs_time = sem_timedwait() abstime argument
// bitset = FUTEX_BITSET_MATCH_ANY (0xffffffff)
// clockrt = 1 (op syscall has FUTEX_CLOCK_REALTIME flag)
static int futex_wait(u32 __user *uaddr, int fshared,
		      u32 val, ktime_t *abs_time, u32 bitset, int clockrt)
{
	struct hrtimer_sleeper timeout, *to = NULL;
	struct restart_block *restart;
	struct futex_hash_bucket *hb;
	struct futex_q q;
	int ret;

	if (!bitset)
		return -EINVAL;

	q.pi_state = NULL;
	q.bitset = bitset;
	q.rt_waiter = NULL;
	q.requeue_pi_key = NULL;

	if (abs_time) {
		to = &timeout;

		hrtimer_init_on_stack(&to->timer, clockrt ? CLOCK_REALTIME :
				      CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
		hrtimer_init_sleeper(to, current);
		hrtimer_set_expires_range_ns(&to->timer, *abs_time,
					     current->timer_slack_ns);
	}

retry:
	/*
	 * Prepare to wait on uaddr. On success, holds hb lock and increments
	 * q.key refs.
	 */
	ret = futex_wait_setup(uaddr, val, fshared, &q, &hb);
	if (ret)
		goto out;

	/* queue_me and wait for wakeup, timeout, or a signal. */
	futex_wait_queue_me(hb, &q, to);

	/* If we were woken (and unqueued), we succeeded, whatever. */
	ret = 0;
	/* unqueue_me() drops q.key ref */
	if (!unqueue_me(&q))
		goto out;
	ret = -ETIMEDOUT;
	if (to && !to->task)
		goto out;

	/*
	 * We expect signal_pending(current), but we might be the
	 * victim of a spurious wakeup as well.
	 */
	if (!signal_pending(current))
		goto retry;

	ret = -ERESTARTSYS;
	if (!abs_time)
		goto out;

	restart = &current_thread_info()->restart_block;
	restart->fn = futex_wait_restart;
	restart->futex.uaddr = (u32 *)uaddr;
	restart->futex.val = val;
	restart->futex.time = abs_time->tv64;
	restart->futex.bitset = bitset;
	restart->futex.flags = FLAGS_HAS_TIMEOUT;

	if (fshared)
		restart->futex.flags |= FLAGS_SHARED;
	if (clockrt)
		restart->futex.flags |= FLAGS_CLOCKRT;

	ret = -ERESTART_RESTARTBLOCK;

out:
	if (to) {
		hrtimer_cancel(&to->timer);
		destroy_hrtimer_on_stack(&to->timer);
	}
	return ret;
}


Note: I computed that futex_wait() fshared=1, even if in the glibc sem->private=1. It seems like sem->private doesn't imply FUTEX_PRIVATE_FLAG: glibc __futex_abstimed_wait_common() only uses its private argument to compute the "op" argumen of the syscall (0x109 = FUTEX_WAIT_BITSET | FUTEX_CLOCK_REALTIME).

Comment 21 Victor Stinner 2021-06-29 17:41:59 UTC
For comparison, here is the futex() operation used by sem_timedwait() on the x86-64 architecture:

* RHEL6, RHEL7: (FUTEX_WAIT_BITSET | FUTEX_CLOCK_REALTIME)
* RHEL8, Fedora 34: (FUTEX_WAIT_BITSET_PRIVATE | FUTEX_CLOCK_REALTIME)

Tested versions:

* RHEL6: kernel 2.6.32-754.39.1.el6.x86_64, glibc-2.12-1.212.el6_10.3.x86_64
* RHEL7: kernel 3.10.0-1160.6.1.el7.x86_64, glibc-2.17-317.el7.x86_64
* RHEL8: kernel 4.18.0-315.el8.x86_64, glibc-2.28-160.el8.x86_64
* Fedora 34: kernel 5.12.8-300.fc34.x86_64, glibc-2.33-14.fc34.x86_64

Comment 22 Victor Stinner 2021-06-29 18:42:13 UTC
Oh, I didn't notice that sem_timedwait.py uses a different kind on lock on Python 2 and Python 3:

* Python 2 uses multiprocessing.synchronize.Lock(): sem_open(), sem_unlink(),
  sem_close().
* Python 3 uses threading.Lock: sem_init(), sem_destroy().

sem_timedwait() on RHEL7 (3.10.0-1062.51.1.el7.s390x, glibc-2.17-292.el7.s390x):

* Python 2.7.5: sem->private=128 and futex() is called with
  (FUTEX_WAIT_BITSET | FUTEX_CLOCK_REALTIME) operation.
* Python 3.6.8: sem->private=0 and futex() is called with
  (FUTEX_WAIT_BITSET_PRIVATE | FUTEX_CLOCK_REALTIME) operation.

But as written in the issue title, Python package builds hang on Brew for Python 2 and Python 3. So it seems like the two kinds of semaphores are affected by the issue.

In my previous comment, FUTEX_WAIT_BITSET vs FUTEX_WAIT_BITSET_PRIVATE is caused by the "python" comment being Python 2 on RHEL6 and RHEL7, but Python 3 on RHEL8 and Fedora 34.

Comment 23 Victor Stinner 2021-06-30 13:03:47 UTC
I'm confident that Python is not responsible for this issue, so I reassign the issue to the glibc on RHEL7. It's just that building a Python package on Brew triggers the issue because Python has multiple tests relying on sem_timedwait() behavior.

In strace, I saw a futex() syscall with a timeout which is not respected by the kernel (does not fail with ETIMEDOUT after the deadline). It may be a RHEL6 kernel bug on futex() with (FUTEX_WAIT_BITSET | FUTEX_CLOCK_REALTIME) and (FUTEX_WAIT_BITSET_PRIVATE | FUTEX_CLOCK_REALTIME) operations, whereas I didn't see any bug on RHEL6 which glibc 2.12 uses futex() with FUTEX_WAIT operation.

Because of the complexity of the issue and the fact that I failed to get a SSH access to a machine where the bug can be reproduced, I was not able to dig deeper, and I couldn't write a C program to fully exclude Python.

Comment 24 Victor Stinner 2021-06-30 13:09:23 UTC
RHEL 7.9 is affected by this issue.

Example with a Python 3.6.8 package build on RHEL 7.9:
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=37718960

Multiple Python tests failed on the s390x build:

* "test_concurrent_futures failed": test_timeout() and test_map_timeout() tests failed
* "test_faulthandler failed": see my explanation of this bug in the first message of this issue
* "test_asyncio crashed (Exit code -9)": killed killed because of the timeout
* the whole build hangs for 24 hours and is then killed by Brew

Comment 25 Victor Stinner 2021-06-30 13:44:17 UTC
Summary:

* The bug was only seen on 3 Brew machines. The bug couldn't be reproduced after 2 Brew machines have been rebooted. Yesterday, the bug was only reproducible on a single Brew machine which has an uptime of 22 days.

* The bug only occurs in a RHEL7 mock container running on RHEL6: RHEL7 glibc on RHEL6 kernel.

* The bug is in the glibc sem_timedwait() function which calls the kernel futex() syscall and an absolute timeout (CLOCK_REALTIME).

* It seems like the futex() syscall doesn't respect the timeout for (FUTEX_WAIT_BITSET | FUTEX_CLOCK_REALTIME) and (FUTEX_WAIT_BITSET_PRIVATE | FUTEX_CLOCK_REALTIME) operations: the syscall doesn't fail with ETIMEDOUT as expected. On the other side, RHEL6 glibc 2.12 calls futex() syscall with the FUTEX_WAIT operation and a relative timeout and it works as expected.

Comment 26 Victor Stinner 2021-06-30 14:07:47 UTC
I modified the reproducer to list the Kernel timers of the Python thread which waited for cancel_event.acquire(timeout=5.0). Tomas Kopecek ran it on Brew where the bug can be reproduced:
---
$ python sem_timedwait3.py 
1625060410.413096905 (0.000 sec): dump_traceback_later: pid 21642
1625060410.413256884 (0.001 sec): cancel_event.acquire(timeout=<5.0 sec>)
 -- list timers --
grep /21643 -A1 /proc/timer_list
 #6: <0000000000000000>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, python/21643
 # expires at 1625060415413278000-1625060415413328000 nsecs [in 1622993232563147117 to 1622993232563197117 nsecs]
-- list timers --
grep /21643 -A1 /proc/timer_list
 #6: <0000000000000000>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, python/21643
 # expires at 1625060415413278000-1625060415413328000 nsecs [in 1622993230891965230 to 1622993230892015230 nsecs]
-- list timers --
grep /21643 -A1 /proc/timer_list
 #6: <0000000000000000>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, python/21643
 # expires at 1625060415413278000-1625060415413328000 nsecs [in 1622993229220423830 to 1622993229220473830 nsecs]
-- list timers --
grep /21643 -A1 /proc/timer_list
 #6: <0000000000000000>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, python/21643
 # expires at 1625060415413278000-1625060415413328000 nsecs [in 1622993227549032115 to 1622993227549082115 nsecs]
-- list timers --
grep /21643 -A1 /proc/timer_list
 #6: <0000000000000000>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, python/21643
 # expires at 1625060415413278000-1625060415413328000 nsecs [in 1622993225877668850 to 1622993225877718850 nsecs]
-- list timers --
grep /21643 -A1 /proc/timer_list
 #6: <0000000000000000>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, python/21643
 # expires at 1625060415413278000-1625060415413328000 nsecs [in 1622993224206839526 to 1622993224206889526 nsecs]
1625060420.440985918 (10.028 sec): cancel_dump_traceback_later
1625060420.441124916 (10.028 sec): cancel_event.acquire(timeout=<5.0 sec>) -> True
1625060420.441145897 (10.028 sec): stop thread
1625060420.441330910 (10.029 sec): exit
---

At 1625060410.413256884, cancel_event.acquire(timeout=0.5) is called: the deadline should be:

>>> 1625060410413256884 + 5*1000*1000*1000
1625060415413256884

The kernel computed "expires at 1625060415413278000-1625060415413328000 nsecs". Using 1625060410.413256884 as the start time, I find again the timeout around 5 seconds:

>>> t0=1625060410413256884
>>> (1625060415413278000-t0)/1e9
5.000021116
>>> (1625060415413328000-t0)/1e9
5.000071116

The last part part looks wrong: "expires (...) [in 1622993229220423830 to 1622993229220473830 nsecs]":

>>> 1622993229220423830/1e9/(3600*24*362.25)
51.855469583762236
>>> 1622993229220473830/1e9/(3600*24*362.25)
51.85546958376383

The timeout will expires in at longer than 51 years.


Output on a s390x machine where the bug doesn't occur:
---
<mock-chroot> sh-4.2# python3 sem_timedwait.py 
1625061502.821355581 (0.000 sec): dump_traceback_later: pid 51274
1625061502.821593523 (0.000 sec): cancel_event.acquire(timeout=<5.0 sec>)
-- list timers --
grep /51275 -A1 /proc/timer_list
 #0: <000000007c443d08>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, python3/51275
 # expires at 1625061507821674000-1625061507821724000 nsecs [in 1624895456218602583 to 1624895456218652583 nsecs]
-- list timers --
grep /51275 -A1 /proc/timer_list
 #0: <000000007c443d08>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, python3/51275
 # expires at 1625061507821674000-1625061507821724000 nsecs [in 1624895454547324897 to 1624895454547374897 nsecs]
-- list timers --
grep /51275 -A1 /proc/timer_list
 #0: <000000007c443d08>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, python3/51275
 # expires at 1625061507821674000-1625061507821724000 nsecs [in 1624895452875767453 to 1624895452875817453 nsecs]
1625061507.821980476 (5.001 sec): cancel_event.acquire(timeout=<5.0 sec>) -> False
1625061507.822108507 (5.001 sec): *** DUMP TRACEBACK ***
-- list timers --
grep /51275 -A1 /proc/timer_list
-- list timers --
grep /51275 -A1 /proc/timer_list
-- list timers --
grep /51275 -A1 /proc/timer_list
1625061512.850618124 (10.029 sec): cancel_dump_traceback_later
1625061512.850964069 (10.030 sec): exit
---

I see again "expires (...) [in 1624895456218602583 to 1624895456218652583 nsecs]" which gives a deadline in longer than 51 days, except that in this case, the wait times out and the futex() syscall fails with ETIMEDOUT as expected.

Maybe it's just that /proc/timer_list miscomputes the "[in ... to ... nsecs]" part in kernel 2.6.32-754.35.1.el6.s390x.

Maybe on the machine where the bug occurs, the REALTIME clock is userland is different than the kernel REALTIME clock. Or maybe there is a confusion between the glibc and the kernel about which clock should be used. The glibc pass FUTEX_CLOCK_REALTIME flag in the futex() operation.

Comment 27 Florian Weimer 2022-05-13 16:14:45 UTC
Victor, I'm closing this bug as CLOSED/WONTFIX because we have no other reports of this issue, and the product life-cycle phase that Red Hat Enterprise Linux 7 is in. I'm so sorry we didn't make good use of the detailed information provided.

Comment 28 Victor Stinner 2022-05-13 16:27:31 UTC
> Victor, I'm closing this bug as CLOSED/WONTFIX because we have no other reports of this issue, and the product life-cycle phase that Red Hat Enterprise Linux 7 is in. I'm so sorry we didn't make good use of the detailed information provided.

Don't worry, it's good that the bug didn't come back.

I spent significant time to dig into the bug because it prevented my team to build any Python package for RHEL7, and so we were no longer able to ship bugfixes or security fixes. But the only affected builder was rebooted, and it worked around the issue. Well, it seems like the bug never came back.


Note You need to log in before you can comment on or make changes to this bug.