Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1398342

Summary:

sem_post() and sem_timed_wait() block forever

Product:

Red Hat Enterprise Linux 7

Reporter:

Paulo Andrade <pandrade>

Component:

glibc

Assignee:

Carlos O'Donell <codonell>

Status:

CLOSED NOTABUG

QA Contact:

qe-baseos-tools-bugs

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

7.3

CC:

ashankar, fweimer, mnewsome, pandrade, pfrankli, triegel

Target Milestone:

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2016-11-30 18:10:32 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
test.c	none

Description Paulo Andrade 2016-11-24 13:46:22 UTC

Created attachment 1223865 [details]
test.c

This is a regression after correction to rhbz#1027348

  The issue happens on a 32 bit application linked to some
significantly old 32 bit libraries.

  The attached test.c mimics the behaviour of the user
environment where the problem happens.

  Basically, it calls sem_timedwait on an "timedout" timer
on a loop, then, attempts to break the loop by doing a
sem_post, but it never finishes.

  The issue happens only if forcing it to use __old_sem_post.

  The user cannot at the moment rebuild the legacy binaries,
nor rebuild as 64 bit binaries.

  The test case should print:
$ gcc -O0 -g3 -m32 test.c -o test -Wl,--wrap=sem_post -pthread
-bash-4.2$ ./test
.
No error

but instead, in rhel 7.3 (or fedora 24) it just prints nostop
a '.'.

Comment 1 Florian Weimer 2016-11-24 14:22:06 UTC

I looked at why the old compat symbol was needed.  The original sem_t for LinuxThreads looks like this:

typedef struct {
    long int sem_status;
    int sem_spinlock;
} old_sem_t;

So it is only 8 or 16 bytes large (for 32 and 64 bit architectures, respectively).  Our current sem_t is 16 or 32 bytes large.

The current implementation needs 16 bytes on both kinds of architectures, so there is no easy way to fix this on 32-bit architectures.

Comment 3 Carlos O'Donell 2016-11-26 15:56:35 UTC

This is either a bug in the old sem_post, which is entirely plausible, or a test case / user application issue, so we need to triage this to the point where we understand what's the problem with the use case. I expect this is a compat symbol implementation issue, we really need more test cases that exercise the compat symbols. I like the use of the asm to for the symver in the example, that's clever, and I know Florian has been pushing similar things upstream e.g. symbol_version_reference. We need a test just like this one upstream.

Comment 4 Florian Weimer 2016-11-26 17:31:44 UTC

(In reply to Carlos O'Donell from comment #3)
> This is either a bug in the old sem_post, which is entirely plausible, or a
> test case / user application issue, so we need to triage this to the point
> where we understand what's the problem with the use case. I expect this is a
> compat symbol implementation issue, we really need more test cases that
> exercise the compat symbols.

The test case attached to the bug suggests this is an incorrect assumption about compatibility of old and new semaphores.  Apparently, the semaphore object is passed between parts of the system using different library versions.

But the test case may not be representative of what actually goes on in the application (see my next comment).

I think we can improve the situation at reasonable cost.  Here are the current struct definitions with 64-bit atomics available:

struct new_sem
{
  uint64_t data;
  int private;
  int pad;
};

Recall that on 64-bit architectures (with 8-byte alignment), the original definition already needed 16 bytes, so we do not need a different implementation for the compat symbol at all.

For architectures without 64-bit atomics, we have:

struct new_sem
{
  unsigned int value;
  int private;
  int pad;
  unsigned int nwaiters;
};

struct old_sem
{
  unsigned int value;
};

The original semaphore may only have been 8 bytes large (see the comment 1).  So we can introduce a shared flag field only by reordering fields, like this:

struct new_sem
{
  unsigned int value;
  int flags;
  int private;
  unsigned int nwaiters;
};

struct old_sem
{
  unsigned int value;
  int flags;
};

New-style sem_init sets flags to 0 and sem_open would set flags 0, the compat implementation of sem_init sets flags to 1.  The compat implementation of sem_wait and sem_post call the non-compat implementation if flags is 0.

This means that the compat implementations would be able to use semaphore objects created by the current implementation, but not vice versa.  This could be sufficient to address the application issue.

Comment 5 Florian Weimer 2016-11-26 17:40:14 UTC

(In reply to Paulo Andrade from comment #0)
>   The attached test.c mimics the behaviour of the user
> environment where the problem happens.
> 
>   Basically, it calls sem_timedwait on an "timedout" timer
> on a loop, then, attempts to break the loop by doing a
> sem_post, but it never finishes.
> 
>   The issue happens only if forcing it to use __old_sem_post.

How accurate is the test case?

What is puzzling about it is that sem_open became available only *after* we increased the size of struct sem_t and created the compat version for sem_post.  This means that it is bizarre that the application does not end up calling the new version of sem_post.

Does the real application call sem_init instead?

Can you double-check that the shared object which calls sem_post is linked against libpthread?  Another possible explanation is that it was not linked correctly, did not receive symbol version, and ld.so binds all its symbols against compat symbols as a result.  In this case, the issue would be similar to what we experienced in this upstream bug:

  https://sourceware.org/bugzilla/show_bug.cgi?id=20489

The shared object is invalid in such a case and impossible to support.  But we currently lack diagnostics for that, so it is rather difficult to enforce that.

One possible way to check this is to examine the old binaries involved with “eu-readelf -s” and check if they reference any GLIBC_ versions newer than GLIBC_2.0.

Comment 6 Florian Weimer 2016-11-26 18:09:03 UTC

(In reply to Florian Weimer from comment #4)
> The original semaphore may only have been 8 bytes large (see the comment 1).
> So we can introduce a shared flag field only by reordering fields, like this:

This does not work everywhere because the pad field is already used as a mutex on pre-v9 SPARC, but the struct definition was updated to reflect that.

But the question remains: Can we squeeze a flag bit from somewhere?

If there are any architectures with 64-bit atomics and only 4-byte alignment for longs, we have another problem (alignment of sem_t is insufficient for the current non-compat implementation).

Comment 7 Paulo Andrade 2016-11-28 11:29:21 UTC

  The libraries are very old and if I understood correctly, are dlopen'ed;
they are part of mysql compiled stored procedures.

  Stepping with gdb and putting watchpoints showed the same behaviour
in the test case. It calls mostly new_* functions, and the call to
old_sem_post causes the infinite loop.

  The fact that binaries and libraries cannot be rebuilt (it might be
mostly due to bureaucracy) is a possible indication of invalid/unsupported
build.

  I will ask the user to check eu-readelf -s and libpthread linking.

Comment 8 Paulo Andrade 2016-11-28 16:44:16 UTC

  eu-readelf -s have shown newer than GLIBC_2.0 symbols.

  The user code did not call sem_init; user was unsure if mariadb somewhere
could be calling it.

  The library was not linked to libpthread. User was instructed to relink
the library with -lpthread, and now we are waiting user feedback.