RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1623070 - (s390x only) sos skips collecting random data due to pipe returncode None
Summary: (s390x only) sos skips collecting random data due to pipe returncode None
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: sos
Version: 7.6
Hardware: s390x
OS: Linux
urgent
high
Target Milestone: rc
: ---
Assignee: Pavel Moravec
QA Contact: Miroslav Hradílek
URL:
Whiteboard:
Depends On:
Blocks: 1627544
TreeView+ depends on / blocked
 
Reported: 2018-08-28 12:39 UTC by Pavel Moravec
Modified: 2018-10-30 10:34 UTC (History)
14 users (show)

Fixed In Version: sos-3.6-6.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1627544 (view as bug list)
Environment:
Last Closed: 2018-10-30 10:33:42 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
reproducer script (1.22 KB, text/plain)
2018-08-28 12:40 UTC, Pavel Moravec
no flags Details
strace from the buggy case (281.34 KB, text/plain)
2018-08-28 12:42 UTC, Pavel Moravec
no flags Details
strace from an OK case (284.02 KB, text/plain)
2018-08-28 12:43 UTC, Pavel Moravec
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github sosreport sos issues 1417 0 None closed [utilities] command returncode can be None in some cases 2020-07-14 08:51:13 UTC
Python 34566 0 None None None 2018-09-03 12:10:15 UTC
Red Hat Product Errata RHEA-2018:3144 0 None None None 2018-10-30 10:34:57 UTC

Description Pavel Moravec 2018-08-28 12:39:17 UTC
Description of problem:
We have a reproducer (bit artificial but narrowed down from a real use case / sosreport tool) where sometimes:

- a spawned thread reads from opened pipe
- pipe read returns EOF
- BUT pipe returncode is still None
  - also pipe poll returns None, so the pipe isnt closed yet

The returncode == None is crucial problem for us, as we rely on the return code from the command executed.

The above holds for a nontrivial amount of time (while we "process" the output somehow, see get_contents method).

The problem (sometimes) happens for empty output from the pipe, BUT also (sometimes) for a nonempty one - see "variant for nonempty stdout:".


Consequence of the bug: sosreport fails to collect absolutely random commands / files, what makes it totally unusable on this architecture.


Version-Release number of selected component (if applicable):
python-2.7.5-69.el7_5.s390x


How reproducible:
approx 16%


Steps to Reproduce:
1. use attached script
2. run it 1000times, count # of returncode values:

for i in $(seq 1 1000); do python repro_pipe_unfinished.py | grep poll; done | sort | uniq -c


Actual results:
    839 output: '', return code: 0, pollstatus=0
    161 output: '', return code: None, pollstatus=None


Expected results:
Just "return code: 0, pollstatus=0" line.


Additional info:
- seems really s390x specific
- very probabilistic
- probability does not seem to be affected by stdout ((non)zero) length

Comment 1 Pavel Moravec 2018-08-28 12:40:05 UTC
Created attachment 1479243 [details]
reproducer script

Comment 2 Pavel Moravec 2018-08-28 12:42:34 UTC
Created attachment 1479244 [details]
strace from the buggy case

strace -Tttfxvo strace.txt-None python repro_pipe_unfinished.py

from a case when returnstatus=None

Comment 3 Pavel Moravec 2018-08-28 12:43:11 UTC
Created attachment 1479245 [details]
strace from an OK case

strace -Tttfxvo strace.txt-OK python repro_pipe_unfinished.py

from a case when returnstatus=0

Comment 4 Pavel Moravec 2018-08-28 12:58:30 UTC
Comparing straces, I *think* this matters (but take the analysis as from a strace noob):

OK: execve was here completed:

..
61682 08:19:05.652309 close(1023)       = -1 EBADF (Bad file descriptor) <0.000005>
61682 08:19:05.652467 execve("/usr/bin/grep", ["/usr/bin/grep", "-q", "localhost", "/etc/hosts"], ["LAB_CONTROLLER=lab-02.rhts.eng.b"..., "XDG_SESSION_ID=35", "HOSTNAME=ibm-z-33.rhts.eng.bos.r"..., "SELINUX_ROLE_REQUESTED=", "TERM=xterm-256color", "SHELL=/bin/bash", "HISTSIZE=1000", "SSH_CLIENT=10.40.205.102 35474 2"..., "SELINUX_USE_CURRENT_RANGE=", "BUILDURL=http://download.eng.bos"..., "DUMPSERVER=netdump-01.eng.bos.re"..., "QTDIR=/usr/lib64/qt-3.3", "QTINC=/usr/lib64/qt-3.3/include", "SSH_TTY=/dev/pts/1", "RHTS_OPTION_COMPATIBLE=", "QT_GRAPHICSSYSTEM_CHECKED=1", "USER=root", "LS_COLORS=rs=0:di=38;5;27:ln=38;"..., "BEAKER_RECIPE_WHITEBOARD=Server "..., "BEAKER_JOB_WHITEBOARD=[TR#340267"..., "MAIL=/var/spool/mail/root", "PATH=/usr/lib64/qt-3.3/bin:/usr/"..., "PWD=/root", "XMODIFIERS=@im=none", "LANG=en_US.UTF-8", "BEAKER=https://beaker.engineerin"..., "LOOKASIDE=http://download.eng.bo"..., "SELINUX_LEVEL_REQUESTED=", "HISTCONTROL=ignoredups", "BEAKER_HUB_URL=https://beaker.en"..., "BEAKER_RECIPE_ID=5584053", "BEAKER_LAB_CONTROLLER=lab-02.rht"..., "SHLVL=1", "HOME=/root", "NFSSERVERS=rhel4-nfs.rhts.eng.bo"..., "BEAKER_LAB_CONTROLLER_URL=http:/"..., "LOGNAME=root", "RHTS_OPTION_COMPAT_SERVICE=", "QTLIB=/usr/lib64/qt-3.3/lib", "SSH_CONNECTION=10.40.205.102 354"..., "LESSOPEN=||/usr/bin/lesspipe.sh "..., "XDG_RUNTIME_DIR=/run/user/0", "_=/usr/bin/strace"]) = 0 <0.000414>


None: execve was interrupted by something else and resumed later on:

61658 08:18:54.097997 close(1023)       = -1 EBADF (Bad file descriptor) <0.000006>
61658 08:18:54.098170 execve("/usr/bin/grep", ["/usr/bin/grep", "-q", "localhost", "/etc/hosts"], ["LAB_CONTROLLER=lab-02.rhts.eng.b"..., "XDG_SESSION_ID=35", "HOSTNAME=ibm-z-33.rhts.eng.bos.r"..., "SELINUX_ROLE_REQUESTED=", "TERM=xterm-256color", "SHELL=/bin/bash", "HISTSIZE=1000", "SSH_CLIENT=10.40.205.102 35474 2"..., "SELINUX_USE_CURRENT_RANGE=", "BUILDURL=http://download.eng.bos"..., "DUMPSERVER=netdump-01.eng.bos.re"..., "QTDIR=/usr/lib64/qt-3.3", "QTINC=/usr/lib64/qt-3.3/include", "SSH_TTY=/dev/pts/1", "RHTS_OPTION_COMPATIBLE=", "QT_GRAPHICSSYSTEM_CHECKED=1", "USER=root", "LS_COLORS=rs=0:di=38;5;27:ln=38;"..., "BEAKER_RECIPE_WHITEBOARD=Server "..., "BEAKER_JOB_WHITEBOARD=[TR#340267"..., "MAIL=/var/spool/mail/root", "PATH=/usr/lib64/qt-3.3/bin:/usr/"..., "PWD=/root", "XMODIFIERS=@im=none", "LANG=en_US.UTF-8", "BEAKER=https://beaker.engineerin"..., "LOOKASIDE=http://download.eng.bo"..., "SELINUX_LEVEL_REQUESTED=", "HISTCONTROL=ignoredups", "BEAKER_HUB_URL=https://beaker.en"..., "BEAKER_RECIPE_ID=5584053", "BEAKER_LAB_CONTROLLER=lab-02.rht"..., "SHLVL=1", "HOME=/root", "NFSSERVERS=rhel4-nfs.rhts.eng.bo"..., "BEAKER_LAB_CONTROLLER_URL=http:/"..., "LOGNAME=root", "RHTS_OPTION_COMPAT_SERVICE=", "QTLIB=/usr/lib64/qt-3.3/lib", "SSH_CONNECTION=10.40.205.102 354"..., "LESSOPEN=||/usr/bin/lesspipe.sh "..., "XDG_RUNTIME_DIR=/run/user/0", "_=/usr/bin/strace"] <unfinished ...>
61657 08:18:54.098693 <... read resumed> "", 1048576) = 0 <0.023026>
61657 08:18:54.098715 mremap(0x3fff6884000, 1052672, 4096, MREMAP_MAYMOVE) = 0x3fff6884000 <0.000011>
61657 08:18:54.098756 close(4)          = 0 <0.000006>
61657 08:18:54.098788 close(6)          = 0 <0.000007>
61657 08:18:54.098812 close(7)          = 0 <0.000011>
61657 08:18:54.098849 munmap(0x3fff6884000, 4096) = 0 <0.000010>
61657 08:18:54.098893 fcntl(3, F_GETFL) = 0 (flags O_RDONLY) <0.000007>
61657 08:18:54.098924 fstat(3, {st_dev=makedev(0, 9), st_ino=1639909, st_mode=S_IFIFO|0600, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2018/08/28-08:18:54.064743374, st_mtime=2018/08/28-08:18:54.064743374, st_ctime=2018/08/28-08:18:54.064743374}) = 0 <0.000007>
61657 08:18:54.098974 fcntl(5, F_GETFL) = 0 (flags O_RDONLY) <0.000006>
61657 08:18:54.099000 fstat(5, {st_dev=makedev(0, 9), st_ino=1639910, st_mode=S_IFIFO|0600, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2018/08/28-08:18:54.064743374, st_mtime=2018/08/28-08:18:54.064743374, st_ctime=2018/08/28-08:18:54.064743374}) = 0 <0.000026>
61657 08:18:54.099226 mmap(NULL, 8392704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x3fff6184000 <0.000009>
61657 08:18:54.099267 mprotect(0x3fff6184000, 4096, PROT_NONE) = 0 <0.000008>
61657 08:18:54.099294 clone(child_stack=0x3fff6984210, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x3fff69849e0, tls=0x3fff6984910, child_tidptr=0x3fff69849e0) = 61659 <0.000039>
61658 08:18:54.099353 <... execve resumed> ) = 0 <0.001072>


Cant this bad interleaving in "None" case cause the bug?

Comment 5 Pavel Moravec 2018-08-30 13:36:38 UTC
FYI on a different s390x machine (gss5.s390.bos.redhat.com, default beaker password), I can reproduce it much much _rarely_:

# for i in $(seq 1 1000); do python ~/repro_pipe_unfinished.py; done | grep poll | sort | uniq -c
    995 output: '', return code: 0, pollstatus=0
      5 output: '', return code: None, pollstatus=None
#


(for python maintainers: feel free to use that machine to anything - I might test potential workaround in sosreport there but nothing else)

Comment 8 Marcel Plch 2018-09-03 10:59:34 UTC
This bug does not occur when compiled from source using --with-pydebug. (maybe the printing of "[# refs]" is slowing the process down and preventing this?)
There is also quite a difference in likelihood of this happening between versions 2.7.5 and 2.7.15. (Respectively 30-100 in 1000 and 4-22 in 1000.)

Comment 9 Marcel Plch 2018-09-03 11:32:48 UTC
I have reproduced this on actual master (Python 3.8), x86_64.
I'm going to report this upstream and will backport the patch once it will have become available.
----------------------
$ git clone https://github.com/python/cpython.git
$ cd cpython
$ ./configure --enable-shared CFLAGS="-O0" LDFLAGS="-Wl,--rpath=$(pwd)"
$ make -j
$ for i in $(seq 1 1000); do ./python script.py; done | grep poll | sort | uniq -c
    630 output: '', return code: 0, pollstatus=0
    370 output: '', return code: None, pollstatus=None

Comment 18 Pavel Moravec 2018-09-06 07:52:53 UTC
Changing component to sos, as sosreport wrongly calls python's API for pipes.

Comment 22 errata-xmlrpc 2018-10-30 10:33:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:3144


Note You need to log in before you can comment on or make changes to this bug.