Bug 1623070

Summary: (s390x only) sos skips collecting random data due to pipe returncode None
Product: Red Hat Enterprise Linux 7 Reporter: Pavel Moravec <pmoravec>
Component: sosAssignee: Pavel Moravec <pmoravec>
Status: CLOSED ERRATA QA Contact: Miroslav HradĂ­lek <mhradile>
Severity: high Docs Contact:
Priority: urgent    
Version: 7.6CC: agk, bmr, cstratak, dhorak, gavin, jkejda, lmiksik, mhradile, mplch, ovasik, plambri, pmoravec, pviktori, sbradley
Target Milestone: rc   
Target Release: ---   
Hardware: s390x   
OS: Linux   
Whiteboard:
Fixed In Version: sos-3.6-6.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1627544 (view as bug list) Environment:
Last Closed: 2018-10-30 10:33:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1627544    
Attachments:
Description Flags
reproducer script
none
strace from the buggy case
none
strace from an OK case none

Description Pavel Moravec 2018-08-28 12:39:17 UTC
Description of problem:
We have a reproducer (bit artificial but narrowed down from a real use case / sosreport tool) where sometimes:

- a spawned thread reads from opened pipe
- pipe read returns EOF
- BUT pipe returncode is still None
  - also pipe poll returns None, so the pipe isnt closed yet

The returncode == None is crucial problem for us, as we rely on the return code from the command executed.

The above holds for a nontrivial amount of time (while we "process" the output somehow, see get_contents method).

The problem (sometimes) happens for empty output from the pipe, BUT also (sometimes) for a nonempty one - see "variant for nonempty stdout:".


Consequence of the bug: sosreport fails to collect absolutely random commands / files, what makes it totally unusable on this architecture.


Version-Release number of selected component (if applicable):
python-2.7.5-69.el7_5.s390x


How reproducible:
approx 16%


Steps to Reproduce:
1. use attached script
2. run it 1000times, count # of returncode values:

for i in $(seq 1 1000); do python repro_pipe_unfinished.py | grep poll; done | sort | uniq -c


Actual results:
    839 output: '', return code: 0, pollstatus=0
    161 output: '', return code: None, pollstatus=None


Expected results:
Just "return code: 0, pollstatus=0" line.


Additional info:
- seems really s390x specific
- very probabilistic
- probability does not seem to be affected by stdout ((non)zero) length

Comment 1 Pavel Moravec 2018-08-28 12:40:05 UTC
Created attachment 1479243 [details]
reproducer script

Comment 2 Pavel Moravec 2018-08-28 12:42:34 UTC
Created attachment 1479244 [details]
strace from the buggy case

strace -Tttfxvo strace.txt-None python repro_pipe_unfinished.py

from a case when returnstatus=None

Comment 3 Pavel Moravec 2018-08-28 12:43:11 UTC
Created attachment 1479245 [details]
strace from an OK case

strace -Tttfxvo strace.txt-OK python repro_pipe_unfinished.py

from a case when returnstatus=0

Comment 4 Pavel Moravec 2018-08-28 12:58:30 UTC
Comparing straces, I *think* this matters (but take the analysis as from a strace noob):

OK: execve was here completed:

..
61682 08:19:05.652309 close(1023)       = -1 EBADF (Bad file descriptor) <0.000005>
61682 08:19:05.652467 execve("/usr/bin/grep", ["/usr/bin/grep", "-q", "localhost", "/etc/hosts"], ["LAB_CONTROLLER=lab-02.rhts.eng.b"..., "XDG_SESSION_ID=35", "HOSTNAME=ibm-z-33.rhts.eng.bos.r"..., "SELINUX_ROLE_REQUESTED=", "TERM=xterm-256color", "SHELL=/bin/bash", "HISTSIZE=1000", "SSH_CLIENT=10.40.205.102 35474 2"..., "SELINUX_USE_CURRENT_RANGE=", "BUILDURL=http://download.eng.bos"..., "DUMPSERVER=netdump-01.eng.bos.re"..., "QTDIR=/usr/lib64/qt-3.3", "QTINC=/usr/lib64/qt-3.3/include", "SSH_TTY=/dev/pts/1", "RHTS_OPTION_COMPATIBLE=", "QT_GRAPHICSSYSTEM_CHECKED=1", "USER=root", "LS_COLORS=rs=0:di=38;5;27:ln=38;"..., "BEAKER_RECIPE_WHITEBOARD=Server "..., "BEAKER_JOB_WHITEBOARD=[TR#340267"..., "MAIL=/var/spool/mail/root", "PATH=/usr/lib64/qt-3.3/bin:/usr/"..., "PWD=/root", "XMODIFIERS=@im=none", "LANG=en_US.UTF-8", "BEAKER=https://beaker.engineerin"..., "LOOKASIDE=http://download.eng.bo"..., "SELINUX_LEVEL_REQUESTED=", "HISTCONTROL=ignoredups", "BEAKER_HUB_URL=https://beaker.en"..., "BEAKER_RECIPE_ID=5584053", "BEAKER_LAB_CONTROLLER=lab-02.rht"..., "SHLVL=1", "HOME=/root", "NFSSERVERS=rhel4-nfs.rhts.eng.bo"..., "BEAKER_LAB_CONTROLLER_URL=http:/"..., "LOGNAME=root", "RHTS_OPTION_COMPAT_SERVICE=", "QTLIB=/usr/lib64/qt-3.3/lib", "SSH_CONNECTION=10.40.205.102 354"..., "LESSOPEN=||/usr/bin/lesspipe.sh "..., "XDG_RUNTIME_DIR=/run/user/0", "_=/usr/bin/strace"]) = 0 <0.000414>


None: execve was interrupted by something else and resumed later on:

61658 08:18:54.097997 close(1023)       = -1 EBADF (Bad file descriptor) <0.000006>
61658 08:18:54.098170 execve("/usr/bin/grep", ["/usr/bin/grep", "-q", "localhost", "/etc/hosts"], ["LAB_CONTROLLER=lab-02.rhts.eng.b"..., "XDG_SESSION_ID=35", "HOSTNAME=ibm-z-33.rhts.eng.bos.r"..., "SELINUX_ROLE_REQUESTED=", "TERM=xterm-256color", "SHELL=/bin/bash", "HISTSIZE=1000", "SSH_CLIENT=10.40.205.102 35474 2"..., "SELINUX_USE_CURRENT_RANGE=", "BUILDURL=http://download.eng.bos"..., "DUMPSERVER=netdump-01.eng.bos.re"..., "QTDIR=/usr/lib64/qt-3.3", "QTINC=/usr/lib64/qt-3.3/include", "SSH_TTY=/dev/pts/1", "RHTS_OPTION_COMPATIBLE=", "QT_GRAPHICSSYSTEM_CHECKED=1", "USER=root", "LS_COLORS=rs=0:di=38;5;27:ln=38;"..., "BEAKER_RECIPE_WHITEBOARD=Server "..., "BEAKER_JOB_WHITEBOARD=[TR#340267"..., "MAIL=/var/spool/mail/root", "PATH=/usr/lib64/qt-3.3/bin:/usr/"..., "PWD=/root", "XMODIFIERS=@im=none", "LANG=en_US.UTF-8", "BEAKER=https://beaker.engineerin"..., "LOOKASIDE=http://download.eng.bo"..., "SELINUX_LEVEL_REQUESTED=", "HISTCONTROL=ignoredups", "BEAKER_HUB_URL=https://beaker.en"..., "BEAKER_RECIPE_ID=5584053", "BEAKER_LAB_CONTROLLER=lab-02.rht"..., "SHLVL=1", "HOME=/root", "NFSSERVERS=rhel4-nfs.rhts.eng.bo"..., "BEAKER_LAB_CONTROLLER_URL=http:/"..., "LOGNAME=root", "RHTS_OPTION_COMPAT_SERVICE=", "QTLIB=/usr/lib64/qt-3.3/lib", "SSH_CONNECTION=10.40.205.102 354"..., "LESSOPEN=||/usr/bin/lesspipe.sh "..., "XDG_RUNTIME_DIR=/run/user/0", "_=/usr/bin/strace"] <unfinished ...>
61657 08:18:54.098693 <... read resumed> "", 1048576) = 0 <0.023026>
61657 08:18:54.098715 mremap(0x3fff6884000, 1052672, 4096, MREMAP_MAYMOVE) = 0x3fff6884000 <0.000011>
61657 08:18:54.098756 close(4)          = 0 <0.000006>
61657 08:18:54.098788 close(6)          = 0 <0.000007>
61657 08:18:54.098812 close(7)          = 0 <0.000011>
61657 08:18:54.098849 munmap(0x3fff6884000, 4096) = 0 <0.000010>
61657 08:18:54.098893 fcntl(3, F_GETFL) = 0 (flags O_RDONLY) <0.000007>
61657 08:18:54.098924 fstat(3, {st_dev=makedev(0, 9), st_ino=1639909, st_mode=S_IFIFO|0600, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2018/08/28-08:18:54.064743374, st_mtime=2018/08/28-08:18:54.064743374, st_ctime=2018/08/28-08:18:54.064743374}) = 0 <0.000007>
61657 08:18:54.098974 fcntl(5, F_GETFL) = 0 (flags O_RDONLY) <0.000006>
61657 08:18:54.099000 fstat(5, {st_dev=makedev(0, 9), st_ino=1639910, st_mode=S_IFIFO|0600, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2018/08/28-08:18:54.064743374, st_mtime=2018/08/28-08:18:54.064743374, st_ctime=2018/08/28-08:18:54.064743374}) = 0 <0.000026>
61657 08:18:54.099226 mmap(NULL, 8392704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x3fff6184000 <0.000009>
61657 08:18:54.099267 mprotect(0x3fff6184000, 4096, PROT_NONE) = 0 <0.000008>
61657 08:18:54.099294 clone(child_stack=0x3fff6984210, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x3fff69849e0, tls=0x3fff6984910, child_tidptr=0x3fff69849e0) = 61659 <0.000039>
61658 08:18:54.099353 <... execve resumed> ) = 0 <0.001072>


Cant this bad interleaving in "None" case cause the bug?

Comment 5 Pavel Moravec 2018-08-30 13:36:38 UTC
FYI on a different s390x machine (gss5.s390.bos.redhat.com, default beaker password), I can reproduce it much much _rarely_:

# for i in $(seq 1 1000); do python ~/repro_pipe_unfinished.py; done | grep poll | sort | uniq -c
    995 output: '', return code: 0, pollstatus=0
      5 output: '', return code: None, pollstatus=None
#


(for python maintainers: feel free to use that machine to anything - I might test potential workaround in sosreport there but nothing else)

Comment 8 Marcel Plch 2018-09-03 10:59:34 UTC
This bug does not occur when compiled from source using --with-pydebug. (maybe the printing of "[# refs]" is slowing the process down and preventing this?)
There is also quite a difference in likelihood of this happening between versions 2.7.5 and 2.7.15. (Respectively 30-100 in 1000 and 4-22 in 1000.)

Comment 9 Marcel Plch 2018-09-03 11:32:48 UTC
I have reproduced this on actual master (Python 3.8), x86_64.
I'm going to report this upstream and will backport the patch once it will have become available.
----------------------
$ git clone https://github.com/python/cpython.git
$ cd cpython
$ ./configure --enable-shared CFLAGS="-O0" LDFLAGS="-Wl,--rpath=$(pwd)"
$ make -j
$ for i in $(seq 1 1000); do ./python script.py; done | grep poll | sort | uniq -c
    630 output: '', return code: 0, pollstatus=0
    370 output: '', return code: None, pollstatus=None

Comment 18 Pavel Moravec 2018-09-06 07:52:53 UTC
Changing component to sos, as sosreport wrongly calls python's API for pipes.

Comment 22 errata-xmlrpc 2018-10-30 10:33:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:3144