Bug 1623070
| Summary: | (s390x only) sos skips collecting random data due to pipe returncode None | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Pavel Moravec <pmoravec> | ||||||||
| Component: | sos | Assignee: | Pavel Moravec <pmoravec> | ||||||||
| Status: | CLOSED ERRATA | QA Contact: | Miroslav HradĂlek <mhradile> | ||||||||
| Severity: | high | Docs Contact: | |||||||||
| Priority: | urgent | ||||||||||
| Version: | 7.6 | CC: | agk, bmr, cstratak, dhorak, gavin, jkejda, lmiksik, mhradile, mplch, ovasik, plambri, pmoravec, pviktori, sbradley | ||||||||
| Target Milestone: | rc | ||||||||||
| Target Release: | --- | ||||||||||
| Hardware: | s390x | ||||||||||
| OS: | Linux | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | sos-3.6-6.el7 | Doc Type: | If docs needed, set a value | ||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | |||||||||||
| : | 1627544 (view as bug list) | Environment: | |||||||||
| Last Closed: | 2018-10-30 10:33:42 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Bug Depends On: | |||||||||||
| Bug Blocks: | 1627544 | ||||||||||
| Attachments: |
|
||||||||||
Created attachment 1479243 [details]
reproducer script
Created attachment 1479244 [details]
strace from the buggy case
strace -Tttfxvo strace.txt-None python repro_pipe_unfinished.py
from a case when returnstatus=None
Created attachment 1479245 [details]
strace from an OK case
strace -Tttfxvo strace.txt-OK python repro_pipe_unfinished.py
from a case when returnstatus=0
Comparing straces, I *think* this matters (but take the analysis as from a strace noob):
OK: execve was here completed:
..
61682 08:19:05.652309 close(1023) = -1 EBADF (Bad file descriptor) <0.000005>
61682 08:19:05.652467 execve("/usr/bin/grep", ["/usr/bin/grep", "-q", "localhost", "/etc/hosts"], ["LAB_CONTROLLER=lab-02.rhts.eng.b"..., "XDG_SESSION_ID=35", "HOSTNAME=ibm-z-33.rhts.eng.bos.r"..., "SELINUX_ROLE_REQUESTED=", "TERM=xterm-256color", "SHELL=/bin/bash", "HISTSIZE=1000", "SSH_CLIENT=10.40.205.102 35474 2"..., "SELINUX_USE_CURRENT_RANGE=", "BUILDURL=http://download.eng.bos"..., "DUMPSERVER=netdump-01.eng.bos.re"..., "QTDIR=/usr/lib64/qt-3.3", "QTINC=/usr/lib64/qt-3.3/include", "SSH_TTY=/dev/pts/1", "RHTS_OPTION_COMPATIBLE=", "QT_GRAPHICSSYSTEM_CHECKED=1", "USER=root", "LS_COLORS=rs=0:di=38;5;27:ln=38;"..., "BEAKER_RECIPE_WHITEBOARD=Server "..., "BEAKER_JOB_WHITEBOARD=[TR#340267"..., "MAIL=/var/spool/mail/root", "PATH=/usr/lib64/qt-3.3/bin:/usr/"..., "PWD=/root", "XMODIFIERS=@im=none", "LANG=en_US.UTF-8", "BEAKER=https://beaker.engineerin"..., "LOOKASIDE=http://download.eng.bo"..., "SELINUX_LEVEL_REQUESTED=", "HISTCONTROL=ignoredups", "BEAKER_HUB_URL=https://beaker.en"..., "BEAKER_RECIPE_ID=5584053", "BEAKER_LAB_CONTROLLER=lab-02.rht"..., "SHLVL=1", "HOME=/root", "NFSSERVERS=rhel4-nfs.rhts.eng.bo"..., "BEAKER_LAB_CONTROLLER_URL=http:/"..., "LOGNAME=root", "RHTS_OPTION_COMPAT_SERVICE=", "QTLIB=/usr/lib64/qt-3.3/lib", "SSH_CONNECTION=10.40.205.102 354"..., "LESSOPEN=||/usr/bin/lesspipe.sh "..., "XDG_RUNTIME_DIR=/run/user/0", "_=/usr/bin/strace"]) = 0 <0.000414>
None: execve was interrupted by something else and resumed later on:
61658 08:18:54.097997 close(1023) = -1 EBADF (Bad file descriptor) <0.000006>
61658 08:18:54.098170 execve("/usr/bin/grep", ["/usr/bin/grep", "-q", "localhost", "/etc/hosts"], ["LAB_CONTROLLER=lab-02.rhts.eng.b"..., "XDG_SESSION_ID=35", "HOSTNAME=ibm-z-33.rhts.eng.bos.r"..., "SELINUX_ROLE_REQUESTED=", "TERM=xterm-256color", "SHELL=/bin/bash", "HISTSIZE=1000", "SSH_CLIENT=10.40.205.102 35474 2"..., "SELINUX_USE_CURRENT_RANGE=", "BUILDURL=http://download.eng.bos"..., "DUMPSERVER=netdump-01.eng.bos.re"..., "QTDIR=/usr/lib64/qt-3.3", "QTINC=/usr/lib64/qt-3.3/include", "SSH_TTY=/dev/pts/1", "RHTS_OPTION_COMPATIBLE=", "QT_GRAPHICSSYSTEM_CHECKED=1", "USER=root", "LS_COLORS=rs=0:di=38;5;27:ln=38;"..., "BEAKER_RECIPE_WHITEBOARD=Server "..., "BEAKER_JOB_WHITEBOARD=[TR#340267"..., "MAIL=/var/spool/mail/root", "PATH=/usr/lib64/qt-3.3/bin:/usr/"..., "PWD=/root", "XMODIFIERS=@im=none", "LANG=en_US.UTF-8", "BEAKER=https://beaker.engineerin"..., "LOOKASIDE=http://download.eng.bo"..., "SELINUX_LEVEL_REQUESTED=", "HISTCONTROL=ignoredups", "BEAKER_HUB_URL=https://beaker.en"..., "BEAKER_RECIPE_ID=5584053", "BEAKER_LAB_CONTROLLER=lab-02.rht"..., "SHLVL=1", "HOME=/root", "NFSSERVERS=rhel4-nfs.rhts.eng.bo"..., "BEAKER_LAB_CONTROLLER_URL=http:/"..., "LOGNAME=root", "RHTS_OPTION_COMPAT_SERVICE=", "QTLIB=/usr/lib64/qt-3.3/lib", "SSH_CONNECTION=10.40.205.102 354"..., "LESSOPEN=||/usr/bin/lesspipe.sh "..., "XDG_RUNTIME_DIR=/run/user/0", "_=/usr/bin/strace"] <unfinished ...>
61657 08:18:54.098693 <... read resumed> "", 1048576) = 0 <0.023026>
61657 08:18:54.098715 mremap(0x3fff6884000, 1052672, 4096, MREMAP_MAYMOVE) = 0x3fff6884000 <0.000011>
61657 08:18:54.098756 close(4) = 0 <0.000006>
61657 08:18:54.098788 close(6) = 0 <0.000007>
61657 08:18:54.098812 close(7) = 0 <0.000011>
61657 08:18:54.098849 munmap(0x3fff6884000, 4096) = 0 <0.000010>
61657 08:18:54.098893 fcntl(3, F_GETFL) = 0 (flags O_RDONLY) <0.000007>
61657 08:18:54.098924 fstat(3, {st_dev=makedev(0, 9), st_ino=1639909, st_mode=S_IFIFO|0600, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2018/08/28-08:18:54.064743374, st_mtime=2018/08/28-08:18:54.064743374, st_ctime=2018/08/28-08:18:54.064743374}) = 0 <0.000007>
61657 08:18:54.098974 fcntl(5, F_GETFL) = 0 (flags O_RDONLY) <0.000006>
61657 08:18:54.099000 fstat(5, {st_dev=makedev(0, 9), st_ino=1639910, st_mode=S_IFIFO|0600, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2018/08/28-08:18:54.064743374, st_mtime=2018/08/28-08:18:54.064743374, st_ctime=2018/08/28-08:18:54.064743374}) = 0 <0.000026>
61657 08:18:54.099226 mmap(NULL, 8392704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x3fff6184000 <0.000009>
61657 08:18:54.099267 mprotect(0x3fff6184000, 4096, PROT_NONE) = 0 <0.000008>
61657 08:18:54.099294 clone(child_stack=0x3fff6984210, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x3fff69849e0, tls=0x3fff6984910, child_tidptr=0x3fff69849e0) = 61659 <0.000039>
61658 08:18:54.099353 <... execve resumed> ) = 0 <0.001072>
Cant this bad interleaving in "None" case cause the bug?
FYI on a different s390x machine (gss5.s390.bos.redhat.com, default beaker password), I can reproduce it much much _rarely_:
# for i in $(seq 1 1000); do python ~/repro_pipe_unfinished.py; done | grep poll | sort | uniq -c
995 output: '', return code: 0, pollstatus=0
5 output: '', return code: None, pollstatus=None
#
(for python maintainers: feel free to use that machine to anything - I might test potential workaround in sosreport there but nothing else)
This bug does not occur when compiled from source using --with-pydebug. (maybe the printing of "[# refs]" is slowing the process down and preventing this?) There is also quite a difference in likelihood of this happening between versions 2.7.5 and 2.7.15. (Respectively 30-100 in 1000 and 4-22 in 1000.) I have reproduced this on actual master (Python 3.8), x86_64. I'm going to report this upstream and will backport the patch once it will have become available. ---------------------- $ git clone https://github.com/python/cpython.git $ cd cpython $ ./configure --enable-shared CFLAGS="-O0" LDFLAGS="-Wl,--rpath=$(pwd)" $ make -j $ for i in $(seq 1 1000); do ./python script.py; done | grep poll | sort | uniq -c 630 output: '', return code: 0, pollstatus=0 370 output: '', return code: None, pollstatus=None Changing component to sos, as sosreport wrongly calls python's API for pipes. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:3144 |
Description of problem: We have a reproducer (bit artificial but narrowed down from a real use case / sosreport tool) where sometimes: - a spawned thread reads from opened pipe - pipe read returns EOF - BUT pipe returncode is still None - also pipe poll returns None, so the pipe isnt closed yet The returncode == None is crucial problem for us, as we rely on the return code from the command executed. The above holds for a nontrivial amount of time (while we "process" the output somehow, see get_contents method). The problem (sometimes) happens for empty output from the pipe, BUT also (sometimes) for a nonempty one - see "variant for nonempty stdout:". Consequence of the bug: sosreport fails to collect absolutely random commands / files, what makes it totally unusable on this architecture. Version-Release number of selected component (if applicable): python-2.7.5-69.el7_5.s390x How reproducible: approx 16% Steps to Reproduce: 1. use attached script 2. run it 1000times, count # of returncode values: for i in $(seq 1 1000); do python repro_pipe_unfinished.py | grep poll; done | sort | uniq -c Actual results: 839 output: '', return code: 0, pollstatus=0 161 output: '', return code: None, pollstatus=None Expected results: Just "return code: 0, pollstatus=0" line. Additional info: - seems really s390x specific - very probabilistic - probability does not seem to be affected by stdout ((non)zero) length