Bug 1819662
Summary: | sosreport hangs on rabbitmq plugin | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Christophe Besson <cbesson> |
Component: | sos | Assignee: | Pavel Moravec <pmoravec> |
Status: | CLOSED ERRATA | QA Contact: | Miroslav HradĂlek <mhradile> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 8.0 | CC: | agk, astupnik, bmr, dwalsh, jhunsaker, mhradile, michele, pkhedeka, plambri, pmoravec, sbradley, vincent.vossier |
Target Milestone: | rc | Keywords: | OtherQA |
Target Release: | 8.0 | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | sos-3.9.1-2.el8 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-11-04 01:57:56 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Is there an internal reproducer I can play with sos? Or could you try alternative fix on 3.9 / upstream sos, via running: rm -rf master.zip sos-master curl -L -O https://github.com/sosreport/sos/archive/master.zip unzip master.zip cd sos-master (now apply patch: --- a/sos/report/plugins/rabbitmq.py +++ b/sos/report/plugins/rabbitmq.py @@ -34,7 +34,8 @@ class RabbitMQ(Plugin, RedHatPlugin, Deb for container in container_names: self.get_container_logs(container) self.add_cmd_output( - self.fmt_container_cmd(container, 'rabbitmqctl report') + self.fmt_container_cmd(container, 'rabbitmqctl report'), + foreground=True ) else: self.add_cmd_output("rabbitmqctl report") ) python3 ./bin/sosreport # optionally, use some cmdline opts here I prefer the foregorund=True patch rather than timeout=None: the later can allow potentially hung podman command to block sosreport execution for redundantly long time About the podman suspected bug: could you try potential reproducer outside sos report: ------8<----------- from subprocess import Popen, PIPE p = Popen("/usr/bin/podman exec -t rabbitmq-bundle-podman-0 uptime", shell=False, stdout=PIPE, stderr=PIPE, bufsize=-1, close_fds=True) out, err = p.communicate() if err: print("Error: %s" % err) print("Output: %s" % out) ------8<----------- ? Thanks Pavel, so here we go. (In reply to Pavel Moravec from comment #8) > Is there an internal reproducer I can play with sos? In theory I can prepare an env, but it is a bit convoluted atm > Or could you try alternative fix on 3.9 / upstream sos, via running: > > rm -rf master.zip sos-master > curl -L -O https://github.com/sosreport/sos/archive/master.zip > unzip master.zip > cd sos-master > > (now apply patch: > --- a/sos/report/plugins/rabbitmq.py > +++ b/sos/report/plugins/rabbitmq.py > @@ -34,7 +34,8 @@ class RabbitMQ(Plugin, RedHatPlugin, Deb > for container in container_names: > self.get_container_logs(container) > self.add_cmd_output( > - self.fmt_container_cmd(container, 'rabbitmqctl report') > + self.fmt_container_cmd(container, 'rabbitmqctl report'), > + foreground=True > ) > else: > self.add_cmd_output("rabbitmqctl report") > ) > > python3 ./bin/sosreport # optionally, use some cmdline opts here > > > I prefer the foregorund=True patch rather than timeout=None: the later can > allow potentially hung podman command to block sosreport execution for > redundantly long time With 'foreground=True' it worked just fine and no timeouts where hit: Setting up archive ... [archive:TarFileArchive] initialised empty FileCacheArchive at '/var/tmp/sos.hn0gcpy3/sosreport-messaging-1-2020-05-05-hgitmih' [sos.report:setup] executing 'sos report --batch --verbose --debug -o rabbitmq' [sos.report:setup] using 'rhosp' preset defaults (--plugopts process.lsof=off,networking.ethtool_namespaces=False,networking.namespaces=200) [sos.report:setup] effective options now: --batch --debug --onlyplugins rabbitmq --plugopts process.lsof=off,networking.ethtool_namespaces=False,networking.namespaces=200 -v Setting up plugins ... [plugin:rabbitmq] added cmd output 'podman logs -t rabbitmq-bundle-podman-1' [plugin:rabbitmq] added cmd output 'podman exec rabbitmq-bundle-podman-1 'rabbitmqctl report'' Running plugins. Please wait ... Starting 1/1 rabbitmq [Running: rabbitmq] [plugin:rabbitmq] collecting path '/var/lib/config-data/puppet-generated/rabbitmq/etc/rabbitmq/inetrc' [plugin:rabbitmq] collecting path '/var/lib/config-data/puppet-generated/rabbitmq/etc/rabbitmq/enabled_plugins' [plugin:rabbitmq] collecting path '/var/lib/config-data/puppet-generated/rabbitmq/etc/rabbitmq/ssl' [plugin:rabbitmq] collecting path '/var/lib/config-data/puppet-generated/rabbitmq/etc/systemd/' [plugin:rabbitmq] collecting path '/var/lib/config-data/puppet-generated/rabbitmq/etc/rabbitmq/rabbitmq-env.conf' [plugin:rabbitmq] collecting path '/var/lib/config-data/puppet-generated/rabbitmq/etc/security/limits.d/' [plugin:rabbitmq] collecting path '/var/lib/config-data/puppet-generated/rabbitmq/etc/rabbitmq/rabbitmq.config' [plugin:rabbitmq] collecting output of 'podman logs -t rabbitmq-bundle-podman-1' [plugin:rabbitmq] collecting output of 'podman exec rabbitmq-bundle-podman-1 'rabbitmqctl report'' Finished running plugins Creating compressed archive... [archive:TarFileArchive] finalizing archive '/var/tmp/sos.hn0gcpy3/sosreport-messaging-1-2020-05-05-hgitmih' using method 'auto' [archive:TarFileArchive] built archive at '/var/tmp/sos.hn0gcpy3/sosreport-messaging-1-2020-05-05-hgitmih.tar' (size=81920) Your sosreport has been generated and saved in: /var/tmp/sosreport-messaging-1-2020-05-05-hgitmih.tar.xz Size 7.39KiB Owner root md5 9343896c3dd05283476045b062d94649 > About the podman suspected bug: could you try potential reproducer outside > sos report: > > > ------8<----------- > from subprocess import Popen, PIPE > > p = Popen("/usr/bin/podman exec -t rabbitmq-bundle-podman-0 uptime", > shell=False, > stdout=PIPE, > stderr=PIPE, > bufsize=-1, > close_fds=True) > > out, err = p.communicate() > > if err: > print("Error: %s" % err) > > print("Output: %s" % out) > ------8<----------- > > ? # Note I had to transform the string into an array of strings otherwise I'd get a FileNotFound Exception. [root@messaging-1 ~]# more repro.py from subprocess import Popen, PIPE p = Popen(["/usr/bin/podman", "exec", "-t", "rabbitmq-bundle-podman-1", "uptime"], shell=False, stdout=PIPE, stderr=PIPE, bufsize=-1, close_fds=True) out, err = p.communicate() if err: print("Error: %s" % err) print("Output: %s" % out) [root@messaging-1 ~]# python3 repro.py Output: b' 20:26:01 up 7:43, 0 users, load average: 0.21, 0.36, 0.34\r\n' Let me know if you need anything else. Thanks for your help, Michele Ah I forget (also) to "timeout 300s" in the reproducer - could you try: ---------------------8<----------------- from subprocess import Popen, PIPE p = Popen(["timeout", "300s", "/usr/bin/podman", "exec", "-t", "rabbitmq-bundle-podman-1", "uptime"], shell=False, stdout=PIPE, stderr=PIPE, bufsize=-1, close_fds=True) out, err = p.communicate() if err: print("Error: %s" % err) ---------------------8<----------------- ? That *should* hang (and can be used as a reproducer of a podman(?) bug that should be fixed as the root cause/problem). As a workaround fix in sosreport, PR 2047 with the foreground=True option is raised. Assuming otherQE can happen (please confirm so), this can be fixed in 8.3. *** Bug 1834691 has been marked as a duplicate of this bug. *** Could either the reporter or Parikshit confirm OtherQE? If so, I can include the bugfix in RHEL 8.3 Thanks for the reproducer. Is the command (that sos report calls) correct? Should we really call: podman exec rabbitmq-bundle-podman-0 'rabbitmqctl report' ? Since: [root@controller-0 sos-master]# podman exec rabbitmq-bundle-podman-0 'rabbitmqctl report' Error: exec failed: container_linux.go:346: starting container process caused "exec: \"rabbitmqctl report\": executable file not found in $PATH": OCI runtime command not found error [root@controller-0 sos-master]# podman exec rabbitmq-bundle-podman-0 'which rabbitmqctl' Error: exec failed: container_linux.go:346: starting container process caused "exec: \"which rabbitmqctl\": executable file not found in $PATH": OCI runtime command not found error [root@controller-0 sos-master]# podman exec rabbitmq-bundle-podman-0 'echo $PATH' Error: exec failed: container_linux.go:346: starting container process caused "exec: \"echo $PATH\": executable file not found in $PATH": OCI runtime command not found error [root@controller-0 sos-master]# All the commands return 127, that sos report interprets as "command not found". Calling "ls" inside the container works, so just some problem with environment inside the container..? Hmm, not sure to understand why, I'm not a podman/osp specialist... Indeed I have the same error by quoting the command, but that works for me without the quotes: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # podman exec rabbitmq-bundle-podman-0 rabbitmqctl report | head Reporting server status of node rabbit@controller-0 ... Status of node rabbit@controller-0 ... [{pid,726}, {running_applications, [{rabbitmq_management,"RabbitMQ Management Console","3.7.23"}, {rabbitmq_management_agent,"RabbitMQ Management Agent","3.7.23"}, {rabbitmq_web_dispatch,"RabbitMQ Web Dispatcher","3.7.23"}, {rabbit,"RabbitMQ","3.7.23"}, {mnesia,"MNESIA CXC 138 12","4.15.3.2"}, ... ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ It's not the original issue, which seems to be fixed by your patch, and I won't be able to say whether it is a normal behavior or specific to this setup. Dan, a question from podman newbie: if one wants to execute a command with arguments inside a container, what is the correct or allowed syntax of "podman exec" (I failed to find it in manpages and similar)? Which of below is correct or allowed? podman exec containerID "cmd arg1 arg2" podman exec containerID 'cmd arg1 arg2' podman exec containerID cmd arg1 arg2 (sos report is requested to collect output of command podman exec rabbitmq-bundle-podman-0 rabbitmqctl report that is currently called by sos report as: podman exec rabbitmq-bundle-podman-0 'rabbitmqctl report' that fails - we are not sure if the problem is deployment specific or podman specific or invalid syntax) Hello, could you please promise verification of the fix against candidate build by yourself? This falls beyond scope of sos report QE team.. (once the candidate build will be ready / in one week, I will ask for the verification itself) Hello, thaks for offer to verify this BZ. Please use the below build for verification: A yum repository for the build of sos-3.9.1-2.el8 (task 28860028) is available at: http://brew-task-repos.usersys.redhat.com/repos/official/sos/3.9.1/2.el8/ You can install the rpms locally by putting this .repo file in your /etc/yum.repos.d/ directory: http://brew-task-repos.usersys.redhat.com/repos/official/sos/3.9.1/2.el8/sos-3.9.1-2.el8.repo RPMs and build logs can be found in the following locations: http://brew-task-repos.usersys.redhat.com/repos/official/sos/3.9.1/2.el8/noarch/ The full list of available rpms is: http://brew-task-repos.usersys.redhat.com/repos/official/sos/3.9.1/2.el8/noarch/sos-3.9.1-2.el8.src.rpm http://brew-task-repos.usersys.redhat.com/repos/official/sos/3.9.1/2.el8/noarch/sos-3.9.1-2.el8.noarch.rpm http://brew-task-repos.usersys.redhat.com/repos/official/sos/3.9.1/2.el8/noarch/sos-audit-3.9.1-2.el8.noarch.rpm The repository will be available for the next 50 days. Scratch build output will be deleted earlier, based on the Brew scratch build retention policy. I've just tested the candidate sos-3.9.1-2.el8 and it works as expected: no timeout and report correctly gathered. # du -hs ./sos_commands/rabbitmq/docker_exec_rabbitmq-bundle-podman-0_rabbitmqctl_report 164K ./sos_commands/rabbitmq/docker_exec_rabbitmq-bundle-podman-0_rabbitmqctl_report Thanks for your time and for the patch. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (sos bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:4534 |
Description of problem: sosreport is hanging on RHOSP 15 with the rabbitmq plugin. Commands executed by this plugin work fine when running them manually. Version-Release number of selected component (if applicable): sos-3.6-10.el8_0.3.noarch podman-1.0.5-1.gitf604175.module+el8.0.0+4017+bbba319f.x86_64 runc-1.0.0-56.rc8.dev.git425e105.module+el8.0.0+4017+bbba319f.x86_64 How reproducible: 100% Steps to Reproduce: 1. Setup a RHOSP 15 with a rabbitmq container. 2. Run an sosreport Actual results: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # sosreport -v --batch -o rabbitmq ---[cut]--- [sos.sosreport:setup] using 'rhosp' preset defaults (--all-logs --plugopts process.lsof=off --threads 4) [sos.sosreport:setup] effective options now: --all-logs --batch --onlyplugins rabbitmq --plugopts process.lsof=off --threads 4 -v Setting up plugins ... [plugin:rabbitmq] added cmd output 'docker logs rabbitmq-bundle-podman-0' [plugin:rabbitmq] added cmd output 'docker exec -t rabbitmq-bundle-podman-0 rabbitmqctl report' Running plugins. Please wait ... Starting 1/1 rabbitmq [Running: rabbitmq] ---[cut]--- [plugin:rabbitmq] collecting output of 'docker logs rabbitmq-bundle-podman-0' [plugin:rabbitmq] collecting output of 'docker exec -t rabbitmq-bundle-podman-0 rabbitmqctl report' <<<hang here during 300s>>> Plugin rabbitmq timed out ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Expected results: The command ends without hanging and not leaving remaining processes. Additional info: # sosreport -v --batch -o rabbitmq ---[cut]--- [plugin:rabbitmq] added cmd output 'docker logs rabbitmq-bundle-podman-0' [plugin:rabbitmq] added cmd output 'docker exec -t rabbitmq-bundle-podman-0 uptime' Running plugins. Please wait ... Starting 1/1 rabbitmq [Running: rabbitmq] ---[cut]--- [plugin:rabbitmq] collecting output of 'docker logs rabbitmq-bundle-podman-0' [plugin:rabbitmq] collecting output of 'docker exec -t rabbitmq-bundle-podman-0 uptime' > /usr/lib/python3.6/site-packages/sos/utilities.py(126)sos_get_command_output() ---[cut]--- /usr/lib/python3.6/site-packages/sos/plugins/__init__.py: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ def get_command_output(...): ---[cut]--- result = sos_get_command_output(prog, timeout=timeout, stderr=stderr, chroot=root, chdir=runat, env=env, binary=binary, sizelimit=sizelimit) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ---[cut]--- /usr/lib/python3.6/site-packages/sos/utilities.py: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ def sos_get_command_output(...): ---[cut]--- p = Popen(expanded_args, shell=False, stdout=PIPE, stderr=STDOUT if stderr else PIPE, bufsize=-1, env=cmd_env, close_fds=True, preexec_fn=_child_prep_fn) reader = AsyncReader(p.stdout, sizelimit, binary) stdout = reader.get_contents() ---[cut]--- class AsyncReader(threading.Thread): def __init__(self, channel, sizelimit, binary): ---[cut]--- self.start() self.join() <<<<<<< hang here ---[cut]--- /usr/lib64/python3.6/threading.py: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ def _wait_for_tstate_lock(self, block=True, timeout=-1): # Issue #18808: wait for the thread state to be gone. # At the end of the thread's life, after all knowledge of the thread # is removed from C data structures, C code releases our _tstate_lock. # This method passes its arguments to _tstate_lock.acquire(). # If the lock is acquired, the C code is done, and self._stop() is # called. That sets ._is_stopped to True, and ._tstate_lock to None. lock = self._tstate_lock if lock is None: # already determined that the C code is done assert self._is_stopped elif lock.acquire(block, timeout): <<<<<< hang here lock.release() self._stop() ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ From the rabbitmq plugin, we can see the first command is correctly executed "docker ps -a". And then the python code iterates the list of containers, and execute the command "rabbitmqctl report" on the rabbitmq container through docker/podman exec. For testing purpose, I replaced it with "uptime". The command is executed, we can see it with ps: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [root@controller-0 heat-admin]# ps aux| grep -E "sos|uptime" root 379349 4.4 0.2 578344 68948 pts/2 Sl+ 12:46 0:02 /usr/libexec/platform-python -s /sbin/sosreport -v --batch -o rabbitmq root 380459 0.0 0.0 8636 832 pts/2 S 12:46 0:00 timeout 300s docker exec -t rabbitmq-bundle-podman-0 uptime root 380461 0.1 0.1 1173012 51108 pts/2 Tl 12:46 0:00 /usr/bin/podman exec -t rabbitmq-bundle-podman-0 uptime root 380476 287 0.0 1156068 21620 pts/2 Rl 12:46 2:49 /usr/bin/runc --log /var/lib/containers/storage/overlay-containers/c9d42ebb58fc5bdc284d8eb88d24d6c33c6ef41380ae61cbaf3c9f0d931750f9/userdata/ctr.log exec --pid-file /var/run/containers/storage/overlay-containers/c9d42ebb58fc5bdc284d8eb88d24d6c33c6ef41380ae61cbaf3c9f0d931750f9/userdata/exec_pid_3d95991d6e65eb7706b188278b313072493ab8ff5230cf39ff9b9cb19d692478 --tty c9d42ebb58fc5bdc284d8eb88d24d6c33c6ef41380ae61cbaf3c9f0d931750f9 uptime root 380485 0.0 0.0 0 0 ? Zs 12:46 0:00 [uptime] <defunct> root 383486 0.0 0.0 12112 2508 pts/3 S+ 12:47 0:00 grep --color=auto -E sos|uptime ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ==> the process "uptime" is zombified. While stracing "runc exec uptime": ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [pid 380476] futex(0x5619a7683ce0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 380828] <... futex resumed>) = -1 EAGAIN (Resource temporarily unavailable) [pid 380481] <... futex resumed>) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) [pid 380480] futex(0x5619a76a04a0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 380479] <... rt_sigreturn resumed>) = 16 [pid 380828] futex(0x5619a76a04a0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 380494] --- SIGTTIN {si_signo=SIGTTIN, si_code=SI_KERNEL} --- [pid 380480] <... futex resumed>) = 0 [pid 380479] ioctl(0, SNDCTL_TMR_START or TCSETS, {B38400 opost -isig -icanon -echo ...} <unfinished ...> [pid 380494] futex(0x5619a76a04a0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 380480] rt_sigreturn({mask=[]} <unfinished ...> [pid 380494] <... futex resumed>) = 1 [pid 380481] --- SIGTTOU {si_signo=SIGTTOU, si_code=SI_KERNEL} --- [pid 380828] <... futex resumed>) = 0 [pid 380479] <... ioctl resumed>) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) [pid 380476] <... futex resumed>) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) [pid 380494] rt_sigreturn({mask=[]} <unfinished ...> [pid 380480] <... rt_sigreturn resumed>) = 202 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ It works without delay if I execute the command manually: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # time podman exec -t rabbitmq-bundle-podman-0 uptime 12:33:56 up 6 days, 4:30, 0 users, load average: 2.37, 6.28, 9.24 real 0m0.288s user 0m0.086s sys 0m0.157s ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~