Description of problem: When following the official FFU steps, a minor update is required in advance. Having a environment deployed via the following jenkins job: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/OSPD-Customized-Deployment-virt/6037/ the steps for minor update where followed from the official FFU docs: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html-single/fast_forward_upgrades/#updating_the_current_undercloud_packages_for_openstack_platform_10_z Right at the point of running the minor update with the command "openstack overcloud update stack -i overcloud", the process remains running indefinetely appearing as IN_PROCESS. No further message is displaed. [stack@undercloud-0 ~]$ openstack overcloud update stack -i overcloud starting package update on stack overcloud IN_PROGRESS WAITING on_breakpoint: [u'COMPUTE-R00-00', u'CONTROLLER-R00-00'] Breakpoint reached, continue? Regexp or Enter=proceed (will clear f338306e-2efc-414d-a3f4-a0928072e3ea), no=cancel update, C-c=quit interactive mode: ^BqIN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS When checking the heat resources, the following resources FAILED (once the job timeouted, but during the execution appeared as IN_PROGRESS constantly): [stack@undercloud-0 ~]$ openstack stack resource list -n5 overcloud | grep -v COMPLETE +-----------------------------------------------+---------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+-----------------+----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | resource_name | physical_resource_id | resource_type | resource_status | updated_time | stack_name | +-----------------------------------------------+---------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+-----------------+----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Controller | 05a36863-6bd1-4d91-ba52-b05117950721 | OS::Heat::ResourceGroup | UPDATE_FAILED | 2018-08-30T08:33:58Z | overcloud | | Compute | f394c02b-0196-4eff-b67d-ed94ef154d6a | OS::Heat::ResourceGroup | UPDATE_FAILED | 2018-08-30T08:34:00Z | overcloud | | 0 | 6e39fc55-3022-4a35-af43-b64a66c3f854 | OS::TripleO::Controller | UPDATE_FAILED | 2018-08-30T08:34:00Z | overcloud-Controller-2ghxhd7q6nme | | UpdateDeployment | f338306e-2efc-414d-a3f4-a0928072e3ea | OS::Heat::SoftwareDeployment | UPDATE_FAILED | 2018-08-30T08:36:02Z | overcloud-Controller-2ghxhd7q6nme-0-mztvvqqdjmlw | | 0 | 707adf28-7c55-475b-be56-3636d693ad08 | OS::TripleO::Compute | UPDATE_FAILED | 2018-08-30T08:34:03Z | overcloud-Compute-s6qobk26h3ma | +-----------------------------------------------+---------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+-----------------+----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ Inside the overcloud nodes, no logs are obtained from os-collect-config for the minor update. Which were expected, as there is a SoftwareDeployment resource in_progress. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Install environment with Jenkins job https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/OSPD-Customized-Deployment-virt/6037/ 2. Update the undercloud following https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html-single/fast_forward_upgrades/#updating_the_current_undercloud_packages_for_openstack_platform_10_z 3. Reboot 4. Update overcloud images following https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html-single/fast_forward_upgrades/#updating_the_current_overcloud_images_for_openstack_platform_10_z 5. Run overcloud-deploy with --update-plan-only: -------------------------_SCRIPT_------------------------ openstack overcloud deploy --update-plan-only \ --templates \ --debug \ --stack overcloud \ -e /home/stack/virt/config_lvm.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \ -e /home/stack/virt/network/network-environment.yaml \ -e /home/stack/virt/hostnames.yml \ -e /home/stack/virt/extra_templates.yaml ---------------------------------------------------------- 6. Run minor update: openstack overcloud update stack -i overcloud Actual results: Minor update hungs and ends by workflow timeout as os-collect-config process doesn't run the plan scripts Expected results: Minor update successfully finishes in overcloud nodes. Additional info:
Hi, so on the live env we could strace the os-collect-config processes and saw: 3344 write(2, "No local metadata found (['/var/"..., 68) = -1 EPIPE (Broken pipe) 3344 --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3344, si_uid=0} --- 3344 write(2, "Traceback (most recent call last"..., 35) = -1 EPIPE (Broken pipe) 3344 --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3344, si_uid=0} --- os-collect-config is process 3344. It tries to write to stdout "No local metadata found ( ..." from the local.py collector. Nothing wrong with that. The problem though is that stdout is no longer here: 3344 --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3344, si_uid=0} --- 3344 write(2, "Traceback (most recent call last"..., 35) = -1 EPIPE (Broken pipe) os-collect-config receives a SIGPIPE which means that stdout is no longer there. This unix behavior is expected, this is why: $ yes | time works as expected. When time close its stdin, the yes command receive SIGPIPE on its next write to its stdout and the whole command stop. So back to os-collect-config. It has lost stdout and proceed on restarting from the beginning. Thus no action is ever taken and heat on the undercloud is waiting for an answer till the end of time. The reason os-collect-config has lost its stdout may be linked to journald rotation. In /var/log/message you can see: Aug 29 10:54:56 controller-r00-00 os-collect-config: No local metadata found (['/var/lib/os-collect-config/local-data']) ... Aug 29 10:58:48 controller-r00-00 journal: Permanent journal is using 40.0M (max allowed 2.9G, trying to leave 4.0G free of 23.6G available → current limit 2.9G). Aug 29 10:58:48 controller-r00-00 journal: Journal started and then os-collect-config messages are never to be seen again. I cannot seem to be able to reproduce this using systemctl restart systemd-journal or kill -SIGUSR2 systemd-journal pid, so there may be something more. Restarting os-collect-config make the whole process work again as it gets fresh new stdin/stdout from systemd.
So I could reproduce this outside of any upgrade process. Go to a node where os-collect-config is running. do: journalctl -u os-collect-config -f -n 10 # you should see some output regulary # ... # then systemctl stop journal Output stop. Now strace the process: strace -f -p $(pgrep -f 'python.*os-collect-config') -o /tmp/os.log & You should see that the process is looping: 3344 select(0, NULL, NULL, NULL, {30, 0} (timeout) <- reading stdin timeout out as well to 3344 write(2, "No local metadata found (['/var/"..., 68) = -1 EPIPE (Broken pipe) 3344 --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3344, si_uid=0} --- 3344 write(2, "Traceback (most recent call last"..., 35) = -1 EPIPE (Broken pipe) 3344 --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3344, si_uid=0} --- Basically we're looping inside[1] and when we hit[2] os-collect-config get sigpipe. Now to reproduce you just have to make sure that "collectors = local" is configured in os-net-config and that "/var/lib/os-collect-config/local-data" is missing. [1] https://github.com/openstack/os-collect-config/blob/master/os_collect_config/collect.py#L158..L181 [2] https://github.com/openstack/os-collect-config/blob/master/os_collect_config/local.py#L66
Based on finding on comment#2, I'm moving that to dfg:cloudapp and renaming the ticket.
The versions are: - systemd-219-57.el7_5.1.x86_64 (from Red Hat Enterprise Linux Server release 7.5 (Maipo)) - os-collect-config-5.2.1-2.el7ost.noarch os-collect-config didn't change much so maybe this is reproducible in master.
I followed Sofer's step on master. I confirm the issue exists on master.
That looks like https://bugs.freedesktop.org/show_bug.cgi?id=84923, it's not specific at all to occ.
Hi Thomas, (In reply to Thomas Hervé from comment #6) > That looks like https://bugs.freedesktop.org/show_bug.cgi?id=84923, it's not > specific at all to occ. well, specific maybe not, but the way occ reacts to this SIGPIPE is not the best. Silently failing and blocking any further action until someone strace the process or blindly restart it, is bad user experience. First long running process shouldn't really depends on stdout/stdin/stderr, but rather close them all, run from "/" and other stuff ...[1]. But even if the current design doesn't allow for this, we should catch that sigpipe and do something about it: 1. fail hard (quit the while true loop with a nice error message) 2. re-open stdin/stdout/stderr if possible 3. something user friendly ... My 2cents :) [1] https://www.systutorials.com/docs/linux/man/1-daemonize/
I found the same problem in rhosp11. This is quite ugly, as it blocks the upgrade/update because os-collect-config is hung and unless os-collect-config gets restarted there is no way to go on. As the solution for this doesn't seem trivial, and it might impact other customers. Why don't we include a task in upgrade/update_tasks that will check the running time of systemd-journald an os-collect-config, and if systemd-journald is being running for less than os-collect-config, restart os-collect-config.
Hey Ben, are there any updates on this? IMH this issue can harm upgrades and updates very much as os-collect-config hangs and the upgrade/update timeouts.
Okay, did some more investigation into this. There are two problems contributing to this: 1) Python sets SIGPIPE to SIG_IGN by default. That doesn't allow us to handle the signal properly. If we fix that, then we hit: 2) The os-collect-config service is set to restart only on-failure. Per the docs, "If set to on-failure, the service will be restarted when the process exits with a non-zero exit code, is terminated by a signal (including on core dump, but ***excluding the aforementioned four signals***)" The aforementioned four signals are SIGHUP, SIGINT, SIGTERM or SIGPIPE, which means the process dies and isn't restarted. So I think the thing to do is to tweak the SIGPIPE handling in os-collect-config so the signal kills the process and tweak the service so systemd will restart it when that happens. I'll start on that tomorrow.
Okay, patches are proposed to master. If/when we get agreement on the fix we'll need to backport them to all the affected branches (so, all of them?).
Probably yes, we observed the issue at least in OSP10 and OSP11 and if the code did not change in newest releases it will probably impact to all of them. Thanks a lot for spending time on this.
VERIFIED http://pastebin.test.redhat.com/687508 http://pastebin.test.redhat.com/687507 os-collect-config-5.2.1-3.el7ost.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0075