1623921 – os-collect-config not working after a stop of systemd-journald

Bug 1623921 - os-collect-config not working after a stop of systemd-journald

Summary: os-collect-config not working after a stop of systemd-journald

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	os-collect-config
Sub Component:
Version:	10.0 (Newton)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	10.0 (Newton)
Assignee:	Jill Rouleau
QA Contact:	Shai Revivo
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-08-30 13:02 UTC by Jose Luis Franco
Modified:	2019-01-16 17:09 UTC (History)
CC List:	13 users (show)
Fixed In Version:	os-collect-config-5.2.1-3.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-01-16 17:09:07 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1795030	None	None	None	2018-10-03 07:54:53 UTC
OpenStack gerrit	606133	None	MERGED	Don't ignore SIGPIPE	2020-11-13 09:29:23 UTC
RDO	16568	None	None	None	2018-09-28 17:06:26 UTC
RDO	17561	None	None	None	2018-12-03 22:19:19 UTC
Red Hat Product Errata	RHBA-2019:0075	None	None	None	2019-01-16 17:09:28 UTC

Description Jose Luis Franco 2018-08-30 13:02:20 UTC

Description of problem:

When following the official FFU steps, a minor update is required in advance. Having a environment deployed via the following jenkins job: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/OSPD-Customized-Deployment-virt/6037/

the steps for minor update where followed from the official FFU docs: 

https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html-single/fast_forward_upgrades/#updating_the_current_undercloud_packages_for_openstack_platform_10_z

Right at the point of running the minor update with the command "openstack overcloud update stack -i overcloud", the process remains running indefinetely appearing as IN_PROCESS. No further message is displaed.

[stack@undercloud-0 ~]$ openstack overcloud update stack -i overcloud
starting package update on stack overcloud
IN_PROGRESS
WAITING
on_breakpoint: [u'COMPUTE-R00-00', u'CONTROLLER-R00-00']
Breakpoint reached, continue? Regexp or Enter=proceed (will clear f338306e-2efc-414d-a3f4-a0928072e3ea), no=cancel update, C-c=quit interactive mode: 
^BqIN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS


When checking the heat resources, the following resources FAILED (once the job timeouted, but during the execution appeared as IN_PROGRESS constantly):

[stack@undercloud-0 ~]$ openstack stack resource list -n5 overcloud | grep -v COMPLETE                                                                                       
+-----------------------------------------------+---------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+-----------------+----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| resource_name                                 | physical_resource_id                                                            | resource_type                                                                                                          | resource_status | updated_time         | stack_name                                                                                                                                                               |
+-----------------------------------------------+---------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+-----------------+----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Controller                                    | 05a36863-6bd1-4d91-ba52-b05117950721                                            | OS::Heat::ResourceGroup                                                                                                | UPDATE_FAILED   | 2018-08-30T08:33:58Z | overcloud                                                                                                                                                                |
| Compute                                       | f394c02b-0196-4eff-b67d-ed94ef154d6a                                            | OS::Heat::ResourceGroup                                                                                                | UPDATE_FAILED   | 2018-08-30T08:34:00Z | overcloud                                                                                                                                                                |
| 0                                             | 6e39fc55-3022-4a35-af43-b64a66c3f854                                            | OS::TripleO::Controller                                                                                                | UPDATE_FAILED   | 2018-08-30T08:34:00Z | overcloud-Controller-2ghxhd7q6nme                                                                                                                                        |
| UpdateDeployment                              | f338306e-2efc-414d-a3f4-a0928072e3ea                                            | OS::Heat::SoftwareDeployment                                                                                           | UPDATE_FAILED   | 2018-08-30T08:36:02Z | overcloud-Controller-2ghxhd7q6nme-0-mztvvqqdjmlw                                                                                                                         |
| 0                                             | 707adf28-7c55-475b-be56-3636d693ad08                                            | OS::TripleO::Compute                                                                                                   | UPDATE_FAILED   | 2018-08-30T08:34:03Z | overcloud-Compute-s6qobk26h3ma                                                                                                                                           |
+-----------------------------------------------+---------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+-----------------+----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Inside the overcloud nodes, no logs are obtained from os-collect-config for the minor update. Which were expected, as there is a SoftwareDeployment resource in_progress.



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Install environment with Jenkins job https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/OSPD-Customized-Deployment-virt/6037/
2. Update the undercloud following https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html-single/fast_forward_upgrades/#updating_the_current_undercloud_packages_for_openstack_platform_10_z
3. Reboot
4. Update overcloud images following https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html-single/fast_forward_upgrades/#updating_the_current_overcloud_images_for_openstack_platform_10_z
5. Run overcloud-deploy with --update-plan-only:
-------------------------_SCRIPT_------------------------
openstack overcloud deploy --update-plan-only \
--templates \
--debug \
--stack overcloud \
-e /home/stack/virt/config_lvm.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e /home/stack/virt/network/network-environment.yaml \
-e /home/stack/virt/hostnames.yml \
-e /home/stack/virt/extra_templates.yaml
----------------------------------------------------------
6. Run minor update: openstack overcloud update stack -i overcloud

Actual results:
Minor update hungs and ends by workflow timeout as os-collect-config process doesn't run the plan scripts

Expected results:
Minor update successfully finishes in overcloud nodes.

Additional info:

Comment 1 Sofer Athlan-Guyot 2018-08-30 13:36:16 UTC

Hi,

so on the live env we could strace the os-collect-config processes and saw:

3344  write(2, "No local metadata found (['/var/"..., 68) = -1 EPIPE (Broken pipe)
3344  --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3344, si_uid=0} ---
3344  write(2, "Traceback (most recent call last"..., 35) = -1 EPIPE (Broken pipe)
3344  --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3344, si_uid=0} ---

os-collect-config is process 3344.

It tries to write to stdout "No local metadata found ( ..." from the local.py collector.  Nothing wrong with that.  The problem though is that stdout is no longer here:

3344  --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3344, si_uid=0} ---
3344  write(2, "Traceback (most recent call last"..., 35) = -1 EPIPE (Broken pipe)

os-collect-config receives a SIGPIPE which means that stdout is no longer there.

This unix behavior is expected, this is why:

   $ yes | time

works as expected.  When time close its stdin, the yes command receive SIGPIPE on its next write to its stdout and the whole command stop.

So back to os-collect-config.  It has lost stdout and proceed on restarting from the beginning.  Thus no action is ever taken and heat on the undercloud is waiting for an answer till the end of time.

The reason os-collect-config has lost its stdout may be linked to journald rotation.

In /var/log/message you can see:

Aug 29 10:54:56 controller-r00-00 os-collect-config: No local metadata found (['/var/lib/os-collect-config/local-data'])

... 

Aug 29 10:58:48 controller-r00-00 journal: Permanent journal is using 40.0M (max allowed 2.9G, trying to leave 4.0G free of 23.6G available → current limit 2.9G).
Aug 29 10:58:48 controller-r00-00 journal: Journal started

and then os-collect-config messages are never to be seen again.

I cannot seem to be able to reproduce this using systemctl restart systemd-journal or kill -SIGUSR2 systemd-journal pid, so there may be something more.

Restarting os-collect-config make the whole process work again as it gets fresh new stdin/stdout from systemd.

Comment 2 Sofer Athlan-Guyot 2018-08-30 14:59:02 UTC

So I could reproduce this outside of any upgrade process.

Go to a node where os-collect-config is running.

do:
   journalctl -u os-collect-config -f -n 10
   # you should see some output regulary
   # ...
   # then
   systemctl stop journal

Output stop.  Now strace the process:

    strace -f -p $(pgrep -f 'python.*os-collect-config') -o /tmp/os.log &

You should see that the process is looping:

   3344 select(0, NULL, NULL, NULL, {30, 0} (timeout)  <- reading stdin timeout out as well

to 

   3344  write(2, "No local metadata found (['/var/"..., 68) = -1 EPIPE (Broken pipe)
   3344  --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3344, si_uid=0} ---
    3344  write(2, "Traceback (most recent call last"..., 35) = -1 EPIPE (Broken pipe)
    3344  --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3344, si_uid=0} ---

Basically we're looping inside[1] and when we hit[2] os-collect-config get sigpipe.  

Now to reproduce you just have to make sure that "collectors = local" is configured in os-net-config and that "/var/lib/os-collect-config/local-data" is missing.


[1] https://github.com/openstack/os-collect-config/blob/master/os_collect_config/collect.py#L158..L181

[2] https://github.com/openstack/os-collect-config/blob/master/os_collect_config/local.py#L66

Comment 3 Sofer Athlan-Guyot 2018-08-30 15:04:41 UTC

Based on finding on comment#2, I'm moving that to dfg:cloudapp and renaming the ticket.

Comment 4 Sofer Athlan-Guyot 2018-08-30 15:09:23 UTC

The versions are:

 - systemd-219-57.el7_5.1.x86_64 (from Red Hat Enterprise Linux Server release 7.5 (Maipo))
 - os-collect-config-5.2.1-2.el7ost.noarch

os-collect-config didn't change much so maybe this is reproducible in master.

Comment 5 Sergii Golovatiuk 2018-08-31 10:51:39 UTC

I followed Sofer's step on master. I confirm the issue exists on master.

Comment 6 Thomas Hervé 2018-09-06 14:03:39 UTC

That looks like https://bugs.freedesktop.org/show_bug.cgi?id=84923, it's not specific at all to occ.

Comment 7 Sofer Athlan-Guyot 2018-09-07 11:28:05 UTC

Hi Thomas,

(In reply to Thomas Hervé from comment #6)
> That looks like https://bugs.freedesktop.org/show_bug.cgi?id=84923, it's not
> specific at all to occ.

well, specific maybe not, but the way occ reacts to this SIGPIPE is not the best.  Silently failing and blocking any further action until someone strace the process  or blindly restart it, is bad user experience.

First long running process shouldn't really depends on stdout/stdin/stderr, but rather close them all, run from "/" and other stuff ...[1].

But even if the current design doesn't allow for this, we should catch that sigpipe and do something about it:
 1. fail hard (quit the while true loop with a nice error message)
 2. re-open stdin/stdout/stderr if possible
 3. something user friendly ...

My 2cents :)

[1] https://www.systutorials.com/docs/linux/man/1-daemonize/

Comment 8 Jose Luis Franco 2018-09-27 11:30:50 UTC

I found the same problem in rhosp11. This is quite ugly, as it blocks the upgrade/update because os-collect-config is hung and unless os-collect-config gets restarted there is no way to go on.
As the solution for this doesn't seem trivial, and it might impact other customers. Why don't we include a task in upgrade/update_tasks that will check the running time of systemd-journald an os-collect-config, and if systemd-journald is being running for less than os-collect-config, restart os-collect-config.

Comment 9 Jose Luis Franco 2018-09-27 13:34:23 UTC

Hey Ben, are there any updates on this? IMH this issue can harm upgrades and updates very much as os-collect-config hangs and the upgrade/update timeouts.

Comment 10 Ben Nemec 2018-09-27 23:03:31 UTC

Okay, did some more investigation into this. There are two problems contributing to this:

1) Python sets SIGPIPE to SIG_IGN by default. That doesn't allow us to handle the signal properly. If we fix that, then we hit:

2) The os-collect-config service is set to restart only on-failure. Per the docs,

"If set to on-failure, the service will be restarted when the process exits with a non-zero exit code, is terminated by a signal (including on core dump, but ***excluding the aforementioned four signals***)"

The aforementioned four signals are SIGHUP, SIGINT, SIGTERM or SIGPIPE, which means the process dies and isn't restarted.

So I think the thing to do is to tweak the SIGPIPE handling in os-collect-config so the signal kills the process and tweak the service so systemd will restart it when that happens. I'll start on that tomorrow.

Comment 11 Ben Nemec 2018-09-28 17:10:23 UTC

Okay, patches are proposed to master. If/when we get agreement on the fix we'll need to backport them to all the affected branches (so, all of them?).

Comment 12 Jose Luis Franco 2018-10-03 05:58:52 UTC

Probably yes, we observed the issue at least in OSP10 and OSP11 and if the code did not change in newest releases it will probably impact to all of them.

Thanks a lot for spending time on this.

Comment 19 Artem Hrechanychenko 2018-12-22 21:39:30 UTC

VERIFIED

http://pastebin.test.redhat.com/687508
http://pastebin.test.redhat.com/687507

os-collect-config-5.2.1-3.el7ost.noarch

Comment 21 errata-xmlrpc 2019-01-16 17:09:07 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0075

Note You need to log in before you can comment on or make changes to this bug.