1814174 – haproxy isn't spawn in time to serve instances' requests

Bug 1814174 - haproxy isn't spawn in time to serve instances' requests

Summary: haproxy isn't spawn in time to serve instances' requests

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	tripleo-ansible
Sub Component:
Version:	16.0 (Train)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	z2
Target Release:	16.0 (Train on RHEL 8.1)
Assignee:	Jakub Libosvar
QA Contact:	Roman Safronov
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-03-17 09:29 UTC by Michele Baldessari
Modified:	2020-05-14 12:16 UTC (History)
CC List:	7 users (show)
Fixed In Version:	tripleo-ansible-0.4.2-0.20200321012305.546bc0d.el8ost.noarch.rpm
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-05-14 12:16:18 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1868125	None	None	None	2020-03-20 09:11:44 UTC
OpenStack gerrit	713852	None	MERGED	Filter out wrapper commands from the ps output	2020-11-10 09:55:40 UTC
OpenStack gerrit	713956	None	MERGED	[ovn] Stricter matching on metadata port binding event	2020-11-10 09:55:59 UTC
Red Hat Product Errata	RHBA-2020:2114	None	None	None	2020-05-14 12:16:36 UTC

Description Michele Baldessari 2020-03-17 09:29:08 UTC

Description of problem:
Seen the following error:
[tempest.scenario.manager] Public network connectivity check failed
ConnectionResetError: [Errno 104] Connection reset by peer
tempest.lib.exceptions.SSHTimeout: Connection to the 10.0.0.244 via SSH timed out.

https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/phase1-16_director-rhel-8.1-virthost-1cont_1comp_1ceph-ipv4-geneve-ceph/121/console

Tempest logs are here:
http://cougar11.scl.lab.tlv.redhat.com/phase1-16_director-rhel-8.1-virthost-1cont_1comp_1ceph-ipv4-geneve-ceph/121/undercloud-0.tar.gz?undercloud-0/home/stack/tempest-dir/tempest.log


Version-Release number of selected component (if applicable):
RHOS_TRUNK-16.0-RHEL-8-20200316.n.1

Comment 1 Michele Baldessari 2020-03-17 10:04:33 UTC

I see a bunch of these in the logs at c#1:
╰─$ grep " ERROR " compute-0/var/log/containers/neutron/ovn-metadata-agent.log 
2020-03-16 19:36:46.827 29393 DEBUG networking_ovn.agent.metadata_agent [-] logging_exception_prefix       = %(asctime)s.%(msecs)03d %(process)d ERROR %(name)s %(instance)s log_opt_values /usr/lib/python3.6/site-packages/oslo_config/cfg.py:2581
2020-03-16 19:36:48.487 29393 DEBUG oslo_service.service [-] logging_exception_prefix       = %(asctime)s.%(msecs)03d %(process)d ERROR %(name)s %(instance)s log_opt_values /usr/lib/python3.6/site-packages/oslo_config/cfg.py:2581
2020-03-16 19:50:46.895 29393 ERROR neutron.agent.linux.external_process [-] metadata-proxy for metadata with uuid f7d36c66-a00c-4da3-9bd7-e604892e5a55 not found. The process should not have died
2020-03-16 19:50:47.230 29393 ERROR neutron.agent.linux.external_process [-] metadata-proxy for metadata with uuid 1d59fe65-eca7-4341-95a1-8fd2dad007ea not found. The process should not have died
2020-03-16 19:51:46.896 29393 ERROR neutron.agent.linux.external_process [-] metadata-proxy for metadata with uuid f7d36c66-a00c-4da3-9bd7-e604892e5a55 not found. The process should not have died
2020-03-16 19:51:47.181 29393 ERROR neutron.agent.linux.external_process [-] metadata-proxy for metadata with uuid 1d59fe65-eca7-4341-95a1-8fd2dad007ea not found. The process should not have died
2020-03-16 19:52:46.897 29393 ERROR neutron.agent.linux.external_process [-] metadata-proxy for metadata with uuid 1d59fe65-eca7-4341-95a1-8fd2dad007ea not found. The process should not have died
2020-03-16 19:56:46.902 29393 ERROR neutron.agent.linux.external_process [-] metadata-proxy for metadata with uuid a2f93fc3-fd52-44de-bbd5-995479c46c2d not found. The process should not have died
2020-03-16 19:57:46.903 29393 ERROR neutron.agent.linux.external_process [-] metadata-proxy for metadata with uuid a2f93fc3-fd52-44de-bbd5-995479c46c2d not found. The process should not have died
2020-03-16 19:58:46.906 29393 ERROR neutron.agent.linux.external_process [-] metadata-proxy for metadata with uuid a2f93fc3-fd52-44de-bbd5-995479c46c2d not found. The process should not have died
2020-03-16 20:05:46.915 29393 ERROR neutron.agent.linux.external_process [-] metadata-proxy for metadata with uuid 1924594a-09b2-4443-81f5-20ebba148624 not found. The process should not have died
2020-03-16 20:11:46.922 29393 ERROR neutron.agent.linux.external_process [-] metadata-proxy for metadata with uuid b060899a-36c2-46de-a008-15ff982dc155 not found. The process should not have died
2020-03-16 20:12:46.924 29393 ERROR neutron.agent.linux.external_process [-] metadata-proxy for metadata with uuid b060899a-36c2-46de-a008-15ff982dc155 not found. The process should not have died
2020-03-16 20:13:46.926 29393 ERROR neutron.agent.linux.external_process [-] metadata-proxy for metadata with uuid b060899a-36c2-46de-a008-15ff982dc155 not found. The process should not have died
2020-03-16 20:14:46.928 29393 ERROR neutron.agent.linux.external_process [-] metadata-proxy for metadata with uuid b060899a-36c2-46de-a008-15ff982dc155 not found. The process should not have died
2020-03-16 20:17:46.933 29393 ERROR neutron.agent.linux.external_process [-] metadata-proxy for metadata with uuid bee6e682-997b-4401-925f-a2d85613d93f not found. The process should not have died
2020-03-16 20:18:46.934 29393 ERROR neutron.agent.linux.external_process [-] metadata-proxy for metadata with uuid bee6e682-997b-4401-925f-a2d85613d93f not found. The process should not have died

Comment 2 Bernard Cafarelli 2020-03-17 14:15:15 UTC

If it is the same root cause as bug #1812630, there is https://review.opendev.org/#/c/713395/

Comment 3 Michele Baldessari 2020-03-18 11:36:46 UTC

So I tried a deployment with https://review.opendev.org/#/c/713395/ applied and I can no longer see any ERRORS around 'process should not have died':
undercloud) [stack@undercloud-0 ~]$ for i in 37 30; do ssh 192.168.24.$i "sudo grep ERROR /var/log/containers/neutron/ovn-metadata-agent.log |grep -v NLMSG_ERROR"; done
Warning: Permanently added '192.168.24.37' (ECDSA) to the list of known hosts.
2020-03-18 11:19:13.724 25101 DEBUG networking_ovn.agent.metadata_agent [-] logging_exception_prefix       = %(asctime)s.%(msecs)03d %(process)d ERROR %(name)s %(instance)s log_opt_values /usr/lib/python3.6/site-packages/oslo_config/cfg.py:2581
2020-03-18 11:19:16.622 25101 DEBUG oslo_service.service [-] logging_exception_prefix       = %(asctime)s.%(msecs)03d %(process)d ERROR %(name)s %(instance)s log_opt_values /usr/lib/python3.6/site-packages/oslo_config/cfg.py:2581
Warning: Permanently added '192.168.24.30' (ECDSA) to the list of known hosts.
2020-03-18 11:19:13.698 24976 DEBUG networking_ovn.agent.metadata_agent [-] logging_exception_prefix       = %(asctime)s.%(msecs)03d %(process)d ERROR %(name)s %(instance)s log_opt_values /usr/lib/python3.6/site-packages/oslo_config/cfg.py:2581
2020-03-18 11:19:16.459 24976 DEBUG oslo_service.service [-] logging_exception_prefix       = %(asctime)s.%(msecs)03d %(process)d ERROR %(name)s %(instance)s log_opt_values /usr/lib/python3.6/site-packages/oslo_config/cfg.py:2581

(undercloud) [stack@undercloud-0 ~]$ for i in 37 30; do ssh 192.168.24.$i "sudo grep died /var/log/containers/neutron/ovn-metadata-agent.log "; done
Warning: Permanently added '192.168.24.37' (ECDSA) to the list of known hosts.
Warning: Permanently added '192.168.24.30' (ECDSA) to the list of known hosts.
(undercloud) [stack@undercloud-0 ~]$ 

So I'll push that patch upstream.

Comment 4 Daniel Alvarez Sanchez 2020-03-18 15:59:11 UTC

I did same analysis which leads to the same conclusion:  haproxy fails to start. In that time, the instance has boot up but failed to fetch metadata hence the PK auth fails


- Port gets bound to the hypervisor at 19:50:14 so the VM boots and metadata agent tries to spawn haproxy:

3360 2020-03-16 19:50:14.652 29393 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/external/pids/1d59fe65-eca7-4341-95a1-8fd2dad007ea.pid.haproxy get_valu     e_from_file /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:261

3389 2020-03-16 19:50:14.653 29393 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'ovnme     ta-1d59fe65-eca7-4341-95a1-8fd2dad007ea', 'haproxy', '-f', '/var/lib/neutron/ovn-metadata-proxy/1d59fe65-eca7-4341-95a1-8fd2dad007ea.conf'] create_process /usr/lib/python3     .6/site-packages/neutron/agent/linux/utils.py:87



- As per 'messages' log looks like haproxy is not started until 19:52:47 (if it even starts, I can't tell) so the 20 retries from the Cirros image have already expired 


43244 Mar 16 19:52:47 compute-0 sync[40402]: + podman run --detach -v /var/lib/config-data/puppet-generated/neutron:/etc/neutron:ro -v /run/netns:/run/netns:shared -v /var/lib/      neutron:/var/lib/neutron:z,shared -v /dev/log:/dev/log --log-driver k8s-file --log-opt path=/var/log/containers/stdouts/ovn_metadata_haproxy-ovnmeta-1d59fe65-eca7-4341-95      a1-8fd2dad007ea.log --net host --pid host --privileged -u root --name ovn_metadata_haproxy-ovnmeta-1d59fe65-eca7-4341-95a1-8fd2dad007ea undercloud-0.ctlplane.redhat.local      :8787/rh-osbs/rhosp16-openstack-neutron-metadata-agent-ovn:20200316.1 ip netns exec ovnmeta-1d59fe65-eca7-4341-95a1-8fd2dad007ea /var/lib/neutron/ovn_metadata_haproxy/com      mand -f /var/lib/neutron/ovn-metadata-proxy/1d59fe65-eca7-4341-95a1-8fd2dad007ea.conf
43245 Mar 16 19:52:47 compute-0 podman[40439]: 2020-03-16 19:52:47.445949007 +0000 UTC m=+0.069512561 container create acdf2702c73627565b358e0a7cc864a7ed3fa21a49d3a83017d141a82      0a87824 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-metadata-agent-ovn:20200316.1, name=ovn_metadata_haproxy-ovnmeta-1d59fe65-eca7-43      41-95a1-8fd2dad007ea)
43246 Mar 16 19:52:47 compute-0 systemd[1]: Started libpod-conmon-acdf2702c73627565b358e0a7cc864a7ed3fa21a49d3a83017d141a820a87824.scope.
43247 Mar 16 19:52:47 compute-0 systemd[1]: Started libcontainer container acdf2702c73627565b358e0a7cc864a7ed3fa21a49d3a83017d141a820a87824.
43248 Mar 16 19:52:47 compute-0 kernel: SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
43249 Mar 16 19:52:47 compute-0 podman[40439]: 2020-03-16 19:52:47.559238983 +0000 UTC m=+0.182802564 container init acdf2702c73627565b358e0a7cc864a7ed3fa21a49d3a83017d141a820a      87824 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-metadata-agent-ovn:20200316.1, name=ovn_metadata_haproxy-ovnmeta-1d59fe65-eca7-4341      -95a1-8fd2dad007ea)
43250 Mar 16 19:52:47 compute-0 podman[40439]: 2020-03-16 19:52:47.578606151 +0000 UTC m=+0.202169699 container start acdf2702c73627565b358e0a7cc864a7ed3fa21a49d3a83017d141a820      a87824 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-metadata-agent-ovn:20200316.1, name=ovn_metadata_haproxy-ovnmeta-1d59fe65-eca7-434      1-95a1-8fd2dad007ea)


During this time, metadata agent mechanism detects that haproxy is not running but the respawn doesn't work.

2020-03-16 19:50:47.230 29393 ERROR neutron.agent.linux.external_process [-] metadata-proxy for metadata with uuid 1d59fe65-eca7-4341-95a1-8fd2dad007ea not found. The process should not have died
2020-03-16 19:51:47.181 29393 ERROR neutron.agent.linux.external_process [-] metadata-proxy for metadata with uuid 1d59fe65-eca7-4341-95a1-8fd2dad007ea not found. The process should not have died
2020-03-16 19:52:46.897 29393 ERROR neutron.agent.linux.external_process [-] metadata-proxy for metadata with uuid 1d59fe65-eca7-4341-95a1-8fd2dad007ea not found. The process should not have died

Comment 5 Daniel Alvarez Sanchez 2020-03-18 16:04:56 UTC

My questions are:

1) Why is haproxy starting 2 minutes later than it should?
2) Will https://review.opendev.org/#/c/713395/ address it?

The patch makes sense for the SIGHUP case but does it do something about the starting?

Comment 6 Michele Baldessari 2020-03-18 16:54:38 UTC

(In reply to Daniel Alvarez Sanchez from comment #5)
> My questions are:
> 
> 1) Why is haproxy starting 2 minutes later than it should?
> 2) Will https://review.opendev.org/#/c/713395/ address it?
> 
> The patch makes sense for the SIGHUP case but does it do something about the
> starting?

So I rerun yet another test and I still do see those 'process should not have died' messages. Not sure why they did disappear in other runs, so no I don't think https://review.opendev.org/#/c/713395/ will fix the root cause of this problem (we should push for it anyway as it will remove a useless bash process, but that is about it)

I think I can somewhat reproduce parts of the problem. Right now in my env I see:
2020-03-18 16:44:07.080 25037 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'ovnmeta-bc2c354b-e2e6-4c20-9d4c-d80e32b2a53a', 'haproxy', '-f', '/var/lib/neutron/ovn-metadata-proxy/bc2c354b-e2e6-4c20-9d4c-d80e32b2a53a.conf'] create_process /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:87
2020-03-18 16:44:07.213 25037 DEBUG oslo_concurrency.lockutils [-] Lock "chassis_metadata_lock" acquired by "networking_ovn.agent.metadata.agent.MetadataAgent.update_chassis_metadata_networks" :: waited 0.000s inner /usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:327
2020-03-18 16:44:07.214 25037 DEBUG oslo_concurrency.lockutils [-] Lock "chassis_metadata_lock" released by "networking_ovn.agent.metadata.agent.MetadataAgent.update_chassis_metadata_networks" :: held 0.001s inner /usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:339
2020-03-18 16:44:07.410 25037 DEBUG oslo_concurrency.lockutils [-] Lock "_check_child_processes" released by "neutron.agent.linux.external_process.ProcessMonitor._check_child_processes" :: held 0.332s inner /usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:339
2020-03-18 16:44:40.052 25037 DEBUG ovsdbapp.backend.ovs_idl.event [-] Matched UPDATE: SbGlobalUpdateEvent(events=('update',), table='SB_Global', conditions=None, old_conditions=None) to row=SB_Global(ipsec=False, ssl=[], nb_cfg=98, options={'mac_prefix': '2a:c2:55', 'svc_monitor_mac': '2a:43:05:87:da:87'}, external_ids={}) old=SB_Global(nb_cfg=97) matches /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/event.py:44
2020-03-18 16:44:40.054 25037 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=0): UpdateChassisExtIdsCommand(name=a18b22a8-90ce-4b58-a61e-5b99e185e2ec, external_ids={'neutron:ovn-metadata-sb-cfg': '98'}, if_exists=True) do_commit /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84
2020-03-18 16:45:07.078 25037 DEBUG oslo_concurrency.lockutils [-] Lock "_check_child_processes" acquired by "neutron.agent.linux.external_process.ProcessMonitor._check_child_processes" :: waited 0.000s inner /usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:327
2020-03-18 16:45:07.079 25037 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/external/pids/bc2c354b-e2e6-4c20-9d4c-d80e32b2a53a.pid.haproxy get_value_from_file /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:261
2020-03-18 16:45:07.079 25037 ERROR neutron.agent.linux.external_process [-] metadata-proxy for metadata with uuid bc2c354b-e2e6-4c20-9d4c-d80e32b2a53a not found. The process should not have died
2020-03-18 16:45:07.080 25037 WARNING neutron.agent.linux.external_process [-] Respawning metadata-proxy for uuid bc2c354b-e2e6-4c20-9d4c-d80e32b2a53a
2020-03-18 16:45:07.080 25037 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/external/pids/bc2c354b-e2e6-4c20-9d4c-d80e32b2a53a.pid.haproxy get_value_from_file /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:261

I'll keep the env around so we can debug more tomorrow

Comment 7 Daniel Alvarez Sanchez 2020-03-19 11:32:43 UTC

Kuba and I have been looking into it today and found the answer to 1). (ie. Why haproxy doesn't start after 2 minutes?)

Full description is on the LP bug [0] and we have a patch up for review [1].

As a side note and I think it's not *strictly* related with this issue but looks like it's an issue anyways, we've seen the sync script hanging forever on the flock [2]. 'lslocks' output only shows the sync script locking on the file and nothing else but it didn't work well until we restarted the systemd unit (and killed the process).

[0] https://bugs.launchpad.net/tripleo/+bug/1868082
[1] https://review.opendev.org/713852
[2] https://opendev.org/openstack/tripleo-ansible/src/branch/master/tripleo_ansible/roles/tripleo_systemd_wrapper/templates/service_sync.j2#L37


PS. The issue was introduced by this patch 4 weeks ago: https://review.opendev.org/#/c/706381/

Comment 13 Roman Safronov 2020-04-19 11:34:58 UTC

Verified on 16.0-RHEL-8/RHOS_TRUNK-16.0-RHEL-8-20200406.n.1 with tripleo-ansible-0.4.2-0.20200404124614.67005aa.el8ost.noarch

Verified that haproxy and metadata service are running and available for serving instances requests.
Spawned several instances and all of them were able to get metadata.
No errors on missing metadata-proxy found in metadata logs. 
Tried also reboot compute nodes and start instances after the reboot. Haproxy and metadata service were running, existing instances were able to boot and get metadata. New instances also could be spawned and get metadata.

Comment 17 errata-xmlrpc 2020-05-14 12:16:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2114

Note You need to log in before you can comment on or make changes to this bug.