Description of problem: Seen the following error: [tempest.scenario.manager] Public network connectivity check failed ConnectionResetError: [Errno 104] Connection reset by peer tempest.lib.exceptions.SSHTimeout: Connection to the 10.0.0.244 via SSH timed out. https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/phase1-16_director-rhel-8.1-virthost-1cont_1comp_1ceph-ipv4-geneve-ceph/121/console Tempest logs are here: http://cougar11.scl.lab.tlv.redhat.com/phase1-16_director-rhel-8.1-virthost-1cont_1comp_1ceph-ipv4-geneve-ceph/121/undercloud-0.tar.gz?undercloud-0/home/stack/tempest-dir/tempest.log Version-Release number of selected component (if applicable): RHOS_TRUNK-16.0-RHEL-8-20200316.n.1
I see a bunch of these in the logs at c#1: ╰─$ grep " ERROR " compute-0/var/log/containers/neutron/ovn-metadata-agent.log 2020-03-16 19:36:46.827 29393 DEBUG networking_ovn.agent.metadata_agent [-] logging_exception_prefix = %(asctime)s.%(msecs)03d %(process)d ERROR %(name)s %(instance)s log_opt_values /usr/lib/python3.6/site-packages/oslo_config/cfg.py:2581 2020-03-16 19:36:48.487 29393 DEBUG oslo_service.service [-] logging_exception_prefix = %(asctime)s.%(msecs)03d %(process)d ERROR %(name)s %(instance)s log_opt_values /usr/lib/python3.6/site-packages/oslo_config/cfg.py:2581 2020-03-16 19:50:46.895 29393 ERROR neutron.agent.linux.external_process [-] metadata-proxy for metadata with uuid f7d36c66-a00c-4da3-9bd7-e604892e5a55 not found. The process should not have died 2020-03-16 19:50:47.230 29393 ERROR neutron.agent.linux.external_process [-] metadata-proxy for metadata with uuid 1d59fe65-eca7-4341-95a1-8fd2dad007ea not found. The process should not have died 2020-03-16 19:51:46.896 29393 ERROR neutron.agent.linux.external_process [-] metadata-proxy for metadata with uuid f7d36c66-a00c-4da3-9bd7-e604892e5a55 not found. The process should not have died 2020-03-16 19:51:47.181 29393 ERROR neutron.agent.linux.external_process [-] metadata-proxy for metadata with uuid 1d59fe65-eca7-4341-95a1-8fd2dad007ea not found. The process should not have died 2020-03-16 19:52:46.897 29393 ERROR neutron.agent.linux.external_process [-] metadata-proxy for metadata with uuid 1d59fe65-eca7-4341-95a1-8fd2dad007ea not found. The process should not have died 2020-03-16 19:56:46.902 29393 ERROR neutron.agent.linux.external_process [-] metadata-proxy for metadata with uuid a2f93fc3-fd52-44de-bbd5-995479c46c2d not found. The process should not have died 2020-03-16 19:57:46.903 29393 ERROR neutron.agent.linux.external_process [-] metadata-proxy for metadata with uuid a2f93fc3-fd52-44de-bbd5-995479c46c2d not found. The process should not have died 2020-03-16 19:58:46.906 29393 ERROR neutron.agent.linux.external_process [-] metadata-proxy for metadata with uuid a2f93fc3-fd52-44de-bbd5-995479c46c2d not found. The process should not have died 2020-03-16 20:05:46.915 29393 ERROR neutron.agent.linux.external_process [-] metadata-proxy for metadata with uuid 1924594a-09b2-4443-81f5-20ebba148624 not found. The process should not have died 2020-03-16 20:11:46.922 29393 ERROR neutron.agent.linux.external_process [-] metadata-proxy for metadata with uuid b060899a-36c2-46de-a008-15ff982dc155 not found. The process should not have died 2020-03-16 20:12:46.924 29393 ERROR neutron.agent.linux.external_process [-] metadata-proxy for metadata with uuid b060899a-36c2-46de-a008-15ff982dc155 not found. The process should not have died 2020-03-16 20:13:46.926 29393 ERROR neutron.agent.linux.external_process [-] metadata-proxy for metadata with uuid b060899a-36c2-46de-a008-15ff982dc155 not found. The process should not have died 2020-03-16 20:14:46.928 29393 ERROR neutron.agent.linux.external_process [-] metadata-proxy for metadata with uuid b060899a-36c2-46de-a008-15ff982dc155 not found. The process should not have died 2020-03-16 20:17:46.933 29393 ERROR neutron.agent.linux.external_process [-] metadata-proxy for metadata with uuid bee6e682-997b-4401-925f-a2d85613d93f not found. The process should not have died 2020-03-16 20:18:46.934 29393 ERROR neutron.agent.linux.external_process [-] metadata-proxy for metadata with uuid bee6e682-997b-4401-925f-a2d85613d93f not found. The process should not have died
If it is the same root cause as bug #1812630, there is https://review.opendev.org/#/c/713395/
So I tried a deployment with https://review.opendev.org/#/c/713395/ applied and I can no longer see any ERRORS around 'process should not have died': undercloud) [stack@undercloud-0 ~]$ for i in 37 30; do ssh 192.168.24.$i "sudo grep ERROR /var/log/containers/neutron/ovn-metadata-agent.log |grep -v NLMSG_ERROR"; done Warning: Permanently added '192.168.24.37' (ECDSA) to the list of known hosts. 2020-03-18 11:19:13.724 25101 DEBUG networking_ovn.agent.metadata_agent [-] logging_exception_prefix = %(asctime)s.%(msecs)03d %(process)d ERROR %(name)s %(instance)s log_opt_values /usr/lib/python3.6/site-packages/oslo_config/cfg.py:2581 2020-03-18 11:19:16.622 25101 DEBUG oslo_service.service [-] logging_exception_prefix = %(asctime)s.%(msecs)03d %(process)d ERROR %(name)s %(instance)s log_opt_values /usr/lib/python3.6/site-packages/oslo_config/cfg.py:2581 Warning: Permanently added '192.168.24.30' (ECDSA) to the list of known hosts. 2020-03-18 11:19:13.698 24976 DEBUG networking_ovn.agent.metadata_agent [-] logging_exception_prefix = %(asctime)s.%(msecs)03d %(process)d ERROR %(name)s %(instance)s log_opt_values /usr/lib/python3.6/site-packages/oslo_config/cfg.py:2581 2020-03-18 11:19:16.459 24976 DEBUG oslo_service.service [-] logging_exception_prefix = %(asctime)s.%(msecs)03d %(process)d ERROR %(name)s %(instance)s log_opt_values /usr/lib/python3.6/site-packages/oslo_config/cfg.py:2581 (undercloud) [stack@undercloud-0 ~]$ for i in 37 30; do ssh 192.168.24.$i "sudo grep died /var/log/containers/neutron/ovn-metadata-agent.log "; done Warning: Permanently added '192.168.24.37' (ECDSA) to the list of known hosts. Warning: Permanently added '192.168.24.30' (ECDSA) to the list of known hosts. (undercloud) [stack@undercloud-0 ~]$ So I'll push that patch upstream.
I did same analysis which leads to the same conclusion: haproxy fails to start. In that time, the instance has boot up but failed to fetch metadata hence the PK auth fails - Port gets bound to the hypervisor at 19:50:14 so the VM boots and metadata agent tries to spawn haproxy: 3360 2020-03-16 19:50:14.652 29393 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/external/pids/1d59fe65-eca7-4341-95a1-8fd2dad007ea.pid.haproxy get_valu e_from_file /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:261 3389 2020-03-16 19:50:14.653 29393 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'ovnme ta-1d59fe65-eca7-4341-95a1-8fd2dad007ea', 'haproxy', '-f', '/var/lib/neutron/ovn-metadata-proxy/1d59fe65-eca7-4341-95a1-8fd2dad007ea.conf'] create_process /usr/lib/python3 .6/site-packages/neutron/agent/linux/utils.py:87 - As per 'messages' log looks like haproxy is not started until 19:52:47 (if it even starts, I can't tell) so the 20 retries from the Cirros image have already expired 43244 Mar 16 19:52:47 compute-0 sync[40402]: + podman run --detach -v /var/lib/config-data/puppet-generated/neutron:/etc/neutron:ro -v /run/netns:/run/netns:shared -v /var/lib/ neutron:/var/lib/neutron:z,shared -v /dev/log:/dev/log --log-driver k8s-file --log-opt path=/var/log/containers/stdouts/ovn_metadata_haproxy-ovnmeta-1d59fe65-eca7-4341-95 a1-8fd2dad007ea.log --net host --pid host --privileged -u root --name ovn_metadata_haproxy-ovnmeta-1d59fe65-eca7-4341-95a1-8fd2dad007ea undercloud-0.ctlplane.redhat.local :8787/rh-osbs/rhosp16-openstack-neutron-metadata-agent-ovn:20200316.1 ip netns exec ovnmeta-1d59fe65-eca7-4341-95a1-8fd2dad007ea /var/lib/neutron/ovn_metadata_haproxy/com mand -f /var/lib/neutron/ovn-metadata-proxy/1d59fe65-eca7-4341-95a1-8fd2dad007ea.conf 43245 Mar 16 19:52:47 compute-0 podman[40439]: 2020-03-16 19:52:47.445949007 +0000 UTC m=+0.069512561 container create acdf2702c73627565b358e0a7cc864a7ed3fa21a49d3a83017d141a82 0a87824 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-metadata-agent-ovn:20200316.1, name=ovn_metadata_haproxy-ovnmeta-1d59fe65-eca7-43 41-95a1-8fd2dad007ea) 43246 Mar 16 19:52:47 compute-0 systemd[1]: Started libpod-conmon-acdf2702c73627565b358e0a7cc864a7ed3fa21a49d3a83017d141a820a87824.scope. 43247 Mar 16 19:52:47 compute-0 systemd[1]: Started libcontainer container acdf2702c73627565b358e0a7cc864a7ed3fa21a49d3a83017d141a820a87824. 43248 Mar 16 19:52:47 compute-0 kernel: SELinux: mount invalid. Same superblock, different security settings for (dev mqueue, type mqueue) 43249 Mar 16 19:52:47 compute-0 podman[40439]: 2020-03-16 19:52:47.559238983 +0000 UTC m=+0.182802564 container init acdf2702c73627565b358e0a7cc864a7ed3fa21a49d3a83017d141a820a 87824 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-metadata-agent-ovn:20200316.1, name=ovn_metadata_haproxy-ovnmeta-1d59fe65-eca7-4341 -95a1-8fd2dad007ea) 43250 Mar 16 19:52:47 compute-0 podman[40439]: 2020-03-16 19:52:47.578606151 +0000 UTC m=+0.202169699 container start acdf2702c73627565b358e0a7cc864a7ed3fa21a49d3a83017d141a820 a87824 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-metadata-agent-ovn:20200316.1, name=ovn_metadata_haproxy-ovnmeta-1d59fe65-eca7-434 1-95a1-8fd2dad007ea) During this time, metadata agent mechanism detects that haproxy is not running but the respawn doesn't work. 2020-03-16 19:50:47.230 29393 ERROR neutron.agent.linux.external_process [-] metadata-proxy for metadata with uuid 1d59fe65-eca7-4341-95a1-8fd2dad007ea not found. The process should not have died 2020-03-16 19:51:47.181 29393 ERROR neutron.agent.linux.external_process [-] metadata-proxy for metadata with uuid 1d59fe65-eca7-4341-95a1-8fd2dad007ea not found. The process should not have died 2020-03-16 19:52:46.897 29393 ERROR neutron.agent.linux.external_process [-] metadata-proxy for metadata with uuid 1d59fe65-eca7-4341-95a1-8fd2dad007ea not found. The process should not have died
My questions are: 1) Why is haproxy starting 2 minutes later than it should? 2) Will https://review.opendev.org/#/c/713395/ address it? The patch makes sense for the SIGHUP case but does it do something about the starting?
(In reply to Daniel Alvarez Sanchez from comment #5) > My questions are: > > 1) Why is haproxy starting 2 minutes later than it should? > 2) Will https://review.opendev.org/#/c/713395/ address it? > > The patch makes sense for the SIGHUP case but does it do something about the > starting? So I rerun yet another test and I still do see those 'process should not have died' messages. Not sure why they did disappear in other runs, so no I don't think https://review.opendev.org/#/c/713395/ will fix the root cause of this problem (we should push for it anyway as it will remove a useless bash process, but that is about it) I think I can somewhat reproduce parts of the problem. Right now in my env I see: 2020-03-18 16:44:07.080 25037 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'ovnmeta-bc2c354b-e2e6-4c20-9d4c-d80e32b2a53a', 'haproxy', '-f', '/var/lib/neutron/ovn-metadata-proxy/bc2c354b-e2e6-4c20-9d4c-d80e32b2a53a.conf'] create_process /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:87 2020-03-18 16:44:07.213 25037 DEBUG oslo_concurrency.lockutils [-] Lock "chassis_metadata_lock" acquired by "networking_ovn.agent.metadata.agent.MetadataAgent.update_chassis_metadata_networks" :: waited 0.000s inner /usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:327 2020-03-18 16:44:07.214 25037 DEBUG oslo_concurrency.lockutils [-] Lock "chassis_metadata_lock" released by "networking_ovn.agent.metadata.agent.MetadataAgent.update_chassis_metadata_networks" :: held 0.001s inner /usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:339 2020-03-18 16:44:07.410 25037 DEBUG oslo_concurrency.lockutils [-] Lock "_check_child_processes" released by "neutron.agent.linux.external_process.ProcessMonitor._check_child_processes" :: held 0.332s inner /usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:339 2020-03-18 16:44:40.052 25037 DEBUG ovsdbapp.backend.ovs_idl.event [-] Matched UPDATE: SbGlobalUpdateEvent(events=('update',), table='SB_Global', conditions=None, old_conditions=None) to row=SB_Global(ipsec=False, ssl=[], nb_cfg=98, options={'mac_prefix': '2a:c2:55', 'svc_monitor_mac': '2a:43:05:87:da:87'}, external_ids={}) old=SB_Global(nb_cfg=97) matches /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/event.py:44 2020-03-18 16:44:40.054 25037 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=0): UpdateChassisExtIdsCommand(name=a18b22a8-90ce-4b58-a61e-5b99e185e2ec, external_ids={'neutron:ovn-metadata-sb-cfg': '98'}, if_exists=True) do_commit /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84 2020-03-18 16:45:07.078 25037 DEBUG oslo_concurrency.lockutils [-] Lock "_check_child_processes" acquired by "neutron.agent.linux.external_process.ProcessMonitor._check_child_processes" :: waited 0.000s inner /usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:327 2020-03-18 16:45:07.079 25037 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/external/pids/bc2c354b-e2e6-4c20-9d4c-d80e32b2a53a.pid.haproxy get_value_from_file /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:261 2020-03-18 16:45:07.079 25037 ERROR neutron.agent.linux.external_process [-] metadata-proxy for metadata with uuid bc2c354b-e2e6-4c20-9d4c-d80e32b2a53a not found. The process should not have died 2020-03-18 16:45:07.080 25037 WARNING neutron.agent.linux.external_process [-] Respawning metadata-proxy for uuid bc2c354b-e2e6-4c20-9d4c-d80e32b2a53a 2020-03-18 16:45:07.080 25037 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/external/pids/bc2c354b-e2e6-4c20-9d4c-d80e32b2a53a.pid.haproxy get_value_from_file /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:261 I'll keep the env around so we can debug more tomorrow
Kuba and I have been looking into it today and found the answer to 1). (ie. Why haproxy doesn't start after 2 minutes?) Full description is on the LP bug [0] and we have a patch up for review [1]. As a side note and I think it's not *strictly* related with this issue but looks like it's an issue anyways, we've seen the sync script hanging forever on the flock [2]. 'lslocks' output only shows the sync script locking on the file and nothing else but it didn't work well until we restarted the systemd unit (and killed the process). [0] https://bugs.launchpad.net/tripleo/+bug/1868082 [1] https://review.opendev.org/713852 [2] https://opendev.org/openstack/tripleo-ansible/src/branch/master/tripleo_ansible/roles/tripleo_systemd_wrapper/templates/service_sync.j2#L37 PS. The issue was introduced by this patch 4 weeks ago: https://review.opendev.org/#/c/706381/
Verified on 16.0-RHEL-8/RHOS_TRUNK-16.0-RHEL-8-20200406.n.1 with tripleo-ansible-0.4.2-0.20200404124614.67005aa.el8ost.noarch Verified that haproxy and metadata service are running and available for serving instances requests. Spawned several instances and all of them were able to get metadata. No errors on missing metadata-proxy found in metadata logs. Tried also reboot compute nodes and start instances after the reboot. Haproxy and metadata service were running, existing instances were able to boot and get metadata. New instances also could be spawned and get metadata.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2114