Bug 1666878 - On undercloud some of the neutron containers fail to start at boot time because they cannot mount /run/openvswitch
Summary: On undercloud some of the neutron containers fail to start at boot time becau...
Keywords:
Status: CLOSED DUPLICATE of bug 1685658
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 15.0 (Stein)
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: RHOS Maint
QA Contact: Gurenko Alex
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-01-16 20:33 UTC by Marius Cornea
Modified: 2019-04-05 08:28 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-04-05 08:28:17 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Marius Cornea 2019-01-16 20:33:40 UTC
Description of problem:

On the undercloud some of the neutron containers(neutron_ovs_agent,neutron_l3_agent,neutron_dhcp) fail to start at boot time because they cannot mount /run/openvswitch. 


[root@undercloud stack]# journalctl -l -u tripleo_neutron_l3_agent.service | grep -i -B1 error
Jan 16 14:44:06 undercloud.localdomain podman[7343]: unable to start container 550ab4d11bf342de083019709d55c45a2b157a7975614a340b6e112bd90e9d0a: container create failed: container_linux.go:336: starting container process caused "process_linux.go:399: container init caused \"rootfs_linux.go:58: mounting \\\"/run/openvswitch\\\" to rootfs \\\"/var/lib/containers/storage/overlay/133bf5712b528ae74a84cbf55db3cd28d0ca12eed2ebf1c1dbb68c69285778fb/merged\\\" at \\\"/run/openvswitch\\\" caused \\\"stat /run/openvswitch: no such file or directory\\\"\""
Jan 16 14:44:06 undercloud.localdomain podman[7343]: : internal libpod error

Checking the openvswitch log we can see that it starts 1 minute later(the timezone difference is because the journal log is on EST timezone while the openvswitch is UTC): 

2019-01-16T19:45:31.281Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log


Version-Release number of selected component (if applicable):
openstack-tripleo-image-elements-10.1.1-0.20190111004154.99e6a5a.fc28.noarch
openstack-tripleo-puppet-elements-10.0.1-0.20190108154243.bdf1104.fc28.noarch
openstack-tripleo-common-10.3.1-0.20190115214116.f50b35e.fc28.noarch
puppet-tripleo-10.2.1-0.20190115195112.a2c549a.fc28.noarch
ansible-tripleo-ipsec-9.0.1-0.20190115094401.8b37e93.fc28.noarch
openstack-tripleo-heat-templates-10.3.1-0.20190116115308.d747625.fc28.noarch
python3-tripleoclient-heat-installer-11.2.1-0.20190116115554.72a9f50.fc28.noarch
python3-tripleoclient-11.2.1-0.20190116115554.72a9f50.fc28.noarch
python3-tripleo-common-10.3.1-0.20190115214116.f50b35e.fc28.noarch
openstack-tripleo-validations-10.2.1-0.20190115104503.abdd15f.fc28.noarch
openstack-tripleo-common-containers-10.3.1-0.20190115214116.f50b35e.fc28.noarch
ansible-role-tripleo-modify-image-1.0.1-0.20190114124825.d67f1ef.fc28.noarch


How reproducible:
100%

Steps to Reproduce:
1. Install undercloud on RHEL8
2. Reboot the undercloud(include workaround for BZ#1666387)
3. Check status for the neutron_ovs_agent,neutron_l3_agent,neutron_dhcp containers

Actual results:
Not started

Expected results:
Started

Additional info:

The tripleo_neutron_l3_agent.service eventually ends in failed state(before the openvswitch service getting started). Restarting the service manually after openvswitch started works fine and gets the containers started.

Jan 16 14:44:02 undercloud.localdomain systemd[1]: tripleo_neutron_l3_agent.service: Main process exited, code=exited, status=125/n/a
Jan 16 14:44:02 undercloud.localdomain systemd[1]: tripleo_neutron_l3_agent.service: Failed with result 'exit-code'.
Jan 16 14:44:02 undercloud.localdomain systemd[1]: tripleo_neutron_l3_agent.service: Service RestartSec=100ms expired, scheduling restart.
Jan 16 14:44:02 undercloud.localdomain systemd[1]: tripleo_neutron_l3_agent.service: Scheduled restart job, restart counter is at 1.
Jan 16 14:44:02 undercloud.localdomain systemd[1]: Stopped neutron_l3_agent container.
Jan 16 14:44:02 undercloud.localdomain systemd[1]: Started neutron_l3_agent container.
Jan 16 14:44:06 undercloud.localdomain podman[7343]: unable to start container 550ab4d11bf342de083019709d55c45a2b157a7975614a340b6e112bd90e9d0a: container create failed: container_linux.go:336: starting container process caused "process_>
Jan 16 14:44:06 undercloud.localdomain podman[7343]: : internal libpod error
Jan 16 14:44:06 undercloud.localdomain systemd[1]: tripleo_neutron_l3_agent.service: Main process exited, code=exited, status=125/n/a
Jan 16 14:44:06 undercloud.localdomain systemd[1]: tripleo_neutron_l3_agent.service: Failed with result 'exit-code'.
Jan 16 14:44:07 undercloud.localdomain systemd[1]: tripleo_neutron_l3_agent.service: Service RestartSec=100ms expired, scheduling restart.
Jan 16 14:44:07 undercloud.localdomain systemd[1]: tripleo_neutron_l3_agent.service: Scheduled restart job, restart counter is at 2.
Jan 16 14:44:07 undercloud.localdomain systemd[1]: Stopped neutron_l3_agent container.
Jan 16 14:44:07 undercloud.localdomain systemd[1]: tripleo_neutron_l3_agent.service: Found left-over process 9284 (podman) in control group while starting unit. Ignoring.
Jan 16 14:44:07 undercloud.localdomain systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jan 16 14:44:07 undercloud.localdomain systemd[1]: Started neutron_l3_agent container.
Jan 16 14:44:08 undercloud.localdomain systemd[1]: tripleo_neutron_l3_agent.service: Main process exited, code=exited, status=125/n/a
Jan 16 14:44:08 undercloud.localdomain systemd[1]: tripleo_neutron_l3_agent.service: Failed with result 'exit-code'.
Jan 16 14:44:09 undercloud.localdomain systemd[1]: tripleo_neutron_l3_agent.service: Service RestartSec=100ms expired, scheduling restart.
Jan 16 14:44:09 undercloud.localdomain systemd[1]: tripleo_neutron_l3_agent.service: Scheduled restart job, restart counter is at 3.
Jan 16 14:44:09 undercloud.localdomain systemd[1]: Stopped neutron_l3_agent container.
Jan 16 14:44:09 undercloud.localdomain systemd[1]: Started neutron_l3_agent container.
Jan 16 14:44:11 undercloud.localdomain systemd[1]: tripleo_neutron_l3_agent.service: Main process exited, code=exited, status=125/n/a
Jan 16 14:44:11 undercloud.localdomain systemd[1]: tripleo_neutron_l3_agent.service: Failed with result 'exit-code'.
Jan 16 14:44:11 undercloud.localdomain systemd[1]: tripleo_neutron_l3_agent.service: Service RestartSec=100ms expired, scheduling restart.
Jan 16 14:44:11 undercloud.localdomain systemd[1]: tripleo_neutron_l3_agent.service: Scheduled restart job, restart counter is at 4.
Jan 16 14:44:11 undercloud.localdomain systemd[1]: Stopped neutron_l3_agent container.
Jan 16 14:44:11 undercloud.localdomain systemd[1]: Started neutron_l3_agent container.
Jan 16 14:44:13 undercloud.localdomain systemd[1]: tripleo_neutron_l3_agent.service: Main process exited, code=exited, status=125/n/a
Jan 16 14:44:13 undercloud.localdomain systemd[1]: tripleo_neutron_l3_agent.service: Failed with result 'exit-code'.
Jan 16 14:44:13 undercloud.localdomain systemd[1]: tripleo_neutron_l3_agent.service: Service RestartSec=100ms expired, scheduling restart.
Jan 16 14:44:13 undercloud.localdomain systemd[1]: tripleo_neutron_l3_agent.service: Scheduled restart job, restart counter is at 5.
Jan 16 14:44:13 undercloud.localdomain systemd[1]: Stopped neutron_l3_agent container.
Jan 16 14:44:13 undercloud.localdomain systemd[1]: Started neutron_l3_agent container.
Jan 16 14:44:14 undercloud.localdomain systemd[1]: tripleo_neutron_l3_agent.service: Main process exited, code=exited, status=125/n/a
Jan 16 14:44:14 undercloud.localdomain systemd[1]: tripleo_neutron_l3_agent.service: Failed with result 'exit-code'.
Jan 16 14:44:14 undercloud.localdomain systemd[1]: tripleo_neutron_l3_agent.service: Service RestartSec=100ms expired, scheduling restart.
Jan 16 14:44:14 undercloud.localdomain systemd[1]: tripleo_neutron_l3_agent.service: Scheduled restart job, restart counter is at 6.
Jan 16 14:44:14 undercloud.localdomain systemd[1]: Stopped neutron_l3_agent container.
Jan 16 14:44:14 undercloud.localdomain systemd[1]: Started neutron_l3_agent container.
Jan 16 14:44:14 undercloud.localdomain systemd[1]: tripleo_neutron_l3_agent.service: Main process exited, code=exited, status=125/n/a
Jan 16 14:44:14 undercloud.localdomain systemd[1]: tripleo_neutron_l3_agent.service: Failed with result 'exit-code'.
Jan 16 14:44:15 undercloud.localdomain systemd[1]: tripleo_neutron_l3_agent.service: Service RestartSec=100ms expired, scheduling restart.
Jan 16 14:44:15 undercloud.localdomain systemd[1]: tripleo_neutron_l3_agent.service: Scheduled restart job, restart counter is at 7.
Jan 16 14:44:15 undercloud.localdomain systemd[1]: Stopped neutron_l3_agent container.
Jan 16 14:44:15 undercloud.localdomain systemd[1]: tripleo_neutron_l3_agent.service: Start request repeated too quickly.
Jan 16 14:44:15 undercloud.localdomain systemd[1]: tripleo_neutron_l3_agent.service: Failed with result 'exit-code'.
Jan 16 14:44:15 undercloud.localdomain systemd[1]: Failed to start neutron_l3_agent container.

Comment 1 Bernard Cafarelli 2019-01-17 13:06:04 UTC
It sounds like a missing start dependency between openvswitch and neutron containers?

With docker, the docker service is After=network.target and opensvswitch.service is PartOf=network.target. So docker containers always start after openvswitch

Comment 2 Nate Johnston 2019-01-28 14:48:53 UTC
Brent, can you comment on whether this is more on the Director side or the OVS side?  It seems to me that OVS would need to maintain their existing target for a generic situation, so a different specialized target may be needed.

Comment 3 Brent Eagles 2019-04-01 13:55:37 UTC
This is on the deployment side. The service files responsible for launching the neutron agents should have dependencies to drag openvswitch up.


Note You need to log in before you can comment on or make changes to this bug.