Bug 1444803 - SystemExit: 1 seen in openvswitch-agent.log, due to systemd[1]: network.service start operation timed out. Terminating.
Summary: SystemExit: 1 seen in openvswitch-agent.log, due to systemd[1]: network.servi...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 10.0 (Newton)
Hardware: All
OS: Linux
high
medium
Target Milestone: async
: 10.0 (Newton)
Assignee: Assaf Muller
QA Contact: Toni Freger
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-04-24 11:05 UTC by Masaki Furuta ( RH )
Modified: 2020-06-11 13:40 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-07-25 14:47:30 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Comment 3 Flavio Leitner 2017-05-03 12:55:33 UTC
The default timeout is 10 secs.  If the system is loaded enough that it is takes longer, the ovs-ctl will exit allowing the network service to continue without the interface be ready causing the error you are seeing.

Could you please increase the timeout to a few minutes and see if that fixes the issue?

Just add in /etc/sysconfig/network-scripts/ifcfg-vlan2105 the line:
TIMEOUT=600

and try to reproduce it again.

Comment 4 Jakub Libosvar 2017-05-03 13:00:55 UTC
(In reply to Flavio Leitner from comment #3)
> The default timeout is 10 secs.  If the system is loaded enough that it is
> takes longer, the ovs-ctl will exit allowing the network service to continue
> without the interface be ready causing the error you are seeing.
> 
> Could you please increase the timeout to a few minutes and see if that fixes
> the issue?
> 
> Just add in /etc/sysconfig/network-scripts/ifcfg-vlan2105 the line:
> TIMEOUT=600
> 
> and try to reproduce it again.

It looks to me that the SysV script timed out, not the vsctl commands.

"Mar 21 15:48:39 networker01.localdomain systemd[1]: network.service start operation timed out. Terminating."

Which means we should probably change the timeout of either systemd redirection or init script itself?

Comment 5 Flavio Leitner 2017-05-03 13:57:46 UTC
You're right and in that case you might want to increase the default start timeout in systemd.  I am not a systemd expert, so check out systemd-system.conf(5)

Do we know why the load is so high? Because it sounds like the system is just oversubscribed at this point.

Comment 6 Jakub Libosvar 2017-05-03 14:55:12 UTC
(In reply to Flavio Leitner from comment #5)
> You're right and in that case you might want to increase the default start
> timeout in systemd.  I am not a systemd expert, so check out
> systemd-system.conf(5)
> 
> Do we know why the load is so high? Because it sounds like the system is
> just oversubscribed at this point.

The problem is that network.service does not have any service or unit file. It's generated form the init script, using the default of 5 minutes timeout on start. I'm trying to read some docs whether there is a way to tell systemd to override default timeout for services generated by systemd-sysv-generator (like the network.service). I can't find a solution so far.

Comment 7 Flavio Leitner 2017-05-03 20:06:04 UTC
Looks like it's hardcoded:
https://github.com/systemd/systemd/blob/master/src/sysv-generator/sysv-generator.c#L193

Comment 8 Masaki Furuta ( RH ) 2017-05-08 01:04:51 UTC
(In reply to Flavio Leitner from comment #7)
> Looks like it's hardcoded:
> https://github.com/systemd/systemd/blob/master/src/sysv-generator/sysv-
> generator.c#L193

Yup, I agree.

1. ConfirmTimoutSec>

Both Start / Stop seems to be set to 5min.

  ~~~
  [root@el73-nfs ~]# systemctl show network.service |grep -i 'time.*sec='
  TimeoutStartUSec=5min
  TimeoutStopUSec=5min
  TimerSlackNSec=50000
  JobTimeoutUSec=0
  
  [root@el73-nfs ~]# systemctl cat network.service |grep -i 'time.*sec='
  TimeoutSec=5min
  
  [root@el73-nfs ~]# systemctl cat network.service 
  # /run/systemd/generator.late/network.service
  # Automatically generated by systemd-sysv-generator
  
  [Unit]
  Documentation=man:systemd-sysv-generator(8)
  SourcePath=/etc/rc.d/init.d/network
  Description=LSB: Bring up/down networking
  Before=runlevel2.target
  Before=runlevel3.target
  Before=runlevel4.target
  Before=runlevel5.target
  Before=shutdown.target
  Before=network-online.target
  Before=network.target
  After=iptables.service
  After=ip6tables.service
  After=NetworkManager-wait-online.service
  After=NetworkManager.service
  After=network-pre.target
  Wants=network-online.target
  Conflicts=shutdown.target
  
  [Service]
  Type=forking
  Restart=no
  TimeoutSec=5min
  IgnoreSIGPIPE=no
  KillMode=process
  GuessMainPID=no
  RemainAfterExit=yes
  ExecStart=/etc/rc.d/init.d/network start
  ExecStop=/etc/rc.d/init.d/network stop
  ExecReload=/etc/rc.d/init.d/network reload
  ~~~
  
2. Confirm service file.

Network.service comes from sysv serivce, so it looks created runtime.
  
  ~~~
  [root@el73-nfs ~]# ll /run/systemd/generator.late/network.service
  -rw-r--r--. 1 root root 792 Apr 10 03:06 /run/systemd/generator.late/network.service
  
  [root@el73-nfs ~]# date
  Wed Apr 10 03:14:49 JST 2017\
  ~~~

3. Checked code in systemd

Since systemd creates and uses .service file for sysv , I agree that it's created as below in the code.
  
  - ./systemd-219-30.el7_3.6.src.rpmbuild-bp/src/sysv-generator/sysv-generator.c
  
  ...
  175         fprintf(f,↲
  176                 "\n[Service]\n"↲
  177                 "Type=forking\n"↲
  178                 "Restart=no\n"↲
  179                 "TimeoutSec=5min\n"↲                                                                                                                                                                          
  180                 "IgnoreSIGPIPE=no\n"↲
  181                 "KillMode=process\n"↲
  182                 "GuessMainPID=no\n"↲
  183                 "RemainAfterExit=%s\n",↲
  184                 yes_no(!s->pid_file));↲

I also quick checked if we can set new TimeoutSec value as below, it won't work after running deamon-reload..

  [root@el73-osp10-all-virbr1 ~]# ll /run/systemd/generator.late/network.service
  -rw-r--r--. 1 root root 792 Apr 10 19:21 /run/systemd/generator.late/network.service

  [root@el73-osp10-all-virbr1 ~]# cp -v /run/systemd/generator.late/network.service /root/network.service-$(date +%F)
  ‘/run/systemd/generator.late/network.service’ -> ‘/root/network.service-2017-04-10’
  
  [root@el73-osp10-all-virbr1 ~]# vim /run/systemd/generator.late/network.service
  
  [root@el73-osp10-all-virbr1 ~]# diff -u /root/network.service-$(date +%F) /run/systemd/generator.late/network.service 
  --- /root/network.service-2017-04-10	2017-04-10 19:19:33.564082644 +0900
  +++ /run/systemd/generator.late/network.service	2017-04-10 19:19:45.887147580 +0900
  @@ -22,7 +22,7 @@
   [Service]
   Type=forking
   Restart=no
  -TimeoutSec=5min
  +TimeoutSec=10min
   IgnoreSIGPIPE=no
   KillMode=process
   GuessMainPID=no
  
  [root@el73-osp10-all-virbr1 ~]# systemctl show network.service |grep -i 'time.*sec='
  TimeoutStartUSec=5min
  TimeoutStopUSec=5min
  TimerSlackNSec=50000
  JobTimeoutUSec=0
  
  [root@el73-osp10-all-virbr1 ~]# systemctl daemon-reload

  [root@el73-osp10-all-virbr1 ~]# ll /run/systemd/generator.late/network.service
  -rw-r--r--. 1 root root 792 Apr 10 19:24 /run/systemd/generator.late/network.service
  
  [root@el73-osp10-all-virbr1 ~]# systemctl show network.service |grep -i 'time.*sec='
  TimeoutStartUSec=5min
  TimeoutStopUSec=5min
  TimerSlackNSec=50000
  JobTimeoutUSec=0
  
  [root@el73-osp10-all-virbr1 ~]# grep 5min /run/systemd/generator.late/network.service
  TimeoutSec=5min(In reply to Jakub Libosvar from comment #6)

> (In reply to Flavio Leitner from comment #5)
> > You're right and in that case you might want to increase the default start
> > timeout in systemd.  I am not a systemd expert, so check out
> > systemd-system.conf(5)
> > 
> > Do we know why the load is so high? Because it sounds like the system is
> > just oversubscribed at this point.
> 
> The problem is that network.service does not have any service or unit file.
> It's generated form the init script, using the default of 5 minutes timeout
> on start. I'm trying to read some docs whether there is a way to tell
> systemd to override default timeout for services generated by
> systemd-sysv-generator (like the network.service). I can't find a solution
> so far.

Just in case, I'm currently asking customer to add '+x' in /etc/sysconfig/network-scripts/ifup-ovs to get more detailed output when they see timeout and will update on this bz..

  [root@el73-osp10-all-virbr1-2 ~]# vim /etc/sysconfig/network-scripts/ifup-ovs

  [root@el73-osp10-all-virbr1-2 ~]# diff -u /etc/sysconfig/network-scripts/ifup-ovs-$(date +%F) /etc/sysconfig/network-scripts/ifup-ovs 
  --- /etc/sysconfig/network-scripts/ifup-ovs-2017-04-26	2017-04-26 19:43:55.129592730 +0900
  +++ /etc/sysconfig/network-scripts/ifup-ovs	2017-04-26 19:44:01.795637331 +0900
  @@ -1,4 +1,4 @@
  -#!/bin/bash
  +#!/bin/bash -x 
   
   # Copyright (c) 2011 Alexey I. Froloff.
   #

But if it's due to timeout from systemd (systemd[1]: network.service start operation timed out. Terminating.) Should we change this Bz from neutron to systemd and we need to get from systemd expert ?

Btw, I also get feedback from customer that they had tried to modify timeout based on https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/System_Administrators_Guide/chap-Managing_Services_with_systemd.html and confirmed that timeout value had extended (but they said they had not verified it really works), so I'm also confirming how they had done and verify if it'll work.

Comment 9 Flavio Leitner 2017-05-08 18:31:40 UTC
(In reply to Masaki Furuta from comment #8)
> Should we change this Bz from neutron to
> systemd and we need to get from systemd expert ?

There are two paths here:  Maybe the boot is slow because of another problem and increasing the timeout is just masking the real problem. Most probably a systemd expert can help you troubleshoot the initialization (e.g.: systemd-analyze).

Or the system is really loaded and systemd would need to allow to increase timeout in its config file.  At this point I'd say we need a systemd expert again.

Comment 10 Masaki Furuta ( RH ) 2017-05-10 07:16:27 UTC
(In reply to Flavio Leitner from comment #9)
> (In reply to Masaki Furuta from comment #8)
> > Should we change this Bz from neutron to
> > systemd and we need to get from systemd expert ?
> 
> There are two paths here:  Maybe the boot is slow because of another problem
> and increasing the timeout is just masking the real problem. Most probably a
> systemd expert can help you troubleshoot the initialization (e.g.:
> systemd-analyze).

Hi Flavio and Assaf,

Thanks for clarification!

Okay, I agree that we need to move this from openstack-neutron to systemd if network.service in systemd matters.

But I also saw following in the log:

  1) ovs-vsctl[PID]: ovs|00001|vsctl|INFO|Called as ovs-vsctl -t 10 -- --if-exists del-port br0 vlanXXXX -- add-port br0 vlanXXXX tag=XXXX -- set Interface vlanXXXX type=internal

Then

  2) network[PID]: Bringing up interface vlanXXXX:  [  OK  ]

Does this mean if ovs-vsctl command is taking longer time ?
If true, this should be openvswitch issue than neutron or systemd ?

  # rpm -qf --qf '%{name}\n' $(which ovs-vsctl )
  openvswitch

Excerpt from Bz description:
~~~
Under high load vs no load:

- high load (Not working):
...
Apr 19 15:29:27 networker01.localdomain ovs-vsctl[6617]: ovs|00001|vsctl|INFO|Called as ovs-vsctl -t 10 -- --may-exist add-br br-ex -- set bridge br-ex fail_mode=standalone
Apr 19 15:29:41 networker01.localdomain network[1273]: Bringing up interface br-ex:  [  OK  ]
Apr 19 15:29:41 networker01.localdomain systemd[1]: network.service start operation timed out. Terminating.
Apr 19 15:29:41 networker01.localdomain systemd[1]: Failed to start LSB: Bring up/down networking.
Apr 19 15:29:41 networker01.localdomain systemd[1]: Unit network.service entered failed state.
Apr 19 15:29:41 networker01.localdomain systemd[1]: network.service failed.

- no load (Working):
...
Apr 19 10:40:39 networker01.localdomain ovs-vsctl[3073]: ovs|00001|vsctl|INFO|Called as ovs-vsctl -t 10 -- --may-exist add-br br-ex -- set bridge br-ex fail_mode=standalone
Apr 19 10:40:41 networker01.localdomain network[1274]: Bringing up interface br-ex:  [  OK  ]
Apr 19 10:40:42 networker01.localdomain ovs-vsctl[3248]: ovs|00001|vsctl|INFO|Called as ovs-vsctl -t 10 -- --if-exists del-port br0 vlan2001 -- add-port br0 vlan2001 tag=2001 -- set Interface vlan2001 type=internal
Apr 19 10:40:47 networker01.localdomain network[1274]: Bringing up interface vlan2001:  [  OK  ]
Apr 19 10:40:48 networker01.localdomain ovs-vsctl[3368]: ovs|00001|vsctl|INFO|Called as ovs-vsctl -t 10 -- --if-exists del-port br1 vlan2104 -- add-port br1 vlan2104 tag=2104 -- set Interface vlan2104 type=internal
Apr 19 10:40:53 networker01.localdomain network[1274]: Bringing up interface vlan2104:  [  OK  ]
Apr 19 10:40:54 networker01.localdomain ovs-vsctl[3490]: ovs|00001|vsctl|INFO|Called as ovs-vsctl -t 10 -- --if-exists del-port br-ex vlan2105 -- add-port br-ex vlan2105 tag=2105 -- set Interface vlan2105 type=internal
Apr 19 10:40:59 networker01.localdomain network[1274]: Bringing up interface vlan2105:  [  OK  ]
~~~

> Or the system is really loaded and systemd would need to allow to increase
> timeout in its config file.  At this point I'd say we need a systemd expert
> again.

Now I got feedback from customer that they can set longer TimoutStartUsec with following steps:

(set 15min)

  # systemctl show network -p TimeoutStartUSec
  TimeoutStartUSec=5min

  # mkdir -pv /etc/systemd/system/network.service.d/
  mkdir: created directory ‘/etc/systemd/system/network.service.d/’

  # echo -e '[Service]\nTimeoutStartSec=15min' > /etc/systemd/system/network.service.d/timeout.conf

  # systemctl daemon-reload

  # systemctl show network -p TimeoutStartUSec
  TimeoutStartUSec=15min

(reboot)
  
  # reboot
  ...

(confirm it again)
  
  # systemctl show network -p TimeoutStartUSec
  TimeoutStartUSec=15min


Looks like we can set longer timeout with overriding with /etc/systemd/system/network.service.d/timeout.conf file.

Now I'm asking customer if they can work this around with longer timeout and waiting on feedback, but customer replied that it takes some time till they can re-check with longer timeout.

Comment 11 Masaki Furuta ( RH ) 2017-05-10 07:41:01 UTC
(In reply to Masaki Furuta from comment #10)
> (In reply to Flavio Leitner from comment #9)
> > (In reply to Masaki Furuta from comment #8)
...
> Okay, I agree that we need to move this from openstack-neutron to systemd if
> network.service in systemd matters.
> 
> But I also saw following in the log:
> 
>   1) ovs-vsctl[PID]: ovs|00001|vsctl|INFO|Called as ovs-vsctl -t 10 --
> --if-exists del-port br0 vlanXXXX -- add-port br0 vlanXXXX tag=XXXX -- set
> Interface vlanXXXX type=internal
> 
> Then
> 
>   2) network[PID]: Bringing up interface vlanXXXX:  [  OK  ]
> 
> Does this mean if ovs-vsctl command is taking longer time ?
> If true, this should be openvswitch issue than neutron or systemd ?
> 
>   # rpm -qf --qf '%{name}\n' $(which ovs-vsctl )
>   openvswitch

Or we firstly need to copy this bz for systemd and then clarify how systemd script (network.service) invoke ovs-vsctl and then move this again to openswitch bz ?

Comment 12 Flavio Leitner 2017-05-11 03:40:19 UTC
(In reply to Masaki Furuta from comment #11)
> > Does this mean if ovs-vsctl command is taking longer time ?

I asked on comment#5 why the load would be so high, but I got no answer yet.
If you don't know, maybe the systemd experts can help you troubleshoot the boot time (comment#9)

> Or we firstly need to copy this bz for systemd and then clarify how systemd
> script (network.service) invoke ovs-vsctl and then move this again to
> openswitch bz ?

The network.service will call ifup, which calls ifup-ovs which calls ovs-vsctl. But for some unknown reason, the whole initialization is taking long.

One idea would be to change ifup-ovs to show dates when starting and exiting to know if ovs is causing the slow down.  Or you could ovs-vsctl with a wrapper script to run strace -ttt ovs-vsctl logging the output, so we can review next.


Note You need to log in before you can comment on or make changes to this bug.