RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 2060031 - NMCLI OVS connections intermittently get stuck in "activating" state after power cycle or crash [rhel-8]
Summary: NMCLI OVS connections intermittently get stuck in "activating" state after po...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: NetworkManager
Version: 8.6
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Thomas Haller
QA Contact: Vladimir Benes
URL:
Whiteboard:
Depends On:
Blocks: 2153429 2153430 2173890
TreeView+ depends on / blocked
 
Reported: 2022-03-02 14:43 UTC by Rick Alongi
Modified: 2023-05-16 11:04 UTC (History)
10 users (show)

Fixed In Version: NetworkManager-1.40.2-1.el8
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2153429 2153430 2173890 (view as bug list)
Environment:
Last Closed: 2023-05-16 09:04:54 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
OVS and journalctl logs (180.92 KB, application/gzip)
2022-03-02 14:43 UTC, Rick Alongi
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker NMT-269 0 None None None 2023-02-08 08:44:39 UTC
Red Hat Issue Tracker RHELPLAN-114236 0 None None None 2022-03-02 14:50:34 UTC
Red Hat Product Errata RHBA-2023:2968 0 None None None 2023-05-16 09:06:21 UTC

Description Rick Alongi 2022-03-02 14:43:52 UTC
Created attachment 1863849 [details]
OVS and journalctl logs

Description of problem:
NMCLI OVS connections intermittently get stuck in "activating" state after power cycle

Version-Release number of selected component (if applicable):

Compose: RHEL-8.6.0-20220227.2
Kernel: 4.18.0-369.el8.x86_64

[root@netqe9 ~]# rpm -qa | grep NetworkManager
NetworkManager-libnm-1.36.0-1.el8.x86_64
NetworkManager-tui-1.36.0-1.el8.x86_64
NetworkManager-ovs-1.36.0-1.el8.x86_64
NetworkManager-team-1.36.0-1.el8.x86_64
NetworkManager-1.36.0-1.el8.x86_64

How reproducible:

Intermittent (~50% of the time)

Steps to Reproduce:
1. Provision system with RHEL-8.6
2. Install openvswitch2.15 or openvswitch2.17
3. Start/Enable openvswitch.service
4. Create OVS based NMCLI connections
5. Power cycle system via rhts-power command
6, After system comes back, note one of the connections is stuck in "activating"
7. More detailed steps in Additional info section below

Actual results:
NMCLI connection is stuck in activating state.

Expected results:
NMCLI connections become activated after power cycle without having to manually enter "nmcli con up <connection> to get it to activated state.

Additional info:

- This behavior was not observed when using RHEL-8.4 or 8.5.
- Problem has been observed with RHEL-8.6 when using both openvswitch2.15 and openvswitch2.17.
- Problem does not happen after a software reboot (rhts-reboot)
- TRACE enabled for NetworkManager with journalctl -b output attached
- ovs_vswitchd.log, ovsdb-server.log and journalctl.log attached to this BZ
- sos report located here: http://netqe-infra01.knqe.lab.eng.bos.redhat.com/sosreports/sosreport-netqe9-2022-03-02-bdpdsim.tar.xz
- Example beaker job: https://beaker.engineering.redhat.com/jobs/6358018

# Steps to reproduce:

- Provision system with RHEL-8.6
- Create scripts below (and chmod +x) on provisioned system:

[root@netqe9 ~]# cat rhts_power_install.sh 
#!/bin/bash

cat >/usr/bin/rhts-power <<EOF
#!/bin/bash

curl --insecure \\
     --header "Content-Type: text/xml" \\
     --data "<?xml version=\"1.0\"?>
             <methodCall>
               <methodName>power</methodName>
               <params>
                 <param>
                   <value><string>\$(hostname)</string></value>
                 </param>
                 <param>
                   <value><string>reboot</string></value>
                 </param>
               </params>
             </methodCall>" \\
     http://\${LAB_CONTROLLER}:8000/RPC2
EOF
chmod 755 /usr/bin/rhts-power

[root@netqe9 ~]# cat setup.sh 
#!/bin/bash

RPM_OVS=${RPM_OVS:-"http://netqe-infra01.knqe.lab.eng.bos.redhat.com/repo/packages/openvswitch2.17/el8/openvswitch2.17-2.17.0-0.2.el8fdp.x86_64.rpm"}
ovsbr1=ovsbr1
ovsbr2=ovsbr2
vlan_id=10

ovsbr1_ip4addr=192.168.58.2
ovsbr1_ip6addr=2014:58::2
ovsbr2_ip4addr=192.168.78.2
ovsbr2_ip6addr=2014:78::2

function nmcli-install
{
    yum -y install NetworkManager-ovs
    sed -i 's/#level=TRACE/level=TRACE/g' /etc/NetworkManager/NetworkManager.conf
    systemctl daemon-reload
    systemctl restart NetworkManager
}

function ovs-static-config
{
    ovs-vsctl --if-exists del-br $ovsbr1
    nmcli c add type ovs-bridge conn.interface $ovsbr1 con-name $ovsbr1
    nmcli c add type ovs-port conn.interface $ovsbr1 master $ovsbr1 con-name ovs-port-$ovsbr1
    nmcli c add type ovs-interface slave-type ovs-port conn.interface $ovsbr1 master ovs-port-$ovsbr1 con-name ovs-if-$ovsbr1 ipv4.method static ipv4.address $ovsbr1_ip4addr/24 ipv6.method static ipv6.address $ovsbr1_ip6addr/64
    nmcli con up ovs-if-$ovsbr1
    nmcli con up ovs-port-$ovsbr1
    nmcli con up $ovsbr1
}

function ovs-static-config-vlan
{
    ovs-vsctl --if-exists del-br $ovsbr2
    nmcli c add type ovs-bridge conn.interface $ovsbr2 con-name $ovsbr2
    nmcli c add type ovs-port conn.interface vlan$vlan_id master $ovsbr2 ovs-port.tag $vlan_id con-name ovs-port-vlan$vlan_id
    nmcli c add type ovs-interface slave-type ovs-port conn.interface vlan$vlan_id master ovs-port-vlan$vlan_id con-name ovs-if-vlan$vlan_id ipv4.method static ipv4.address $ovsbr2_ip4addr/24 ipv6.method static ipv6.address $ovsbr2_ip6addr/64
    nmcli con up ovs-if-vlan$vlan_id
    nmcli con up ovs-port-vlan$vlan_id
    nmcli con up $ovsbr2
}

function check-config
{
	ovsbr1=ovsbr1
	ovsbr2=ovsbr2
	vlan_id=10
	ovsbr1_ip4addr=192.168.58.2
	ovsbr1_ip6addr=2014:58::2
	ovsbr2_ip4addr=192.168.78.2
	ovsbr2_ip6addr=2014:78::2

	output_file="/home/ip_output.txt"
	rm -f $output_file
	ip a | tee -a $output_file
	if [[ ! $(grep "$ovsbr1_ip4addr" $output_file) ]]; then echo "FAIL"; else echo "PASS"; fi
	if [[ ! $(grep "$ovsbr1_ip6addr" $output_file) ]]; then echo "FAIL"; else echo "PASS"; fi
	if [[ ! $(grep "$ovsbr2_ip4addr" $output_file) ]]; then echo "FAIL"; else echo "PASS"; fi
	if [[ ! $(grep "$ovsbr2_ip6addr" $output_file) ]]; then echo "FAIL"; else echo "PASS"; fi

	if [[ $(nmcli con show ovs-if-$ovsbr1 | grep 'GENERAL.STATE' | awk '{print $NF}') != activated ]]; then
		echo "FAIL"
	else
		echo "PASS"
	fi

	if [[ $(nmcli con show ovs-if-vlan$vlan_id | grep 'GENERAL.STATE' | awk '{print $NF}') != activated ]]; then
		echo "FAIL"
	else
		echo "PASS"
	fi
}

function beaker-install
{
	echo "sslverify=false" >> /etc/yum.conf

	# install wget in case it's missing
	yum -y install wget

	# install beaker-client.repo
	wget -O /etc/yum.repos.d/beaker-client.repo http://download.lab.bos.redhat.com/beakerrepos/beaker-client-RedHatEnterpriseLinux.repo

	# create beaker-tasks.repo file
	(
		echo [beaker-tasks]
		echo name=beaker-tasks
		echo baseurl=http://beaker.engineering.redhat.com/rpms
		echo enabled=1
		echo gpgcheck=0
		echo skip_if_unavailable=1
	) > /etc/yum.repos.d/beaker-tasks.repo

	# create beaker-harness.repo file
	(
		echo [beaker-harness]
		echo name=beaker-harness
		echo baseurl=http://download.eng.bos.redhat.com/beakerrepos/harness-testing/RedHatEnterpriseLinux8/
		echo enabled=1
		echo gpgcheck=0
		echo skip_if_unavailable=1
	) > /etc/yum.repos.d/beaker-harness.repo

	# install beaker related packages
	yum -y install rhts-test-env beakerlib rhts-devel rhts-python beakerlib-redhat.noarch beaker-client beaker-redhat
}

yum -y install http://download-node-02.eng.bos.redhat.com/brewroot/packages/openvswitch-selinux-extra-policy/1.0/29.el8fdp/noarch/openvswitch-selinux-extra-policy-1.0-29.el8fdp.noarch.rpm
yum -y install $RPM_OVS
systemctl start openvswitch && systemctl enable openvswitch
beaker-install
nmcli-install
ovs-static-config
ovs-static-config-vlan

sleep 5

check-config

[root@netqe9 ~]# cat check_config.sh 
#!/bin/bash

function check-config
{
	ovsbr1=ovsbr1
	ovsbr2=ovsbr2
	vlan_id=10
	ovsbr1_ip4addr=192.168.58.2
	ovsbr1_ip6addr=2014:58::2
	ovsbr2_ip4addr=192.168.78.2
	ovsbr2_ip6addr=2014:78::2

	output_file="/home/ip_output.txt"
	rm -f $output_file
	ip a | tee -a $output_file
	if [[ ! $(grep "$ovsbr1_ip4addr" $output_file) ]]; then echo "FAIL"; else echo "PASS"; fi
	if [[ ! $(grep "$ovsbr1_ip6addr" $output_file) ]]; then echo "FAIL"; else echo "PASS"; fi
	if [[ ! $(grep "$ovsbr2_ip4addr" $output_file) ]]; then echo "FAIL"; else echo "PASS"; fi
	if [[ ! $(grep "$ovsbr2_ip6addr" $output_file) ]]; then echo "FAIL"; else echo "PASS"; fi

	if [[ $(nmcli con show ovs-if-$ovsbr1 | grep 'GENERAL.STATE' | awk '{print $NF}') != activated ]]; then
		echo "FAIL"
	else
		echo "PASS"
	fi

	if [[ $(nmcli con show ovs-if-vlan$vlan_id | grep 'GENERAL.STATE' | awk '{print $NF}') != activated ]]; then
		echo "FAIL"
	else
		echo "PASS"
	fi
}

check-config

- Run scripts on system:

./rhts_power_install.sh
./setup.sh

After config is in place via setup.sh, power cycle system using rhts-power command:

[root@netqe9 ~]# rhts-power
<?xml version='1.0'?>
<methodResponse>
<params>
<param>
<value><string>netqe9.knqe.lab.eng.bos.redhat.com</string></value>
</param>
</params>
</methodResponse>
[root@netqe9 ~]# 

After system comes back up after power cycle, run check_config.sh:

[root@netqe9 ~]# ./check_config.sh 
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: enp130s0f0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 3c:fd:fe:a7:37:54 brd ff:ff:ff:ff:ff:ff
3: enp4s0f0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether f4:e9:d4:ed:aa:64 brd ff:ff:ff:ff:ff:ff
4: eno1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 44:a8:42:32:0c:bd brd ff:ff:ff:ff:ff:ff
5: enp130s0f1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 3c:fd:fe:a7:37:55 brd ff:ff:ff:ff:ff:ff
6: eno2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 44:a8:42:32:0c:bf brd ff:ff:ff:ff:ff:ff
7: eno3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 44:a8:42:32:0c:c1 brd ff:ff:ff:ff:ff:ff
    inet 10.19.15.45/24 brd 10.19.15.255 scope global dynamic noprefixroute eno3
       valid_lft 86012sec preferred_lft 86012sec
    inet6 2620:52:0:130f:46a8:42ff:fe32:cc1/64 scope global dynamic noprefixroute 
       valid_lft 2591978sec preferred_lft 604778sec
    inet6 fe80::46a8:42ff:fe32:cc1/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
8: eno4: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 44:a8:42:32:0c:c3 brd ff:ff:ff:ff:ff:ff
9: enp132s0f0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether a0:36:9f:75:08:90 brd ff:ff:ff:ff:ff:ff
10: enp132s0f1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether a0:36:9f:75:08:92 brd ff:ff:ff:ff:ff:ff
11: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether be:a3:8e:fa:3b:4b brd ff:ff:ff:ff:ff:ff
12: ovsbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether ee:14:ac:99:53:4b brd ff:ff:ff:ff:ff:ff
13: ovsbr1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 5e:e0:37:38:a5:45 brd ff:ff:ff:ff:ff:ff
15: vlan10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether ee:06:34:95:c2:97 brd ff:ff:ff:ff:ff:ff
    inet 192.168.78.2/24 brd 192.168.78.255 scope global noprefixroute vlan10
       valid_lft forever preferred_lft forever
    inet6 2014:78::2/64 scope global noprefixroute 
       valid_lft forever preferred_lft forever
    inet6 fe80::5517:311e:5a93:832/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
16: virbr0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/ether 52:54:00:e0:37:3f brd ff:ff:ff:ff:ff:ff
    inet 192.168.122.1/24 brd 192.168.122.255 scope global virbr0
       valid_lft forever preferred_lft forever
FAIL
FAIL
PASS
PASS
FAIL
PASS

# Note that NMCLI connection is stuck in activating state:

[root@netqe9 ~]# nmcli con show ovs-if-ovsbr1 | grep 'GENERAL.STATE' | awk '{print $NF}'
activating

# When reproducing this issue over many attempts, both of the connections have reported this problem individually but never both at the same time.

Comment 1 Rick Alongi 2022-05-17 17:46:27 UTC
Still seeing this issue using RHEL-8.6 with openvswitch2.17.  This time it is happening after a forced crash as part of a test and can be reproduced manually.  Beaker job link: https://beaker.engineering.redhat.com/jobs/6611291

Comment 2 Rick Alongi 2022-10-18 19:09:50 UTC
Still seeing this issue in FDP 22.J testing using RHEL-8.6 (RHEL-8.6.0-updates-20221014.0) with openvswitch2.15-2.15.0-124.el8fdp and openvswitch2.17-2.17.0-58.el8fdp:

[root@netqe40 ~]# rpm -qa | grep NetworkManager
NetworkManager-libnm-1.36.0-9.el8_6.x86_64
NetworkManager-tui-1.36.0-9.el8_6.x86_64
NetworkManager-ovs-1.36.0-9.el8_6.x86_64
NetworkManager-team-1.36.0-9.el8_6.x86_64
NetworkManager-1.36.0-9.el8_6.x86_64

[root@netqe40 ~]# uname -r
4.18.0-372.32.1.el8_6.x86_64

Comment 3 Thomas Haller 2022-10-18 21:22:39 UTC
Rick, sorry for taking so long to reply.
Thank you for being persistent and keep pinging the rhbz :)


This looks to me, as if it could be fixed by https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/commit/4f60fe293cd5461c47d218b632753ecdfb50cbab. @Beniamino, what do you think?

Comment 4 Thomas Haller 2022-10-19 08:01:12 UTC
This seems indeed fixed upstream by [1].
[1] got backported to upstream nm-1-40 branch as [2].
[2] was released upstream as 1.40.2.

rhel-8.8 is about to get version NetworkManager-1.40.2-1.el8, which contains [2].

[1] https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/commit/4f60fe293cd5461c47d218b632753ecdfb50cbab
[2] https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/commit/f702be2992f0f34c82e96b420947f9056a4cb24e



This should be fixed by NetworkManager-1.40.2-1.el8.
If possible, please try that package.

Thanks for the report!!

Comment 5 Rick Alongi 2022-10-19 11:47:05 UTC
Hi Thomas,

I will test this with NetworkManager-1.40.2-1.el8 as soon as it is merged into RHEL-8.8.  I should mention that I am also seeing this same issue with RHEL-9.0:

[root@netqe40 ~]# rpm -qa | grep NetworkManager
NetworkManager-libnm-1.36.0-5.el9_0.x86_64
NetworkManager-1.36.0-5.el9_0.x86_64
NetworkManager-team-1.36.0-5.el9_0.x86_64
NetworkManager-tui-1.36.0-5.el9_0.x86_64
NetworkManager-ovs-1.36.0-5.el9_0.x86_64

Do you know if there is also a fix available for NetworkManager for RHEL-9.0?  Would it make sense for me to log a separate BZ to track this issue for RHEL-9.0?

Thanks!
Rick

Comment 6 Thomas Haller 2022-10-19 14:03:06 UTC
the fix [1] is on upstream main branch,
which is in upstream 1.41.3.
which, is about to come to rhel-9.2 with "NetworkManager-1.41.3-1.el9"

> [1] https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/commit/4f60fe293cd5461c47d218b632753ecdfb50cbab


> Do you know if there is also a fix available for NetworkManager for RHEL-9.0?

This rhbz tracks development for upcoming RHEL releases (in this case, rhel-8.8), where the issue is about to be fixed.
rhel-9.2 is also about to be fixed.

To fix any older release (rhel-8.7/rhel-9.1 or older), it requires to follow the Z-stream process. Which -- given the severity -- would be appropriate. I will discuss that internally.


It would still be interesting, if you could comment on how this issue affects you (or a RH customer), so we get data about the severity/priority.

Comment 7 Thomas Haller 2022-10-20 13:25:05 UTC
(In reply to Thomas Haller from comment #6)

Rick, although you seem to easily reproduce the issue, Beniamino (who fixed the bug) was not able to reproduce it locally. Seems something special is about your setup. So whether the patch really fixes your issue (or any issue at all) is only the working assumption.

It would be very useful, if you could test either the rhel-8.8 or rhel-9.2 package, and see whether the issue is avoided. That might be in particular relevant, if we should do a Z-stream fix for this bug.

Is that cumbersome for you to do?

Comment 8 Rick Alongi 2022-10-20 17:47:38 UTC
Hi Thomas,

I saw that compose RHEL-9.2.0-20221019.2 contains NetworkManager-1.41.3-1.el9 so I just ran a beaker job using that compose.  I did not see the failure where a connection is stuck in "activating" state so it may be that the fix in question does address the problem.

I'd like to run multiple iterations of the test using a script on a system using RHEL-9.0 and one using RHEL-9.2.0-20221019.2 to see if I can reproduce the issue and also see no occurrences of the issue.  I'd also like to run similar tests using a RHEL-8.8 compose that contains the fix when it becomes available (the latest stable compose for RHEL-8.8 is RHEL-8.8.0-20221017.2 and that does not appear to have the newer NetworkManager packages yet).

I'll let you know what I find.

Thanks,
Rick

Comment 28 errata-xmlrpc 2023-05-16 09:04:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (NetworkManager bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:2968


Note You need to log in before you can comment on or make changes to this bug.