Created attachment 1863849 [details] OVS and journalctl logs Description of problem: NMCLI OVS connections intermittently get stuck in "activating" state after power cycle Version-Release number of selected component (if applicable): Compose: RHEL-8.6.0-20220227.2 Kernel: 4.18.0-369.el8.x86_64 [root@netqe9 ~]# rpm -qa | grep NetworkManager NetworkManager-libnm-1.36.0-1.el8.x86_64 NetworkManager-tui-1.36.0-1.el8.x86_64 NetworkManager-ovs-1.36.0-1.el8.x86_64 NetworkManager-team-1.36.0-1.el8.x86_64 NetworkManager-1.36.0-1.el8.x86_64 How reproducible: Intermittent (~50% of the time) Steps to Reproduce: 1. Provision system with RHEL-8.6 2. Install openvswitch2.15 or openvswitch2.17 3. Start/Enable openvswitch.service 4. Create OVS based NMCLI connections 5. Power cycle system via rhts-power command 6, After system comes back, note one of the connections is stuck in "activating" 7. More detailed steps in Additional info section below Actual results: NMCLI connection is stuck in activating state. Expected results: NMCLI connections become activated after power cycle without having to manually enter "nmcli con up <connection> to get it to activated state. Additional info: - This behavior was not observed when using RHEL-8.4 or 8.5. - Problem has been observed with RHEL-8.6 when using both openvswitch2.15 and openvswitch2.17. - Problem does not happen after a software reboot (rhts-reboot) - TRACE enabled for NetworkManager with journalctl -b output attached - ovs_vswitchd.log, ovsdb-server.log and journalctl.log attached to this BZ - sos report located here: http://netqe-infra01.knqe.lab.eng.bos.redhat.com/sosreports/sosreport-netqe9-2022-03-02-bdpdsim.tar.xz - Example beaker job: https://beaker.engineering.redhat.com/jobs/6358018 # Steps to reproduce: - Provision system with RHEL-8.6 - Create scripts below (and chmod +x) on provisioned system: [root@netqe9 ~]# cat rhts_power_install.sh #!/bin/bash cat >/usr/bin/rhts-power <<EOF #!/bin/bash curl --insecure \\ --header "Content-Type: text/xml" \\ --data "<?xml version=\"1.0\"?> <methodCall> <methodName>power</methodName> <params> <param> <value><string>\$(hostname)</string></value> </param> <param> <value><string>reboot</string></value> </param> </params> </methodCall>" \\ http://\${LAB_CONTROLLER}:8000/RPC2 EOF chmod 755 /usr/bin/rhts-power [root@netqe9 ~]# cat setup.sh #!/bin/bash RPM_OVS=${RPM_OVS:-"http://netqe-infra01.knqe.lab.eng.bos.redhat.com/repo/packages/openvswitch2.17/el8/openvswitch2.17-2.17.0-0.2.el8fdp.x86_64.rpm"} ovsbr1=ovsbr1 ovsbr2=ovsbr2 vlan_id=10 ovsbr1_ip4addr=192.168.58.2 ovsbr1_ip6addr=2014:58::2 ovsbr2_ip4addr=192.168.78.2 ovsbr2_ip6addr=2014:78::2 function nmcli-install { yum -y install NetworkManager-ovs sed -i 's/#level=TRACE/level=TRACE/g' /etc/NetworkManager/NetworkManager.conf systemctl daemon-reload systemctl restart NetworkManager } function ovs-static-config { ovs-vsctl --if-exists del-br $ovsbr1 nmcli c add type ovs-bridge conn.interface $ovsbr1 con-name $ovsbr1 nmcli c add type ovs-port conn.interface $ovsbr1 master $ovsbr1 con-name ovs-port-$ovsbr1 nmcli c add type ovs-interface slave-type ovs-port conn.interface $ovsbr1 master ovs-port-$ovsbr1 con-name ovs-if-$ovsbr1 ipv4.method static ipv4.address $ovsbr1_ip4addr/24 ipv6.method static ipv6.address $ovsbr1_ip6addr/64 nmcli con up ovs-if-$ovsbr1 nmcli con up ovs-port-$ovsbr1 nmcli con up $ovsbr1 } function ovs-static-config-vlan { ovs-vsctl --if-exists del-br $ovsbr2 nmcli c add type ovs-bridge conn.interface $ovsbr2 con-name $ovsbr2 nmcli c add type ovs-port conn.interface vlan$vlan_id master $ovsbr2 ovs-port.tag $vlan_id con-name ovs-port-vlan$vlan_id nmcli c add type ovs-interface slave-type ovs-port conn.interface vlan$vlan_id master ovs-port-vlan$vlan_id con-name ovs-if-vlan$vlan_id ipv4.method static ipv4.address $ovsbr2_ip4addr/24 ipv6.method static ipv6.address $ovsbr2_ip6addr/64 nmcli con up ovs-if-vlan$vlan_id nmcli con up ovs-port-vlan$vlan_id nmcli con up $ovsbr2 } function check-config { ovsbr1=ovsbr1 ovsbr2=ovsbr2 vlan_id=10 ovsbr1_ip4addr=192.168.58.2 ovsbr1_ip6addr=2014:58::2 ovsbr2_ip4addr=192.168.78.2 ovsbr2_ip6addr=2014:78::2 output_file="/home/ip_output.txt" rm -f $output_file ip a | tee -a $output_file if [[ ! $(grep "$ovsbr1_ip4addr" $output_file) ]]; then echo "FAIL"; else echo "PASS"; fi if [[ ! $(grep "$ovsbr1_ip6addr" $output_file) ]]; then echo "FAIL"; else echo "PASS"; fi if [[ ! $(grep "$ovsbr2_ip4addr" $output_file) ]]; then echo "FAIL"; else echo "PASS"; fi if [[ ! $(grep "$ovsbr2_ip6addr" $output_file) ]]; then echo "FAIL"; else echo "PASS"; fi if [[ $(nmcli con show ovs-if-$ovsbr1 | grep 'GENERAL.STATE' | awk '{print $NF}') != activated ]]; then echo "FAIL" else echo "PASS" fi if [[ $(nmcli con show ovs-if-vlan$vlan_id | grep 'GENERAL.STATE' | awk '{print $NF}') != activated ]]; then echo "FAIL" else echo "PASS" fi } function beaker-install { echo "sslverify=false" >> /etc/yum.conf # install wget in case it's missing yum -y install wget # install beaker-client.repo wget -O /etc/yum.repos.d/beaker-client.repo http://download.lab.bos.redhat.com/beakerrepos/beaker-client-RedHatEnterpriseLinux.repo # create beaker-tasks.repo file ( echo [beaker-tasks] echo name=beaker-tasks echo baseurl=http://beaker.engineering.redhat.com/rpms echo enabled=1 echo gpgcheck=0 echo skip_if_unavailable=1 ) > /etc/yum.repos.d/beaker-tasks.repo # create beaker-harness.repo file ( echo [beaker-harness] echo name=beaker-harness echo baseurl=http://download.eng.bos.redhat.com/beakerrepos/harness-testing/RedHatEnterpriseLinux8/ echo enabled=1 echo gpgcheck=0 echo skip_if_unavailable=1 ) > /etc/yum.repos.d/beaker-harness.repo # install beaker related packages yum -y install rhts-test-env beakerlib rhts-devel rhts-python beakerlib-redhat.noarch beaker-client beaker-redhat } yum -y install http://download-node-02.eng.bos.redhat.com/brewroot/packages/openvswitch-selinux-extra-policy/1.0/29.el8fdp/noarch/openvswitch-selinux-extra-policy-1.0-29.el8fdp.noarch.rpm yum -y install $RPM_OVS systemctl start openvswitch && systemctl enable openvswitch beaker-install nmcli-install ovs-static-config ovs-static-config-vlan sleep 5 check-config [root@netqe9 ~]# cat check_config.sh #!/bin/bash function check-config { ovsbr1=ovsbr1 ovsbr2=ovsbr2 vlan_id=10 ovsbr1_ip4addr=192.168.58.2 ovsbr1_ip6addr=2014:58::2 ovsbr2_ip4addr=192.168.78.2 ovsbr2_ip6addr=2014:78::2 output_file="/home/ip_output.txt" rm -f $output_file ip a | tee -a $output_file if [[ ! $(grep "$ovsbr1_ip4addr" $output_file) ]]; then echo "FAIL"; else echo "PASS"; fi if [[ ! $(grep "$ovsbr1_ip6addr" $output_file) ]]; then echo "FAIL"; else echo "PASS"; fi if [[ ! $(grep "$ovsbr2_ip4addr" $output_file) ]]; then echo "FAIL"; else echo "PASS"; fi if [[ ! $(grep "$ovsbr2_ip6addr" $output_file) ]]; then echo "FAIL"; else echo "PASS"; fi if [[ $(nmcli con show ovs-if-$ovsbr1 | grep 'GENERAL.STATE' | awk '{print $NF}') != activated ]]; then echo "FAIL" else echo "PASS" fi if [[ $(nmcli con show ovs-if-vlan$vlan_id | grep 'GENERAL.STATE' | awk '{print $NF}') != activated ]]; then echo "FAIL" else echo "PASS" fi } check-config - Run scripts on system: ./rhts_power_install.sh ./setup.sh After config is in place via setup.sh, power cycle system using rhts-power command: [root@netqe9 ~]# rhts-power <?xml version='1.0'?> <methodResponse> <params> <param> <value><string>netqe9.knqe.lab.eng.bos.redhat.com</string></value> </param> </params> </methodResponse> [root@netqe9 ~]# After system comes back up after power cycle, run check_config.sh: [root@netqe9 ~]# ./check_config.sh 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: enp130s0f0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000 link/ether 3c:fd:fe:a7:37:54 brd ff:ff:ff:ff:ff:ff 3: enp4s0f0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000 link/ether f4:e9:d4:ed:aa:64 brd ff:ff:ff:ff:ff:ff 4: eno1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000 link/ether 44:a8:42:32:0c:bd brd ff:ff:ff:ff:ff:ff 5: enp130s0f1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000 link/ether 3c:fd:fe:a7:37:55 brd ff:ff:ff:ff:ff:ff 6: eno2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000 link/ether 44:a8:42:32:0c:bf brd ff:ff:ff:ff:ff:ff 7: eno3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 44:a8:42:32:0c:c1 brd ff:ff:ff:ff:ff:ff inet 10.19.15.45/24 brd 10.19.15.255 scope global dynamic noprefixroute eno3 valid_lft 86012sec preferred_lft 86012sec inet6 2620:52:0:130f:46a8:42ff:fe32:cc1/64 scope global dynamic noprefixroute valid_lft 2591978sec preferred_lft 604778sec inet6 fe80::46a8:42ff:fe32:cc1/64 scope link noprefixroute valid_lft forever preferred_lft forever 8: eno4: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000 link/ether 44:a8:42:32:0c:c3 brd ff:ff:ff:ff:ff:ff 9: enp132s0f0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000 link/ether a0:36:9f:75:08:90 brd ff:ff:ff:ff:ff:ff 10: enp132s0f1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000 link/ether a0:36:9f:75:08:92 brd ff:ff:ff:ff:ff:ff 11: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether be:a3:8e:fa:3b:4b brd ff:ff:ff:ff:ff:ff 12: ovsbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether ee:14:ac:99:53:4b brd ff:ff:ff:ff:ff:ff 13: ovsbr1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether 5e:e0:37:38:a5:45 brd ff:ff:ff:ff:ff:ff 15: vlan10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether ee:06:34:95:c2:97 brd ff:ff:ff:ff:ff:ff inet 192.168.78.2/24 brd 192.168.78.255 scope global noprefixroute vlan10 valid_lft forever preferred_lft forever inet6 2014:78::2/64 scope global noprefixroute valid_lft forever preferred_lft forever inet6 fe80::5517:311e:5a93:832/64 scope link noprefixroute valid_lft forever preferred_lft forever 16: virbr0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000 link/ether 52:54:00:e0:37:3f brd ff:ff:ff:ff:ff:ff inet 192.168.122.1/24 brd 192.168.122.255 scope global virbr0 valid_lft forever preferred_lft forever FAIL FAIL PASS PASS FAIL PASS # Note that NMCLI connection is stuck in activating state: [root@netqe9 ~]# nmcli con show ovs-if-ovsbr1 | grep 'GENERAL.STATE' | awk '{print $NF}' activating # When reproducing this issue over many attempts, both of the connections have reported this problem individually but never both at the same time.
Still seeing this issue using RHEL-8.6 with openvswitch2.17. This time it is happening after a forced crash as part of a test and can be reproduced manually. Beaker job link: https://beaker.engineering.redhat.com/jobs/6611291
Still seeing this issue in FDP 22.J testing using RHEL-8.6 (RHEL-8.6.0-updates-20221014.0) with openvswitch2.15-2.15.0-124.el8fdp and openvswitch2.17-2.17.0-58.el8fdp: [root@netqe40 ~]# rpm -qa | grep NetworkManager NetworkManager-libnm-1.36.0-9.el8_6.x86_64 NetworkManager-tui-1.36.0-9.el8_6.x86_64 NetworkManager-ovs-1.36.0-9.el8_6.x86_64 NetworkManager-team-1.36.0-9.el8_6.x86_64 NetworkManager-1.36.0-9.el8_6.x86_64 [root@netqe40 ~]# uname -r 4.18.0-372.32.1.el8_6.x86_64
Rick, sorry for taking so long to reply. Thank you for being persistent and keep pinging the rhbz :) This looks to me, as if it could be fixed by https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/commit/4f60fe293cd5461c47d218b632753ecdfb50cbab. @Beniamino, what do you think?
This seems indeed fixed upstream by [1]. [1] got backported to upstream nm-1-40 branch as [2]. [2] was released upstream as 1.40.2. rhel-8.8 is about to get version NetworkManager-1.40.2-1.el8, which contains [2]. [1] https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/commit/4f60fe293cd5461c47d218b632753ecdfb50cbab [2] https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/commit/f702be2992f0f34c82e96b420947f9056a4cb24e This should be fixed by NetworkManager-1.40.2-1.el8. If possible, please try that package. Thanks for the report!!
Hi Thomas, I will test this with NetworkManager-1.40.2-1.el8 as soon as it is merged into RHEL-8.8. I should mention that I am also seeing this same issue with RHEL-9.0: [root@netqe40 ~]# rpm -qa | grep NetworkManager NetworkManager-libnm-1.36.0-5.el9_0.x86_64 NetworkManager-1.36.0-5.el9_0.x86_64 NetworkManager-team-1.36.0-5.el9_0.x86_64 NetworkManager-tui-1.36.0-5.el9_0.x86_64 NetworkManager-ovs-1.36.0-5.el9_0.x86_64 Do you know if there is also a fix available for NetworkManager for RHEL-9.0? Would it make sense for me to log a separate BZ to track this issue for RHEL-9.0? Thanks! Rick
the fix [1] is on upstream main branch, which is in upstream 1.41.3. which, is about to come to rhel-9.2 with "NetworkManager-1.41.3-1.el9" > [1] https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/commit/4f60fe293cd5461c47d218b632753ecdfb50cbab > Do you know if there is also a fix available for NetworkManager for RHEL-9.0? This rhbz tracks development for upcoming RHEL releases (in this case, rhel-8.8), where the issue is about to be fixed. rhel-9.2 is also about to be fixed. To fix any older release (rhel-8.7/rhel-9.1 or older), it requires to follow the Z-stream process. Which -- given the severity -- would be appropriate. I will discuss that internally. It would still be interesting, if you could comment on how this issue affects you (or a RH customer), so we get data about the severity/priority.
(In reply to Thomas Haller from comment #6) Rick, although you seem to easily reproduce the issue, Beniamino (who fixed the bug) was not able to reproduce it locally. Seems something special is about your setup. So whether the patch really fixes your issue (or any issue at all) is only the working assumption. It would be very useful, if you could test either the rhel-8.8 or rhel-9.2 package, and see whether the issue is avoided. That might be in particular relevant, if we should do a Z-stream fix for this bug. Is that cumbersome for you to do?
Hi Thomas, I saw that compose RHEL-9.2.0-20221019.2 contains NetworkManager-1.41.3-1.el9 so I just ran a beaker job using that compose. I did not see the failure where a connection is stuck in "activating" state so it may be that the fix in question does address the problem. I'd like to run multiple iterations of the test using a script on a system using RHEL-9.0 and one using RHEL-9.2.0-20221019.2 to see if I can reproduce the issue and also see no occurrences of the issue. I'd also like to run similar tests using a RHEL-8.8 compose that contains the fix when it becomes available (the latest stable compose for RHEL-8.8 is RHEL-8.8.0-20221017.2 and that does not appear to have the newer NetworkManager packages yet). I'll let you know what I find. Thanks, Rick
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (NetworkManager bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:2968