+++ This bug was initially created as a clone of Bug #2055097 +++ Description of problem: Version-Release number of selected component (if applicable): ovn21.12-central-21.12.0-25.el8fdp.x86_64 ovn21.12-vtep-21.12.0-25.el8fdp.x86_64 ovn21.12-21.12.0-25.el8fdp.x86_64 ovn21.12-host-21.12.0-25.el8fdp.x86_64 Context: hypershift ovn, run ovn nbdb and sbdb as statefulset. Assuming ovndb statefulset pods ovnkube-master-guest-0/1/2 formed the quorum, guest-1 is nb leader. Delete both guest-0 and guest-1 pods, guest-2 become leader. Since statefulset is used, guest-0 gets re-created first (guest-1 needs to wait until guest-0 is ready, guest pod dns/hostname is only resolvable when pod is running), guest-0 finds the new leader guest-2, then start nb with the following cmd (local=guest-0, remote=guest-2): ### + echo 'Cluster already exists for DB: nb' + initial_raft_create=false + wait 71 + exec /usr/share/ovn/scripts/ovn-ctl --db-nb-cluster-local-port=9643 --db-nb-cluster-local-addr=ovnkube-master-guest-0.ovnkube-master-guest.hypershift-ovn.svc.cluster.local --no-monitor --db-nb-cluster-local-proto=ssl --ovn-nb-db-ssl-key=/ovn-cert/tls.key --ovn-nb-db-ssl-cert=/ovn-cert/tls.crt --ovn-nb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt --db-nb-cluster-remote-port=9643 --db-nb-cluster-remote-addr=ovnkube-master-guest-2.ovnkube-master-guest.hypershift-ovn.svc.cluster.local --db-nb-cluster-remote-proto=ssl '--ovn-nb-log=-vconsole:dbg -vfile:off -vPATTERN:console:%D{%Y-%m-%dT%H:%M:%S.###Z}|%05N|%c%T|%p|%m' --db-nb-election-timer=10000 run_nb_ovsdb 2022-02-16T03:05:25.330Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log ovsdb-server: ovsdb error: error reading record 12 from OVN_Northbound log: ssl:ovnkube-master-guest-1.ovnkube-master-guest.hypershift-ovn.svc.cluster.local:9643: syntax error in address [1]+ Exit 1 exec /usr/share/ovn/scripts/ovn-ctl ${OVN_ARGS} --db-nb-cluster-remote-port=9643 --db-nb-cluster-remote-addr=${init_ip} --db-nb-cluster-remote-proto=ssl --ovn-nb-log="-vconsole:${OVN_LOG_LEVEL} -vfile:off -vPATTERN:console:${OVN_LOG_PATTERN_CONSOLE}" ${election_timer} run_nb_ovsdb ### Guest-0 failed due to guest-1 hostname is not resolvable (syntax error in address). Expected results: Guest-0 reconnects successfully until guest-1 become running.
* Wed Mar 30 2022 Open vSwitch CI <ovs-ci> - 2.15.0-88 - Merging upstream branch-2.15 [RH git: a03b5c62e4] Commit list: 0a3867a9a9 ovsdb: raft: Fix inability to read the database with DNS host names. (#2055097)
Ilya, Is this something that can be reproduced using openvswitch alone or does it require a layered product? Thanks, Rick
(In reply to Rick Alongi from comment #4) > Is this something that can be reproduced using openvswitch alone or does it > require a layered product? This requires a DNS server. The sequence should be something like this: 1. Create 3 DNS names for 3 severs. 2. Start the ovsdb cluster using these names instead of IP addresses. 3. Stop one of the servers and remove its DNS record. 4. Restart 2 remaining servers, they should continue to work. With the issue, 2 remaining servers will fail to start because they will not be able to resolve the name of the third server.
Reproducer/Verification steps can be found at https://bugzilla.redhat.com/show_bug.cgi?id=2055097#c16 This will be tested when FDP 22.D is ready for testing.
Reproducer/Verification steps below: # Provision three systems with RHEL-8.6 # Install ovs and ovn packages without the fix included: yum -y install http://download-node-02.eng.bos.redhat.com/brewroot/packages/openvswitch-selinux-extra-policy/1.0/29.el8fdp/noarch/openvswitch-selinux-extra-policy-1.0-29.el8fdp.noarch.rpm http://download-node-02.eng.bos.redhat.com/brewroot/packages/openvswitch2.15/2.15.0/87.el8fdp/x86_64/openvswitch2.15-2.15.0-87.el8fdp.x86_64.rpm http://download-node-02.eng.bos.redhat.com/brewroot/packages/ovn-2021/21.12.0/30.el8fdp/x86_64/ovn-2021-21.12.0-30.el8fdp.x86_64.rpm http://download-node-02.eng.bos.redhat.com/brewroot/packages/ovn-2021/21.12.0/30.el8fdp/x86_64/ovn-2021-central-21.12.0-30.el8fdp.x86_64.rpm http://download-node-02.eng.bos.redhat.com/brewroot/packages/ovn-2021/21.12.0/30.el8fdp/x86_64/ovn-2021-host-21.12.0-30.el8fdp.x86_64.rpm # Start ovs and ovn processes: systemctl start openvswitch systemctl enable openvswitch systemctl enable ovn-controller systemctl enable ovn-northd systemctl start ovn-controller systemctl start ovn-northd ovn-nbctl set-connection ptcp:6641 ovn-sbctl set-connection ptcp:6642 # Additional config for ovs: yum -y install net-tools host_ip=$(ifconfig -a | grep inet | head -n 1 | awk '{print $2}' | tr -d "addr:") ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:${host_ip}:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=${host_ip} systemctl restart ovn-controller # Configure three systems: host1=netqe9.knqe.lab.eng.bos.redhat.com host2=netqe20.knqe.lab.eng.bos.redhat.com host3=netqe21.knqe.lab.eng.bos.redhat.com host1_ip=$(nslookup $host1 | grep Address | grep -v '#53' | awk '{print $NF}') host2_ip=$(nslookup $host2 | grep Address | grep -v '#53' | awk '{print $NF}') host3_ip=$(nslookup $host3 | grep Address | grep -v '#53' | awk '{print $NF}') # disable DNS client on each system: mv -f /etc/resolv.conf /etc/resolv.conf_saved # Create empty resolv.conf: touch /etc/resolv.conf # Test with no DNS or /etc/hosts configured: rm -f ./hosts.txt echo $host1 >> ./hosts.txt echo $host2 >> ./hosts.txt echo $host3 >> ./hosts.txt ping_list=$(grep -v $(hostname) ./hosts.txt) for i in $(echo $ping_list); do ping -c1 $i if [[ $? -ne 0 ]]; then echo "Ping of $i failed as expected: PASS" else echo "Ping of $i should have failed: FAIL" fi done # add host info to /etc/hosts file: echo -e "$host1_ip\t$host1" >> /etc/hosts echo -e "$host2_ip\t$host2" >> /etc/hosts echo -e "$host3_ip\t$host3" >> /etc/hosts # Test with /etc/hosts file configured: for i in $(echo $ping_list); do ping -c1 $i if [[ $? -eq 0 ]]; then echo "Ping of $i was successful: PASS" else echo "Ping of $i was unsuccessful: FAIL" fi done # execute on $host1: /usr/share/ovn/scripts/ovn-ctl --db-nb-addr=$host1 --db-nb-create-insecure-remote=yes --db-sb-addr=$host1 --db-sb-create-insecure-remote=yes --db-nb-cluster-local-addr=$host1 --db-sb-cluster-local-addr=$host1 --ovn-northd-nb-db=tcp:$host1:6641,tcp:$host2:6641,tcp:$host3:6641 --ovn-northd-sb-db=tcp:$host1:6642,tcp:$host2:6642,tcp:$host3:6642 start_northd # execute on $host2: /usr/share/ovn/scripts/ovn-ctl --db-nb-addr=$host2 --db-nb-create-insecure-remote=yes --db-sb-addr=$host2 --db-sb-create-insecure-remote=yes --db-nb-cluster-local-addr=$host2 --db-sb-cluster-local-addr=$host2 --db-nb-cluster-remote-addr=$host1 --db-sb-cluster-remote-addr=$host1 --ovn-northd-nb-db=tcp:$host1:6641,tcp:$host2:6641,tcp:$host3:6641 --ovn-northd-sb-db=tcp:$host1:6642,tcp:$host2:6642,tcp:$host3:6642 start_northd # execute on $host3: /usr/share/ovn/scripts/ovn-ctl --db-nb-addr=$host3 --db-nb-create-insecure-remote=yes --db-sb-addr=$host3 --db-sb-create-insecure-remote=yes --db-nb-cluster-local-addr=$host3 --db-sb-cluster-local-addr=$host3 --db-nb-cluster-remote-addr=$host1 --db-sb-cluster-remote-addr=$host1 --ovn-northd-nb-db=tcp:$host1:6641,tcp:$host2:6641,tcp:$host3:6641 --ovn-northd-sb-db=tcp:$host1:6642,tcp:$host2:6642,tcp:$host3:6642 start_northd # stop nb_ovsdb on $host1: /usr/share/ovn/scripts/ovn-ctl stop_nb_ovsdb # delete $host1 entry from /etc/hosts file on $host2 and $host3: sed -i "/$host1_ip/d" /etc/hosts # Test on $host2 and $host3: ping -c1 $host1 if [[ $? -ne 0 ]]; then echo "Ping of $i failed as expected: PASS" else echo "Ping of $i should have failed: FAIL" fi # restart nb_ovsdb on $host2 and $host3: # host2: /usr/share/ovn/scripts/ovn-ctl --db-nb-addr=$host2 --db-nb-create-insecure-remote=yes --db-sb-addr=$host2 --db-sb-create-insecure-remote=yes --db-nb-cluster-local-addr=$host2 --db-sb-cluster-local-addr=$host2 --db-nb-cluster-remote-addr=$host1 --db-sb-cluster-remote-addr=$host1 --ovn-northd-nb-db=tcp:$host1:6641,tcp:$host2:6641,tcp:$host3:6641 --ovn-northd-sb-db=tcp:$host1:6642,tcp:$host2:6642,tcp:$host3:6642 restart_nb_ovsdb | tee output.log # host3: /usr/share/ovn/scripts/ovn-ctl --db-nb-addr=$host3 --db-nb-create-insecure-remote=yes --db-sb-addr=$host3 --db-sb-create-insecure-remote=yes --db-nb-cluster-local-addr=$host3 --db-sb-cluster-local-addr=$host3 --db-nb-cluster-remote-addr=$host1 --db-sb-cluster-remote-addr=$host1 --ovn-northd-nb-db=tcp:$host1:6641,tcp:$host2:6641,tcp:$host3:6641 --ovn-northd-sb-db=tcp:$host1:6642,tcp:$host2:6642,tcp:$host3:6642 restart_nb_ovsdb | tee output.log sleep 150s # Check for problem grep '/run/ovn/ovnnb_db.sock: connection attempt failed (No such file or directory)' /var/log/ovn/ovn-northd.log grep 'Joining /etc/ovn/ovnnb_db.db to cluster' output.log | grep FAILED grep 'Starting ovsdb-nb' output.log | grep FAILED grep 'Waiting for OVN_Northbound to come up' output.log | grep FAILED ####################### # Repro of problem: openvswitch2.15-2.15.0-87.el8fdp ovn-2021-21.12.0-30.el8fdp ovn-2021-central-21.12.0-30.el8fdp ovn-2021-host-21.12.0-30.el8fdp RHEL-8.6.0-updates-20220510.0 kernel version: 4.18.0-372.9.1.el8.x86_64 [root@netqe20 ~]# /usr/share/ovn/scripts/ovn-ctl --db-nb-addr=$host2 --db-nb-create-insecure-remote=yes --db-sb-addr=$host2 --db-sb-create-insecure-remote=yes --db-nb-cluster-local-addr=$host2 --db-sb-cluster-local-addr=$host2 --db-nb-cluster-remote-addr=$host1 --db-sb-cluster-remote-addr=$host1 --ovn-northd-nb-db=tcp:$host1:6641,tcp:$host2:6641,tcp:$host3:6641 --ovn-northd-sb-db=tcp:$host1:6642,tcp:$host2:6642,tcp:$host3:6642 restart_nb_ovsdb | tee output.log Exiting ovnnb_db (39208) [ OK ] Joining /etc/ovn/ovnnb_db.db to cluster ovsdb-tool: ovsdb error: tcp:netqe9.knqe.lab.eng.bos.redhat.com:6643: syntax error in address [FAILED] Starting ovsdb-nb ovsdb-server: I/O error: /etc/ovn/ovnnb_db.db: open failed (No such file or directory) [FAILED] Waiting for OVN_Northbound to come up 2022-05-10T13:05:02Z|00001|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting... 2022-05-10T13:05:02Z|00002|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connection attempt failed (No such file or directory) 2022-05-10T13:05:03Z|00003|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting... 2022-05-10T13:05:03Z|00004|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connection attempt failed (No such file or directory) 2022-05-10T13:05:03Z|00005|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: waiting 2 seconds before reconnect 2022-05-10T13:05:05Z|00006|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting... 2022-05-10T13:05:05Z|00007|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connection attempt failed (No such file or directory) 2022-05-10T13:05:05Z|00008|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: waiting 4 seconds before reconnect 2022-05-10T13:05:09Z|00009|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting... 2022-05-10T13:05:09Z|00010|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connection attempt failed (No such file or directory) 2022-05-10T13:05:09Z|00011|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: continuing to reconnect in the background but suppressing further logging 2022-05-10T13:05:32Z|00012|fatal_signal|WARN|terminating with signal 14 (Alarm clock) /usr/share/openvswitch/scripts/ovs-lib: line 602: 39579 Alarm clock "$@" [FAILED] [root@netqe20 ~]# grep '/run/ovn/ovnnb_db.sock: connection attempt failed (No such file or directory)' /var/log/ovn/ovn-northd.log 2022-05-10T13:03:26.276Z|00015|reconnect|INFO|unix:/run/ovn/ovnnb_db.sock: connection attempt failed (No such file or directory) 2022-05-10T13:03:28.279Z|00018|reconnect|INFO|unix:/run/ovn/ovnnb_db.sock: connection attempt failed (No such file or directory) 2022-05-10T13:03:32.283Z|00021|reconnect|INFO|unix:/run/ovn/ovnnb_db.sock: connection attempt failed (No such file or directory) [root@netqe20 ~]# grep 'Joining /etc/ovn/ovnnb_db.db to cluster' output.log | grep FAILED Joining /etc/ovn/ovnnb_db.db to cluster [FAILED] [root@netqe20 ~]# grep 'Starting ovsdb-nb' output.log | grep FAILED Starting ovsdb-nb [FAILED] [root@netqe20 ~]# grep 'Waiting for OVN_Northbound to come up' output.log | grep FAILED Waiting for OVN_Northbound to come up [FAILED] [root@netqe20 ~]# Verification of fix: # Install ovs package with the fix included (FDP 22.D): yum -y update http://download-node-02.eng.bos.redhat.com/brewroot/packages/openvswitch2.15/2.15.0/99.el8fdp/x86_64/openvswitch2.15-2.15.0-99.el8fdp.x86_64.rpm [root@netqe20 ~]# rpm -qa | grep openvswitch openvswitch-selinux-extra-policy-1.0-29.el8fdp.noarch openvswitch2.15-2.15.0-99.el8fdp.x86_64 [root@netqe20 ~]# rpm -qa | grep ovn ovn-2021-central-21.12.0-30.el8fdp.x86_64 ovn-2021-21.12.0-30.el8fdp.x86_64 ovn-2021-host-21.12.0-30.el8fdp.x86_64 # Follow steps outlined above to set up config for test. # Database now restarts/reconnects without any problem: [root@netqe21 ~]# /usr/share/ovn/scripts/ovn-ctl --db-nb-addr=$host3 --db-nb-create-insecure-remote=yes --db-sb-addr=$host3 --db-sb-create-insecure-remote=yes --db-nb-cluster-local-addr=$host3 --db-sb-cluster-local-addr=$host3 --db-nb-cluster-remote-addr=$host1 --db-sb-cluster-remote-addr=$host1 --ovn-northd-nb-db=tcp:$host1:6641,tcp:$host2:6641,tcp:$host3:6641 --ovn-northd-sb-db=tcp:$host1:6642,tcp:$host2:6642,tcp:$host3:6642 restart_nb_ovsdb Exiting ovnnb_db (40605) [ OK ] Starting ovsdb-nb [ OK ] Waiting for OVN_Northbound to come up 2022-05-10T13:24:55Z|00001|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting... 2022-05-10T13:24:55Z|00002|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connected 2022-05-10T13:25:25Z|00003|fatal_signal|WARN|terminating with signal 14 (Alarm clock) /usr/share/openvswitch/scripts/ovs-lib: line 602: 40829 Alarm clock "$@" [FAILED] Note: Dev has said the intermittent alarm clock failure above is unrelated to this issue.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: openvswitch2.15 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:4787