Description of problem: Version-Release number of selected component (if applicable): ovn21.12-central-21.12.0-25.el8fdp.x86_64 ovn21.12-vtep-21.12.0-25.el8fdp.x86_64 ovn21.12-21.12.0-25.el8fdp.x86_64 ovn21.12-host-21.12.0-25.el8fdp.x86_64 Context: hypershift ovn, run ovn nbdb and sbdb as statefulset. Assuming ovndb statefulset pods ovnkube-master-guest-0/1/2 formed the quorum, guest-1 is nb leader. Delete both guest-0 and guest-1 pods, guest-2 become leader. Since statefulset is used, guest-0 gets re-created first (guest-1 needs to wait until guest-0 is ready, guest pod dns/hostname is only resolvable when pod is running), guest-0 finds the new leader guest-2, then start nb with the following cmd (local=guest-0, remote=guest-2): ### + echo 'Cluster already exists for DB: nb' + initial_raft_create=false + wait 71 + exec /usr/share/ovn/scripts/ovn-ctl --db-nb-cluster-local-port=9643 --db-nb-cluster-local-addr=ovnkube-master-guest-0.ovnkube-master-guest.hypershift-ovn.svc.cluster.local --no-monitor --db-nb-cluster-local-proto=ssl --ovn-nb-db-ssl-key=/ovn-cert/tls.key --ovn-nb-db-ssl-cert=/ovn-cert/tls.crt --ovn-nb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt --db-nb-cluster-remote-port=9643 --db-nb-cluster-remote-addr=ovnkube-master-guest-2.ovnkube-master-guest.hypershift-ovn.svc.cluster.local --db-nb-cluster-remote-proto=ssl '--ovn-nb-log=-vconsole:dbg -vfile:off -vPATTERN:console:%D{%Y-%m-%dT%H:%M:%S.###Z}|%05N|%c%T|%p|%m' --db-nb-election-timer=10000 run_nb_ovsdb 2022-02-16T03:05:25.330Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log ovsdb-server: ovsdb error: error reading record 12 from OVN_Northbound log: ssl:ovnkube-master-guest-1.ovnkube-master-guest.hypershift-ovn.svc.cluster.local:9643: syntax error in address [1]+ Exit 1 exec /usr/share/ovn/scripts/ovn-ctl ${OVN_ARGS} --db-nb-cluster-remote-port=9643 --db-nb-cluster-remote-addr=${init_ip} --db-nb-cluster-remote-proto=ssl --ovn-nb-log="-vconsole:${OVN_LOG_LEVEL} -vfile:off -vPATTERN:console:${OVN_LOG_PATTERN_CONSOLE}" ${election_timer} run_nb_ovsdb ### Guest-0 failed due to guest-1 hostname is not resolvable (syntax error in address). Expected results: Guest-0 reconnects successfully until guest-1 become running.
Syntax error in address is also noticed during database running (not initial read) in db pods, however it is not a fatal error, the pod hostname will be resolved eventually when statefulset pod is fully running. It would be good to fix that too. ### 855 2022-02-16T13:02:36.505Z|00633|raft|INFO|ssl:ovnkube-master-guest-0.ovnkube-master-guest.hypershift-ovn.svc.cluster.local:9643: ovsdb error: ssl:ovnkube-master-guest-2. ovnkube-master-guest.hypershift-ovn.svc.cluster.local:9643: syntax error in address ###
ovsdb-tool also reports the same error which results in failure joining db to cluster, this happens when the ovndb statefulset pod gets created: ### + exec /usr/share/ovn/scripts/ovn-ctl --db-nb-cluster-local-port=9643 --db-nb-cluster-local-addr=ovnkube-master-guest-2.ovnkube-master-guest.hypershift-ovn.svc.cluster.local --no-monitor --db-nb-cluster-local-proto=ssl --ovn-nb-db-ssl-key=/ovn-cert/tls.key --ovn-nb-db-ssl-cert=/ovn-cert/tls.crt --ovn-nb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt --db-nb-cluster-remote-port=9643 --db-nb-cluster-remote-addr=ovnkube-master-guest-0.ovnkube-master-guest.hypershift-ovn.svc.cluster.local --db-nb-cluster-remote-proto=ssl '--ovn-nb-log=-vconsole:dbg -vfile:off -vPATTERN:console:%D{%Y-%m-%dT%H:%M:%S.###Z}|%05N|%c%T|%p|%m' run_nb_ovsdb Joining /etc/ovn/ovnnb_db.db to cluster ... failed! ovsdb-tool: ovsdb error: ssl:ovnkube-master-guest-0.ovnkube-master-guest.hypershift-ovn.svc.cluster.local:9643: syntax error in address 2022-02-22T12:43:25.348Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log ovsdb-server: I/O error: /etc/ovn/ovnnb_db.db: open failed (No such file or directory) [1]+ Exit 1 exec /usr/share/ovn/scripts/ovn-ctl ${OVN_ARGS} --db-nb-cluster-remote-port=9643 --db-nb-cluster-remote-addr=${init_ip} --db-nb-cluster-remote-proto=ssl --ovn-nb-log="-vconsole:${OVN_LOG_LEVEL} -vfile:off -vPATTERN:console:${OVN_LOG_PATTERN_CONSOLE}" run_nb_ovsdb ###
Posted for review: https://patchwork.ozlabs.org/project/openvswitch/patch/20220310223317.4023737-1-i.maximets@ovn.org/
I created a scratch build with the fix here: http://brew-task-repos.usersys.redhat.com/repos/scratch/imaximet/openvswitch2.16/2.16.0/63.bz2055097.el8fdp/ @zshi, could you, please, test it in your setup?
(In reply to Ilya Maximets from comment #4) > I created a scratch build with the fix here: > > http://brew-task-repos.usersys.redhat.com/repos/scratch/imaximet/ > openvswitch2.16/2.16.0/63.bz2055097.el8fdp/ > > @zshi, could you, please, test it in your setup? Forgot to update the bz. I tested the above build in hypershift deployment, didn't see the syntax error failure. Logs captured in ovn db pods: https://gist.github.com/b29101296e72efdd58126cc43b07abd1
* Wed Mar 30 2022 Open vSwitch CI <ovs-ci> - 2.16.0-64 - Merging upstream branch-2.16 [RH git: 32008eb008] Commit list: 1570924c3f ovsdb: raft: Fix inability to read the database with DNS host names. (#2055097)
@Ilya Do you know if the fix is included in ovs version 2.17.0-8.el8fdp (latest ovnk[1] moved to use 2.17 ovs version)? [1]: https://github.com/openshift/ovn-kubernetes/blob/master/Dockerfile#L35
Ilya, Is this something that can be reproduced using openvswitch alone or does it require a layered product? Thanks, Rick
(In reply to zenghui.shi from comment #10) > @Ilya Do you know if the fix is included in ovs version 2.17.0-8.el8fdp > (latest ovnk[1] moved to use 2.17 ovs version)? > > [1]: https://github.com/openshift/ovn-kubernetes/blob/master/Dockerfile#L35 Hi. The fix is in the build, but I'm not sure at this point if it will be included in the official errata. But the code is there anyway. (In reply to Rick Alongi from comment #11) > Is this something that can be reproduced using openvswitch alone or does it > require a layered product? Replied to this question here: https://bugzilla.redhat.com/show_bug.cgi?id=2070343#c6
(In reply to Ilya Maximets from comment #13) > (In reply to zenghui.shi from comment #10) > > @Ilya Do you know if the fix is included in ovs version 2.17.0-8.el8fdp > > (latest ovnk[1] moved to use 2.17 ovs version)? > > > > [1]: https://github.com/openshift/ovn-kubernetes/blob/master/Dockerfile#L35 > > Hi. The fix is in the build, but I'm not sure at this point if it will be > included in the official errata. But the code is there anyway. Ok, so the code is in build 2.17.0-8.el8fdp? If so, it is already used in ovnk image.
(In reply to zenghui.shi from comment #14) > Ok, so the code is in build 2.17.0-8.el8fdp? Correct.
Reproducer/Verification steps below: # Provision three systems with RHEL-8.6 # Install ovs and ovn packages without the fix included: yum -y install http://download-node-02.eng.bos.redhat.com/brewroot/packages/openvswitch-selinux-extra-policy/1.0/29.el8fdp/noarch/openvswitch-selinux-extra-policy-1.0-29.el8fdp.noarch.rpm http://download-node-02.eng.bos.redhat.com/brewroot/packages/openvswitch2.16/2.16.0/58.el8fdp/x86_64/openvswitch2.16-2.16.0-58.el8fdp.x86_64.rpm http://download-node-02.eng.bos.redhat.com/brewroot/packages/ovn-2021/21.12.0/30.el8fdp/x86_64/ovn-2021-21.12.0-30.el8fdp.x86_64.rpm http://download-node-02.eng.bos.redhat.com/brewroot/packages/ovn-2021/21.12.0/30.el8fdp/x86_64/ovn-2021-central-21.12.0-30.el8fdp.x86_64.rpm http://download-node-02.eng.bos.redhat.com/brewroot/packages/ovn-2021/21.12.0/30.el8fdp/x86_64/ovn-2021-host-21.12.0-30.el8fdp.x86_64.rpm # Start ovs and ovn processes: systemctl start openvswitch systemctl enable openvswitch systemctl enable ovn-controller systemctl enable ovn-northd systemctl start ovn-controller systemctl start ovn-northd ovn-nbctl set-connection ptcp:6641 ovn-sbctl set-connection ptcp:6642 # Additional config for ovs: yum -y install net-tools host_ip=$(ifconfig -a | grep inet | head -n 1 | awk '{print $2}' | tr -d "addr:") ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:${host_ip}:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=${host_ip} systemctl restart ovn-controller # Configure three systems: host1=netqe9.knqe.lab.eng.bos.redhat.com host2=netqe20.knqe.lab.eng.bos.redhat.com host3=netqe21.knqe.lab.eng.bos.redhat.com host1_ip=$(nslookup $host1 | grep Address | grep -v '#53' | awk '{print $NF}') host2_ip=$(nslookup $host2 | grep Address | grep -v '#53' | awk '{print $NF}') host3_ip=$(nslookup $host3 | grep Address | grep -v '#53' | awk '{print $NF}') # disable DNS client on each system: mv -f /etc/resolv.conf /etc/resolv.conf_saved # Create empty resolv.conf: touch /etc/resolv.conf # Test with no DNS or /etc/hosts configured: rm -f ./hosts.txt echo $host1 >> ./hosts.txt echo $host2 >> ./hosts.txt echo $host3 >> ./hosts.txt ping_list=$(grep -v $(hostname) ./hosts.txt) for i in $(echo $ping_list); do ping -c1 $i if [[ $? -ne 0 ]]; then echo "Ping of $i failed as expected: PASS" else echo "Ping of $i should have failed: FAIL" fi done # add host info to /etc/hosts file: echo -e "$host1_ip\t$host1" >> /etc/hosts echo -e "$host2_ip\t$host2" >> /etc/hosts echo -e "$host3_ip\t$host3" >> /etc/hosts # Test with /etc/hosts file configured: for i in $(echo $ping_list); do ping -c1 $i if [[ $? -eq 0 ]]; then echo "Ping of $i was successful: PASS" else echo "Ping of $i was unsuccessful: FAIL" fi done # execute on $host1: /usr/share/ovn/scripts/ovn-ctl --db-nb-addr=$host1 --db-nb-create-insecure-remote=yes --db-sb-addr=$host1 --db-sb-create-insecure-remote=yes --db-nb-cluster-local-addr=$host1 --db-sb-cluster-local-addr=$host1 --ovn-northd-nb-db=tcp:$host1:6641,tcp:$host2:6641,tcp:$host3:6641 --ovn-northd-sb-db=tcp:$host1:6642,tcp:$host2:6642,tcp:$host3:6642 start_northd # execute on $host2: /usr/share/ovn/scripts/ovn-ctl --db-nb-addr=$host2 --db-nb-create-insecure-remote=yes --db-sb-addr=$host2 --db-sb-create-insecure-remote=yes --db-nb-cluster-local-addr=$host2 --db-sb-cluster-local-addr=$host2 --db-nb-cluster-remote-addr=$host1 --db-sb-cluster-remote-addr=$host1 --ovn-northd-nb-db=tcp:$host1:6641,tcp:$host2:6641,tcp:$host3:6641 --ovn-northd-sb-db=tcp:$host1:6642,tcp:$host2:6642,tcp:$host3:6642 start_northd # execute on $host3: /usr/share/ovn/scripts/ovn-ctl --db-nb-addr=$host3 --db-nb-create-insecure-remote=yes --db-sb-addr=$host3 --db-sb-create-insecure-remote=yes --db-nb-cluster-local-addr=$host3 --db-sb-cluster-local-addr=$host3 --db-nb-cluster-remote-addr=$host1 --db-sb-cluster-remote-addr=$host1 --ovn-northd-nb-db=tcp:$host1:6641,tcp:$host2:6641,tcp:$host3:6641 --ovn-northd-sb-db=tcp:$host1:6642,tcp:$host2:6642,tcp:$host3:6642 start_northd # stop nb_ovsdb on $host1: /usr/share/ovn/scripts/ovn-ctl stop_nb_ovsdb # delete $host1 entry from /etc/hosts file on $host2 and $host3: sed -i "/$host1_ip/d" /etc/hosts # Test on $host2 and $host3: ping -c1 $host1 if [[ $? -ne 0 ]]; then echo "Ping of $i failed as expected: PASS" else echo "Ping of $i should have failed: FAIL" fi # restart nb_ovsdb on $host2 and $host3: # host2: /usr/share/ovn/scripts/ovn-ctl --db-nb-addr=$host2 --db-nb-create-insecure-remote=yes --db-sb-addr=$host2 --db-sb-create-insecure-remote=yes --db-nb-cluster-local-addr=$host2 --db-sb-cluster-local-addr=$host2 --db-nb-cluster-remote-addr=$host1 --db-sb-cluster-remote-addr=$host1 --ovn-northd-nb-db=tcp:$host1:6641,tcp:$host2:6641,tcp:$host3:6641 --ovn-northd-sb-db=tcp:$host1:6642,tcp:$host2:6642,tcp:$host3:6642 restart_nb_ovsdb | tee output.log # host3: /usr/share/ovn/scripts/ovn-ctl --db-nb-addr=$host3 --db-nb-create-insecure-remote=yes --db-sb-addr=$host3 --db-sb-create-insecure-remote=yes --db-nb-cluster-local-addr=$host3 --db-sb-cluster-local-addr=$host3 --db-nb-cluster-remote-addr=$host1 --db-sb-cluster-remote-addr=$host1 --ovn-northd-nb-db=tcp:$host1:6641,tcp:$host2:6641,tcp:$host3:6641 --ovn-northd-sb-db=tcp:$host1:6642,tcp:$host2:6642,tcp:$host3:6642 restart_nb_ovsdb | tee output.log sleep 150s # Check for problem grep '/run/ovn/ovnnb_db.sock: connection attempt failed (No such file or directory)' /var/log/ovn/ovn-northd.log grep 'Joining /etc/ovn/ovnnb_db.db to cluster' output.log | grep FAILED grep 'Starting ovsdb-nb' output.log | grep FAILED grep 'Waiting for OVN_Northbound to come up' output.log | grep FAILED ####################### # Repro of problem: openvswitch2.16-2.16.0-58.el8fdp ovn-2021-21.12.0-30.el8fdp ovn-2021-central-21.12.0-30.el8fdp ovn-2021-host-21.12.0-30.el8fdp RHEL-8.6.0-updates-20220424.0 kernel version: 4.18.0-372.9.1.el8.x86_64 # From output.log: ovnnb_db is not running. Joining /etc/ovn/ovnnb_db.db to cluster [FAILED] Starting ovsdb-nb [FAILED] Waiting for OVN_Northbound to come up [FAILED] From console window: [root@netqe20 ~]# /usr/share/ovn/scripts/ovn-ctl --db-nb-addr=$host2 --db-nb-create-insecure-remote=yes --db-sb-addr=$host2 --db-sb-create-insecure-remote=yes --db-nb-cluster-local-addr=$host2 --db-sb-cluster-local-addr=$host2 --db-nb-cluster-remote-addr=$host1 --db-sb-cluster-remote-addr=$host1 --ovn-northd-nb-db=tcp:$host1:6641,tcp:$host2:6641,tcp:$host3:6641 --ovn-northd-sb-db=tcp:$host1:6642,tcp:$host2:6642,tcp:$host3:6642 restart_nb_ovsdb | tee ./output.log ovnnb_db is not running. Joining /etc/ovn/ovnnb_db.db to cluster ovsdb-tool: ovsdb error: tcp:netqe9.knqe.lab.eng.bos.redhat.com:6643: syntax error in address [FAILED] Starting ovsdb-nb ovsdb-server: I/O error: /etc/ovn/ovnnb_db.db: open failed (No such file or directory) [FAILED] Waiting for OVN_Northbound to come up 2022-04-26T15:07:21Z|00001|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting... 2022-04-26T15:07:21Z|00002|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connection attempt failed (No such file or directory) 2022-04-26T15:07:22Z|00003|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting... 2022-04-26T15:07:22Z|00004|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connection attempt failed (No such file or directory) 2022-04-26T15:07:22Z|00005|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: waiting 2 seconds before reconnect 2022-04-26T15:07:24Z|00006|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting... 2022-04-26T15:07:24Z|00007|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connection attempt failed (No such file or directory) 2022-04-26T15:07:24Z|00008|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: waiting 4 seconds before reconnect 2022-04-26T15:07:28Z|00009|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting... 2022-04-26T15:07:28Z|00010|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connection attempt failed (No such file or directory) 2022-04-26T15:07:28Z|00011|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: continuing to reconnect in the background but suppressing further logging 2022-04-26T15:07:51Z|00012|fatal_signal|WARN|terminating with signal 14 (Alarm clock) /usr/share/openvswitch/scripts/ovs-lib: line 602: 39987 Alarm clock "$@" [FAILED] [root@netqe20 ~]# # Verification of fix: # Install ovs package with the fix included: yum -y install http://download-node-02.eng.bos.redhat.com/brewroot/packages/openvswitch2.16/2.16.0/70.el8fdp/x86_64/openvswitch2.16-2.16.0-70.el8fdp.x86_64.rpm [root@netqe20 ~]# rpm -qa | grep openvswitch openvswitch-selinux-extra-policy-1.0-29.el8fdp.noarch openvswitch2.16-2.16.0-70.el8fdp.x86_64 [root@netqe20 ~]# rpm -qa | grep ovn ovn-2021-central-21.12.0-30.el8fdp.x86_64 ovn-2021-21.12.0-30.el8fdp.x86_64 ovn-2021-host-21.12.0-30.el8fdp.x86_64 # Follow steps outlined above to set up config for test. # Database now restarts/reconnects without any problem: [root@netqe20 etc]# /usr/share/ovn/scripts/ovn-ctl --db-nb-addr=$host2 --db-nb-create-insecure-remote=yes --db-sb-addr=$host2 --db-sb-create-insecure-remote=yes --db-nb-cluster-local-addr=$host2 --db-sb-cluster-local-addr=$host2 --db-nb-cluster-remote-addr=$host1 --db-sb-cluster-remote-addr=$host1 --ovn-northd-nb-db=tcp:$host1:6641,tcp:$host2:6641,tcp:$host3:6641 --ovn-northd-sb-db=tcp:$host1:6642,tcp:$host2:6642,tcp:$host3:6642 restart_nb_ovsdb | tee output.log Exiting ovnnb_db (51112) [ OK ] Starting ovsdb-nb [ OK ] Waiting for OVN_Northbound to come up 2022-04-27T13:37:02Z|00001|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting... 2022-04-27T13:37:02Z|00002|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connected [ OK ] [root@netqe20 etc]# [root@netqe21 etc]# /usr/share/ovn/scripts/ovn-ctl --db-nb-addr=$host3 --db-nb-create-insecure-remote=yes --db-sb-addr=$host3 --db-sb-create-insecure-remote=yes --db-nb-cluster-local-addr=$host3 --db-sb-cluster-local-addr=$host3 --db-nb-cluster-remote-addr=$host1 --db-sb-cluster-remote-addr=$host1 --ovn-northd-nb-db=tcp:$host1:6641,tcp:$host2:6641,tcp:$host3:6641 --ovn-northd-sb-db=tcp:$host1:6642,tcp:$host2:6642,tcp:$host3:6642 restart_nb_ovsdb | tee output.log Exiting ovnnb_db (50779) [ OK ] Starting ovsdb-nb [ OK ] Waiting for OVN_Northbound to come up 2022-04-27T13:37:35Z|00001|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting... 2022-04-27T13:37:35Z|00002|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connected [ OK ] [root@netqe21 etc]# I will wait to mark this as Verified until it can be tested using the FDP 22.D openvswitch package.
Verification using FDP 22.D build: [root@netqe20 ~]# rpm -qa | grep openvswitch openvswitch-selinux-extra-policy-1.0-29.el8fdp.noarch openvswitch2.16-2.16.0-74.el8fdp.x86_64 [root@netqe20 ~]# rpm -qa | grep ovn ovn-2021-central-21.12.0-30.el8fdp.x86_64 ovn-2021-21.12.0-30.el8fdp.x86_64 ovn-2021-host-21.12.0-30.el8fdp.x86_64 [root@netqe20 ~]# /usr/share/ovn/scripts/ovn-ctl --db-nb-addr=$host2 --db-nb-create-insecure-remote=yes --db-sb-addr=$host2 --db-sb-create-insecure-remote=yes --db-nb-cluster-local-addr=$host2 --db-sb-cluster-local-addr=$host2 --db-nb-cluster-remote-addr=$host1 --db-sb-cluster-remote-addr=$host1 --ovn-northd-nb-db=tcp:$host1:6641,tcp:$host2:6641,tcp:$host3:6641 --ovn-northd-sb-db=tcp:$host1:6642,tcp:$host2:6642,tcp:$host3:6642 restart_nb_ovsdb | tee output.log Exiting ovnnb_db (38591) [ OK ] Joining /etc/ovn/ovnnb_db.db to cluster [ OK ] Starting ovsdb-nb [ OK ] Waiting for OVN_Northbound to come up 2022-05-10T14:50:15Z|00001|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting... 2022-05-10T14:50:15Z|00002|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connected 2022-05-10T14:50:45Z|00003|fatal_signal|WARN|terminating with signal 14 (Alarm clock) /usr/share/openvswitch/scripts/ovs-lib: line 602: 38949 Alarm clock "$@" [FAILED] [root@netqe20 ~]# Note: Dev has said the intermittent alarm clock failure above is unrelated to this issue.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: openvswitch2.16 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:4788