The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.
Bug 2070343 - Failed to read database with dns hostname address
Summary: Failed to read database with dns hostname address
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: openvswitch2.15
Version: FDP 22.A
Hardware: Unspecified
OS: Unspecified
high
unspecified
Target Milestone: ---
: FDP 22.D
Assignee: Ilya Maximets
QA Contact: Rick Alongi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-03-30 21:15 UTC by OvS team
Modified: 2022-05-27 18:15 UTC (History)
7 users (show)

Fixed In Version: openvswitch2.15-2.15.0-88.el8fdp
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-05-27 18:14:51 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker FD-1868 0 None None None 2022-03-30 21:30:09 UTC
Red Hat Product Errata RHSA-2022:4787 0 None None None 2022-05-27 18:15:05 UTC

Description OvS team 2022-03-30 21:15:50 UTC
+++ This bug was initially created as a clone of Bug #2055097 +++

Description of problem:

Version-Release number of selected component (if applicable):
ovn21.12-central-21.12.0-25.el8fdp.x86_64
ovn21.12-vtep-21.12.0-25.el8fdp.x86_64
ovn21.12-21.12.0-25.el8fdp.x86_64
ovn21.12-host-21.12.0-25.el8fdp.x86_64

Context: hypershift ovn, run ovn nbdb and sbdb as statefulset.

Assuming ovndb statefulset pods ovnkube-master-guest-0/1/2 formed the quorum, guest-1 is nb leader. Delete both guest-0 and guest-1 pods, guest-2 become leader. 

Since statefulset is used, guest-0 gets re-created first (guest-1 needs to wait until guest-0 is ready, guest pod dns/hostname is only resolvable when pod is running), guest-0 finds the new leader guest-2,  then start nb with the following cmd (local=guest-0, remote=guest-2):


###
+ echo 'Cluster already exists for DB: nb'
+ initial_raft_create=false
+ wait 71
+ exec /usr/share/ovn/scripts/ovn-ctl --db-nb-cluster-local-port=9643 --db-nb-cluster-local-addr=ovnkube-master-guest-0.ovnkube-master-guest.hypershift-ovn.svc.cluster.local --no-monitor --db-nb-cluster-local-proto=ssl --ovn-nb-db-ssl-key=/ovn-cert/tls.key --ovn-nb-db-ssl-cert=/ovn-cert/tls.crt --ovn-nb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt --db-nb-cluster-remote-port=9643 --db-nb-cluster-remote-addr=ovnkube-master-guest-2.ovnkube-master-guest.hypershift-ovn.svc.cluster.local --db-nb-cluster-remote-proto=ssl '--ovn-nb-log=-vconsole:dbg -vfile:off -vPATTERN:console:%D{%Y-%m-%dT%H:%M:%S.###Z}|%05N|%c%T|%p|%m' --db-nb-election-timer=10000 run_nb_ovsdb
2022-02-16T03:05:25.330Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log
ovsdb-server: ovsdb error: error reading record 12 from OVN_Northbound log: ssl:ovnkube-master-guest-1.ovnkube-master-guest.hypershift-ovn.svc.cluster.local:9643: syntax error in address
[1]+  Exit 1                  exec /usr/share/ovn/scripts/ovn-ctl ${OVN_ARGS} --db-nb-cluster-remote-port=9643 --db-nb-cluster-remote-addr=${init_ip} --db-nb-cluster-remote-proto=ssl --ovn-nb-log="-vconsole:${OVN_LOG_LEVEL} -vfile:off -vPATTERN:console:${OVN_LOG_PATTERN_CONSOLE}" ${election_timer} run_nb_ovsdb
###

Guest-0 failed due to guest-1 hostname is not resolvable (syntax error in address).  


Expected results:

Guest-0 reconnects successfully until guest-1 become running.

Comment 1 OvS team 2022-03-30 21:15:53 UTC
* Wed Mar 30 2022 Open vSwitch CI <ovs-ci> - 2.15.0-88
- Merging upstream branch-2.15 [RH git: a03b5c62e4]
    Commit list:
    0a3867a9a9 ovsdb: raft: Fix inability to read the database with DNS host names. (#2055097)

Comment 4 Rick Alongi 2022-04-20 17:48:33 UTC
Ilya,

Is this something that can be reproduced using openvswitch alone or does it require a layered product?

Thanks,
Rick

Comment 6 Ilya Maximets 2022-04-25 11:20:38 UTC
(In reply to Rick Alongi from comment #4)
> Is this something that can be reproduced using openvswitch alone or does it
> require a layered product?

This requires a DNS server.  The sequence should be something like this:

1. Create 3 DNS names for 3 severs.
2. Start the ovsdb cluster using these names instead of IP addresses.
3. Stop one of the servers and remove its DNS record.
4. Restart 2 remaining servers, they should continue to work.

With the issue, 2 remaining servers will fail to start because they will
not be able to resolve the name of the third server.

Comment 7 Rick Alongi 2022-04-27 18:04:07 UTC
Reproducer/Verification steps can be found at https://bugzilla.redhat.com/show_bug.cgi?id=2055097#c16

This will be tested when FDP 22.D is ready for testing.

Comment 10 Rick Alongi 2022-05-10 13:50:06 UTC
Reproducer/Verification steps below:

# Provision three systems with RHEL-8.6

# Install ovs and ovn packages without the fix included:

yum -y install http://download-node-02.eng.bos.redhat.com/brewroot/packages/openvswitch-selinux-extra-policy/1.0/29.el8fdp/noarch/openvswitch-selinux-extra-policy-1.0-29.el8fdp.noarch.rpm http://download-node-02.eng.bos.redhat.com/brewroot/packages/openvswitch2.15/2.15.0/87.el8fdp/x86_64/openvswitch2.15-2.15.0-87.el8fdp.x86_64.rpm http://download-node-02.eng.bos.redhat.com/brewroot/packages/ovn-2021/21.12.0/30.el8fdp/x86_64/ovn-2021-21.12.0-30.el8fdp.x86_64.rpm http://download-node-02.eng.bos.redhat.com/brewroot/packages/ovn-2021/21.12.0/30.el8fdp/x86_64/ovn-2021-central-21.12.0-30.el8fdp.x86_64.rpm http://download-node-02.eng.bos.redhat.com/brewroot/packages/ovn-2021/21.12.0/30.el8fdp/x86_64/ovn-2021-host-21.12.0-30.el8fdp.x86_64.rpm

# Start ovs and ovn processes:

systemctl start openvswitch
systemctl enable openvswitch
systemctl enable ovn-controller
systemctl enable ovn-northd
systemctl start ovn-controller
systemctl start ovn-northd
ovn-nbctl set-connection ptcp:6641
ovn-sbctl set-connection ptcp:6642

# Additional config for ovs:

yum -y install net-tools
host_ip=$(ifconfig -a | grep inet | head -n 1 | awk '{print $2}' | tr -d "addr:")
ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:${host_ip}:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=${host_ip}
systemctl restart ovn-controller

# Configure three systems:

host1=netqe9.knqe.lab.eng.bos.redhat.com
host2=netqe20.knqe.lab.eng.bos.redhat.com
host3=netqe21.knqe.lab.eng.bos.redhat.com

host1_ip=$(nslookup $host1 | grep Address | grep -v '#53' | awk '{print $NF}')
host2_ip=$(nslookup $host2 | grep Address | grep -v '#53' | awk '{print $NF}')
host3_ip=$(nslookup $host3 | grep Address | grep -v '#53' | awk '{print $NF}')

# disable DNS client on each system:

mv -f /etc/resolv.conf /etc/resolv.conf_saved

# Create empty resolv.conf:

touch /etc/resolv.conf

# Test with no DNS or /etc/hosts configured:

rm -f ./hosts.txt
echo $host1 >> ./hosts.txt
echo $host2 >> ./hosts.txt
echo $host3 >> ./hosts.txt

ping_list=$(grep -v $(hostname) ./hosts.txt)
for i in $(echo $ping_list); do
    ping -c1 $i
    if [[ $? -ne 0 ]]; then
        echo "Ping of $i failed as expected: PASS"
    else
        echo "Ping of $i should have failed: FAIL"
    fi
done

# add host info to /etc/hosts file:

echo -e "$host1_ip\t$host1" >> /etc/hosts
echo -e "$host2_ip\t$host2" >> /etc/hosts
echo -e "$host3_ip\t$host3" >> /etc/hosts

# Test with /etc/hosts file configured:

for i in $(echo $ping_list); do
    ping -c1 $i
    if [[ $? -eq 0 ]]; then
        echo "Ping of $i was successful: PASS"
    else
        echo "Ping of $i was unsuccessful: FAIL"
    fi
done

# execute on $host1:

/usr/share/ovn/scripts/ovn-ctl --db-nb-addr=$host1 --db-nb-create-insecure-remote=yes                       --db-sb-addr=$host1 --db-sb-create-insecure-remote=yes                  --db-nb-cluster-local-addr=$host1 --db-sb-cluster-local-addr=$host1                       --ovn-northd-nb-db=tcp:$host1:6641,tcp:$host2:6641,tcp:$host3:6641                        --ovn-northd-sb-db=tcp:$host1:6642,tcp:$host2:6642,tcp:$host3:6642 start_northd

# execute on $host2:

/usr/share/ovn/scripts/ovn-ctl --db-nb-addr=$host2 --db-nb-create-insecure-remote=yes                      --db-sb-addr=$host2 --db-sb-create-insecure-remote=yes                         --db-nb-cluster-local-addr=$host2 --db-sb-cluster-local-addr=$host2                     --db-nb-cluster-remote-addr=$host1 --db-sb-cluster-remote-addr=$host1                     --ovn-northd-nb-db=tcp:$host1:6641,tcp:$host2:6641,tcp:$host3:6641                        --ovn-northd-sb-db=tcp:$host1:6642,tcp:$host2:6642,tcp:$host3:6642 start_northd

# execute on $host3:

/usr/share/ovn/scripts/ovn-ctl --db-nb-addr=$host3 --db-nb-create-insecure-remote=yes                     --db-sb-addr=$host3 --db-sb-create-insecure-remote=yes                        --db-nb-cluster-local-addr=$host3 --db-sb-cluster-local-addr=$host3                    --db-nb-cluster-remote-addr=$host1 --db-sb-cluster-remote-addr=$host1                    --ovn-northd-nb-db=tcp:$host1:6641,tcp:$host2:6641,tcp:$host3:6641                       --ovn-northd-sb-db=tcp:$host1:6642,tcp:$host2:6642,tcp:$host3:6642 start_northd

# stop nb_ovsdb on $host1:

/usr/share/ovn/scripts/ovn-ctl stop_nb_ovsdb

# delete $host1 entry from /etc/hosts file on $host2 and $host3:

sed -i "/$host1_ip/d" /etc/hosts

# Test on $host2 and $host3:

ping -c1 $host1
if [[ $? -ne 0 ]]; then
    echo "Ping of $i failed as expected: PASS"
else
    echo "Ping of $i should have failed: FAIL"
fi

# restart nb_ovsdb on $host2 and $host3:

# host2:

/usr/share/ovn/scripts/ovn-ctl --db-nb-addr=$host2 --db-nb-create-insecure-remote=yes --db-sb-addr=$host2 --db-sb-create-insecure-remote=yes --db-nb-cluster-local-addr=$host2 --db-sb-cluster-local-addr=$host2 --db-nb-cluster-remote-addr=$host1 --db-sb-cluster-remote-addr=$host1 --ovn-northd-nb-db=tcp:$host1:6641,tcp:$host2:6641,tcp:$host3:6641 --ovn-northd-sb-db=tcp:$host1:6642,tcp:$host2:6642,tcp:$host3:6642 restart_nb_ovsdb | tee output.log

# host3:

/usr/share/ovn/scripts/ovn-ctl --db-nb-addr=$host3 --db-nb-create-insecure-remote=yes --db-sb-addr=$host3 --db-sb-create-insecure-remote=yes --db-nb-cluster-local-addr=$host3 --db-sb-cluster-local-addr=$host3 --db-nb-cluster-remote-addr=$host1 --db-sb-cluster-remote-addr=$host1 --ovn-northd-nb-db=tcp:$host1:6641,tcp:$host2:6641,tcp:$host3:6641 --ovn-northd-sb-db=tcp:$host1:6642,tcp:$host2:6642,tcp:$host3:6642 restart_nb_ovsdb | tee output.log

sleep 150s

# Check for problem
grep '/run/ovn/ovnnb_db.sock: connection attempt failed (No such file or directory)' /var/log/ovn/ovn-northd.log
grep 'Joining /etc/ovn/ovnnb_db.db to cluster' output.log | grep FAILED
grep 'Starting ovsdb-nb'  output.log | grep FAILED
grep 'Waiting for OVN_Northbound to come up'  output.log | grep FAILED

#######################

# Repro of problem:

openvswitch2.15-2.15.0-87.el8fdp
ovn-2021-21.12.0-30.el8fdp
ovn-2021-central-21.12.0-30.el8fdp
ovn-2021-host-21.12.0-30.el8fdp

RHEL-8.6.0-updates-20220510.0
kernel version: 4.18.0-372.9.1.el8.x86_64

[root@netqe20 ~]# /usr/share/ovn/scripts/ovn-ctl --db-nb-addr=$host2 --db-nb-create-insecure-remote=yes --db-sb-addr=$host2 --db-sb-create-insecure-remote=yes --db-nb-cluster-local-addr=$host2 --db-sb-cluster-local-addr=$host2 --db-nb-cluster-remote-addr=$host1 --db-sb-cluster-remote-addr=$host1 --ovn-northd-nb-db=tcp:$host1:6641,tcp:$host2:6641,tcp:$host3:6641 --ovn-northd-sb-db=tcp:$host1:6642,tcp:$host2:6642,tcp:$host3:6642 restart_nb_ovsdb | tee output.log
Exiting ovnnb_db (39208)                                   [  OK  ]
Joining /etc/ovn/ovnnb_db.db to cluster ovsdb-tool: ovsdb error: tcp:netqe9.knqe.lab.eng.bos.redhat.com:6643: syntax error in address
                                                           [FAILED]
Starting ovsdb-nb ovsdb-server: I/O error: /etc/ovn/ovnnb_db.db: open failed (No such file or directory)
                                                           [FAILED]
Waiting for OVN_Northbound to come up 2022-05-10T13:05:02Z|00001|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting...
2022-05-10T13:05:02Z|00002|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connection attempt failed (No such file or directory)
2022-05-10T13:05:03Z|00003|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting...
2022-05-10T13:05:03Z|00004|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connection attempt failed (No such file or directory)
2022-05-10T13:05:03Z|00005|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: waiting 2 seconds before reconnect
2022-05-10T13:05:05Z|00006|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting...
2022-05-10T13:05:05Z|00007|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connection attempt failed (No such file or directory)
2022-05-10T13:05:05Z|00008|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: waiting 4 seconds before reconnect
2022-05-10T13:05:09Z|00009|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting...
2022-05-10T13:05:09Z|00010|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connection attempt failed (No such file or directory)
2022-05-10T13:05:09Z|00011|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: continuing to reconnect in the background but suppressing further logging
2022-05-10T13:05:32Z|00012|fatal_signal|WARN|terminating with signal 14 (Alarm clock)
/usr/share/openvswitch/scripts/ovs-lib: line 602: 39579 Alarm clock             "$@"
                                                           [FAILED]
[root@netqe20 ~]# grep '/run/ovn/ovnnb_db.sock: connection attempt failed (No such file or directory)' /var/log/ovn/ovn-northd.log
2022-05-10T13:03:26.276Z|00015|reconnect|INFO|unix:/run/ovn/ovnnb_db.sock: connection attempt failed (No such file or directory)
2022-05-10T13:03:28.279Z|00018|reconnect|INFO|unix:/run/ovn/ovnnb_db.sock: connection attempt failed (No such file or directory)
2022-05-10T13:03:32.283Z|00021|reconnect|INFO|unix:/run/ovn/ovnnb_db.sock: connection attempt failed (No such file or directory)
[root@netqe20 ~]# grep 'Joining /etc/ovn/ovnnb_db.db to cluster' output.log | grep FAILED
Joining /etc/ovn/ovnnb_db.db to cluster                    [FAILED]
[root@netqe20 ~]# grep 'Starting ovsdb-nb'  output.log | grep FAILED
Starting ovsdb-nb                                          [FAILED]
[root@netqe20 ~]# grep 'Waiting for OVN_Northbound to come up'  output.log | grep FAILED
Waiting for OVN_Northbound to come up                      [FAILED]
[root@netqe20 ~]# 

Verification of fix:

# Install ovs package with the fix included (FDP 22.D):

yum -y update http://download-node-02.eng.bos.redhat.com/brewroot/packages/openvswitch2.15/2.15.0/99.el8fdp/x86_64/openvswitch2.15-2.15.0-99.el8fdp.x86_64.rpm

[root@netqe20 ~]# rpm -qa | grep openvswitch
openvswitch-selinux-extra-policy-1.0-29.el8fdp.noarch
openvswitch2.15-2.15.0-99.el8fdp.x86_64
[root@netqe20 ~]# rpm -qa | grep ovn
ovn-2021-central-21.12.0-30.el8fdp.x86_64
ovn-2021-21.12.0-30.el8fdp.x86_64
ovn-2021-host-21.12.0-30.el8fdp.x86_64

# Follow steps outlined above to set up config for test.
# Database now restarts/reconnects without any problem:

[root@netqe21 ~]# /usr/share/ovn/scripts/ovn-ctl --db-nb-addr=$host3 --db-nb-create-insecure-remote=yes --db-sb-addr=$host3 --db-sb-create-insecure-remote=yes --db-nb-cluster-local-addr=$host3 --db-sb-cluster-local-addr=$host3 --db-nb-cluster-remote-addr=$host1 --db-sb-cluster-remote-addr=$host1 --ovn-northd-nb-db=tcp:$host1:6641,tcp:$host2:6641,tcp:$host3:6641 --ovn-northd-sb-db=tcp:$host1:6642,tcp:$host2:6642,tcp:$host3:6642 restart_nb_ovsdb
Exiting ovnnb_db (40605)                                   [  OK  ]
Starting ovsdb-nb                                          [  OK  ]
Waiting for OVN_Northbound to come up 2022-05-10T13:24:55Z|00001|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting...
2022-05-10T13:24:55Z|00002|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connected
2022-05-10T13:25:25Z|00003|fatal_signal|WARN|terminating with signal 14 (Alarm clock)
/usr/share/openvswitch/scripts/ovs-lib: line 602: 40829 Alarm clock             "$@"
                                                           [FAILED]

Note: Dev has said the intermittent alarm clock failure above is unrelated to this issue.

Comment 12 errata-xmlrpc 2022-05-27 18:14:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: openvswitch2.15 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:4787


Note You need to log in before you can comment on or make changes to this bug.