+++ This bug was initially created as a clone of Bug #2091686 +++ Description of problem: OVN DB containers (ovn_cluster_south_db_server, ovn_cluster_north_db_server) are not restarted correctly because PID file remains in the /var/lib/openvswitch/ovn/ folder after containers are stopped using podman or systemctl commands. Version-Release number of selected component (if applicable): RHOS-17.0-RHEL-9-20220526.n.0 ovn-2021-21.12.0-59 How reproducible: Steps to Reproduce: systemctl restart tripleo_ovn_cluster_south_db_server or podman restart ovn_cluster_south_db_server Actual results: Expected results: Additional info: --- Additional comment from Terry Wilson on 2022-05-31 16:32:08 CDT --- Interestingly, it looks like on controller-0, restarting the services works as expected. On controller-1 and controller-2 it fails the pidfile_is_running() check. What looks like is happening is that on controller-0, which is set up to initially create the cluster, we're hitting ovn-ctl here: https://github.com/ovn-org/ovn/blob/915e6e0c2b4cb46d16db62fb6155eacdb3a0cb89/utilities/ovn-ctl#L311 which, even though the command we've built is run with 'exec', in this case it is run in the background with an '&'. This means ovn-ctl is running with pid=6, and ovsdb-server is running with pid=66. BUT, on the other two servers, we are hitting ovn-ctl here: https://github.com/ovn-org/ovn/blob/915e6e0c2b4cb46d16db62fb6155eacdb3a0cb89/utilities/ovn-ctl#L314 which ends up running the command in the foreground, so using exec means that ovsdb-server ends up running with pid=6...which is the pid that ovn-ctl will get when the container is restarted, so when pidfile_is_running() reads the .pid file and then checks /proc/6, it will find that it *does* exist, and so it thinks that ovsdb-server is running. One way to fix might be to pass in an optional binary to pidfile_is_running and then check if `readlink /proc/$pid/exe` == $binary. --- Additional comment from Terry Wilson on 2022-06-02 11:55:06 CDT --- Fix here: https://patchwork.ozlabs.org/project/ovn/list/?series=303093
Moving this to MODIFIED since this has been merged into OVN both up- and downstream.
Hi Mark, which patch solve the problem ? could you give suggestions on how to reproduce the issue? thanks
Hi, Terry originally fixed the issue with these two patches: https://github.com/ovn-org/ovn/commit/ab7d5b978e4f944f6bd9438ab5749d902164e160 https://github.com/ovn-org/ovn/commit/4cb3f9d9940e936c72cead63023bb74fc84b2cad Ihar then posted a follow-up that fixed an issue with the original patches: https://github.com/ovn-org/ovn/commit/cc3c32534ecb00d3547e41e75d2243bc57c7662b We may want to get confirmation from @twilson , but I think the problem is this: 1. ovn-ctl starts ovsdb-server. 2. ovn-ctl checks the PID of ovsdb-server and writes it to a pidfile. 3. Something stops the ovsdb-server process. 4. A new program (we'll call it "Program X") starts running and now has the same PID that ovsdb-server was using before. 5. ovn-ctl tries to restart ovsdb-server, and reads the PID from the pidfile. 6. ovn-ctl checks if that PID is currently in use. 7. The PID is being used by Program X, and so ovn-ctl assumes ovsdb-server is still running and therefore does not re-start ovsdb-server. The fix that Terry made was to modify step 6 so that ovn-ctl checks which program is currently using the PID. This way, ovn-ctl can see that Program X is currently using the PID and knows to re-start ovsdb-server. I'm not sure the best way to reproduce this, because I'm not sure if it's possible to request specific PIDs when starting processes. I also doubt that PID selection is predictable across all kernels/architectures.
It's really that with containers, you'd always get the same pid for ovsdb-server and since ovn-ctl runs ovsdb-server with exec() ovsdb-server would end up with the same pid as ovn-ctl and the .pid file was on a mount so it survived the container restart. You should be able to test by creating a stale pid file that matches some other process pid and ensure that ovsdb-server successfully starts.
tested with following steps: 1. start nb db with: /usr/share/ovn/scripts/ovn-ctl start_nb_ovsdb 2. get pid for ovsdb 3. kill ovsdb 4. echo pid for ovsdb to /var/run/ovn/ovnnb_db.pid 5. start program with following code: #include <sys/stat.h> #include <fcntl.h> #include <stdio.h> #include <string.h> #include <stdlib.h> int main(int argc, char *argv[]) { int fd, pid; char buf[32]; if (argc != 2) return 1; printf("Opening ns_last_pid...\n"); fd = open("/proc/sys/kernel/ns_last_pid", O_RDWR | O_CREAT, 0644); if (fd < 0) { perror("Can't open ns_last_pid"); return 1; } printf("Done\n"); printf("Locking ns_last_pid...\n"); if (flock(fd, LOCK_EX)) { close(fd); printf("Can't lock ns_last_pid\n"); return 1; } printf("Done\n"); pid = atoi(argv[1]); snprintf(buf, sizeof(buf), "%d", pid - 1); printf("Writing pid-1 to ns_last_pid...\n"); if (write(fd, buf, strlen(buf)) != strlen(buf)) { printf("Can't write to buf\n"); return 1; } printf("Done\n"); printf("Forking...\n"); int new_pid; new_pid = fork(); if (new_pid == 0) { printf("I'm child!\n"); exit(0); } else if (new_pid == pid) { printf("I'm parent. My child got right pid!\n"); } else { printf("pid does not match expected one\n"); } printf("Done\n"); sleep(60); printf("Cleaning up..."); if (flock(fd, LOCK_UN)) { printf("Can't unlock"); } close(fd); printf("Done\n"); return 0; } 6. start nb db again with: /usr/share/ovn/scripts/ovn-ctl start_nb_ovsdb 7. check if ovsdb is started reproduced on ovn22.03-22.03.0-52: [root@wsfd-advnetlab16 ovn]# rpm -qa | grep -E "ovn22.03|openvswitch2.17" ovn22.03-host-22.03.0-52.el8fdp.x86_64 openvswitch2.17-2.17.0-33.el8fdp.x86_64 ovn22.03-central-22.03.0-52.el8fdp.x86_64 ovn22.03-22.03.0-52.el8fdp.x86_64 [root@wsfd-advnetlab16 ~]# /usr/share/ovn/scripts/ovn-ctl start_nb_ovsdb Starting ovsdb-nb [ OK ] [root@wsfd-advnetlab16 ~]# ps aux | grep ovs root 49938 0.0 0.0 71240 24548 ? Ss 22:33 0:00 ovsdb-server: monitoring pid 49939 (healthy) root 49939 0.0 0.0 71636 29028 ? S 22:33 0:00 ovsdb-server -vconsole:off -vfile:info --log-file=/var/log/ovn/ovsdb-server-nb.log --remote=punix:/var/run/ovn/ovnnb_db.sock --pidfile=/var/run/ovn/ovnnb_db.pid --unixctl=/var/run/ovn/ovnnb_db.ctl --detach --monitor --remote=db:OVN_Northbound,NB_Global,connections --private-key=db:OVN_Northbound,SSL,private_key --certificate=db:OVN_Northbound,SSL,certificate --ca-cert=db:OVN_Northbound,SSL,ca_cert --ssl-protocols=db:OVN_Northbound,SSL,ssl_protocols --ssl-ciphers=db:OVN_Northbound,SSL,ssl_ciphers /etc/ovn/ovnnb_db.db root 49947 0.0 0.0 11808 1140 pts/1 S+ 22:33 0:00 grep --color=auto ovs [root@wsfd-advnetlab16 ~]# kill 49939 [root@wsfd-advnetlab16 ~]# echo 49939 > /var/run/ovn/ovnnb_db.pid [root@wsfd-advnetlab16 ~]# ./test 49939 [root@wsfd-advnetlab16 ~]# /usr/share/ovn/scripts/ovn-ctl start_nb_ovsdb [root@wsfd-advnetlab16 ~]# ps aux | grep ovs root 49961 0.0 0.0 11808 1192 pts/1 S+ 22:34 0:00 grep --color=auto ovs <=== ovs db is not started [root@wsfd-advnetlab16 ~]# ps aux | grep 49939 root 49939 0.0 0.0 0 0 pts/0 Z+ 22:34 0:00 [test] <defunct> root 49965 0.0 0.0 11808 1180 pts/1 S+ 22:34 0:00 grep --color=auto 49939 root 49974 0.0 0.0 4368 904 pts/0 S+ 22:34 0:00 ./test 49939 Verified on ovn22.03-22.03.0-62: [root@wsfd-advnetlab16 ~]# rpm -qa | grep -E "ovn22.03|openvswitch2.17" openvswitch2.17-2.17.0-33.el8fdp.x86_64 ovn22.03-22.03.0-62.el8fdp.x86_64 ovn22.03-host-22.03.0-62.el8fdp.x86_64 ovn22.03-central-22.03.0-62.el8fdp.x86_64 [root@wsfd-advnetlab16 ~]# /usr/share/ovn/scripts/ovn-ctl start_nb_ovsdb Starting ovsdb-nb [ OK ] [root@wsfd-advnetlab16 ~]# ps aux | grep ovs root 50415 0.0 0.0 71240 26088 ? Ss 22:35 0:00 ovsdb-server: monitoring pid 50416 (healthy) root 50416 0.2 0.0 71636 30708 ? S 22:35 0:00 ovsdb-server -vconsole:off -vfile:info --log-file=/var/log/ovn/ovsdb-server-nb.log --remote=punix:/var/run/ovn/ovnnb_db.sock --pidfile=/var/run/ovn/ovnnb_db.pid --unixctl=/var/run/ovn/ovnnb_db.ctl --detach --monitor --remote=db:OVN_Northbound,NB_Global,connections --private-key=db:OVN_Northbound,SSL,private_key --certificate=db:OVN_Northbound,SSL,certificate --ca-cert=db:OVN_Northbound,SSL,ca_cert --ssl-protocols=db:OVN_Northbound,SSL,ssl_protocols --ssl-ciphers=db:OVN_Northbound,SSL,ssl_ciphers /etc/ovn/ovnnb_db.db root 50424 0.0 0.0 11808 1180 pts/1 S+ 22:35 0:00 grep --color=auto ovs [root@wsfd-advnetlab16 ~]# kill 50416 [root@wsfd-advnetlab16 ~]# echo 50416 > /var/run/ovn/ovnnb_db.pid [root@wsfd-advnetlab16 ~]# ps aux | grep 50416 root 50416 0.0 0.0 0 0 pts/0 Z+ 22:35 0:00 [test] <defunct> root 50420 0.0 0.0 11808 1184 pts/1 S+ 22:36 0:00 grep --color=auto 50416 root 50433 0.0 0.0 4368 904 pts/0 S+ 22:35 0:00 ./test 50416 [root@wsfd-advnetlab16 ~]# /usr/share/ovn/scripts/ovn-ctl start_nb_ovsdb Starting ovsdb-nb [ OK ] [root@wsfd-advnetlab16 ~]# ps aux | grep ovs root 50446 0.0 0.0 71240 24548 ? Ss 22:36 0:00 ovsdb-server: monitoring pid 50447 (healthy) root 50447 0.0 0.0 71636 29036 ? S 22:36 0:00 ovsdb-server -vconsole:off -vfile:info --log-file=/var/log/ovn/ovsdb-server-nb.log --remote=punix:/var/run/ovn/ovnnb_db.sock --pidfile=/var/run/ovn/ovnnb_db.pid --unixctl=/var/run/ovn/ovnnb_db.ctl --detach --monitor --remote=db:OVN_Northbound,NB_Global,connections --private-key=db:OVN_Northbound,SSL,private_key --certificate=db:OVN_Northbound,SSL,certificate --ca-cert=db:OVN_Northbound,SSL,ca_cert --ssl-protocols=db:OVN_Northbound,SSL,ssl_protocols --ssl-ciphers=db:OVN_Northbound,SSL,ssl_ciphers /etc/ovn/ovnnb_db.db root 50455 0.0 0.0 11808 1180 pts/1 S+ 22:36 0:00 grep --color=auto ovs <=== ovsdb is started
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (ovn-2021 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:5787