Bug 2092976
Summary: | [RAFT] OVN DB container unable to restart | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux Fast Datapath | Reporter: | Terry Wilson <twilson> |
Component: | ovn22.03 | Assignee: | OVN Team <ovnteam> |
Status: | CLOSED ERRATA | QA Contact: | Jianlin Shi <jishi> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | FDP 22.C | CC: | akatz, apevec, ctrautma, ekuris, ffernand, ihrachys, jiji, lhh, majopela, mmichels, scohen, twilson |
Target Milestone: | --- | Keywords: | Triaged |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | ovn22.03-22.03.0-55 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | 2091686 | Environment: | |
Last Closed: | 2022-08-01 14:11:03 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 2022144 | ||
Bug Blocks: | 1503518, 2091686 |
Description
Terry Wilson
2022-06-02 16:56:46 UTC
Moving this to MODIFIED since this has been merged into OVN both up- and downstream. Hi Mark, which patch solve the problem ? could you give suggestions on how to reproduce the issue? thanks Hi, Terry originally fixed the issue with these two patches: https://github.com/ovn-org/ovn/commit/ab7d5b978e4f944f6bd9438ab5749d902164e160 https://github.com/ovn-org/ovn/commit/4cb3f9d9940e936c72cead63023bb74fc84b2cad Ihar then posted a follow-up that fixed an issue with the original patches: https://github.com/ovn-org/ovn/commit/cc3c32534ecb00d3547e41e75d2243bc57c7662b We may want to get confirmation from @twilson , but I think the problem is this: 1. ovn-ctl starts ovsdb-server. 2. ovn-ctl checks the PID of ovsdb-server and writes it to a pidfile. 3. Something stops the ovsdb-server process. 4. A new program (we'll call it "Program X") starts running and now has the same PID that ovsdb-server was using before. 5. ovn-ctl tries to restart ovsdb-server, and reads the PID from the pidfile. 6. ovn-ctl checks if that PID is currently in use. 7. The PID is being used by Program X, and so ovn-ctl assumes ovsdb-server is still running and therefore does not re-start ovsdb-server. The fix that Terry made was to modify step 6 so that ovn-ctl checks which program is currently using the PID. This way, ovn-ctl can see that Program X is currently using the PID and knows to re-start ovsdb-server. I'm not sure the best way to reproduce this, because I'm not sure if it's possible to request specific PIDs when starting processes. I also doubt that PID selection is predictable across all kernels/architectures. It's really that with containers, you'd always get the same pid for ovsdb-server and since ovn-ctl runs ovsdb-server with exec() ovsdb-server would end up with the same pid as ovn-ctl and the .pid file was on a mount so it survived the container restart. You should be able to test by creating a stale pid file that matches some other process pid and ensure that ovsdb-server successfully starts. tested with following steps: 1. start nb db with: /usr/share/ovn/scripts/ovn-ctl start_nb_ovsdb 2. get pid for ovsdb 3. kill ovsdb 4. echo pid for ovsdb to /var/run/ovn/ovnnb_db.pid 5. start program with following code: #include <sys/stat.h> #include <fcntl.h> #include <stdio.h> #include <string.h> #include <stdlib.h> int main(int argc, char *argv[]) { int fd, pid; char buf[32]; if (argc != 2) return 1; printf("Opening ns_last_pid...\n"); fd = open("/proc/sys/kernel/ns_last_pid", O_RDWR | O_CREAT, 0644); if (fd < 0) { perror("Can't open ns_last_pid"); return 1; } printf("Done\n"); printf("Locking ns_last_pid...\n"); if (flock(fd, LOCK_EX)) { close(fd); printf("Can't lock ns_last_pid\n"); return 1; } printf("Done\n"); pid = atoi(argv[1]); snprintf(buf, sizeof(buf), "%d", pid - 1); printf("Writing pid-1 to ns_last_pid...\n"); if (write(fd, buf, strlen(buf)) != strlen(buf)) { printf("Can't write to buf\n"); return 1; } printf("Done\n"); printf("Forking...\n"); int new_pid; new_pid = fork(); if (new_pid == 0) { printf("I'm child!\n"); exit(0); } else if (new_pid == pid) { printf("I'm parent. My child got right pid!\n"); } else { printf("pid does not match expected one\n"); } printf("Done\n"); sleep(60); printf("Cleaning up..."); if (flock(fd, LOCK_UN)) { printf("Can't unlock"); } close(fd); printf("Done\n"); return 0; } 6. start nb db again with: /usr/share/ovn/scripts/ovn-ctl start_nb_ovsdb 7. check if ovsdb is started reproduced on ovn22.03-22.03.0-52: [root@wsfd-advnetlab16 ovn]# rpm -qa | grep -E "ovn22.03|openvswitch2.17" ovn22.03-host-22.03.0-52.el8fdp.x86_64 openvswitch2.17-2.17.0-33.el8fdp.x86_64 ovn22.03-central-22.03.0-52.el8fdp.x86_64 ovn22.03-22.03.0-52.el8fdp.x86_64 [root@wsfd-advnetlab16 ~]# /usr/share/ovn/scripts/ovn-ctl start_nb_ovsdb Starting ovsdb-nb [ OK ] [root@wsfd-advnetlab16 ~]# ps aux | grep ovs root 49938 0.0 0.0 71240 24548 ? Ss 22:33 0:00 ovsdb-server: monitoring pid 49939 (healthy) root 49939 0.0 0.0 71636 29028 ? S 22:33 0:00 ovsdb-server -vconsole:off -vfile:info --log-file=/var/log/ovn/ovsdb-server-nb.log --remote=punix:/var/run/ovn/ovnnb_db.sock --pidfile=/var/run/ovn/ovnnb_db.pid --unixctl=/var/run/ovn/ovnnb_db.ctl --detach --monitor --remote=db:OVN_Northbound,NB_Global,connections --private-key=db:OVN_Northbound,SSL,private_key --certificate=db:OVN_Northbound,SSL,certificate --ca-cert=db:OVN_Northbound,SSL,ca_cert --ssl-protocols=db:OVN_Northbound,SSL,ssl_protocols --ssl-ciphers=db:OVN_Northbound,SSL,ssl_ciphers /etc/ovn/ovnnb_db.db root 49947 0.0 0.0 11808 1140 pts/1 S+ 22:33 0:00 grep --color=auto ovs [root@wsfd-advnetlab16 ~]# kill 49939 [root@wsfd-advnetlab16 ~]# echo 49939 > /var/run/ovn/ovnnb_db.pid [root@wsfd-advnetlab16 ~]# ./test 49939 [root@wsfd-advnetlab16 ~]# /usr/share/ovn/scripts/ovn-ctl start_nb_ovsdb [root@wsfd-advnetlab16 ~]# ps aux | grep ovs root 49961 0.0 0.0 11808 1192 pts/1 S+ 22:34 0:00 grep --color=auto ovs <=== ovs db is not started [root@wsfd-advnetlab16 ~]# ps aux | grep 49939 root 49939 0.0 0.0 0 0 pts/0 Z+ 22:34 0:00 [test] <defunct> root 49965 0.0 0.0 11808 1180 pts/1 S+ 22:34 0:00 grep --color=auto 49939 root 49974 0.0 0.0 4368 904 pts/0 S+ 22:34 0:00 ./test 49939 Verified on ovn22.03-22.03.0-62: [root@wsfd-advnetlab16 ~]# rpm -qa | grep -E "ovn22.03|openvswitch2.17" openvswitch2.17-2.17.0-33.el8fdp.x86_64 ovn22.03-22.03.0-62.el8fdp.x86_64 ovn22.03-host-22.03.0-62.el8fdp.x86_64 ovn22.03-central-22.03.0-62.el8fdp.x86_64 [root@wsfd-advnetlab16 ~]# /usr/share/ovn/scripts/ovn-ctl start_nb_ovsdb Starting ovsdb-nb [ OK ] [root@wsfd-advnetlab16 ~]# ps aux | grep ovs root 50415 0.0 0.0 71240 26088 ? Ss 22:35 0:00 ovsdb-server: monitoring pid 50416 (healthy) root 50416 0.2 0.0 71636 30708 ? S 22:35 0:00 ovsdb-server -vconsole:off -vfile:info --log-file=/var/log/ovn/ovsdb-server-nb.log --remote=punix:/var/run/ovn/ovnnb_db.sock --pidfile=/var/run/ovn/ovnnb_db.pid --unixctl=/var/run/ovn/ovnnb_db.ctl --detach --monitor --remote=db:OVN_Northbound,NB_Global,connections --private-key=db:OVN_Northbound,SSL,private_key --certificate=db:OVN_Northbound,SSL,certificate --ca-cert=db:OVN_Northbound,SSL,ca_cert --ssl-protocols=db:OVN_Northbound,SSL,ssl_protocols --ssl-ciphers=db:OVN_Northbound,SSL,ssl_ciphers /etc/ovn/ovnnb_db.db root 50424 0.0 0.0 11808 1180 pts/1 S+ 22:35 0:00 grep --color=auto ovs [root@wsfd-advnetlab16 ~]# kill 50416 [root@wsfd-advnetlab16 ~]# echo 50416 > /var/run/ovn/ovnnb_db.pid [root@wsfd-advnetlab16 ~]# ps aux | grep 50416 root 50416 0.0 0.0 0 0 pts/0 Z+ 22:35 0:00 [test] <defunct> root 50420 0.0 0.0 11808 1184 pts/1 S+ 22:36 0:00 grep --color=auto 50416 root 50433 0.0 0.0 4368 904 pts/0 S+ 22:35 0:00 ./test 50416 [root@wsfd-advnetlab16 ~]# /usr/share/ovn/scripts/ovn-ctl start_nb_ovsdb Starting ovsdb-nb [ OK ] [root@wsfd-advnetlab16 ~]# ps aux | grep ovs root 50446 0.0 0.0 71240 24548 ? Ss 22:36 0:00 ovsdb-server: monitoring pid 50447 (healthy) root 50447 0.0 0.0 71636 29036 ? S 22:36 0:00 ovsdb-server -vconsole:off -vfile:info --log-file=/var/log/ovn/ovsdb-server-nb.log --remote=punix:/var/run/ovn/ovnnb_db.sock --pidfile=/var/run/ovn/ovnnb_db.pid --unixctl=/var/run/ovn/ovnnb_db.ctl --detach --monitor --remote=db:OVN_Northbound,NB_Global,connections --private-key=db:OVN_Northbound,SSL,private_key --certificate=db:OVN_Northbound,SSL,certificate --ca-cert=db:OVN_Northbound,SSL,ca_cert --ssl-protocols=db:OVN_Northbound,SSL,ssl_protocols --ssl-ciphers=db:OVN_Northbound,SSL,ssl_ciphers /etc/ovn/ovnnb_db.db root 50455 0.0 0.0 11808 1180 pts/1 S+ 22:36 0:00 grep --color=auto ovs <=== ovsdb is started Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (ovn-2021 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:5787 |