The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.
Bug 2092976 - [RAFT] OVN DB container unable to restart
Summary: [RAFT] OVN DB container unable to restart
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: ovn22.03
Version: FDP 22.C
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: OVN Team
QA Contact: Jianlin Shi
URL:
Whiteboard:
Depends On: 2022144
Blocks: ovsdbclustering 2091686
TreeView+ depends on / blocked
 
Reported: 2022-06-02 16:56 UTC by Terry Wilson
Modified: 2022-08-01 14:11 UTC (History)
12 users (show)

Fixed In Version: ovn22.03-22.03.0-55
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 2091686
Environment:
Last Closed: 2022-08-01 14:11:03 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker FD-2003 0 None None None 2022-06-02 17:00:02 UTC
Red Hat Product Errata RHBA-2022:5787 0 None None None 2022-08-01 14:11:14 UTC

Internal Links: 2091686

Description Terry Wilson 2022-06-02 16:56:46 UTC
+++ This bug was initially created as a clone of Bug #2091686 +++

Description of problem:
OVN DB containers (ovn_cluster_south_db_server, ovn_cluster_north_db_server) are not restarted correctly because PID file remains in the /var/lib/openvswitch/ovn/ folder after containers are stopped using podman or systemctl commands.


Version-Release number of selected component (if applicable):
RHOS-17.0-RHEL-9-20220526.n.0
ovn-2021-21.12.0-59


How reproducible:


Steps to Reproduce:
systemctl restart tripleo_ovn_cluster_south_db_server
or
podman restart ovn_cluster_south_db_server


Actual results:


Expected results:


Additional info:

--- Additional comment from Terry Wilson on 2022-05-31 16:32:08 CDT ---

Interestingly, it looks like on controller-0, restarting the services works as expected. On controller-1 and controller-2 it fails the pidfile_is_running() check.

What looks like is happening is that on controller-0, which is set up to initially create the cluster, we're hitting ovn-ctl here:

  https://github.com/ovn-org/ovn/blob/915e6e0c2b4cb46d16db62fb6155eacdb3a0cb89/utilities/ovn-ctl#L311

which, even though the command we've built is run with 'exec', in this case it is run in the background with an '&'. This means ovn-ctl is running with pid=6, and ovsdb-server is running with pid=66.

BUT, on the other two servers, we are hitting ovn-ctl here:

  https://github.com/ovn-org/ovn/blob/915e6e0c2b4cb46d16db62fb6155eacdb3a0cb89/utilities/ovn-ctl#L314

which ends up running the command in the foreground, so using exec means that ovsdb-server ends up running with pid=6...which is the pid that ovn-ctl will get when the container is restarted, so when pidfile_is_running() reads the .pid file and then checks /proc/6, it will find that it *does* exist, and so it thinks that ovsdb-server is running.

One way to fix might be to pass in an optional binary to pidfile_is_running and then check if `readlink /proc/$pid/exe` == $binary.

--- Additional comment from Terry Wilson on 2022-06-02 11:55:06 CDT ---

Fix here: https://patchwork.ozlabs.org/project/ovn/list/?series=303093

Comment 3 Mark Michelson 2022-06-28 18:37:32 UTC
Moving this to MODIFIED since this has been merged into OVN both up- and downstream.

Comment 7 Jianlin Shi 2022-07-12 02:20:55 UTC
Hi Mark,

which patch solve the problem ? could you give suggestions on how to reproduce the issue? thanks

Comment 8 Mark Michelson 2022-07-12 13:16:35 UTC
Hi,

Terry originally fixed the issue with these two patches: 

https://github.com/ovn-org/ovn/commit/ab7d5b978e4f944f6bd9438ab5749d902164e160
https://github.com/ovn-org/ovn/commit/4cb3f9d9940e936c72cead63023bb74fc84b2cad

Ihar then posted a follow-up that fixed an issue with the original patches:

https://github.com/ovn-org/ovn/commit/cc3c32534ecb00d3547e41e75d2243bc57c7662b

We may want to get confirmation from @twilson , but I think the problem is this:

1. ovn-ctl starts ovsdb-server.
2. ovn-ctl checks the PID of ovsdb-server and writes it to a pidfile.
3. Something stops the ovsdb-server process.
4. A new program (we'll call it "Program X") starts running and now has the same PID that ovsdb-server was using before.
5. ovn-ctl tries to restart ovsdb-server, and reads the PID from the pidfile.
6. ovn-ctl checks if that PID is currently in use.
7. The PID is being used by Program X, and so ovn-ctl assumes ovsdb-server is still running and therefore does not re-start ovsdb-server.

The fix that Terry made was to modify step 6 so that ovn-ctl checks which program is currently using the PID. This way, ovn-ctl can see that Program X is currently using the PID and knows to re-start ovsdb-server.

I'm not sure the best way to reproduce this, because I'm not sure if it's possible to request specific PIDs when starting processes. I also doubt that PID selection is predictable across all kernels/architectures.

Comment 9 Terry Wilson 2022-07-13 11:06:01 UTC
It's really that with containers, you'd always get the same pid for ovsdb-server and since ovn-ctl runs ovsdb-server with exec() ovsdb-server would end up with the same pid as ovn-ctl and the .pid file was on a mount so it survived the container restart.

You should be able to test by creating a stale pid file that matches some other process pid and ensure that ovsdb-server successfully starts.

Comment 10 Jianlin Shi 2022-07-14 02:37:14 UTC
tested with following steps:

1. start nb db with: /usr/share/ovn/scripts/ovn-ctl start_nb_ovsdb
2. get pid for ovsdb
3. kill ovsdb
4. echo pid for ovsdb to /var/run/ovn/ovnnb_db.pid
5. start program with following code:
#include <sys/stat.h>   
#include <fcntl.h>                            
#include <stdio.h>
#include <string.h>                             
#include <stdlib.h>                                                                                   
                                                   
int main(int argc, char *argv[])
{    
    int fd, pid;     
    char buf[32];
                                                   
    if (argc != 2)
     return 1;       
                                                   
    printf("Opening ns_last_pid...\n");
    fd = open("/proc/sys/kernel/ns_last_pid", O_RDWR | O_CREAT, 0644);
    if (fd < 0) {               
        perror("Can't open ns_last_pid");                                                             
        return 1;
    }                                                                                                 
    printf("Done\n");
                                                   
    printf("Locking ns_last_pid...\n");
    if (flock(fd, LOCK_EX)) {
        close(fd);           
        printf("Can't lock ns_last_pid\n");
        return 1;              
    }
    printf("Done\n");
                                                   
    pid = atoi(argv[1]);
    snprintf(buf, sizeof(buf), "%d", pid - 1);

    printf("Writing pid-1 to ns_last_pid...\n");
    if (write(fd, buf, strlen(buf)) != strlen(buf)) {
        printf("Can't write to buf\n");
        return 1;
    }
    printf("Done\n");

    printf("Forking...\n");
    int new_pid;
    new_pid = fork();
    if (new_pid == 0) {
        printf("I'm child!\n");
        exit(0);
    } else if (new_pid == pid) {
        printf("I'm parent. My child got right pid!\n");
    } else {
        printf("pid does not match expected one\n");
    }
    printf("Done\n");
    sleep(60);
    printf("Cleaning up...");
    if (flock(fd, LOCK_UN)) {
        printf("Can't unlock");
    }

    close(fd);

    printf("Done\n");

    return 0;
}
6. start nb db again with: /usr/share/ovn/scripts/ovn-ctl start_nb_ovsdb
7. check if ovsdb is started

reproduced on ovn22.03-22.03.0-52:

[root@wsfd-advnetlab16 ovn]# rpm -qa | grep -E "ovn22.03|openvswitch2.17"                             
ovn22.03-host-22.03.0-52.el8fdp.x86_64                                                                
openvswitch2.17-2.17.0-33.el8fdp.x86_64                                                               
ovn22.03-central-22.03.0-52.el8fdp.x86_64                                                             
ovn22.03-22.03.0-52.el8fdp.x86_64

[root@wsfd-advnetlab16 ~]# /usr/share/ovn/scripts/ovn-ctl start_nb_ovsdb
Starting ovsdb-nb                                          [  OK  ]
[root@wsfd-advnetlab16 ~]# ps aux | grep ovs
root       49938  0.0  0.0  71240 24548 ?        Ss   22:33   0:00 ovsdb-server: monitoring pid 49939 (healthy)
root       49939  0.0  0.0  71636 29028 ?        S    22:33   0:00 ovsdb-server -vconsole:off -vfile:info --log-file=/var/log/ovn/ovsdb-server-nb.log --remote=punix:/var/run/ovn/ovnnb_db.sock --pidfile=/var/run/ovn/ovnnb_db.pid --unixctl=/var/run/ovn/ovnnb_db.ctl --detach --monitor --remote=db:OVN_Northbound,NB_Global,connections --private-key=db:OVN_Northbound,SSL,private_key --certificate=db:OVN_Northbound,SSL,certificate --ca-cert=db:OVN_Northbound,SSL,ca_cert --ssl-protocols=db:OVN_Northbound,SSL,ssl_protocols --ssl-ciphers=db:OVN_Northbound,SSL,ssl_ciphers /etc/ovn/ovnnb_db.db
root       49947  0.0  0.0  11808  1140 pts/1    S+   22:33   0:00 grep --color=auto ovs
[root@wsfd-advnetlab16 ~]# kill 49939
[root@wsfd-advnetlab16 ~]# echo 49939 > /var/run/ovn/ovnnb_db.pid

[root@wsfd-advnetlab16 ~]# ./test 49939
[root@wsfd-advnetlab16 ~]# /usr/share/ovn/scripts/ovn-ctl start_nb_ovsdb
[root@wsfd-advnetlab16 ~]# ps aux | grep ovs
root       49961  0.0  0.0  11808  1192 pts/1    S+   22:34   0:00 grep --color=auto ovs

<=== ovs db is not started

[root@wsfd-advnetlab16 ~]# ps aux | grep 49939                                                        
root       49939  0.0  0.0      0     0 pts/0    Z+   22:34   0:00 [test] <defunct>
root       49965  0.0  0.0  11808  1180 pts/1    S+   22:34   0:00 grep --color=auto 49939
root       49974  0.0  0.0   4368   904 pts/0    S+   22:34   0:00 ./test 49939

Verified on ovn22.03-22.03.0-62:

[root@wsfd-advnetlab16 ~]# rpm -qa | grep -E "ovn22.03|openvswitch2.17"
openvswitch2.17-2.17.0-33.el8fdp.x86_64
ovn22.03-22.03.0-62.el8fdp.x86_64
ovn22.03-host-22.03.0-62.el8fdp.x86_64
ovn22.03-central-22.03.0-62.el8fdp.x86_64

[root@wsfd-advnetlab16 ~]# /usr/share/ovn/scripts/ovn-ctl start_nb_ovsdb
Starting ovsdb-nb                                          [  OK  ]
[root@wsfd-advnetlab16 ~]# ps aux | grep ovs
root       50415  0.0  0.0  71240 26088 ?        Ss   22:35   0:00 ovsdb-server: monitoring pid 50416 (healthy)
root       50416  0.2  0.0  71636 30708 ?        S    22:35   0:00 ovsdb-server -vconsole:off -vfile:info --log-file=/var/log/ovn/ovsdb-server-nb.log --remote=punix:/var/run/ovn/ovnnb_db.sock --pidfile=/var/run/ovn/ovnnb_db.pid --unixctl=/var/run/ovn/ovnnb_db.ctl --detach --monitor --remote=db:OVN_Northbound,NB_Global,connections --private-key=db:OVN_Northbound,SSL,private_key --certificate=db:OVN_Northbound,SSL,certificate --ca-cert=db:OVN_Northbound,SSL,ca_cert --ssl-protocols=db:OVN_Northbound,SSL,ssl_protocols --ssl-ciphers=db:OVN_Northbound,SSL,ssl_ciphers /etc/ovn/ovnnb_db.db
root       50424  0.0  0.0  11808  1180 pts/1    S+   22:35   0:00 grep --color=auto ovs
[root@wsfd-advnetlab16 ~]# kill 50416                                                                 
[root@wsfd-advnetlab16 ~]# echo 50416 > /var/run/ovn/ovnnb_db.pid                                     
[root@wsfd-advnetlab16 ~]# ps aux | grep 50416
root       50416  0.0  0.0      0     0 pts/0    Z+   22:35   0:00 [test] <defunct>
root       50420  0.0  0.0  11808  1184 pts/1    S+   22:36   0:00 grep --color=auto 50416            
root       50433  0.0  0.0   4368   904 pts/0    S+   22:35   0:00 ./test 50416                       
[root@wsfd-advnetlab16 ~]# /usr/share/ovn/scripts/ovn-ctl start_nb_ovsdb
Starting ovsdb-nb                                          [  OK  ]                                   
[root@wsfd-advnetlab16 ~]# ps aux | grep ovs                                                          
root       50446  0.0  0.0  71240 24548 ?        Ss   22:36   0:00 ovsdb-server: monitoring pid 50447 (healthy)
root       50447  0.0  0.0  71636 29036 ?        S    22:36   0:00 ovsdb-server -vconsole:off -vfile:info --log-file=/var/log/ovn/ovsdb-server-nb.log --remote=punix:/var/run/ovn/ovnnb_db.sock --pidfile=/var/run/ovn/ovnnb_db.pid --unixctl=/var/run/ovn/ovnnb_db.ctl --detach --monitor --remote=db:OVN_Northbound,NB_Global,connections --private-key=db:OVN_Northbound,SSL,private_key --certificate=db:OVN_Northbound,SSL,certificate --ca-cert=db:OVN_Northbound,SSL,ca_cert --ssl-protocols=db:OVN_Northbound,SSL,ssl_protocols --ssl-ciphers=db:OVN_Northbound,SSL,ssl_ciphers /etc/ovn/ovnnb_db.db
root       50455  0.0  0.0  11808  1180 pts/1    S+   22:36   0:00 grep --color=auto ovs

<=== ovsdb is started

Comment 12 errata-xmlrpc 2022-08-01 14:11:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (ovn-2021 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:5787


Note You need to log in before you can comment on or make changes to this bug.