Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 2092976

Summary: [RAFT] OVN DB container unable to restart
Product: Red Hat Enterprise Linux Fast Datapath Reporter: Terry Wilson <twilson>
Component: ovn22.03Assignee: OVN Team <ovnteam>
Status: CLOSED ERRATA QA Contact: Jianlin Shi <jishi>
Severity: high Docs Contact:
Priority: high    
Version: FDP 22.CCC: akatz, apevec, ctrautma, ekuris, ffernand, ihrachys, jiji, lhh, majopela, mmichels, scohen, twilson
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ovn22.03-22.03.0-55 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 2091686 Environment:
Last Closed: 2022-08-01 14:11:03 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2022144    
Bug Blocks: 1503518, 2091686    

Description Terry Wilson 2022-06-02 16:56:46 UTC
+++ This bug was initially created as a clone of Bug #2091686 +++

Description of problem:
OVN DB containers (ovn_cluster_south_db_server, ovn_cluster_north_db_server) are not restarted correctly because PID file remains in the /var/lib/openvswitch/ovn/ folder after containers are stopped using podman or systemctl commands.


Version-Release number of selected component (if applicable):
RHOS-17.0-RHEL-9-20220526.n.0
ovn-2021-21.12.0-59


How reproducible:


Steps to Reproduce:
systemctl restart tripleo_ovn_cluster_south_db_server
or
podman restart ovn_cluster_south_db_server


Actual results:


Expected results:


Additional info:

--- Additional comment from Terry Wilson on 2022-05-31 16:32:08 CDT ---

Interestingly, it looks like on controller-0, restarting the services works as expected. On controller-1 and controller-2 it fails the pidfile_is_running() check.

What looks like is happening is that on controller-0, which is set up to initially create the cluster, we're hitting ovn-ctl here:

  https://github.com/ovn-org/ovn/blob/915e6e0c2b4cb46d16db62fb6155eacdb3a0cb89/utilities/ovn-ctl#L311

which, even though the command we've built is run with 'exec', in this case it is run in the background with an '&'. This means ovn-ctl is running with pid=6, and ovsdb-server is running with pid=66.

BUT, on the other two servers, we are hitting ovn-ctl here:

  https://github.com/ovn-org/ovn/blob/915e6e0c2b4cb46d16db62fb6155eacdb3a0cb89/utilities/ovn-ctl#L314

which ends up running the command in the foreground, so using exec means that ovsdb-server ends up running with pid=6...which is the pid that ovn-ctl will get when the container is restarted, so when pidfile_is_running() reads the .pid file and then checks /proc/6, it will find that it *does* exist, and so it thinks that ovsdb-server is running.

One way to fix might be to pass in an optional binary to pidfile_is_running and then check if `readlink /proc/$pid/exe` == $binary.

--- Additional comment from Terry Wilson on 2022-06-02 11:55:06 CDT ---

Fix here: https://patchwork.ozlabs.org/project/ovn/list/?series=303093

Comment 3 Mark Michelson 2022-06-28 18:37:32 UTC
Moving this to MODIFIED since this has been merged into OVN both up- and downstream.

Comment 7 Jianlin Shi 2022-07-12 02:20:55 UTC
Hi Mark,

which patch solve the problem ? could you give suggestions on how to reproduce the issue? thanks

Comment 8 Mark Michelson 2022-07-12 13:16:35 UTC
Hi,

Terry originally fixed the issue with these two patches: 

https://github.com/ovn-org/ovn/commit/ab7d5b978e4f944f6bd9438ab5749d902164e160
https://github.com/ovn-org/ovn/commit/4cb3f9d9940e936c72cead63023bb74fc84b2cad

Ihar then posted a follow-up that fixed an issue with the original patches:

https://github.com/ovn-org/ovn/commit/cc3c32534ecb00d3547e41e75d2243bc57c7662b

We may want to get confirmation from @twilson , but I think the problem is this:

1. ovn-ctl starts ovsdb-server.
2. ovn-ctl checks the PID of ovsdb-server and writes it to a pidfile.
3. Something stops the ovsdb-server process.
4. A new program (we'll call it "Program X") starts running and now has the same PID that ovsdb-server was using before.
5. ovn-ctl tries to restart ovsdb-server, and reads the PID from the pidfile.
6. ovn-ctl checks if that PID is currently in use.
7. The PID is being used by Program X, and so ovn-ctl assumes ovsdb-server is still running and therefore does not re-start ovsdb-server.

The fix that Terry made was to modify step 6 so that ovn-ctl checks which program is currently using the PID. This way, ovn-ctl can see that Program X is currently using the PID and knows to re-start ovsdb-server.

I'm not sure the best way to reproduce this, because I'm not sure if it's possible to request specific PIDs when starting processes. I also doubt that PID selection is predictable across all kernels/architectures.

Comment 9 Terry Wilson 2022-07-13 11:06:01 UTC
It's really that with containers, you'd always get the same pid for ovsdb-server and since ovn-ctl runs ovsdb-server with exec() ovsdb-server would end up with the same pid as ovn-ctl and the .pid file was on a mount so it survived the container restart.

You should be able to test by creating a stale pid file that matches some other process pid and ensure that ovsdb-server successfully starts.

Comment 10 Jianlin Shi 2022-07-14 02:37:14 UTC
tested with following steps:

1. start nb db with: /usr/share/ovn/scripts/ovn-ctl start_nb_ovsdb
2. get pid for ovsdb
3. kill ovsdb
4. echo pid for ovsdb to /var/run/ovn/ovnnb_db.pid
5. start program with following code:
#include <sys/stat.h>   
#include <fcntl.h>                            
#include <stdio.h>
#include <string.h>                             
#include <stdlib.h>                                                                                   
                                                   
int main(int argc, char *argv[])
{    
    int fd, pid;     
    char buf[32];
                                                   
    if (argc != 2)
     return 1;       
                                                   
    printf("Opening ns_last_pid...\n");
    fd = open("/proc/sys/kernel/ns_last_pid", O_RDWR | O_CREAT, 0644);
    if (fd < 0) {               
        perror("Can't open ns_last_pid");                                                             
        return 1;
    }                                                                                                 
    printf("Done\n");
                                                   
    printf("Locking ns_last_pid...\n");
    if (flock(fd, LOCK_EX)) {
        close(fd);           
        printf("Can't lock ns_last_pid\n");
        return 1;              
    }
    printf("Done\n");
                                                   
    pid = atoi(argv[1]);
    snprintf(buf, sizeof(buf), "%d", pid - 1);

    printf("Writing pid-1 to ns_last_pid...\n");
    if (write(fd, buf, strlen(buf)) != strlen(buf)) {
        printf("Can't write to buf\n");
        return 1;
    }
    printf("Done\n");

    printf("Forking...\n");
    int new_pid;
    new_pid = fork();
    if (new_pid == 0) {
        printf("I'm child!\n");
        exit(0);
    } else if (new_pid == pid) {
        printf("I'm parent. My child got right pid!\n");
    } else {
        printf("pid does not match expected one\n");
    }
    printf("Done\n");
    sleep(60);
    printf("Cleaning up...");
    if (flock(fd, LOCK_UN)) {
        printf("Can't unlock");
    }

    close(fd);

    printf("Done\n");

    return 0;
}
6. start nb db again with: /usr/share/ovn/scripts/ovn-ctl start_nb_ovsdb
7. check if ovsdb is started

reproduced on ovn22.03-22.03.0-52:

[root@wsfd-advnetlab16 ovn]# rpm -qa | grep -E "ovn22.03|openvswitch2.17"                             
ovn22.03-host-22.03.0-52.el8fdp.x86_64                                                                
openvswitch2.17-2.17.0-33.el8fdp.x86_64                                                               
ovn22.03-central-22.03.0-52.el8fdp.x86_64                                                             
ovn22.03-22.03.0-52.el8fdp.x86_64

[root@wsfd-advnetlab16 ~]# /usr/share/ovn/scripts/ovn-ctl start_nb_ovsdb
Starting ovsdb-nb                                          [  OK  ]
[root@wsfd-advnetlab16 ~]# ps aux | grep ovs
root       49938  0.0  0.0  71240 24548 ?        Ss   22:33   0:00 ovsdb-server: monitoring pid 49939 (healthy)
root       49939  0.0  0.0  71636 29028 ?        S    22:33   0:00 ovsdb-server -vconsole:off -vfile:info --log-file=/var/log/ovn/ovsdb-server-nb.log --remote=punix:/var/run/ovn/ovnnb_db.sock --pidfile=/var/run/ovn/ovnnb_db.pid --unixctl=/var/run/ovn/ovnnb_db.ctl --detach --monitor --remote=db:OVN_Northbound,NB_Global,connections --private-key=db:OVN_Northbound,SSL,private_key --certificate=db:OVN_Northbound,SSL,certificate --ca-cert=db:OVN_Northbound,SSL,ca_cert --ssl-protocols=db:OVN_Northbound,SSL,ssl_protocols --ssl-ciphers=db:OVN_Northbound,SSL,ssl_ciphers /etc/ovn/ovnnb_db.db
root       49947  0.0  0.0  11808  1140 pts/1    S+   22:33   0:00 grep --color=auto ovs
[root@wsfd-advnetlab16 ~]# kill 49939
[root@wsfd-advnetlab16 ~]# echo 49939 > /var/run/ovn/ovnnb_db.pid

[root@wsfd-advnetlab16 ~]# ./test 49939
[root@wsfd-advnetlab16 ~]# /usr/share/ovn/scripts/ovn-ctl start_nb_ovsdb
[root@wsfd-advnetlab16 ~]# ps aux | grep ovs
root       49961  0.0  0.0  11808  1192 pts/1    S+   22:34   0:00 grep --color=auto ovs

<=== ovs db is not started

[root@wsfd-advnetlab16 ~]# ps aux | grep 49939                                                        
root       49939  0.0  0.0      0     0 pts/0    Z+   22:34   0:00 [test] <defunct>
root       49965  0.0  0.0  11808  1180 pts/1    S+   22:34   0:00 grep --color=auto 49939
root       49974  0.0  0.0   4368   904 pts/0    S+   22:34   0:00 ./test 49939

Verified on ovn22.03-22.03.0-62:

[root@wsfd-advnetlab16 ~]# rpm -qa | grep -E "ovn22.03|openvswitch2.17"
openvswitch2.17-2.17.0-33.el8fdp.x86_64
ovn22.03-22.03.0-62.el8fdp.x86_64
ovn22.03-host-22.03.0-62.el8fdp.x86_64
ovn22.03-central-22.03.0-62.el8fdp.x86_64

[root@wsfd-advnetlab16 ~]# /usr/share/ovn/scripts/ovn-ctl start_nb_ovsdb
Starting ovsdb-nb                                          [  OK  ]
[root@wsfd-advnetlab16 ~]# ps aux | grep ovs
root       50415  0.0  0.0  71240 26088 ?        Ss   22:35   0:00 ovsdb-server: monitoring pid 50416 (healthy)
root       50416  0.2  0.0  71636 30708 ?        S    22:35   0:00 ovsdb-server -vconsole:off -vfile:info --log-file=/var/log/ovn/ovsdb-server-nb.log --remote=punix:/var/run/ovn/ovnnb_db.sock --pidfile=/var/run/ovn/ovnnb_db.pid --unixctl=/var/run/ovn/ovnnb_db.ctl --detach --monitor --remote=db:OVN_Northbound,NB_Global,connections --private-key=db:OVN_Northbound,SSL,private_key --certificate=db:OVN_Northbound,SSL,certificate --ca-cert=db:OVN_Northbound,SSL,ca_cert --ssl-protocols=db:OVN_Northbound,SSL,ssl_protocols --ssl-ciphers=db:OVN_Northbound,SSL,ssl_ciphers /etc/ovn/ovnnb_db.db
root       50424  0.0  0.0  11808  1180 pts/1    S+   22:35   0:00 grep --color=auto ovs
[root@wsfd-advnetlab16 ~]# kill 50416                                                                 
[root@wsfd-advnetlab16 ~]# echo 50416 > /var/run/ovn/ovnnb_db.pid                                     
[root@wsfd-advnetlab16 ~]# ps aux | grep 50416
root       50416  0.0  0.0      0     0 pts/0    Z+   22:35   0:00 [test] <defunct>
root       50420  0.0  0.0  11808  1184 pts/1    S+   22:36   0:00 grep --color=auto 50416            
root       50433  0.0  0.0   4368   904 pts/0    S+   22:35   0:00 ./test 50416                       
[root@wsfd-advnetlab16 ~]# /usr/share/ovn/scripts/ovn-ctl start_nb_ovsdb
Starting ovsdb-nb                                          [  OK  ]                                   
[root@wsfd-advnetlab16 ~]# ps aux | grep ovs                                                          
root       50446  0.0  0.0  71240 24548 ?        Ss   22:36   0:00 ovsdb-server: monitoring pid 50447 (healthy)
root       50447  0.0  0.0  71636 29036 ?        S    22:36   0:00 ovsdb-server -vconsole:off -vfile:info --log-file=/var/log/ovn/ovsdb-server-nb.log --remote=punix:/var/run/ovn/ovnnb_db.sock --pidfile=/var/run/ovn/ovnnb_db.pid --unixctl=/var/run/ovn/ovnnb_db.ctl --detach --monitor --remote=db:OVN_Northbound,NB_Global,connections --private-key=db:OVN_Northbound,SSL,private_key --certificate=db:OVN_Northbound,SSL,certificate --ca-cert=db:OVN_Northbound,SSL,ca_cert --ssl-protocols=db:OVN_Northbound,SSL,ssl_protocols --ssl-ciphers=db:OVN_Northbound,SSL,ssl_ciphers /etc/ovn/ovnnb_db.db
root       50455  0.0  0.0  11808  1180 pts/1    S+   22:36   0:00 grep --color=auto ovs

<=== ovsdb is started

Comment 12 errata-xmlrpc 2022-08-01 14:11:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (ovn-2021 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:5787