Bug 1416014

Summary: rlWaitForSocket --close now waits for incorrect socket
Product: [Fedora] Fedora Reporter: Frantisek Sumsal <fsumsal>
Component: beakerlibAssignee: Jakub Heger <jheger>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: medium    
Version: 30CC: azelinka, dapospis, hkario, jheger, mkyral, muller, szidek
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: beakerlib-1.18-3 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-05-24 14:51:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1496120    
Bug Blocks: 1388422    

Description Frantisek Sumsal 2017-01-24 11:10:09 UTC
Description of problem:
A patch in the latest testing version of beakerlib (BZ#1388422) now causes that rlWaitForSocket --close can cause a deadlock/unwanted delay, because of grepping an incorrect socket.

Current solution of waiting for a socket is done by this snippet:

local cmd="netstat -nla | grep -E '$grep_opt' >/dev/null"

Given a real-world example scenario, where a client-server test is executed (server listens on port 4433 and client connects to it) following a simple server process kill along with rlWaitForSocket --close 4433, we get following netstat output:

# netstat -nla
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State   
...
tcp6       0      0 ::1:56226               ::1:4433                TIME_WAIT

Now the problem can be easily spotted - this socket right here is a *client* socket, which is irrelevant for us right now, but the function rlWaitForSocket actually waits for it to close, because it's matched by the grep. This usually takes (in our case) around 50 seconds. So, from a phase, which originally took around 3 seconds we have a phase, which now takes around 53 seconds. This causes a huge difference in a runtime of some of our tests (from ~30 minutes to (dozens of) hours).

There are few possible solutions, basically you just need to compare the given socket with local sockets only, not with foreign ones. Also, I would split checking of unix sockets and TCP/UDP sockets into two different calls, to prevent another weird issues, like socket name containing the checked port. Not sure how popular awk is between beakerlib developers, but it could by done by something along this lines:

ss -natu state all | awk '
BEGIN {
    FS=" "
}
{
    match($5, "^.*?:(.+)$", a);
    if(a[1])
        print a[1];
}'

Something similar can be done for unix sockets as well:

ss -nax state all | awk '
BEGIN {
    FS=" "
}
{
    print $5;
}'

The matching itself must be exact - current solution with grep -E causes 22 being matched even if a port 2222 is opened instead of 22.

Everything written here is just an overall idea what should be checked/compared with what - any another ideas and improvements are more than welcome.

Version-Release number of selected component (if applicable):
beakerlib-1.12-1.fc25.noarch

Comment 1 Dalibor Pospíšil 2017-01-24 11:15:29 UTC
*** Bug 1416018 has been marked as a duplicate of this bug. ***

Comment 2 Dalibor Pospíšil 2017-01-24 13:39:20 UTC
(In reply to Frantisek Sumsal from comment #0)
> The matching itself must be exact - current solution with grep -E causes 22
> being matched even if a port 2222 is opened instead of 22.
Fortunately this cannot happen as the regexp is not that dumb. But it may happed in case of unix socket.

Comment 3 Jan Kurik 2017-08-15 08:40:13 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 27 development cycle.
Changing version to '27'.

Comment 4 Dalibor Pospíšil 2019-02-07 13:39:29 UTC
I propose to limit monitoring to local ports and to add --remote to specify monitoring to remote port. This would be backward incompatible change but it is more natural approach.

Comment 5 Ben Cotton 2019-02-19 17:11:47 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 30 development cycle.
Changing version to '30.

Comment 6 Jakub Heger 2019-02-22 14:27:29 UTC
Should be fixed by commit https://github.com/beakerlib/beakerlib/commit/4ac8297be3d6bca718269dab1d5a1c8f4ebfb257
Dalibor could you please review it/test it?