Bug 1517764 - ovirt-imageio: socket.error: Address already in use
Summary: ovirt-imageio: socket.error: Address already in use
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: General
Version: 4.4.0
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ovirt-4.4.0
: ---
Assignee: Nir Soffer
QA Contact: Avihai
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-11-27 12:06 UTC by Dafna Ron
Modified: 2020-06-19 09:52 UTC (History)
14 users (show)

Fixed In Version: ovirt-imgeio-2.0
Clone Of:
Environment:
Last Closed: 2020-06-04 06:39:11 UTC
oVirt Team: Storage
Embargoed:
pm-rhel: ovirt-4.4?
amureini: blocker?
dron: planning_ack?
pm-rhel: devel_ack+
aefrat: testing_ack-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 85327 0 master ABANDONED Proxy: Allow to reuse the address 2021-02-19 22:44:38 UTC
oVirt gerrit 85557 0 None MERGED Resereving port 54323 for imageio-proxy 2021-02-19 22:44:38 UTC
oVirt gerrit 85586 0 None MERGED Resereving port 54322 for imageio-daemon 2021-02-19 22:44:38 UTC
oVirt gerrit 108943 0 master MERGED packaging: Reserve imageio ports 2021-02-19 22:44:38 UTC
oVirt gerrit 109056 0 master MERGED packaging: Support reserving ports 2021-02-19 22:44:38 UTC

Description Dafna Ron 2017-11-27 12:06:19 UTC
Description of problem:

we are seeing random failures in ost for test: 001_initialize_engine.test_initialize_engine

The error seems to be because of ovirt-imageio-proxy: socket.error:  Address already in use

We do not see ovirt-imageio-proxy trying to start twice so it may be that some other processes has already started on the same port. 

For full logs you can see:  http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4055/artifact/


Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine systemd: Starting Session 8 of user root.
Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: Traceback (most recent call last):
Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/bin/ovirt-imageio-proxy", line 85, in <module>
Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: status = image_proxy.main(args, config)
Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/lib/python2.7/site-packages/ovirt_imageio_proxy/image_proxy.py", line 21, in main
Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: image_server.start(config)
Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/lib/python2.7/site-packages/ovirt_imageio_proxy/server.py", line 45, in start
Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: WSGIRequestHandler)
Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/lib64/python2.7/SocketServer.py", line 419, in __init__
Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: self.server_bind()
Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/lib64/python2.7/wsgiref/simple_server.py", line 48, in server_bind
Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: HTTPServer.server_bind(self)
Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/lib64/python2.7/BaseHTTPServer.py", line 108, in server_bind
Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: SocketServer.TCPServer.server_bind(self)
Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/lib64/python2.7/SocketServer.py", line 430, in server_bind
Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: self.socket.bind(self.server_address)
Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/lib64/python2.7/socket.py", line 224, in meth
Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: return getattr(self._sock,name)(*args)
Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: socket.error: [Errno 98] Address already in use
Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine systemd: ovirt-imageio-proxy.service: main process exited, code=exited, status=1/FAILURE
Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine systemd: Failed to start oVirt ImageIO Proxy.
Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine systemd: Unit ovirt-imageio-proxy.service entered failed state.
Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine systemd: ovirt-imageio-proxy.service failed.
Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine systemd: ovirt-imageio-proxy.service holdoff time over, scheduling restart.
Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine systemd: Starting oVirt ImageIO Proxy...
Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: Traceback (most recent call last):
Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/bin/ovirt-imageio-proxy", line 85, in <module>
Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: status = image_proxy.main(args, config)
Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/lib/python2.7/site-packages/ovirt_imageio_proxy/image_proxy.py", line 21, in main
Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: image_server.start(config)
Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/lib/python2.7/site-packages/ovirt_imageio_proxy/server.py", line 45, in start
Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: WSGIRequestHandler)
Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/lib64/python2.7/SocketServer.py", line 419, in __init__
Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: self.server_bind()
Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/lib64/python2.7/wsgiref/simple_server.py", line 48, in server_bind
Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: HTTPServer.server_bind(self)
Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/lib64/python2.7/BaseHTTPServer.py", line 108, in server_bind
Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: SocketServer.TCPServer.server_bind(self)
Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/lib64/python2.7/SocketServer.py", line 430, in server_bind
Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: self.socket.bind(self.server_address)
Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/lib64/python2.7/socket.py", line 224, in meth
Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: return getattr(self._sock,name)(*args)
Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: socket.error: [Errno 98] Address already in use
Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine systemd: ovirt-imageio-proxy.service: main process exited, code=exited, status=1/FAILURE
Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine systemd: Failed to start oVirt ImageIO Proxy.
Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine systemd: Unit ovirt-imageio-proxy.service entered failed state.
Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine systemd: ovirt-imageio-proxy.service failed.
Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine systemd: ovirt-imageio-proxy.service holdoff time over, scheduling restart.
Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine systemd: start request repeated too quickly for ovirt-imageio-proxy.service
Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine systemd: Failed to start oVirt ImageIO Proxy.
Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine systemd: Unit ovirt-imageio-proxy.service entered failed state.
Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine systemd: ovirt-imageio-proxy.service failed.

Comment 1 Daniel Erez 2017-11-27 13:05:09 UTC
* I couldn't find the proxy log in jenkins, can you please attach it?
* Can you please check that no other process is using port 54323
* Try to change the proxy's port to see if it mitigates the issue. The configuration file is located at: /etc/ovirt-imageio-proxy/ovirt-imageio-proxy.conf

Comment 2 Dafna Ron 2017-11-27 13:40:39 UTC
The lago environments are created and deleted for the specific patch tested and we have no access to the environments once the job is done. 

(In reply to Daniel Erez from comment #1)
> * I couldn't find the proxy log in jenkins, can you please attach it?

The logs are collected automatically at the end of the test so if it does not exist than this is something that is related to when you guys tell the logs to be created - you can change the code to add further debugging.  


> * Can you please check that no other process is using port 54323

sorry, we no longer have the environment and cannot reproduce since its random. however, if you add a patch that would check what is occupying the port we can give you that info once its available. 


> * Try to change the proxy's port to see if it mitigates the issue. The
> configuration file is located at:
> /etc/ovirt-imageio-proxy/ovirt-imageio-proxy.conf

Not possible for us :) 

Adding Nir and Didi since they have been commenting on the list

Comment 3 Yedidyah Bar David 2017-11-27 14:06:51 UTC
(In reply to Dafna Ron from comment #2)
> The lago environments are created and deleted for the specific patch tested
> and we have no access to the environments once the job is done. 
> 
> (In reply to Daniel Erez from comment #1)
> > * I couldn't find the proxy log in jenkins, can you please attach it?
> 
> The logs are collected automatically at the end of the test so if it does
> not exist than this is something that is related to when you guys tell the
> logs to be created - you can change the code to add further debugging.  
> 

I think they are actually collected, but are very hard to find, until you have a good understanding of the tree structure (which personally took me many months to start to understand). Pushed for now a patch to also archive them all [1], if it fails again IMO we should attach the result to current bug.

[1] https://gerrit.ovirt.org/84716

> 
> > * Can you please check that no other process is using port 54323
> 
> sorry, we no longer have the environment and cannot reproduce since its
> random. however, if you add a patch that would check what is occupying the
> port we can give you that info once its available. 

Merged earlier today patches to otopi that should run 'ss -anp' on failures, as part of the package otopi-debug-plugins. Pushed to ovirt-system-tests a patch [2] to install that, still fighting jenkins to make it work.

[2] https://gerrit.ovirt.org/84710

> 
> 
> > * Try to change the proxy's port to see if it mitigates the issue. The
> > configuration file is located at:
> > /etc/ovirt-imageio-proxy/ovirt-imageio-proxy.conf

Main issue here is that it's not fully reproducible on CI. So even if we change that, we don't know if it helps immediately. So I'd better invest in patches that will give us more hard data than mere guessing.

> 
> Not possible for us :) 
> 
> Adding Nir and Didi since they have been commenting on the list

Comment 4 Yedidyah Bar David 2017-11-29 08:03:28 UTC
Merged to otopi master branch patches to add a debug plugin to list network connections on failure.

Filed bz 1517764 to have this in 4.1, as I was told we see similar failures there too.

Merged to ovirt-system-tests a patch to install otopi-debug-plugins.

Comment 5 Daniel Erez 2017-11-29 13:20:44 UTC
(In reply to Yedidyah Bar David from comment #4)
> Merged to otopi master branch patches to add a debug plugin to list network
> connections on failure.

@Didi - so where can I see the relevant error logs now?

> 
> Filed bz 1517764 to have this in 4.1, as I was told we see similar failures
> there too.
> 
> Merged to ovirt-system-tests a patch to install otopi-debug-plugins.

Comment 6 Yedidyah Bar David 2017-11-29 13:32:41 UTC
(In reply to Daniel Erez from comment #5)
> (In reply to Yedidyah Bar David from comment #4)
> > Merged to otopi master branch patches to add a debug plugin to list network
> > connections on failure.
> 
> @Didi - so where can I see the relevant error logs now?

In the relevant log - engine-setup or host-deploy.

> 
> > 
> > Filed bz 1517764 to have this in 4.1, as I was told we see similar failures
> > there too.
> > 
> > Merged to ovirt-system-tests a patch to install otopi-debug-plugins.

This patch was merged only this morning, so only failures following it will have the plugin.

Comment 7 Allon Mureinik 2017-11-29 15:44:43 UTC
We haven't been able to reproduce this outside of OST. Retargetting to 4.2.1 as to stop blocking 4.2's GA.

Comment 8 Daniel Erez 2017-12-11 10:11:56 UTC
(In reply to Yedidyah Bar David from comment #6)
> (In reply to Daniel Erez from comment #5)
> > (In reply to Yedidyah Bar David from comment #4)
> > > Merged to otopi master branch patches to add a debug plugin to list network
> > > connections on failure.
> > 
> > @Didi - so where can I see the relevant error logs now?
> 
> In the relevant log - engine-setup or host-deploy.
> 
> > 
> > > 
> > > Filed bz 1517764 to have this in 4.1, as I was told we see similar failures
> > > there too.
> > > 
> > > Merged to ovirt-system-tests a patch to install otopi-debug-plugins.
> 
> This patch was merged only this morning, so only failures following it will
> have the plugin.

@Dafna - How can I find a recent jenkins test that failed on this issue?

Comment 10 Daniel Erez 2017-12-11 12:37:06 UTC
(In reply to Yedidyah Bar David from comment #9)
> Last one I personally looked at - on 4.1, no otopi plugin to help see what
> used the address - is [1], specifically [2]. It's almost 5 days old, now
> pressed 'Keep this build forever' so that you can have a look. Inside it,
> you can check e.g. [3][4].
> 
> [1] http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4313/
> [2]
> http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4313/artifact/
> exported-artifacts/upgrade-from-release-suit-master-el7/test_logs/upgrade-
> from-release-suite-master/post-001_initialize_engine.py/lago-upgrade-from-
> release-suite-master-engine/_var_log/
> [3]
> http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4313/artifact/
> exported-artifacts/upgrade-from-release-suit-master-el7/test_logs/upgrade-
> from-release-suite-master/post-001_initialize_engine.py/lago-upgrade-from-
> release-suite-master-engine/_var_log/ovirt-engine/setup/ovirt-engine-setup-
> 20171206090556-8ika3z.log
> [4]
> http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4313/artifact/
> exported-artifacts/upgrade-from-release-suit-master-el7/test_logs/upgrade-
> from-release-suite-master/post-001_initialize_engine.py/lago-upgrade-from-
> release-suite-master-engine/_var_log/messages

@Didi - are you referring to the db connection error? [*]
Seems unrelated as it uses port 5432, while ovirt-imageio-proxy uses 54323


[*]
2017-12-06 09:06:07 DEBUG otopi.ovirt_engine_setup.engine_common.database database.getCredentials:1179 dbenv: {'OVESETUP_DWH_DB/database': 'ovirt_engine_history', 'OVESETUP_DWH_DB/host': 'localhost', 'OVESETUP_DWH_DB/port': 5432, 'OVESETUP_DWH_DB/securedHostValidation': False, 'OVESETUP_DWH_DB/secured': False, 'OVESETUP_DWH_DB/password': '**FILTERED**', 'OVESETUP_DWH_DB/user': 'ovirt_engine_history'}
2017-12-06 09:06:07 DEBUG otopi.ovirt_engine_setup.engine_common.database database.execute:212 Database: 'None', Statement: '
                select count(*) as count
                from pg_catalog.pg_tables
                where schemaname = 'public';
            ', args: {}
2017-12-06 09:06:07 DEBUG otopi.ovirt_engine_setup.engine_common.database database.execute:217 Creating own connection
2017-12-06 09:06:07 DEBUG otopi.ovirt_engine_setup.engine_common.database database.getCredentials:1211 database connection failed
Traceback (most recent call last):
  File "/usr/share/ovirt-engine/setup/ovirt_engine_setup/engine_common/database.py", line 1209, in getCredentials
    ] = self.isNewDatabase()
  File "/usr/share/ovirt-engine/setup/ovirt_engine_setup/engine_common/database.py", line 405, in isNewDatabase
    transaction=False,
  File "/usr/share/ovirt-engine/setup/ovirt_engine_setup/engine_common/database.py", line 226, in execute
    database=database,
  File "/usr/share/ovirt-engine/setup/ovirt_engine_setup/engine_common/database.py", line 160, in connect
    sslmode=sslmode,
  File "/usr/lib64/python2.7/site-packages/psycopg2/__init__.py", line 164, in connect
    conn = _connect(dsn, connection_factory=connection_factory, async=async)
OperationalError: could not connect to server: Connection refused
	Is the server running on host "localhost" (::1) and accepting
	TCP/IP connections on port 5432?
could not connect to server: Connection refused
	Is the server running on host "localhost" (127.0.0.1) and accepting
	TCP/IP connections on port 5432?

Comment 11 Yedidyah Bar David 2017-12-11 12:58:55 UTC
(In reply to Daniel Erez from comment #10)
> @Didi - are you referring to the db connection error? [*]
> Seems unrelated as it uses port 5432, while ovirt-imageio-proxy uses 54323

No, setup logs this and ignores. Check the next one in the same setup log:

2017-12-06 09:07:24 DEBUG otopi.plugins.otopi.services.systemd plugin.execute:926 execute-output: ('/bin/systemctl', 'start', 'ovirt-imageio-proxy.service') stderr:
Job for ovirt-imageio-proxy.service failed because the control process exited with error code. See "systemctl status ovirt-imageio-proxy.service" and "journalctl -xe" for details.

2017-12-06 09:07:24 DEBUG otopi.context context._executeMethod:142 method exception
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/otopi/context.py", line 132, in _executeMethod
    method['method']()
  File "/usr/share/ovirt-engine/setup/bin/../plugins/ovirt-engine-setup/ovirt_imageio_proxy/config.py", line 352, in _closeup
    state=state,
  File "/usr/share/otopi/plugins/otopi/services/systemd.py", line 141, in state
    service=name,
RuntimeError: Failed to start service 'ovirt-imageio-proxy'

And, from /var/log/messages ([4] above):

Dec  6 09:07:23 lago-upgrade-from-release-suite-master-engine systemd: Starting oVirt ImageIO Proxy...
Dec  6 09:07:24 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: Traceback (most recent call last):
Dec  6 09:07:24 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/bin/ovirt-imageio-proxy", line 85, in <module>
Dec  6 09:07:24 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: status = image_proxy.main(args, config)
Dec  6 09:07:24 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/lib/python2.7/site-packages/ovirt_imageio_proxy/image_proxy.py", line 21, in main
Dec  6 09:07:24 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: image_server.start(config)
Dec  6 09:07:24 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/lib/python2.7/site-packages/ovirt_imageio_proxy/server.py", line 45, in start
Dec  6 09:07:24 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: WSGIRequestHandler)
Dec  6 09:07:24 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/lib64/python2.7/SocketServer.py", line 419, in __init__
Dec  6 09:07:24 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: self.server_bind()
Dec  6 09:07:24 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/lib64/python2.7/wsgiref/simple_server.py", line 48, in server_bind
Dec  6 09:07:24 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: HTTPServer.server_bind(self)
Dec  6 09:07:24 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/lib64/python2.7/BaseHTTPServer.py", line 108, in server_bind
Dec  6 09:07:24 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: SocketServer.TCPServer.server_bind(self)
Dec  6 09:07:24 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/lib64/python2.7/SocketServer.py", line 430, in server_bind
Dec  6 09:07:24 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: self.socket.bind(self.server_address)
Dec  6 09:07:24 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/lib64/python2.7/socket.py", line 224, in meth
Dec  6 09:07:24 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: return getattr(self._sock,name)(*args)
Dec  6 09:07:24 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: socket.error: [Errno 98] Address already in use

However, not sure what else you can dig up by checking logs there - the proxy's log is empty. Eyal now merged a patch to run 'ss -anp' after setup if it fails, let's see if it happens again and if it's enough to see what used the address.

https://gerrit.ovirt.org/85228

Comment 12 Dafna Ron 2017-12-11 16:47:30 UTC
seems like we have results from the patch:

http://jenkins.ovirt.org/job/ovirt-4.1_change-queue-tester/1432/testReport/junit/(root)/001_initialize_engine/test_initialize_engine/

seems like we have 
tcp    TIME-WAIT  0      0       ::1:54323               ::1:5432

is that what we would expect?

Comment 13 Yedidyah Bar David 2017-12-12 09:31:16 UTC
Not really, but it at least explains the failure.

I expected some process to LISTEN on this address.

TIME-WAIT is a normal state after a connection closes, that lasts by default 60 seconds. During these 60 seconds, the kernel prevents other processes from listening on the port.

In this specific case, the port was used on the client side of a connection to postgresql (likely by a psql command ran during engine-setup). When a client connects to a server, the kernel allocates a so-called "Ephemeral port" for it, from a range that is OS-specific, in Linux usually 32768 to 61000.

Inside this range, AFAICT, ports are allocated one-by-one, so if we reached 54323, it means we had around 54323-32786=21537 connections since the (lago engine) machine started, which looks like quite a lot to me.

- How come this started only recently, and only on CI?

I have no idea. Perhaps a change in CI that uses many more connections in initial stages.

- What do we do now?

Some ideas I can think of:

1. Somehow make CI not fail on this. E.g.

- Lower the time - /proc/sys/net/ipv4/tcp_fin_timeout
- Change the range to not include this port - /proc/sys/net/ipv4/ip_local_port_range
- Find what's using so many connections and lower that

Both of the first two have pro's and con's, not sure what's the best way.

2. Somehow lower the number of connections in engine-setup, which are mainly by running many small psql commands on small files. In principle we can combine them to a few batches. This will also help performance - by my measurements, done a few weeks ago (so recalling from memory), around 5-6 seconds less in the best case (for engine-setup).

3. Change the port used by imageio-proxy (and others):

- As a first step, to be outside of the default ephemeral ports range of common OSes (or at least Linux)

- On the longer run, register ports for all of our network-facing services (including imageio-*, vdsm, qemu (spice), etc.) at IANA.

Comment 14 Dafna Ron 2017-12-12 10:28:39 UTC
I vote for #3.

Comment 15 Yedidyah Bar David 2017-12-12 10:49:57 UTC
#3 is the natural choice, but it has its costs - it will have to be handled very carefully, especially upgrades. So unlikely to happen in 4.2.

Comment 16 Yedidyah Bar David 2017-12-12 11:24:50 UTC
4. setsockopt(SO_REUSEADDR)

Comment 17 Dafna Ron 2017-12-12 12:03:27 UTC
so we open a bug for ost to fix the test (setsockopt(SO_REUSEADDR) would do that right?) 
and we use this bug as RFE to register ports for all our network-facing services for the long run?

Comment 18 Nir Soffer 2017-12-12 12:07:02 UTC
(In reply to Yedidyah Bar David from comment #16)
> 4. setsockopt(SO_REUSEADDR)

(In reply to Yedidyah Bar David from comment #16)
> 4. setsockopt(SO_REUSEADDR)

Python builtin http server (which we use) is already doing this:

    class HTTPServer(SocketServer.TCPServer):                                                                                                                                                
        allow_reuse_address = 1
        ...

This is used in TCPServer.server_bind to set the SO_REUSEADDR bit.

    class TCPServer(BaseServer):
        ...
        def server_bind(self):
            ...
            if self.allow_reuse_address:
                self.socket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)

I also verified in the past that this SO_REUSADDR bit is set on our socket.

I agree that #3 is the right way to fix this.

According to https://www.iana.org/assignments/service-names-port-numbers/service-names-port-numbers.txt:

    8009-8018            Unassigned

I suggest to register:

vdsm: 8010
ovirt-imageio-daemon: 8011
ovirt-imageio-proxy: 8012

Changing vdsm port can be a big issue, breaking migration between new and old
machines, but we can listen on both 8010 and 54321, passing connections to the 
actual sever.

Changing daemon port is an issue because engine assume the daemon port. We can 
solve it in the same way by listening to both ports, or publish the daemon port
in vdsm getCapabillities.

Changing proxy port is easier because it is used internally from engine.

All changes may require changes in firewalls in the user environment, this is 
the real issue.

Dan, Piotr, Francesco, what do you think?

Comment 19 Piotr Kliczewski 2017-12-12 12:19:06 UTC
(In reply to Nir Soffer from comment #18)
> (In reply to Yedidyah Bar David from comment #16)
> > 4. setsockopt(SO_REUSEADDR)
> 
> (In reply to Yedidyah Bar David from comment #16)
> > 4. setsockopt(SO_REUSEADDR)
> 
> Python builtin http server (which we use) is already doing this:
> 
>     class HTTPServer(SocketServer.TCPServer):                               
> 
>         allow_reuse_address = 1
>         ...
> 
> This is used in TCPServer.server_bind to set the SO_REUSEADDR bit.
> 
>     class TCPServer(BaseServer):
>         ...
>         def server_bind(self):
>             ...
>             if self.allow_reuse_address:
>                 self.socket.setsockopt(socket.SOL_SOCKET,
> socket.SO_REUSEADDR, 1)
> 
> I also verified in the past that this SO_REUSADDR bit is set on our socket.
> 
> I agree that #3 is the right way to fix this.
> 
> According to
> https://www.iana.org/assignments/service-names-port-numbers/service-names-
> port-numbers.txt:
> 
>     8009-8018            Unassigned
> 
> I suggest to register:
> 
> vdsm: 8010
> ovirt-imageio-daemon: 8011
> ovirt-imageio-proxy: 8012
> 
> Changing vdsm port can be a big issue, breaking migration between new and old
> machines, but we can listen on both 8010 and 54321, passing connections to
> the 
> actual sever.

We introduced protocol detector because we did not want to change vdsm port number. Vm migration uses config.py to get port value which means that both vdsm need to use the same port number. We make it "configurable" but changing it would break many things. Exposing second port could help to solve it but I am not sure when we will be able to stop using current port number. 

> 
> Changing daemon port is an issue because engine assume the daemon port. We
> can 
> solve it in the same way by listening to both ports, or publish the daemon
> port
> in vdsm getCapabillities.
> 
> Changing proxy port is easier because it is used internally from engine.

+1

> 
> All changes may require changes in firewalls in the user environment, this
> is 
> the real issue.
> 
> Dan, Piotr, Francesco, what do you think?

Comment 20 Piotr Kliczewski 2017-12-12 12:19:58 UTC
Restoring needinfos

Comment 21 Yedidyah Bar David 2017-12-12 16:01:55 UTC
Patch does not work, moving back to NEW.

Verified that this works:

echo 1 > /proc/sys/net/ipv4/tcp_tw_recycle

But it's considered risky, if checking the kernel documentation for it.

Perhaps it's not risky enough to use in CI.

Keeping needinfos.

Comment 22 Yedidyah Bar David 2017-12-12 16:07:52 UTC
BTW, the TIME-WAIT ports in at least [1] look weird to me, and I could not reproduce them easily. Compare for example [2] (from the same build).

For 'connect-from-random-source-port', the kernel tries to use even port numbers from within the range. It keeps the odd numbers for 'listen-on-random-port'.

It's easy to connect from a specific port, so it's easy to cause a "reproduction" of current bug, by:

======================================================
#!/usr/bin/python

import socket

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind(('0.0.0.0', 54323))
s.connect(('127.0.0.1', 80))
s.close()
======================================================

However, I tried to cause it to happen "at random", by running a loop, and it seems like the kernel tries quite hard to not use odd numbers.

So there is likely something a bit weird going on there.

[1] http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86_64/2742/artifact/exported-artifacts/upgrade-from-release-suite-master__logs/lago_logs/lago.log

[2] http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86_64/2742/artifact/exported-artifacts/basic-suite-master__logs/lago_logs/lago.log

Comment 23 Nir Soffer 2017-12-12 16:13:21 UTC
(In reply to Yedidyah Bar David from comment #22)
> For 'connect-from-random-source-port', the kernel tries to use even port
> numbers from within the range. It keeps the odd numbers for
> 'listen-on-random-port'.

This means ovirt-imageio-daemon port (54322) is even worse :-)

I guess this we don't have any issue in the field yet because vdsm starts the 
deamon very early during boot, and nobody restart it, so the chance to have another
process using this port is very low.

Comment 24 Daniel Erez 2017-12-19 10:39:43 UTC
The long-term solution should be indeed changing the ports of imageio-proxy, imageio-daemon and vdsm, as the issue affects all of these modules. Since it's not feasible for 4.2 (requires port registration in IANA [1], backwards compatibility issues, etc), pushed a patch to OST for reserving the port before starting the engine.

[1] https://www.iana.org/assignments/service-names-port-numbers/service-names-port-numbers.txt:

Comment 25 Yedidyah Bar David 2017-12-19 10:48:46 UTC
(In reply to Daniel Erez from comment #24)
> The long-term solution should be indeed changing the ports of imageio-proxy,
> imageio-daemon and vdsm, as the issue affects all of these modules. Since
> it's not feasible for 4.2 (requires port registration in IANA [1], backwards
> compatibility issues, etc), pushed a patch to OST for reserving the port
> before starting the engine.

Thanks.

Since we already got CI failures on the ovirt-imageio-daemon, with the same error message, likely for very similar reasons, perhaps patch OST to reserve also its port.

Comment 26 Daniel Erez 2017-12-19 13:06:37 UTC
(In reply to Yedidyah Bar David from comment #25)
> (In reply to Daniel Erez from comment #24)
> > The long-term solution should be indeed changing the ports of imageio-proxy,
> > imageio-daemon and vdsm, as the issue affects all of these modules. Since
> > it's not feasible for 4.2 (requires port registration in IANA [1], backwards
> > compatibility issues, etc), pushed a patch to OST for reserving the port
> > before starting the engine.
> 
> Thanks.
> 
> Since we already got CI failures on the ovirt-imageio-daemon, with the same
> error message, likely for very similar reasons, perhaps patch OST to reserve
> also its port.

Sure. Is the port also used by postgres? Cause I'm not sure if I can put it in the same script (setup_engine.sh), or somewhere before OST is starting vdsm. Do you have an example of a failure log?

Comment 27 Allon Mureinik 2017-12-19 13:19:46 UTC
(In reply to Daniel Erez from comment #26)
> (In reply to Yedidyah Bar David from comment #25)
> > (In reply to Daniel Erez from comment #24)
> > > The long-term solution should be indeed changing the ports of imageio-proxy,
> > > imageio-daemon and vdsm, as the issue affects all of these modules. Since
> > > it's not feasible for 4.2 (requires port registration in IANA [1], backwards
> > > compatibility issues, etc), pushed a patch to OST for reserving the port
> > > before starting the engine.
> > 
> > Thanks.
> > 
> > Since we already got CI failures on the ovirt-imageio-daemon, with the same
> > error message, likely for very similar reasons, perhaps patch OST to reserve
> > also its port.
> 
> Sure. Is the port also used by postgres? Cause I'm not sure if I can put it
> in the same script (setup_engine.sh), or somewhere before OST is starting
> vdsm. Do you have an example of a failure log?

The attached patch doesn't fix anything in imageio.
Please either move this bug to OST (where the fix actually is!), or, if you intend to have a longer term solution, clone it to track that.

Comment 28 Yedidyah Bar David 2017-12-19 13:26:03 UTC
(In reply to Daniel Erez from comment #26)
> Sure. Is the port also used by postgres?

It's not used by pg, and also was not used by pg in the proxy's case.

In the specific failure we diagnosed, it was used by psql as a random source port, which, is _random_ - so anything could use it, including e.g. if we run wget/curl or whatever during ost.

> Cause I'm not sure if I can put it
> in the same script (setup_engine.sh), or somewhere before OST is starting
> vdsm.

I guess you need latter, but not sure where exactly either.

> Do you have an example of a failure log?

http://lists.ovirt.org/pipermail/devel/2017-November/031821.html

Comment 29 Yedidyah Bar David 2017-12-19 13:28:43 UTC
(In reply to Allon Mureinik from comment #27)
> The attached patch doesn't fix anything in imageio.
> Please either move this bug to OST (where the fix actually is!),

The bug is in imageio, and patch to OST is not a fix but a mere workaround.

Real users can in principle run into this bug as well, even though it's not very likely.

> or, if you
> intend to have a longer term solution, clone it to track that.

Not sure we need a bug on OST for this, perhaps 'Related-To' is enough to give a context there. As you wish.

Comment 30 Tal Nisan 2018-09-16 12:15:25 UTC
Sandro, please target, I don't know which version it should be in

Comment 31 Sandro Bonazzola 2018-09-20 05:52:40 UTC
(In reply to Tal Nisan from comment #30)
> Sandro, please target, I don't know which version it should be in

This is ovirt-system-test, as far as I know, it doesn't follow any release schema.
Just check if it's working in currently running jenkins OST jobs.

Comment 32 Nir Soffer 2020-01-14 22:40:35 UTC
This is random failure in OST. Nobody reported any issue for more then a year,
so the patches mentioned in this bug probably fixed this issue.

Comment 33 Nir Soffer 2020-05-15 17:11:26 UTC
Reopen since we have new reports for this issue.

Comment 34 Nir Soffer 2020-05-15 17:15:15 UTC
Moving to ovirt-engine, since proxy is configured by ovirt-engine
and ovirt-engine is responsible to make the proxy port available.

Comment 35 Nir Soffer 2020-05-15 18:30:21 UTC
Errors seen recently in OST in Michal setup (dev-09.rhev.lab.eng.brq)

$ cd ost/ovirt-system-tests/test_logs/basic-suite-master/post-001_initialize_engine_pytest.py

$ grep -rn 'OSError: \[Errno 98\] Address already in use' .
./lago-basic-suite-master-engine/_var_log/ovirt-imageio/daemon.log:16:OSError: [Errno 98] Address already in use
./lago-basic-suite-master-engine/_var_log/ovirt-imageio/daemon.log:32:OSError: [Errno 98] Address already in use
./lago-basic-suite-master-engine/_var_log/ovirt-imageio/daemon.log:48:OSError: [Errno 98] Address already in use
./lago-basic-suite-master-engine/_var_log/ovirt-imageio/daemon.log:64:OSError: [Errno 98] Address already in use
./lago-basic-suite-master-engine/_var_log/ovirt-imageio/daemon.log:80:OSError: [Errno 98] Address already in use


2020-05-14 20:38:22,737 ERROR   (MainThread) [server] Server failed
Traceback (most recent call last):
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/server.py", line 37, in main
    server = Server(cfg)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/server.py", line 83, in __init__
    self.control_service = services.ControlService(self.config, self.auth)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/services.py", line 148, in __init__
    self._server = http.Server(("localhost", port), http.Connection)
  File "/usr/lib64/python3.6/socketserver.py", line 456, in __init__
    self.server_bind()
  File "/usr/lib64/python3.6/http/server.py", line 136, in server_bind
    socketserver.TCPServer.server_bind(self)
  File "/usr/lib64/python3.6/socketserver.py", line 470, in server_bind
    self.socket.bind(self.server_address)
OSError: [Errno 98] Address already in use

2020-05-14 20:38:23,091 ERROR   (MainThread) [server] Server failed
Traceback (most recent call last):
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/server.py", line 37, in main
    server = Server(cfg)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/server.py", line 83, in __init__
    self.control_service = services.ControlService(self.config, self.auth)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/services.py", line 148, in __init__
    self._server = http.Server(("localhost", port), http.Connection)
  File "/usr/lib64/python3.6/socketserver.py", line 456, in __init__
    self.server_bind()
  File "/usr/lib64/python3.6/http/server.py", line 136, in server_bind
    socketserver.TCPServer.server_bind(self)
  File "/usr/lib64/python3.6/socketserver.py", line 470, in server_bind
    self.socket.bind(self.server_address)
OSError: [Errno 98] Address already in use

2020-05-14 20:38:23,492 ERROR   (MainThread) [server] Server failed
Traceback (most recent call last):
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/server.py", line 37, in main
    server = Server(cfg)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/server.py", line 83, in __init__
    self.control_service = services.ControlService(self.config, self.auth)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/services.py", line 148, in __init__
    self._server = http.Server(("localhost", port), http.Connection)
  File "/usr/lib64/python3.6/socketserver.py", line 456, in __init__
    self.server_bind()
  File "/usr/lib64/python3.6/http/server.py", line 136, in server_bind
    socketserver.TCPServer.server_bind(self)
  File "/usr/lib64/python3.6/socketserver.py", line 470, in server_bind
    self.socket.bind(self.server_address)
OSError: [Errno 98] Address already in use

2020-05-14 20:38:23,737 ERROR   (MainThread) [server] Server failed
Traceback (most recent call last):
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/server.py", line 37, in main
    server = Server(cfg)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/server.py", line 83, in __init__
    self.control_service = services.ControlService(self.config, self.auth)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/services.py", line 148, in __init__
    self._server = http.Server(("localhost", port), http.Connection)
  File "/usr/lib64/python3.6/socketserver.py", line 456, in __init__
    self.server_bind()
  File "/usr/lib64/python3.6/http/server.py", line 136, in server_bind
    socketserver.TCPServer.server_bind(self)
  File "/usr/lib64/python3.6/socketserver.py", line 470, in server_bind
    self.socket.bind(self.server_address)
OSError: [Errno 98] Address already in use

2020-05-14 20:38:24,114 ERROR   (MainThread) [server] Server failed
Traceback (most recent call last):
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/server.py", line 37, in main
    server = Server(cfg)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/server.py", line 83, in __init__
    self.control_service = services.ControlService(self.config, self.auth)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/services.py", line 148, in __init__
    self._server = http.Server(("localhost", port), http.Connection)
  File "/usr/lib64/python3.6/socketserver.py", line 456, in __init__
    self.server_bind()
  File "/usr/lib64/python3.6/http/server.py", line 136, in server_bind
    socketserver.TCPServer.server_bind(self)
  File "/usr/lib64/python3.6/socketserver.py", line 470, in server_bind
    self.socket.bind(self.server_address)
OSError: [Errno 98] Address already in use

In all cases the failure is the control port (54324). We know that even
port numbers are more problematic, as they are used by the kernel as the
source port for connected sockets.

This also explains why we see this error now - ovirt-imageio 2.0.5 listens
to 2 ports: 54323 and 54324. In the past we listened only to 54323. We don't
know why we had this issue in the past when we use only port 54323, and why
the issue disappeared.

A minimal fix is to change the control port to 54325. This does not 
ensure that the port will be available but makes it less likely to 
be used.

A more robust fix is to change to port to 64323, which is not in the dynamic
port range on linux (32786-6100). But admin can change the dynamic port range
using sysctl.

The best solution is to reserve imageio ports, ensuring that both ports
are always available. This is what we do in vdsm:
https://github.com/oVirt/vdsm/blob/6f5df19a96d971b92e4886330c1b856085022352/static/etc/sysctl.d/vdsm.conf#L16

Comment 36 Nir Soffer 2020-05-15 19:16:01 UTC
Michal, since we reproduced this issue only on your setup, can you try
this build and report if the issue is resolved?
https://jenkins.ovirt.org/job/ovirt-engine_standard-check-patch/5884/

Comment 37 Sandro Bonazzola 2020-05-18 14:55:06 UTC
Moving out to 4.4.1 not being approved as blocker for 4.4.0 and we are preparing for GA

Comment 38 Avihai 2020-06-02 06:22:25 UTC
Nir we do not have any way of verifing this bug as QE is not using OST and this was also seen in a single setup.

Please provide a verification scenario for QE to verify or close this bug.

Comment 39 RHEL Program Management 2020-06-04 06:39:11 UTC
Testing has indicated this request is declined. You may appeal this decision by reopening this request.

Comment 40 Nir Soffer 2020-06-04 18:09:00 UTC
Changing to CURRENTRELEASE since this is fixed in ovirt-imgeio-2.0.


Note You need to log in before you can comment on or make changes to this bug.