Description of problem: we are seeing random failures in ost for test: 001_initialize_engine.test_initialize_engine The error seems to be because of ovirt-imageio-proxy: socket.error: Address already in use We do not see ovirt-imageio-proxy trying to start twice so it may be that some other processes has already started on the same port. For full logs you can see: http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4055/artifact/ Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine systemd: Starting Session 8 of user root. Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: Traceback (most recent call last): Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/bin/ovirt-imageio-proxy", line 85, in <module> Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: status = image_proxy.main(args, config) Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/lib/python2.7/site-packages/ovirt_imageio_proxy/image_proxy.py", line 21, in main Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: image_server.start(config) Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/lib/python2.7/site-packages/ovirt_imageio_proxy/server.py", line 45, in start Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: WSGIRequestHandler) Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/lib64/python2.7/SocketServer.py", line 419, in __init__ Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: self.server_bind() Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/lib64/python2.7/wsgiref/simple_server.py", line 48, in server_bind Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: HTTPServer.server_bind(self) Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/lib64/python2.7/BaseHTTPServer.py", line 108, in server_bind Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: SocketServer.TCPServer.server_bind(self) Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/lib64/python2.7/SocketServer.py", line 430, in server_bind Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: self.socket.bind(self.server_address) Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/lib64/python2.7/socket.py", line 224, in meth Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: return getattr(self._sock,name)(*args) Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: socket.error: [Errno 98] Address already in use Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine systemd: ovirt-imageio-proxy.service: main process exited, code=exited, status=1/FAILURE Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine systemd: Failed to start oVirt ImageIO Proxy. Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine systemd: Unit ovirt-imageio-proxy.service entered failed state. Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine systemd: ovirt-imageio-proxy.service failed. Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine systemd: ovirt-imageio-proxy.service holdoff time over, scheduling restart. Nov 23 07:30:47 lago-upgrade-from-release-suite-master-engine systemd: Starting oVirt ImageIO Proxy... Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: Traceback (most recent call last): Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/bin/ovirt-imageio-proxy", line 85, in <module> Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: status = image_proxy.main(args, config) Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/lib/python2.7/site-packages/ovirt_imageio_proxy/image_proxy.py", line 21, in main Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: image_server.start(config) Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/lib/python2.7/site-packages/ovirt_imageio_proxy/server.py", line 45, in start Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: WSGIRequestHandler) Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/lib64/python2.7/SocketServer.py", line 419, in __init__ Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: self.server_bind() Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/lib64/python2.7/wsgiref/simple_server.py", line 48, in server_bind Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: HTTPServer.server_bind(self) Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/lib64/python2.7/BaseHTTPServer.py", line 108, in server_bind Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: SocketServer.TCPServer.server_bind(self) Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/lib64/python2.7/SocketServer.py", line 430, in server_bind Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: self.socket.bind(self.server_address) Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/lib64/python2.7/socket.py", line 224, in meth Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: return getattr(self._sock,name)(*args) Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: socket.error: [Errno 98] Address already in use Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine systemd: ovirt-imageio-proxy.service: main process exited, code=exited, status=1/FAILURE Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine systemd: Failed to start oVirt ImageIO Proxy. Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine systemd: Unit ovirt-imageio-proxy.service entered failed state. Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine systemd: ovirt-imageio-proxy.service failed. Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine systemd: ovirt-imageio-proxy.service holdoff time over, scheduling restart. Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine systemd: start request repeated too quickly for ovirt-imageio-proxy.service Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine systemd: Failed to start oVirt ImageIO Proxy. Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine systemd: Unit ovirt-imageio-proxy.service entered failed state. Nov 23 07:30:48 lago-upgrade-from-release-suite-master-engine systemd: ovirt-imageio-proxy.service failed.
* I couldn't find the proxy log in jenkins, can you please attach it? * Can you please check that no other process is using port 54323 * Try to change the proxy's port to see if it mitigates the issue. The configuration file is located at: /etc/ovirt-imageio-proxy/ovirt-imageio-proxy.conf
The lago environments are created and deleted for the specific patch tested and we have no access to the environments once the job is done. (In reply to Daniel Erez from comment #1) > * I couldn't find the proxy log in jenkins, can you please attach it? The logs are collected automatically at the end of the test so if it does not exist than this is something that is related to when you guys tell the logs to be created - you can change the code to add further debugging. > * Can you please check that no other process is using port 54323 sorry, we no longer have the environment and cannot reproduce since its random. however, if you add a patch that would check what is occupying the port we can give you that info once its available. > * Try to change the proxy's port to see if it mitigates the issue. The > configuration file is located at: > /etc/ovirt-imageio-proxy/ovirt-imageio-proxy.conf Not possible for us :) Adding Nir and Didi since they have been commenting on the list
(In reply to Dafna Ron from comment #2) > The lago environments are created and deleted for the specific patch tested > and we have no access to the environments once the job is done. > > (In reply to Daniel Erez from comment #1) > > * I couldn't find the proxy log in jenkins, can you please attach it? > > The logs are collected automatically at the end of the test so if it does > not exist than this is something that is related to when you guys tell the > logs to be created - you can change the code to add further debugging. > I think they are actually collected, but are very hard to find, until you have a good understanding of the tree structure (which personally took me many months to start to understand). Pushed for now a patch to also archive them all [1], if it fails again IMO we should attach the result to current bug. [1] https://gerrit.ovirt.org/84716 > > > * Can you please check that no other process is using port 54323 > > sorry, we no longer have the environment and cannot reproduce since its > random. however, if you add a patch that would check what is occupying the > port we can give you that info once its available. Merged earlier today patches to otopi that should run 'ss -anp' on failures, as part of the package otopi-debug-plugins. Pushed to ovirt-system-tests a patch [2] to install that, still fighting jenkins to make it work. [2] https://gerrit.ovirt.org/84710 > > > > * Try to change the proxy's port to see if it mitigates the issue. The > > configuration file is located at: > > /etc/ovirt-imageio-proxy/ovirt-imageio-proxy.conf Main issue here is that it's not fully reproducible on CI. So even if we change that, we don't know if it helps immediately. So I'd better invest in patches that will give us more hard data than mere guessing. > > Not possible for us :) > > Adding Nir and Didi since they have been commenting on the list
Merged to otopi master branch patches to add a debug plugin to list network connections on failure. Filed bz 1517764 to have this in 4.1, as I was told we see similar failures there too. Merged to ovirt-system-tests a patch to install otopi-debug-plugins.
(In reply to Yedidyah Bar David from comment #4) > Merged to otopi master branch patches to add a debug plugin to list network > connections on failure. @Didi - so where can I see the relevant error logs now? > > Filed bz 1517764 to have this in 4.1, as I was told we see similar failures > there too. > > Merged to ovirt-system-tests a patch to install otopi-debug-plugins.
(In reply to Daniel Erez from comment #5) > (In reply to Yedidyah Bar David from comment #4) > > Merged to otopi master branch patches to add a debug plugin to list network > > connections on failure. > > @Didi - so where can I see the relevant error logs now? In the relevant log - engine-setup or host-deploy. > > > > > Filed bz 1517764 to have this in 4.1, as I was told we see similar failures > > there too. > > > > Merged to ovirt-system-tests a patch to install otopi-debug-plugins. This patch was merged only this morning, so only failures following it will have the plugin.
We haven't been able to reproduce this outside of OST. Retargetting to 4.2.1 as to stop blocking 4.2's GA.
(In reply to Yedidyah Bar David from comment #6) > (In reply to Daniel Erez from comment #5) > > (In reply to Yedidyah Bar David from comment #4) > > > Merged to otopi master branch patches to add a debug plugin to list network > > > connections on failure. > > > > @Didi - so where can I see the relevant error logs now? > > In the relevant log - engine-setup or host-deploy. > > > > > > > > > Filed bz 1517764 to have this in 4.1, as I was told we see similar failures > > > there too. > > > > > > Merged to ovirt-system-tests a patch to install otopi-debug-plugins. > > This patch was merged only this morning, so only failures following it will > have the plugin. @Dafna - How can I find a recent jenkins test that failed on this issue?
Last one I personally looked at - on 4.1, no otopi plugin to help see what used the address - is [1], specifically [2]. It's almost 5 days old, now pressed 'Keep this build forever' so that you can have a look. Inside it, you can check e.g. [3][4]. [1] http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4313/ [2] http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4313/artifact/exported-artifacts/upgrade-from-release-suit-master-el7/test_logs/upgrade-from-release-suite-master/post-001_initialize_engine.py/lago-upgrade-from-release-suite-master-engine/_var_log/ [3] http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4313/artifact/exported-artifacts/upgrade-from-release-suit-master-el7/test_logs/upgrade-from-release-suite-master/post-001_initialize_engine.py/lago-upgrade-from-release-suite-master-engine/_var_log/ovirt-engine/setup/ovirt-engine-setup-20171206090556-8ika3z.log [4] http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4313/artifact/exported-artifacts/upgrade-from-release-suit-master-el7/test_logs/upgrade-from-release-suite-master/post-001_initialize_engine.py/lago-upgrade-from-release-suite-master-engine/_var_log/messages
(In reply to Yedidyah Bar David from comment #9) > Last one I personally looked at - on 4.1, no otopi plugin to help see what > used the address - is [1], specifically [2]. It's almost 5 days old, now > pressed 'Keep this build forever' so that you can have a look. Inside it, > you can check e.g. [3][4]. > > [1] http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4313/ > [2] > http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4313/artifact/ > exported-artifacts/upgrade-from-release-suit-master-el7/test_logs/upgrade- > from-release-suite-master/post-001_initialize_engine.py/lago-upgrade-from- > release-suite-master-engine/_var_log/ > [3] > http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4313/artifact/ > exported-artifacts/upgrade-from-release-suit-master-el7/test_logs/upgrade- > from-release-suite-master/post-001_initialize_engine.py/lago-upgrade-from- > release-suite-master-engine/_var_log/ovirt-engine/setup/ovirt-engine-setup- > 20171206090556-8ika3z.log > [4] > http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4313/artifact/ > exported-artifacts/upgrade-from-release-suit-master-el7/test_logs/upgrade- > from-release-suite-master/post-001_initialize_engine.py/lago-upgrade-from- > release-suite-master-engine/_var_log/messages @Didi - are you referring to the db connection error? [*] Seems unrelated as it uses port 5432, while ovirt-imageio-proxy uses 54323 [*] 2017-12-06 09:06:07 DEBUG otopi.ovirt_engine_setup.engine_common.database database.getCredentials:1179 dbenv: {'OVESETUP_DWH_DB/database': 'ovirt_engine_history', 'OVESETUP_DWH_DB/host': 'localhost', 'OVESETUP_DWH_DB/port': 5432, 'OVESETUP_DWH_DB/securedHostValidation': False, 'OVESETUP_DWH_DB/secured': False, 'OVESETUP_DWH_DB/password': '**FILTERED**', 'OVESETUP_DWH_DB/user': 'ovirt_engine_history'} 2017-12-06 09:06:07 DEBUG otopi.ovirt_engine_setup.engine_common.database database.execute:212 Database: 'None', Statement: ' select count(*) as count from pg_catalog.pg_tables where schemaname = 'public'; ', args: {} 2017-12-06 09:06:07 DEBUG otopi.ovirt_engine_setup.engine_common.database database.execute:217 Creating own connection 2017-12-06 09:06:07 DEBUG otopi.ovirt_engine_setup.engine_common.database database.getCredentials:1211 database connection failed Traceback (most recent call last): File "/usr/share/ovirt-engine/setup/ovirt_engine_setup/engine_common/database.py", line 1209, in getCredentials ] = self.isNewDatabase() File "/usr/share/ovirt-engine/setup/ovirt_engine_setup/engine_common/database.py", line 405, in isNewDatabase transaction=False, File "/usr/share/ovirt-engine/setup/ovirt_engine_setup/engine_common/database.py", line 226, in execute database=database, File "/usr/share/ovirt-engine/setup/ovirt_engine_setup/engine_common/database.py", line 160, in connect sslmode=sslmode, File "/usr/lib64/python2.7/site-packages/psycopg2/__init__.py", line 164, in connect conn = _connect(dsn, connection_factory=connection_factory, async=async) OperationalError: could not connect to server: Connection refused Is the server running on host "localhost" (::1) and accepting TCP/IP connections on port 5432? could not connect to server: Connection refused Is the server running on host "localhost" (127.0.0.1) and accepting TCP/IP connections on port 5432?
(In reply to Daniel Erez from comment #10) > @Didi - are you referring to the db connection error? [*] > Seems unrelated as it uses port 5432, while ovirt-imageio-proxy uses 54323 No, setup logs this and ignores. Check the next one in the same setup log: 2017-12-06 09:07:24 DEBUG otopi.plugins.otopi.services.systemd plugin.execute:926 execute-output: ('/bin/systemctl', 'start', 'ovirt-imageio-proxy.service') stderr: Job for ovirt-imageio-proxy.service failed because the control process exited with error code. See "systemctl status ovirt-imageio-proxy.service" and "journalctl -xe" for details. 2017-12-06 09:07:24 DEBUG otopi.context context._executeMethod:142 method exception Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/otopi/context.py", line 132, in _executeMethod method['method']() File "/usr/share/ovirt-engine/setup/bin/../plugins/ovirt-engine-setup/ovirt_imageio_proxy/config.py", line 352, in _closeup state=state, File "/usr/share/otopi/plugins/otopi/services/systemd.py", line 141, in state service=name, RuntimeError: Failed to start service 'ovirt-imageio-proxy' And, from /var/log/messages ([4] above): Dec 6 09:07:23 lago-upgrade-from-release-suite-master-engine systemd: Starting oVirt ImageIO Proxy... Dec 6 09:07:24 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: Traceback (most recent call last): Dec 6 09:07:24 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/bin/ovirt-imageio-proxy", line 85, in <module> Dec 6 09:07:24 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: status = image_proxy.main(args, config) Dec 6 09:07:24 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/lib/python2.7/site-packages/ovirt_imageio_proxy/image_proxy.py", line 21, in main Dec 6 09:07:24 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: image_server.start(config) Dec 6 09:07:24 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/lib/python2.7/site-packages/ovirt_imageio_proxy/server.py", line 45, in start Dec 6 09:07:24 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: WSGIRequestHandler) Dec 6 09:07:24 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/lib64/python2.7/SocketServer.py", line 419, in __init__ Dec 6 09:07:24 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: self.server_bind() Dec 6 09:07:24 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/lib64/python2.7/wsgiref/simple_server.py", line 48, in server_bind Dec 6 09:07:24 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: HTTPServer.server_bind(self) Dec 6 09:07:24 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/lib64/python2.7/BaseHTTPServer.py", line 108, in server_bind Dec 6 09:07:24 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: SocketServer.TCPServer.server_bind(self) Dec 6 09:07:24 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/lib64/python2.7/SocketServer.py", line 430, in server_bind Dec 6 09:07:24 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: self.socket.bind(self.server_address) Dec 6 09:07:24 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: File "/usr/lib64/python2.7/socket.py", line 224, in meth Dec 6 09:07:24 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: return getattr(self._sock,name)(*args) Dec 6 09:07:24 lago-upgrade-from-release-suite-master-engine ovirt-imageio-proxy: socket.error: [Errno 98] Address already in use However, not sure what else you can dig up by checking logs there - the proxy's log is empty. Eyal now merged a patch to run 'ss -anp' after setup if it fails, let's see if it happens again and if it's enough to see what used the address. https://gerrit.ovirt.org/85228
seems like we have results from the patch: http://jenkins.ovirt.org/job/ovirt-4.1_change-queue-tester/1432/testReport/junit/(root)/001_initialize_engine/test_initialize_engine/ seems like we have tcp TIME-WAIT 0 0 ::1:54323 ::1:5432 is that what we would expect?
Not really, but it at least explains the failure. I expected some process to LISTEN on this address. TIME-WAIT is a normal state after a connection closes, that lasts by default 60 seconds. During these 60 seconds, the kernel prevents other processes from listening on the port. In this specific case, the port was used on the client side of a connection to postgresql (likely by a psql command ran during engine-setup). When a client connects to a server, the kernel allocates a so-called "Ephemeral port" for it, from a range that is OS-specific, in Linux usually 32768 to 61000. Inside this range, AFAICT, ports are allocated one-by-one, so if we reached 54323, it means we had around 54323-32786=21537 connections since the (lago engine) machine started, which looks like quite a lot to me. - How come this started only recently, and only on CI? I have no idea. Perhaps a change in CI that uses many more connections in initial stages. - What do we do now? Some ideas I can think of: 1. Somehow make CI not fail on this. E.g. - Lower the time - /proc/sys/net/ipv4/tcp_fin_timeout - Change the range to not include this port - /proc/sys/net/ipv4/ip_local_port_range - Find what's using so many connections and lower that Both of the first two have pro's and con's, not sure what's the best way. 2. Somehow lower the number of connections in engine-setup, which are mainly by running many small psql commands on small files. In principle we can combine them to a few batches. This will also help performance - by my measurements, done a few weeks ago (so recalling from memory), around 5-6 seconds less in the best case (for engine-setup). 3. Change the port used by imageio-proxy (and others): - As a first step, to be outside of the default ephemeral ports range of common OSes (or at least Linux) - On the longer run, register ports for all of our network-facing services (including imageio-*, vdsm, qemu (spice), etc.) at IANA.
I vote for #3.
#3 is the natural choice, but it has its costs - it will have to be handled very carefully, especially upgrades. So unlikely to happen in 4.2.
4. setsockopt(SO_REUSEADDR)
so we open a bug for ost to fix the test (setsockopt(SO_REUSEADDR) would do that right?) and we use this bug as RFE to register ports for all our network-facing services for the long run?
(In reply to Yedidyah Bar David from comment #16) > 4. setsockopt(SO_REUSEADDR) (In reply to Yedidyah Bar David from comment #16) > 4. setsockopt(SO_REUSEADDR) Python builtin http server (which we use) is already doing this: class HTTPServer(SocketServer.TCPServer): allow_reuse_address = 1 ... This is used in TCPServer.server_bind to set the SO_REUSEADDR bit. class TCPServer(BaseServer): ... def server_bind(self): ... if self.allow_reuse_address: self.socket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) I also verified in the past that this SO_REUSADDR bit is set on our socket. I agree that #3 is the right way to fix this. According to https://www.iana.org/assignments/service-names-port-numbers/service-names-port-numbers.txt: 8009-8018 Unassigned I suggest to register: vdsm: 8010 ovirt-imageio-daemon: 8011 ovirt-imageio-proxy: 8012 Changing vdsm port can be a big issue, breaking migration between new and old machines, but we can listen on both 8010 and 54321, passing connections to the actual sever. Changing daemon port is an issue because engine assume the daemon port. We can solve it in the same way by listening to both ports, or publish the daemon port in vdsm getCapabillities. Changing proxy port is easier because it is used internally from engine. All changes may require changes in firewalls in the user environment, this is the real issue. Dan, Piotr, Francesco, what do you think?
(In reply to Nir Soffer from comment #18) > (In reply to Yedidyah Bar David from comment #16) > > 4. setsockopt(SO_REUSEADDR) > > (In reply to Yedidyah Bar David from comment #16) > > 4. setsockopt(SO_REUSEADDR) > > Python builtin http server (which we use) is already doing this: > > class HTTPServer(SocketServer.TCPServer): > > allow_reuse_address = 1 > ... > > This is used in TCPServer.server_bind to set the SO_REUSEADDR bit. > > class TCPServer(BaseServer): > ... > def server_bind(self): > ... > if self.allow_reuse_address: > self.socket.setsockopt(socket.SOL_SOCKET, > socket.SO_REUSEADDR, 1) > > I also verified in the past that this SO_REUSADDR bit is set on our socket. > > I agree that #3 is the right way to fix this. > > According to > https://www.iana.org/assignments/service-names-port-numbers/service-names- > port-numbers.txt: > > 8009-8018 Unassigned > > I suggest to register: > > vdsm: 8010 > ovirt-imageio-daemon: 8011 > ovirt-imageio-proxy: 8012 > > Changing vdsm port can be a big issue, breaking migration between new and old > machines, but we can listen on both 8010 and 54321, passing connections to > the > actual sever. We introduced protocol detector because we did not want to change vdsm port number. Vm migration uses config.py to get port value which means that both vdsm need to use the same port number. We make it "configurable" but changing it would break many things. Exposing second port could help to solve it but I am not sure when we will be able to stop using current port number. > > Changing daemon port is an issue because engine assume the daemon port. We > can > solve it in the same way by listening to both ports, or publish the daemon > port > in vdsm getCapabillities. > > Changing proxy port is easier because it is used internally from engine. +1 > > All changes may require changes in firewalls in the user environment, this > is > the real issue. > > Dan, Piotr, Francesco, what do you think?
Restoring needinfos
Patch does not work, moving back to NEW. Verified that this works: echo 1 > /proc/sys/net/ipv4/tcp_tw_recycle But it's considered risky, if checking the kernel documentation for it. Perhaps it's not risky enough to use in CI. Keeping needinfos.
BTW, the TIME-WAIT ports in at least [1] look weird to me, and I could not reproduce them easily. Compare for example [2] (from the same build). For 'connect-from-random-source-port', the kernel tries to use even port numbers from within the range. It keeps the odd numbers for 'listen-on-random-port'. It's easy to connect from a specific port, so it's easy to cause a "reproduction" of current bug, by: ====================================================== #!/usr/bin/python import socket s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) s.bind(('0.0.0.0', 54323)) s.connect(('127.0.0.1', 80)) s.close() ====================================================== However, I tried to cause it to happen "at random", by running a loop, and it seems like the kernel tries quite hard to not use odd numbers. So there is likely something a bit weird going on there. [1] http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86_64/2742/artifact/exported-artifacts/upgrade-from-release-suite-master__logs/lago_logs/lago.log [2] http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86_64/2742/artifact/exported-artifacts/basic-suite-master__logs/lago_logs/lago.log
(In reply to Yedidyah Bar David from comment #22) > For 'connect-from-random-source-port', the kernel tries to use even port > numbers from within the range. It keeps the odd numbers for > 'listen-on-random-port'. This means ovirt-imageio-daemon port (54322) is even worse :-) I guess this we don't have any issue in the field yet because vdsm starts the deamon very early during boot, and nobody restart it, so the chance to have another process using this port is very low.
The long-term solution should be indeed changing the ports of imageio-proxy, imageio-daemon and vdsm, as the issue affects all of these modules. Since it's not feasible for 4.2 (requires port registration in IANA [1], backwards compatibility issues, etc), pushed a patch to OST for reserving the port before starting the engine. [1] https://www.iana.org/assignments/service-names-port-numbers/service-names-port-numbers.txt:
(In reply to Daniel Erez from comment #24) > The long-term solution should be indeed changing the ports of imageio-proxy, > imageio-daemon and vdsm, as the issue affects all of these modules. Since > it's not feasible for 4.2 (requires port registration in IANA [1], backwards > compatibility issues, etc), pushed a patch to OST for reserving the port > before starting the engine. Thanks. Since we already got CI failures on the ovirt-imageio-daemon, with the same error message, likely for very similar reasons, perhaps patch OST to reserve also its port.
(In reply to Yedidyah Bar David from comment #25) > (In reply to Daniel Erez from comment #24) > > The long-term solution should be indeed changing the ports of imageio-proxy, > > imageio-daemon and vdsm, as the issue affects all of these modules. Since > > it's not feasible for 4.2 (requires port registration in IANA [1], backwards > > compatibility issues, etc), pushed a patch to OST for reserving the port > > before starting the engine. > > Thanks. > > Since we already got CI failures on the ovirt-imageio-daemon, with the same > error message, likely for very similar reasons, perhaps patch OST to reserve > also its port. Sure. Is the port also used by postgres? Cause I'm not sure if I can put it in the same script (setup_engine.sh), or somewhere before OST is starting vdsm. Do you have an example of a failure log?
(In reply to Daniel Erez from comment #26) > (In reply to Yedidyah Bar David from comment #25) > > (In reply to Daniel Erez from comment #24) > > > The long-term solution should be indeed changing the ports of imageio-proxy, > > > imageio-daemon and vdsm, as the issue affects all of these modules. Since > > > it's not feasible for 4.2 (requires port registration in IANA [1], backwards > > > compatibility issues, etc), pushed a patch to OST for reserving the port > > > before starting the engine. > > > > Thanks. > > > > Since we already got CI failures on the ovirt-imageio-daemon, with the same > > error message, likely for very similar reasons, perhaps patch OST to reserve > > also its port. > > Sure. Is the port also used by postgres? Cause I'm not sure if I can put it > in the same script (setup_engine.sh), or somewhere before OST is starting > vdsm. Do you have an example of a failure log? The attached patch doesn't fix anything in imageio. Please either move this bug to OST (where the fix actually is!), or, if you intend to have a longer term solution, clone it to track that.
(In reply to Daniel Erez from comment #26) > Sure. Is the port also used by postgres? It's not used by pg, and also was not used by pg in the proxy's case. In the specific failure we diagnosed, it was used by psql as a random source port, which, is _random_ - so anything could use it, including e.g. if we run wget/curl or whatever during ost. > Cause I'm not sure if I can put it > in the same script (setup_engine.sh), or somewhere before OST is starting > vdsm. I guess you need latter, but not sure where exactly either. > Do you have an example of a failure log? http://lists.ovirt.org/pipermail/devel/2017-November/031821.html
(In reply to Allon Mureinik from comment #27) > The attached patch doesn't fix anything in imageio. > Please either move this bug to OST (where the fix actually is!), The bug is in imageio, and patch to OST is not a fix but a mere workaround. Real users can in principle run into this bug as well, even though it's not very likely. > or, if you > intend to have a longer term solution, clone it to track that. Not sure we need a bug on OST for this, perhaps 'Related-To' is enough to give a context there. As you wish.
Sandro, please target, I don't know which version it should be in
(In reply to Tal Nisan from comment #30) > Sandro, please target, I don't know which version it should be in This is ovirt-system-test, as far as I know, it doesn't follow any release schema. Just check if it's working in currently running jenkins OST jobs.
This is random failure in OST. Nobody reported any issue for more then a year, so the patches mentioned in this bug probably fixed this issue.
Reopen since we have new reports for this issue.
Moving to ovirt-engine, since proxy is configured by ovirt-engine and ovirt-engine is responsible to make the proxy port available.
Errors seen recently in OST in Michal setup (dev-09.rhev.lab.eng.brq) $ cd ost/ovirt-system-tests/test_logs/basic-suite-master/post-001_initialize_engine_pytest.py $ grep -rn 'OSError: \[Errno 98\] Address already in use' . ./lago-basic-suite-master-engine/_var_log/ovirt-imageio/daemon.log:16:OSError: [Errno 98] Address already in use ./lago-basic-suite-master-engine/_var_log/ovirt-imageio/daemon.log:32:OSError: [Errno 98] Address already in use ./lago-basic-suite-master-engine/_var_log/ovirt-imageio/daemon.log:48:OSError: [Errno 98] Address already in use ./lago-basic-suite-master-engine/_var_log/ovirt-imageio/daemon.log:64:OSError: [Errno 98] Address already in use ./lago-basic-suite-master-engine/_var_log/ovirt-imageio/daemon.log:80:OSError: [Errno 98] Address already in use 2020-05-14 20:38:22,737 ERROR (MainThread) [server] Server failed Traceback (most recent call last): File "/usr/lib64/python3.6/site-packages/ovirt_imageio/server.py", line 37, in main server = Server(cfg) File "/usr/lib64/python3.6/site-packages/ovirt_imageio/server.py", line 83, in __init__ self.control_service = services.ControlService(self.config, self.auth) File "/usr/lib64/python3.6/site-packages/ovirt_imageio/services.py", line 148, in __init__ self._server = http.Server(("localhost", port), http.Connection) File "/usr/lib64/python3.6/socketserver.py", line 456, in __init__ self.server_bind() File "/usr/lib64/python3.6/http/server.py", line 136, in server_bind socketserver.TCPServer.server_bind(self) File "/usr/lib64/python3.6/socketserver.py", line 470, in server_bind self.socket.bind(self.server_address) OSError: [Errno 98] Address already in use 2020-05-14 20:38:23,091 ERROR (MainThread) [server] Server failed Traceback (most recent call last): File "/usr/lib64/python3.6/site-packages/ovirt_imageio/server.py", line 37, in main server = Server(cfg) File "/usr/lib64/python3.6/site-packages/ovirt_imageio/server.py", line 83, in __init__ self.control_service = services.ControlService(self.config, self.auth) File "/usr/lib64/python3.6/site-packages/ovirt_imageio/services.py", line 148, in __init__ self._server = http.Server(("localhost", port), http.Connection) File "/usr/lib64/python3.6/socketserver.py", line 456, in __init__ self.server_bind() File "/usr/lib64/python3.6/http/server.py", line 136, in server_bind socketserver.TCPServer.server_bind(self) File "/usr/lib64/python3.6/socketserver.py", line 470, in server_bind self.socket.bind(self.server_address) OSError: [Errno 98] Address already in use 2020-05-14 20:38:23,492 ERROR (MainThread) [server] Server failed Traceback (most recent call last): File "/usr/lib64/python3.6/site-packages/ovirt_imageio/server.py", line 37, in main server = Server(cfg) File "/usr/lib64/python3.6/site-packages/ovirt_imageio/server.py", line 83, in __init__ self.control_service = services.ControlService(self.config, self.auth) File "/usr/lib64/python3.6/site-packages/ovirt_imageio/services.py", line 148, in __init__ self._server = http.Server(("localhost", port), http.Connection) File "/usr/lib64/python3.6/socketserver.py", line 456, in __init__ self.server_bind() File "/usr/lib64/python3.6/http/server.py", line 136, in server_bind socketserver.TCPServer.server_bind(self) File "/usr/lib64/python3.6/socketserver.py", line 470, in server_bind self.socket.bind(self.server_address) OSError: [Errno 98] Address already in use 2020-05-14 20:38:23,737 ERROR (MainThread) [server] Server failed Traceback (most recent call last): File "/usr/lib64/python3.6/site-packages/ovirt_imageio/server.py", line 37, in main server = Server(cfg) File "/usr/lib64/python3.6/site-packages/ovirt_imageio/server.py", line 83, in __init__ self.control_service = services.ControlService(self.config, self.auth) File "/usr/lib64/python3.6/site-packages/ovirt_imageio/services.py", line 148, in __init__ self._server = http.Server(("localhost", port), http.Connection) File "/usr/lib64/python3.6/socketserver.py", line 456, in __init__ self.server_bind() File "/usr/lib64/python3.6/http/server.py", line 136, in server_bind socketserver.TCPServer.server_bind(self) File "/usr/lib64/python3.6/socketserver.py", line 470, in server_bind self.socket.bind(self.server_address) OSError: [Errno 98] Address already in use 2020-05-14 20:38:24,114 ERROR (MainThread) [server] Server failed Traceback (most recent call last): File "/usr/lib64/python3.6/site-packages/ovirt_imageio/server.py", line 37, in main server = Server(cfg) File "/usr/lib64/python3.6/site-packages/ovirt_imageio/server.py", line 83, in __init__ self.control_service = services.ControlService(self.config, self.auth) File "/usr/lib64/python3.6/site-packages/ovirt_imageio/services.py", line 148, in __init__ self._server = http.Server(("localhost", port), http.Connection) File "/usr/lib64/python3.6/socketserver.py", line 456, in __init__ self.server_bind() File "/usr/lib64/python3.6/http/server.py", line 136, in server_bind socketserver.TCPServer.server_bind(self) File "/usr/lib64/python3.6/socketserver.py", line 470, in server_bind self.socket.bind(self.server_address) OSError: [Errno 98] Address already in use In all cases the failure is the control port (54324). We know that even port numbers are more problematic, as they are used by the kernel as the source port for connected sockets. This also explains why we see this error now - ovirt-imageio 2.0.5 listens to 2 ports: 54323 and 54324. In the past we listened only to 54323. We don't know why we had this issue in the past when we use only port 54323, and why the issue disappeared. A minimal fix is to change the control port to 54325. This does not ensure that the port will be available but makes it less likely to be used. A more robust fix is to change to port to 64323, which is not in the dynamic port range on linux (32786-6100). But admin can change the dynamic port range using sysctl. The best solution is to reserve imageio ports, ensuring that both ports are always available. This is what we do in vdsm: https://github.com/oVirt/vdsm/blob/6f5df19a96d971b92e4886330c1b856085022352/static/etc/sysctl.d/vdsm.conf#L16
Michal, since we reproduced this issue only on your setup, can you try this build and report if the issue is resolved? https://jenkins.ovirt.org/job/ovirt-engine_standard-check-patch/5884/
Moving out to 4.4.1 not being approved as blocker for 4.4.0 and we are preparing for GA
Nir we do not have any way of verifing this bug as QE is not using OST and this was also seen in a single setup. Please provide a verification scenario for QE to verify or close this bug.
Testing has indicated this request is declined. You may appeal this decision by reopening this request.
Changing to CURRENTRELEASE since this is fixed in ovirt-imgeio-2.0.