Bug 1283364

Summary:	utils.run_worker_threads can hang if a consumer thread crashes
Product:	[Fedora] Fedora	Reporter:	Miloslav Trmač <mitr>
Component:	sigul	Assignee:	Patrick Uiterwijk <puiterwijk>
Status:	CLOSED CANTFIX	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	23	CC:	john, kevin, mitr, nmavrogi, opensource
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-10-03 08:01:30 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Miloslav Trmač 2015-11-18 19:55:37 UTC

Version-Release number of selected component (if applicable):
sigul-0.101

* An exception happens in a thread consuming items from a queue.
* utils.WorkerThread’s finally: clause puts an EOF marker into all of the thread’s output queues. Thus, consumers of these queues will terminate.
* The thread exits
* Unlike UNIX pipes, which kill producers with SIGPIPE when the consumers die, the thread-safe queue doesn’t do that. So, the thread producing work items into the queue continues to read requests from its input put them into the crashed thread’s input queue
* … and the queue fills up, and the thread blocks.
* utils.run_worker_threads is waiting for thread exit (necessarily) in the producer→consumer order, so it will will patiently wait for the producer thread to exit… forever.

Because utils.WorkerThread knows about all the involved queues, a probable fix would be:
* Wrap Queue.Queue in an object which carries an additional “consumer exited” flag
* Set the flag during the cleanup in utils.WorkerThread.run
* Have WrappedQueue.put() instead do
> while not queue.consumer_exited:
>    queue.underlying_put(item, True, increasing_timeout)
essentially simulating the SIGPIPE behavior.

Comment 1 Miloslav Trmač 2015-11-26 15:59:16 UTC

Hum, not so easy. The above will work on the server and for the receiving half of the bridge, but if the sending half of the bridge fails, we end up with the server->bridge socket buffers full, server blocked and unable to send more replies, and receiving half of the bridge blocked on the server.  (And for TCP, having buffers full and the receive window set to zero is not a reason to time out, as long as the blocked recipient is still sending zero-window ACKs.)

IOW, adding SIGPIPE-like behavior to in-process queues does not add SIGPIPE-like behavior to the server->bridge half of that TCP connection.

Comment 2 Fedora Update System 2015-11-26 21:02:00 UTC

sigul-0.102-1.fc22 has been submitted as an update to Fedora 22. https://bodhi.fedoraproject.org/updates/FEDORA-2015-db7d77b801

Comment 3 Fedora Update System 2015-11-27 22:21:33 UTC

sigul-0.102-1.fc22 has been pushed to the Fedora 22 testing repository. If problems still persist, please make note of it in this bug report.
If you want to test the update, you can install it with
$ su -c 'dnf --enablerepo=updates-testing update sigul'
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2015-db7d77b801

Comment 4 Fedora Update System 2015-11-28 02:20:37 UTC

sigul-0.102-1.fc23 has been pushed to the Fedora 23 testing repository. If problems still persist, please make note of it in this bug report.
If you want to test the update, you can install it with
$ su -c 'dnf --enablerepo=updates-testing update sigul'
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2015-90039a8cc5

Comment 5 Till Maas 2015-11-28 08:26:24 UTC

FYI: Both bridges showed this yesterday after trying batch mode to sign rawhide:

2015-11-27 17:57:36,641 DEBUG: sign-rpms:send requests: Started handling 'id' = '\x00\x00\x03\xe7', 'rpm-arch' = 'noarch', 'rpm-epoch' = '5', 'rpm-name' = 'texlive-gfscomplutum', 'rpm-release' = '15.fc24', 'rpm-sigmd5' = '\xb0@\xe82.\xf2\x83GX\x88\x82\xde\xb1{\xea\xbb', 'rpm-version' = 'svn19469.1.0'
2015-11-27 17:57:36,688 DEBUG: sign-rpms:send requests finished, exc_info: None
2015-11-27 17:57:36,689 DEBUG: Sending final EOFs to sign-rpms:read replies...
2015-11-27 17:57:36,689 DEBUG: Waiting for sign-rpms:read replies...
2015-11-27 18:08:53,456 ERROR: sign-rpms:send replies: I/O error: NSPR connection reset
2015-11-27 18:08:56,265 DEBUG: sign-rpms:koji replies: Sending queue EOF failed, queue already orphaned
2015-11-27 18:08:56,265 INFO: sign-rpms:koji replies: Writing to work queue failed, queue orphaned
2015-11-27 18:08:58,546 DEBUG: sign-rpms:read replies: Exception, shutting down write side as well
2015-11-27 18:08:58,546 DEBUG: sign-rpms:read replies: Sending queue EOF failed, queue already orphaned
2015-11-27 18:08:58,546 INFO: sign-rpms:read replies: Writing to work queue failed, queue orphaned
2015-11-27 18:08:58,547 DEBUG: sign-rpms:read replies finished, exc_info: (<class 'utils.WorkerQueueOrphanedError'>, WorkerQueueOrphanedError('Work queue orphaned by consumer',), <traceback object at 0x7fa2742cf908>)
2015-11-27 18:08:58,547 DEBUG: Sending final EOFs to sign-rpms:koji replies...
2015-11-27 18:08:58,547 DEBUG: Waiting for sign-rpms:koji replies...
2015-11-27 18:08:58,547 DEBUG: sign-rpms:koji replies finished, exc_info: (<class 'utils.WorkerQueueOrphanedError'>, WorkerQueueOrphanedError('Work queue orphaned by consumer',), <traceback object at 0x7fa2742f2e60>)
2015-11-27 18:08:58,547 DEBUG: Sending final EOFs to sign-rpms:send replies...
2015-11-27 18:08:58,547 DEBUG: Waiting for sign-rpms:send replies...
2015-11-27 18:08:58,547 DEBUG: sign-rpms:send replies finished, exc_info: (<type 'nss.error.NSPRError'>, NSPRError(), <traceback object at 0x7fa27c1dfea8>)
2015-11-27 18:08:58,652 INFO: NSPR I/O error: (PR_HOST_UNREACHABLE_ERROR) Host is unreachable.
Traceback (most recent call last):
  File "/usr/share/sigul/bridge.py", line 1241, in bridge_one_request
    server_sock)
  File "/usr/share/sigul/bridge.py", line 1092, in handle_connection
    conn.__terminate_client_connection()
  File "/usr/share/sigul/bridge.py", line 1194, in __terminate_client_connection
    self.client_buf.send_outer_eof()
  File "/usr/share/sigul/double_tls.py", line 890, in send_outer_eof
    self.__socket.send(utils.u32_pack(0))
NSPRError: (PR_HOST_UNREACHABLE_ERROR) Host is unreachable.


Only the client was still version 0.101, I will try now with the client updated to 0.102

Comment 6 Till Maas 2015-11-28 08:57:42 UTC

with client updated to 0.102, I got this error:

2015-11-28 08:43:39,075 INFO: Subrequest: 'id' = '\x00\x00\x03\x87', 'rpm-arch' = 'noarch', 'rpm-epoch' = '', 'rpm-name' = 'unbound-munin', 'rpm-release' = '1.fc24', 'rpm-sigmd5' = '\xd8\xa6x\xe9(\xec?6\xe6\xcb\xda\x9a\x8d\xf9+\xdf', 'rpm-version' = '1.5.6'
2015-11-28 08:52:27,341 ERROR: sign-rpms:send replies: I/O error: NSPR connection reset
2015-11-28 08:52:41,424 DEBUG: sign-rpms:koji replies: Sending queue EOF failed, queue already orphaned
2015-11-28 08:52:41,424 INFO: sign-rpms:koji replies: Writing to work queue failed, queue orphaned
2015-11-28 08:52:42,362 DEBUG: sign-rpms:read replies: Exception, shutting down write side as well
2015-11-28 08:52:42,362 DEBUG: sign-rpms:read replies: Sending queue EOF failed, queue already orphaned
2015-11-28 08:52:42,362 INFO: sign-rpms:read replies: Writing to work queue failed, queue orphaned
2015-11-28 08:52:42,363 ERROR: sign-rpms:send requests: I/O error: NSPR connection reset
2015-11-28 08:52:56,447 DEBUG: sign-rpms:koji requests: Sending queue EOF failed, queue already orphaned
2015-11-28 08:52:56,447 INFO: sign-rpms:koji requests: Writing to work queue failed, queue orphaned
2015-11-28 08:53:12,532 DEBUG: sign-rpms:read requests: Sending queue EOF failed, queue already orphaned
2015-11-28 08:53:12,533 INFO: sign-rpms:read requests: Writing to work queue failed, queue orphaned
2015-11-28 08:53:12,533 DEBUG: sign-rpms:read requests finished, exc_info: (<class 'utils.WorkerQueueOrphanedError'>, WorkerQueueOrphanedError('Work queue orphaned by consumer',), <traceback object at 0x7fb7c0645cf8>)
2015-11-28 08:53:12,534 DEBUG: Sending final EOFs to sign-rpms:koji requests...
2015-11-28 08:53:12,534 DEBUG: Waiting for sign-rpms:koji requests...
2015-11-28 08:53:12,534 DEBUG: sign-rpms:koji requests finished, exc_info: (<class 'utils.WorkerQueueOrphanedError'>, WorkerQueueOrphanedError('Work queue orphaned by consumer',), <traceback object at 0x7fb7c015e6c8>)
2015-11-28 08:53:12,535 DEBUG: Sending final EOFs to sign-rpms:send requests...
2015-11-28 08:53:12,535 DEBUG: Waiting for sign-rpms:send requests...
2015-11-28 08:53:12,535 DEBUG: sign-rpms:send requests finished, exc_info: (<type 'nss.error.NSPRError'>, NSPRError(), <traceback object at 0x7fb7c05ba6c8>)
2015-11-28 08:53:12,535 DEBUG: Sending final EOFs to sign-rpms:read replies...
2015-11-28 08:53:12,535 DEBUG: Waiting for sign-rpms:read replies...
2015-11-28 08:53:12,535 DEBUG: sign-rpms:read replies finished, exc_info: (<class 'utils.WorkerQueueOrphanedError'>, WorkerQueueOrphanedError('Work queue orphaned by consumer',), <traceback object at 0x7fb7c056fa28>)
2015-11-28 08:53:12,536 DEBUG: Sending final EOFs to sign-rpms:koji replies...
2015-11-28 08:53:12,536 DEBUG: Waiting for sign-rpms:koji replies...
2015-11-28 08:53:12,536 DEBUG: sign-rpms:koji replies finished, exc_info: (<class 'utils.WorkerQueueOrphanedError'>, WorkerQueueOrphanedError('Work queue orphaned by consumer',), <traceback object at 0x7fb7bc10f0e0>)
2015-11-28 08:53:12,536 DEBUG: Sending final EOFs to sign-rpms:send replies...
2015-11-28 08:53:12,536 DEBUG: Waiting for sign-rpms:send replies...
2015-11-28 08:53:12,537 DEBUG: sign-rpms:send replies finished, exc_info: (<type 'nss.error.NSPRError'>, NSPRError(), <traceback object at 0x7fb7c06947a0>)
2015-11-28 08:53:12,909 INFO: NSPR I/O error: (PR_HOST_UNREACHABLE_ERROR) Host is unreachable.
Traceback (most recent call last):
  File "/usr/share/sigul/bridge.py", line 1241, in bridge_one_request
    server_sock)
  File "/usr/share/sigul/bridge.py", line 1092, in handle_connection
    conn.__terminate_client_connection()
  File "/usr/share/sigul/bridge.py", line 1194, in __terminate_client_connection
    self.client_buf.send_outer_eof()
  File "/usr/share/sigul/double_tls.py", line 890, in send_outer_eof
    self.__socket.send(utils.u32_pack(0))
NSPRError: (PR_HOST_UNREACHABLE_ERROR) Host is unreachable.
2015-11-28 08:53:12,909 DEBUG: Request handling finished
2015-11-28 08:53:12,909 DEBUG: Waiting for the server to connect
2015-11-28 08:54:29,485 DEBUG: Waiting for the client to connect

Comment 7 Nikos Mavrogiannopoulos 2015-11-30 08:43:33 UTC

As far as I understand this issue was about preventing the deadlock occuring in similar exceptions (i.e., disconnection of the bridge from koji or vault). After this debug log, is the bridge operational again or there is still a deadlock? What is the status of the vault? What happened and caused the host unreachable error?

Comment 8 Miloslav Trmač 2015-11-30 11:27:09 UTC

First of all, Nikos is absolutely right: if a network error causes the current operation to terminate, and the bridge+server to be able to handle a new connection, this bug has been fixed.  What causes the network error would be another item to investigate.

(In reply to Till Maas from comment #5)
> 2015-11-27 17:57:36,688 DEBUG: sign-rpms:send requests finished, exc_info:
> None
> 2015-11-27 17:57:36,689 DEBUG: Waiting for sign-rpms:read replies...

All RPMs have been sent to the vault for signing.

> 2015-11-27 18:08:53,456 ERROR: sign-rpms:send replies: I/O error: NSPR
> connection reset

This is the root cause, when sending data from bridge to the client. The kernel reported one of ECONNRESET, EPIPE, or POLLHUP.

> 2015-11-27 18:08:56,265 DEBUG: sign-rpms:koji replies: Sending queue EOF
> failed, queue already orphaned
> 2015-11-27 18:08:56,265 INFO: sign-rpms:koji replies: Writing to work queue
> failed, queue orphaned
> 2015-11-27 18:08:58,546 DEBUG: sign-rpms:read replies: Exception, shutting
> down write side as well
> 2015-11-27 18:08:58,546 DEBUG: sign-rpms:read replies: Sending queue EOF
> failed, queue already orphaned
> 2015-11-27 18:08:58,546 INFO: sign-rpms:read replies: Writing to work queue
> failed, queue orphaned
> 2015-11-27 18:08:58,547 DEBUG: sign-rpms:read replies finished, exc_info:
> (<class 'utils.WorkerQueueOrphanedError'>, WorkerQueueOrphanedError('Work
> queue orphaned by consumer',), <traceback object at 0x7fa2742cf908>)
> 2015-11-27 18:08:58,547 DEBUG: Sending final EOFs to sign-rpms:koji
> replies...
> 2015-11-27 18:08:58,547 DEBUG: Waiting for sign-rpms:koji replies...
> 2015-11-27 18:08:58,547 DEBUG: sign-rpms:koji replies finished, exc_info:
> (<class 'utils.WorkerQueueOrphanedError'>, WorkerQueueOrphanedError('Work
> queue orphaned by consumer',), <traceback object at 0x7fa2742f2e60>)
> 2015-11-27 18:08:58,547 DEBUG: Sending final EOFs to sign-rpms:send
> replies...
> 2015-11-27 18:08:58,547 DEBUG: Waiting for sign-rpms:send replies...
> 2015-11-27 18:08:58,547 DEBUG: sign-rpms:send replies finished, exc_info:
> (<type 'nss.error.NSPRError'>, NSPRError(), <traceback object at
> 0x7fa27c1dfea8>)

Expected sequence of other threads terminating as the error condition propagates. (They are logged to make sure that no thread goes away without leaving a trace.)

> 2015-11-27 18:08:58,652 INFO: NSPR I/O error: (PR_HOST_UNREACHABLE_ERROR)
> Host is unreachable.
> Traceback (most recent call last):
>   File "/usr/share/sigul/bridge.py", line 1241, in bridge_one_request
>     server_sock)
>   File "/usr/share/sigul/bridge.py", line 1092, in handle_connection
>     conn.__terminate_client_connection()
>   File "/usr/share/sigul/bridge.py", line 1194, in
> __terminate_client_connection
>     self.client_buf.send_outer_eof()
>   File "/usr/share/sigul/double_tls.py", line 890, in send_outer_eof
>     self.__socket.send(utils.u32_pack(0))
> NSPRError: (PR_HOST_UNREACHABLE_ERROR) Host is unreachable.

All threads have terminated, the processing of the request is cleaning up, sending a final “end of stream” marker from the bridge to the client (using the same socket as above) failed again, this time with EHOSTUNREACH or EHOSTDOWN.

The backtrace could be cleaned up, OTOH at the moment it is useful to show what exactly has failed.

Comment 9 Miloslav Trmač 2015-11-30 11:32:24 UTC

(In reply to Till Maas from comment #6)
> with client updated to 0.102, I got this error:

> 2015-11-28 08:52:27,341 ERROR: sign-rpms:send replies: I/O error: NSPR
> connection reset
Same root cause.
<snip>
> 2015-11-28 08:52:42,362 DEBUG: sign-rpms:read replies: Exception, shutting
> down write side as well
As part of the shut down sequence, this ensures that the bridge terminates (comment #1 above) and…
> 2015-11-28 08:52:42,363 ERROR: sign-rpms:send requests: I/O error: NSPR
> connection reset
… this is how “shutting down write side” manifests. This is an expected error log in this situation, not an indication of another problem. Improving the reporting would be slightly awkward but probably possible.
<snip>

> 2015-11-28 08:53:12,909 INFO: NSPR I/O error: (PR_HOST_UNREACHABLE_ERROR)
> Host is unreachable.
> Traceback (most recent call last):
>   File "/usr/share/sigul/bridge.py", line 1241, in bridge_one_request
>     server_sock)
>   File "/usr/share/sigul/bridge.py", line 1092, in handle_connection
>     conn.__terminate_client_connection()
>   File "/usr/share/sigul/bridge.py", line 1194, in
> __terminate_client_connection
>     self.client_buf.send_outer_eof()
>   File "/usr/share/sigul/double_tls.py", line 890, in send_outer_eof
>     self.__socket.send(utils.u32_pack(0))
> NSPRError: (PR_HOST_UNREACHABLE_ERROR) Host is unreachable.
Same EHOSTUNREACH as above.

> 2015-11-28 08:53:12,909 DEBUG: Request handling finished
> 2015-11-28 08:53:12,909 DEBUG: Waiting for the server to connect
> 2015-11-28 08:54:29,485 DEBUG: Waiting for the client to connect

The vauld and the bridge should be ready to accept another request now.

Comment 10 Till Maas 2015-11-30 17:19:36 UTC

After the errors I added in comment:5 and comment:6, the bridge still works. However, the client does not terminate. Just wondering, might the client close a/the connection to the bridge too early causing the EHOSTUNREACH error? I did never see any connection problems between the host I ran sigul on and the bridges. I tested from a host in fedora infrastructure (autosign01).

Comment 11 Miloslav Trmač 2015-12-01 18:22:05 UTC

(In reply to Till Maas from comment #10)
> After the errors I added in comment:5 and comment:6, the bridge still works.
> However, the client does not terminate.

At this stage, the client will not be sending any more data, and it is just waiting for oncoming replies. The kernel doesn’t inherently know how long to wait so the default timeout, if there even is any, is pretty large.

> Just wondering, might the client
> close a/the connection to the bridge too early causing the EHOSTUNREACH
> error?

If the two were still able to communicate, and the client closed the connection, the bridge would be receiving RST packets when trying to send more data. That could explain the ECONNRESET but not EHOSTUNREACH. (EHOSTUNREACH is IIRC the error you typically get when ARP lookups on a local network fail.)

One theory that has come up is that there might be a stateful firewall somewhere in between (in the network, or on either of the hosts), which forcibly terminates even an active connection after a long time (an hour or more?), or perhaps terminates a connection which is already half-closed and only carrying data in one direction.

… Actually the default timeouts in http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/net/netfilter/nf_conntrack_proto_tcp.c#n71 for half-open connections (FIN_WAIT, CLOSE_WAIT) seem pretty short. No idea whether this is relevant but it might be something to look into.

Comment 12 Till Maas 2015-12-02 20:50:31 UTC

(In reply to Miloslav Trmač from comment #11)

> One theory that has come up is that there might be a stateful firewall
> somewhere in between (in the network, or on either of the hosts), which
> forcibly terminates even an active connection after a long time (an hour or
> more?), or perhaps terminates a connection which is already half-closed and
> only carrying data in one direction.
> 
> … Actually the default timeouts in
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/net/
> netfilter/nf_conntrack_proto_tcp.c#n71 for half-open connections (FIN_WAIT,
> CLOSE_WAIT) seem pretty short. No idea whether this is relevant but it might
> be something to look into.

Yes, this seems to be a problem. The last time I noticed this, there was a 5 minute delay in the bridge log:

Here is a timeline, I hope it makes sense to you. Lines like "13966 22:23:32" are about a tcpdump, with the first column being the frame ID. Here it can be seen that there is no data for about 7 minutes send from the bridge to the autosigner after the autosigner sent a FIN packet (entering the FIN_WAIT state afaiu). However it is also unclear why the bridge delays it operations for 5 minutes - there are no log lines left out by me. Also I noticed that it takes a while for the bridge to notice that the communication to the autosigner is broken, the first ICMP host prohibited is sent 16 minutes before the bridge reports the errors and gets ready to process the next requests. 



13966 22:23:32 SYN from 56973

2015-12-01 22:23:36,371 INFO: Client with CN 'till' connected

15076 22:23:44 FIN from 56973 to 44334
17162 22:29:09 to 56973

2015-12-01 22:29:12,861 DEBUG: sign-rpms:send replies: Started handling 'id' = '\x00\x00\x01\xe3', 'status' = '\x00\x00\x00\x00'

!!! 5 minute delay

2015-12-01 22:34:09,266 DEBUG: sign-rpms:send requests: Started handling 'id' = '\x00\x00\x01\xe5', 'rpm-arch' = 'noarch', 'rpm-epoch' = '', 'rpm-name' = 'python3-magic', 'rpm-release' = '3.fc24', 'rpm-sigmd5' = 'F\xad*\xaeE\xb0\x85\x93\xcfc\xf8\x94\x01\xac\x8e\xf4', 'rpm-version' = '5.25'

17164 22:35:59 to 56973

2015-12-01 22:36:03,316 DEBUG: sign-rpms:send replies: Started handling 'id' = '\x00\x00\x01\xe5', 'status' = '\x00\x00\x00\x00'

17165 22:35:59 first icmp host prohibited sent to bridge regarding 56973tcp

2015-12-01 22:39:17,109 DEBUG: sign-rpms:send requests: Started handling 'id' = '\x00\x00\x03\xe7', 'rpm-arch' = 'noarch', 'rpm-epoch' = '5', 'rpm-name' = 'texlive-guitlogo-doc', 'rpm-release' = '16.fc24', 'rpm-sigmd5' = '#\xb9\xfb%L@\xeb\x1e\xd4B\x10dJ\xd1,)', 'rpm-version' = 'svn27458.0.9.2'
2015-12-01 22:39:17,182 DEBUG: sign-rpms:send requests finished, exc_info: None
2015-12-01 22:39:17,183 DEBUG: Sending final EOFs to sign-rpms:read replies...
2015-12-01 22:39:17,183 DEBUG: Waiting for sign-rpms:read replies...

!!! 12 minute delay

2015-12-01 22:51:30,790 ERROR: sign-rpms:send replies: I/O error: NSPR connection reset
2015-12-01 22:51:32,953 DEBUG: sign-rpms:koji replies: Sending queue EOF failed, queue already orphaned
2015-12-01 22:51:32,954 INFO: sign-rpms:koji replies: Writing to work queue failed, queue orphaned
2015-12-01 22:51:34,351 DEBUG: sign-rpms:read replies: Exception, shutting down write side as well
2015-12-01 22:51:34,352 DEBUG: sign-rpms:read replies: Sending queue EOF failed, queue already orphaned
2015-12-01 22:51:34,352 INFO: sign-rpms:read replies: Writing to work queue failed, queue orphaned
2015-12-01 22:51:34,353 DEBUG: sign-rpms:read replies finished, exc_info: (<class 'utils.WorkerQueueOrphanedError'>, WorkerQueueOrphanedError('Work queue orphaned by consumer',), <traceback object at 0x7f3fe82a3ab8>)
2015-12-01 22:51:34,353 DEBUG: Sending final EOFs to sign-rpms:koji replies...
2015-12-01 22:51:34,353 DEBUG: Waiting for sign-rpms:koji replies...
2015-12-01 22:51:34,354 DEBUG: sign-rpms:koji replies finished, exc_info: (<class 'utils.WorkerQueueOrphanedError'>, WorkerQueueOrphanedError('Work queue orphaned by consumer',), <traceback object at 0x7f3fe8136f38>)
2015-12-01 22:51:34,354 DEBUG: Sending final EOFs to sign-rpms:send replies...
2015-12-01 22:51:34,354 DEBUG: Waiting for sign-rpms:send replies...
2015-12-01 22:51:34,354 DEBUG: sign-rpms:send replies finished, exc_info: (<type 'nss.error.NSPRError'>, NSPRError(), <traceback object at 0x7f3fe8247e18>)
2015-12-01 22:51:34,517 INFO: NSPR I/O error: (PR_HOST_UNREACHABLE_ERROR) Host is unreachable.
Traceback (most recent call last):
  File "/usr/share/sigul/bridge.py", line 1241, in bridge_one_request
    server_sock)
  File "/usr/share/sigul/bridge.py", line 1092, in handle_connection
    conn.__terminate_client_connection()
  File "/usr/share/sigul/bridge.py", line 1194, in __terminate_client_connection
    self.client_buf.send_outer_eof()
  File "/usr/share/sigul/double_tls.py", line 890, in send_outer_eof
    self.__socket.send(utils.u32_pack(0))
NSPRError: (PR_HOST_UNREACHABLE_ERROR) Host is unreachable.
2015-12-01 22:51:34,569 DEBUG: Request handling finished
2015-12-01 22:51:34,570 DEBUG: Waiting for the server to connect
2015-12-01 22:51:38,023 DEBUG: Waiting for the client to connect

Comment 13 Kevin Fenzi 2015-12-02 23:40:25 UTC

We could try ruling out contrack issues by exempting sigul ports from conntrack?

Or just drop firewalls for a test run?

Comment 14 Miloslav Trmač 2015-12-04 19:57:43 UTC

(In reply to Till Maas from comment #12)
> However it is also unclear why the bridge delays
> it operations for 5 minutes - there are no log lines left out by me.

The batch of 1000 RPMs, with debugging on the bridge apparently enabled, really only results in these few lines? Or do you mean “left out around the tcpdump lines” or something similar perhaps?

> Also I
> noticed that it takes a while for the bridge to notice that the
> communication to the autosigner is broken, the first ICMP host prohibited is
> sent 16 minutes before the bridge reports the errors and gets ready to
> process the next requests. 

I have no idea how Linux TCP reacts to host prohibited error on a previously established connection; I wouldn’t be too surprised if the behavior were surprising ☺


> 13966 22:23:32 SYN from 56973
> 
> 2015-12-01 22:23:36,371 INFO: Client with CN 'till' connected
> 
> 15076 22:23:44 FIN from 56973 to 44334

The list of 1000 RPMs was sent in 12 seconds, OK, plausible. (This is with NEVRAs referring to Koji, not to local file names?  I am bit surprised that the client was able to get the metadata in about 12 seconds, i.e. about 12 ms per getRPM Koji request.  That’s much faster than what I can get from here, I don’t know, it but might be plausible as well.)

> 17162 22:29:09 to 56973
> 
> 2015-12-01 22:29:12,861 DEBUG: sign-rpms:send replies: Started handling 'id'
> = '\x00\x00\x01\xe3', 'status' = '\x00\x00\x00\x00'

The ID is a counter, this is the 484th RPM. There should be equivalent log lines for the 483 previous ones. And also “sign-rpms:{read requests,koji requests,send requests,read replies,koji replies}:” lines for each of the 484 RPMs should also have appeared by now, each is a prerequisite for the line above appearing.

Have you omitted them, or is there perhaps a thread safety issue to look for in the Python logging framework?

> !!! 5 minute delay

The next “sign-rpms:read replies” line after the delay, if any, might suggest a RPM which took that long to sign. Or not, if other queues are already close to capacity.

(This is all with --koji-only, i.e. the signed RPMs are not being all sent to the client, right?)

> 2015-12-01 22:36:03,316 DEBUG: sign-rpms:send replies: Started handling 'id'
> = '\x00\x00\x01\xe5', 'status' = '\x00\x00\x00\x00'

(So based on the first 486 RPMs, around 1.55s per signed RPM so far, modulo bridge→client socket buffer. Seems reasonable.)
 
> 17165 22:35:59 first icmp host prohibited sent to bridge regarding 56973tcp

So the error is actually “host unreachable” from the start and something in the kernel is turning into ECONNRESET, might be interesting to know.

> 2015-12-01 22:39:17,109 DEBUG: sign-rpms:send requests: Started handling
> 'id' = '\x00\x00\x03\xe7', 'rpm-arch' = 'noarch', 'rpm-epoch' = '5',
> 'rpm-name' = 'texlive-guitlogo-doc', 'rpm-release' = '16.fc24', 'rpm-sigmd5'
> = '#\xb9\xfb%L@\xeb\x1e\xd4B\x10dJ\xd1,)', 'rpm-version' = 'svn27458.0.9.2'
> 2015-12-01 22:39:17,182 DEBUG: sign-rpms:send requests finished, exc_info:
> None

(All 1000 RPMs, including full contents, sent to the vault for signing, modulo bridge→server socket buffer. At ~0.945s / RPM, plausible.)

> !!! 12 minute delay

Yeah.

> 2015-12-01 22:51:30,790 ERROR: sign-rpms:send replies: I/O error: NSPR
> connection reset

… and the reply processing side fails/is terminated.

>   File "/usr/share/sigul/double_tls.py", line 890, in send_outer_eof
>     self.__socket.send(utils.u32_pack(0))
> NSPRError: (PR_HOST_UNREACHABLE_ERROR) Host is unreachable.

(Surprisingly this one does not turn into ECONNRESET.)

Comment 15 Till Maas 2015-12-04 21:13:01 UTC

(In reply to Miloslav Trmač from comment #14)
> (In reply to Till Maas from comment #12)
> > However it is also unclear why the bridge delays
> > it operations for 5 minutes - there are no log lines left out by me.
> 
> The batch of 1000 RPMs, with debugging on the bridge apparently enabled,
> really only results in these few lines? Or do you mean “left out around the
> tcpdump lines” or something similar perhaps?

Sorry, I meant that I left out no lines when I added the markers about the delays.

> > 13966 22:23:32 SYN from 56973
> > 
> > 2015-12-01 22:23:36,371 INFO: Client with CN 'till' connected
> > 
> > 15076 22:23:44 FIN from 56973 to 44334
> 
> The list of 1000 RPMs was sent in 12 seconds, OK, plausible. (This is with
> NEVRAs referring to Koji, not to local file names?  I am bit surprised that
> the client was able to get the metadata in about 12 seconds, i.e. about 12
> ms per getRPM Koji request.  That’s much faster than what I can get from
> here, I don’t know, it but might be plausible as well.)

The metadata was already fetched when the client connected. It is this code:
https://pagure.io/releng/blob/master/f/scripts/sigulsign_unsigned.py#_586


> 
> > 17162 22:29:09 to 56973
> > 
> > 2015-12-01 22:29:12,861 DEBUG: sign-rpms:send replies: Started handling 'id'
> > = '\x00\x00\x01\xe3', 'status' = '\x00\x00\x00\x00'
> 
> The ID is a counter, this is the 484th RPM. There should be equivalent log
> lines for the 483 previous ones. And also “sign-rpms:{read requests,koji
> requests,send requests,read replies,koji replies}:” lines for each of the
> 484 RPMs should also have appeared by now, each is a prerequisite for the
> line above appearing.
> 
> Have you omitted them, or is there perhaps a thread safety issue to look for
> in the Python logging framework?

Yes, I omitted most entries. I uploaded the full log here:
https://till.fedorapeople.org/tmp/2015-12-01-bridge.log

I also noticed that for some ids there is an odd amount of log entries, not sure if they really have a problem, here are some examples:
     27 '\x00\x00\x02\xa4'
     27 '\x00\x00\x02\xa5'
     27 '\x00\x00\x02\xa6'
     27 '\x00\x00\x02\xa7'
     27 '\x00\x00\x02\xa8'


> > !!! 5 minute delay
> 
> The next “sign-rpms:read replies” line after the delay, if any, might
> suggest a RPM which took that long to sign. Or not, if other queues are
> already close to capacity.

The next read replies is:
2015-12-01 22:35:00,173 DEBUG: sign-rpms:read replies: Started handling 'id' = '\x00\x00\x01\xe4', 'status' = '\x00\x00\x00\x00'

It seems to be the texlive src RPM which might take long:
2015-12-01 22:27:52,985 DEBUG: sign-rpms:send requests: Started handling 'id' = '\x00\x00\x01\xe4', 'rpm-arch' = 'src', 'rpm-epoch' = '5', 'rpm-name' = 'texlive', 'rpm-release' = '16.20150728_r37987.fc24', 'rpm-sigmd5' = 'p\x97dr[W\xe8D2\xe0\xa6\x86\x82\xbc\x03\x18', 'rpm-version' = '2015'

> (This is all with --koji-only, i.e. the signed RPMs are not being all sent
> to the client, right?)

The command line was like this:
 sigul --batch --config-file /home/fedora/till/.sigul/client-secondary.conf sign-rpms --store-in-koji --koji-only -k arm --v3-signature fedora-24-secondary [1000 RPMS]

Comment 16 Miloslav Trmač 2015-12-04 21:30:31 UTC

(In reply to Till Maas from comment #15)
> (In reply to Miloslav Trmač from comment #14)
> > (In reply to Till Maas from comment #12)
> > > However it is also unclear why the bridge delays
> > > it operations for 5 minutes - there are no log lines left out by me.
> > 
> > The batch of 1000 RPMs, with debugging on the bridge apparently enabled,
> > really only results in these few lines? Or do you mean “left out around the
> > tcpdump lines” or something similar perhaps?
> 
> Sorry, I meant that I left out no lines when I added the markers about the
> delays.

Ah, OK. That makes sense.

> > > 13966 22:23:32 SYN from 56973
> > > 
> > > 2015-12-01 22:23:36,371 INFO: Client with CN 'till' connected
> > > 
> > > 15076 22:23:44 FIN from 56973 to 44334
> > 
> > The list of 1000 RPMs was sent in 12 seconds, OK, plausible. (This is with
> > NEVRAs referring to Koji, not to local file names?  I am bit surprised that
> > the client was able to get the metadata in about 12 seconds, i.e. about 12
> > ms per getRPM Koji request.  That’s much faster than what I can get from
> > here, I don’t know, it but might be plausible as well.)
> 
> The metadata was already fetched when the client connected. It is this code:
> https://pagure.io/releng/blob/master/f/scripts/sigulsign_unsigned.py#_586

(But the sigul client is sending 1000 individual getRPM requests anyway. Is the Koji Python client, or some proxy, doing some cross-process caching that would be primed by sigulsign_unsigned.py getting data first? Anyway, that’s a tangent.) 


> > > 17162 22:29:09 to 56973
> > > 
> > > 2015-12-01 22:29:12,861 DEBUG: sign-rpms:send replies: Started handling 'id'
> > > = '\x00\x00\x01\xe3', 'status' = '\x00\x00\x00\x00'
> > 
> > The ID is a counter, this is the 484th RPM. There should be equivalent log
> > lines for the 483 previous ones. And also “sign-rpms:{read requests,koji
> > requests,send requests,read replies,koji replies}:” lines for each of the
> > 484 RPMs should also have appeared by now, each is a prerequisite for the
> > line above appearing.
> > 
> > Have you omitted them, or is there perhaps a thread safety issue to look for
> > in the Python logging framework?
> 
> Yes, I omitted most entries. I uploaded the full log here:
> https://till.fedorapeople.org/tmp/2015-12-01-bridge.log
> 
> I also noticed that for some ids there is an odd amount of log entries, not
> sure if they really have a problem, here are some examples:
>      27 '\x00\x00\x02\xa4'
>      27 '\x00\x00\x02\xa5'
>      27 '\x00\x00\x02\xa6'
>      27 '\x00\x00\x02\xa7'
>      27 '\x00\x00\x02\xa8'

Yeah, the “send replies” thread starts handling \x02\xa3, receives the TCP error and terminates, so it will not process, and not log, the remaining items. (In these cases, it seems that they all get processed all the way through “koji replies”, get into the queue between “koji replies” and “send replies”, and are abandoned in there when the “send replies” thread terminates. But that is timing dependent, the processing of in-flight items may be aborted pretty much at any point in the pipeline.)


> 
> > > !!! 5 minute delay
> > 
> > The next “sign-rpms:read replies” line after the delay, if any, might
> > suggest a RPM which took that long to sign. Or not, if other queues are
> > already close to capacity.
> 
> The next read replies is:
> 2015-12-01 22:35:00,173 DEBUG: sign-rpms:read replies: Started handling 'id'
> = '\x00\x00\x01\xe4', 'status' = '\x00\x00\x00\x00'
> 
> It seems to be the texlive src RPM which might take long:
> 2015-12-01 22:27:52,985 DEBUG: sign-rpms:send requests: Started handling
> 'id' = '\x00\x00\x01\xe4', 'rpm-arch' = 'src', 'rpm-epoch' = '5', 'rpm-name'
> = 'texlive', 'rpm-release' = '16.20150728_r37987.fc24', 'rpm-sigmd5' =
> 'p\x97dr[W\xe8D2\xe0\xa6\x86\x82\xbc\x03\x18', 'rpm-version' = '2015'

almost 2 GB? Yeah, very likely.

Comment 17 Miloslav Trmač 2015-12-04 21:46:58 UTC

(In reply to Till Maas from comment #12)
> 2015-12-01 22:29:12,861 DEBUG: sign-rpms:send replies: Started handling 'id'
> = '\x00\x00\x01\xe3', 'status' = '\x00\x00\x00\x00'
> 
> !!! 5 minute delay

So at this point:
* “read requests” is at item 686 (22:27:55,497)
* “koji requests” is at item 585 (22:27:55,473)
* “send requests” is at item 484 (22:27:52,985), the \x01\xe4 AKA texlive.src

i.e. the vault has been buffering and sending texlive.src for <2 minutes + the 5 minute delay; the other two threads on the “request” (client→bridge(+koji)→vault) side have queued 100 items for the next thread and are now blocking until “send requests” is done with texlive.src,, and the three threads on the “reply” (vault→bridge(+koji)→client) side have processed all outstanding work and are waiting for signed texlive.src and following RPMs.

The texlive.src reply only *starts* to arrive by 22:35:00,173 (i.e. in about 8 minutes), and the signed RPM is copied back to the bridge a bit before 22:36:01,552.

> 2015-12-01 22:34:09,266 DEBUG: sign-rpms:send requests: Started handling
> 'id' = '\x00\x00\x01\xe5', 'rpm-arch' = 'noarch', 'rpm-epoch' = '',
> 'rpm-name' = 'python3-magic', 'rpm-release' = '3.fc24', 'rpm-sigmd5' =
> 'F\xad*\xaeE\xb0\x85\x93\xcfc\xf8\x94\x01\xac\x8e\xf4', 'rpm-version' =
> '5.25'
> 
> 17164 22:35:59 to 56973
> 
> 2015-12-01 22:36:03,316 DEBUG: sign-rpms:send replies: Started handling 'id'
> = '\x00\x00\x01\xe5', 'status' = '\x00\x00\x00\x00'
> 
> 17165 22:35:59 first icmp host prohibited sent to bridge regarding 56973tcp
> 
> 2015-12-01 22:39:17,109 DEBUG: sign-rpms:send requests: Started handling
> 'id' = '\x00\x00\x03\xe7', 'rpm-arch' = 'noarch', 'rpm-epoch' = '5',
> 'rpm-name' = 'texlive-guitlogo-doc', 'rpm-release' = '16.fc24', 'rpm-sigmd5'
> = '#\xb9\xfb%L@\xeb\x1e\xd4B\x10dJ\xd1,)', 'rpm-version' = 'svn27458.0.9.2'
> 2015-12-01 22:39:17,182 DEBUG: sign-rpms:send requests finished, exc_info:
> None
> 2015-12-01 22:39:17,183 DEBUG: Sending final EOFs to sign-rpms:read
> replies...
> 2015-12-01 22:39:17,183 DEBUG: Waiting for sign-rpms:read replies...
> 
> !!! 12 minute delay
> 
> 2015-12-01 22:51:30,790 ERROR: sign-rpms:send replies: I/O error: NSPR
> connection reset

(or, over 15 minute delay since first ICMP host prohibited.)

Comment 18 Till Maas 2015-12-07 21:11:23 UTC

I added firewall rules to the signing host to not be stateful. I hope this helps. However, could sigul be changed to also work with default Linux stateful firewalls?

Comment 19 Miloslav Trmač 2015-12-09 13:44:09 UTC

Due to the previous reliability work and moving the termination from TCP to application-level it should be possible, at least as a guess without a detailed analysis.

Step 1 would be figuring out what the firewall is doing, though; so far we only have an educated guess that it is even the cause. AFAICT keeping a half-open connection this way is perfectly legal TCP¹ so we will need to figure out the precise constraints.

I think it has also become clear that there probably are some best practices for TCP WRT non-default but recommended timeouts, keepalives etc. which could make the failure detection, if not the behavior, much smoother, and about which I have no idea. One does not too frequently hear about a TCP connection taking 15 minutes to time out.

(OTOH if signing a single RPM does legitimately take 8 minutes and causes a comparable amount of quiet time on the network, I am not sure we can tighten timeouts much. Still, keepalives could be useful.)


¹ Note to self: This is over TLS: is a long-term half-open connection valid for TLS? Will rekeying etc. eventually break?

Comment 20 Till Maas 2015-12-09 22:14:43 UTC

(In reply to Miloslav Trmač from comment #19)

> Step 1 would be figuring out what the firewall is doing, though; so far we
> only have an educated guess that it is even the cause. AFAICT keeping a
> half-open connection this way is perfectly legal TCP¹ so we will need to
> figure out the precise constraints.

It is certain that it is the firewall. The tcpdump showed the admin prohibited pkgs that come only from the firewall and I noticed a small TTL in iptstate after the send queue went on the sigul client system.

The TCP connection goes to FIN-WAIT-2 according to ss, and the connection state in iptabels becomes CLOSE_WAIT with a TTL of 60 seconds.

Maybe it is also a bug in the iptables defaults since the CLOSE_WAIT timeout used to be 12 hours according to: http://www.iptables.info/en/connection-state.html

Comment 21 Fedora Admin XMLRPC Client 2016-08-09 14:55:31 UTC

This package has changed ownership in the Fedora Package Database.  Reassigning to the new owner of this component.

Comment 22 Patrick Uiterwijk 2016-10-03 08:01:30 UTC

This seems to have been fixed by making a static firewall rule for inbound connections on the server.