1608217 – goferd prints reconnect error whenever qdrouterd is restarted

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1608217 - goferd prints reconnect error whenever qdrouterd is restarted

Summary: goferd prints reconnect error whenever qdrouterd is restarted

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	katello-agent
Sub Component:
Version:	6.3.2
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	Released
Assignee:	satellite6-bugs
QA Contact:	Jan Hutař
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-07-25 06:45 UTC by Pavel Moravec
Modified:	2019-10-07 17:37 UTC (History)
CC List:	11 users (show)
Fixed In Version:	Satellite 6.4.1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-12-11 16:20:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Pavel Moravec 2018-07-25 06:45:49 UTC

Description of problem:
Whenever qdrouterd is restarted (i.e. due to katello-service restart or just to solely restart qdrouterd), all goferd clients directly connected to that qdrouterd log error messages, despite their connection is bounced only and they are successfully re-connected in a while. That ridiculously raises alarms "some error happens on all these hundreds/thousands of systems" that an operator must manually checked on some monitoring tool.

I think this can have a simple "code workaround" where goferd will try first reconnect not immediately, but after 10 seconds (i.e. follow the same retry scheme like after 1st connection attempt failure). qdrouterd should be already up and should have (I hope) routing table propagated in case of Capsule (at least per my testing).

Version-Release number of selected component (if applicable):
Sat 6.3.2 tools:
- gofer 2.7.7-3
- katello-agent 3.1.0-2

How reproducible:
100%

Steps to Reproduce:
1. Have a goferd connected to qdrouterd on a system registered to Sat or Caps
2. Restart qdrouterd on that Sat / Caps
3. Monitor /var/log/messages (and check if/when goferd re-creates connection to qdrouterd to port 5647)

Actual results:
error like below appears before successful reconnect:
Jul 25 08:41:51 pmoravco-rhel7 goferd: [ERROR][pulp.agent.6be25b22-1aec-4175-b708-a44f2242d22a] gofer.messaging.adapter.proton.reliability:53 - Connection amqps://pmoravec-caps63.gsslab.brq2.redhat.com:5647 disconnected: Condition('amqp:connection:framing-error', 'SSL Failure: Unknown error.')

Expected results:
no such error, successful reconnect

Additional info:

Comment 1 Pavel Moravec 2018-07-25 08:01:33 UTC

goferd already waits the 10 seconds and the error:

Jul 25 09:45:04 pmoravec-caps63 goferd: [ERROR][pulp.agent.d79cc595-3b70-4e8b-8b72-d8482f4b66e9] gofer.messaging.adapter.proton.reliability:53 - Connection amqps://pmoravec-sat63.gsslab.brq2.redhat.com:5647 disconnected: Condition('amqp:connection:framing-error', 'SSL Failure: Unknown error')

is printed by /usr/lib/python2.7/site-packages/gofer/messaging/adapter/proton/reliability.py :

def reliable(fn):
    def _fn(messenger, *args, **kwargs):
        repair = lambda: None
        while not Thread.aborted():
            try:
                repair()
                return fn(messenger, *args, **kwargs)
            except LinkDetached, le:
                if le.condition != NOT_FOUND:
                    log.error(utf8(le))
                    repair = messenger.repair
                    sleep(DELAY)
                else:
                    raise NotFound(*le.args)
            except ConnectionException, pe:
                log.error(utf8(pe))               ###### this line
                repair = messenger.repair
                sleep(DELAY)
    return _fn


Both "log.error(utf8(le))" are called here just when connection has been successfully established but some link or connection error (other than missing pulp.agent.* queue, that is important) is hit. Since goferd will try to reconnect in 10s, these events shall be warning instead of error.

Only missing queue event is worth to be of error verbosity as that means something fishy happening on Satellite qpidd. But that is logged either way via explicit "raise NotFound(*le.args)" and logged as error in gofer.messaging.consumer:74 .

So I suggest both "log.error(utf8(le))" in proton/reliability.py to be changed to warning verbosity.

Jeff, do you agree?

Comment 2 Jeff Ortel 2018-07-25 12:35:03 UTC

Agreed.

Comment 3 Jeff Ortel 2018-07-25 14:19:32 UTC

Requested change completed/merged in the upstream project.  It will be tagged and released in Fedora (updates) and Copr shortly.

https://github.com/jortel/gofer/pull/90
https://github.com/jortel/gofer/pull/92

Comment 4 Brad Buckingham 2018-07-27 19:55:29 UTC

Moving to POST as the 2 PRs referenced in comment 3 have been merged upstream.

Comment 16 Mike McCune 2018-12-11 16:20:34 UTC

We shipped this in 6.4.1 with the update here:

https://bugzilla.redhat.com/show_bug.cgi?id=1646736

I tested the updated builds and see the switch to WARNING vs ERROR.

Closing as CURRENTRELEASE.

Note You need to log in before you can comment on or make changes to this bug.