1688317 – sporadic timeouts in opening TCP connection prevents Satellite upgrades

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1688317 - sporadic timeouts in opening TCP connection prevents Satellite upgrades

Summary: sporadic timeouts in opening TCP connection prevents Satellite upgrades

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	Upgrades
Sub Component:
Version:	6.4.2
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	6.4.3
Assignee:	satellite6-bugs
QA Contact:	Nikhil Kathole
Docs Contact:
URL:
Whiteboard:
Depends On:	1687190
Blocks:
TreeView+	depends on / blocked

Reported:	2019-03-13 14:28 UTC by sthirugn@redhat.com
Modified:	2019-10-30 17:14 UTC (History)
CC List:	8 users (show)
Fixed In Version:	foreman-1.18.0.41-1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1687190
Environment:
Last Closed:	2019-04-29 18:15:37 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:0903	0	None	None	None	2019-04-29 18:15:39 UTC

Description sthirugn@redhat.com 2019-03-13 14:28:21 UTC

+++ This bug was initially created as a clone of Bug #1687190 +++

Description of problem:
Upgrading Satellite (e.g. from 6.3 to 6.4) can fail at random rake script with backtrace like:

RestClient::Exceptions::OpenTimeout: Timed out connecting to server
/opt/theforeman/tfm/root/usr/share/gems/gems/rest-client-2.0.1/lib/restclient/request.rb:731:in `rescue in transmit'
/opt/theforeman/tfm/root/usr/share/gems/gems/rest-client-2.0.1/lib/restclient/request.rb:642:in `transmit'
/opt/theforeman/tfm/root/usr/share/gems/gems/rest-client-2.0.1/lib/restclient/request.rb:145:in `execute'
/opt/theforeman/tfm/root/usr/share/gems/gems/rest-client-2.0.1/lib/restclient/request.rb:52:in `execute'
/opt/theforeman/tfm/root/usr/share/gems/gems/rest-client-2.0.1/lib/restclient/resource.rb:67:in `post'
/opt/theforeman/tfm/root/usr/share/gems/gems/runcible-2.8.1/lib/runcible/base.rb:92:in `get_response'
/opt/theforeman/tfm/root/usr/share/gems/gems/runcible-2.8.1/lib/runcible/base.rb:72:in `call'
/opt/theforeman/tfm/root/usr/share/gems/gems/runcible-2.8.1/lib/runcible/resources/repository.rb:147:in `unit_search'
/opt/theforeman/tfm/root/usr/share/gems/gems/runcible-2.8.1/lib/runcible/extensions/repository.rb:199:in `distributions'
/opt/theforeman/tfm/root/usr/share/gems/gems/katello-3.7.0.46/app/models/katello/repository.rb:620:in `import_distribution_data'
/opt/theforeman/tfm/root/usr/share/gems/gems/katello-3.7.0.46/app/models/katello/repository.rb:615:in `block in import_distributions'
/opt/theforeman/tfm-ror51/root/usr/share/gems/gems/activerecord-5.1.6/lib/active_record/relation/delegation.rb:39:in `each'
/opt/theforeman/tfm-ror51/root/usr/share/gems/gems/activerecord-5.1.6/lib/active_record/relation/delegation.rb:39:in `each'

(now it depends per use case / rake script failed, e.g. few times:

/opt/theforeman/tfm/root/usr/share/gems/gems/katello-3.7.0.46/app/models/katello/repository.rb:614:in `import_distributions'
/opt/theforeman/tfm/root/usr/share/gems/gems/katello-3.7.0.46/lib/katello/tasks/upgrades/3.7/make_all_ks_repos_bootable.rake:7:in `block (4 levels) in <top (required)>'
/opt/rh/rh-ruby24/root/usr/share/gems/gems/rake-12.0.0/exe/rake:27:in `<top (required)>'

)

Usually it happens in a rake script that calls REST requests in a tight loop. Very random occurrences, more often on 6.5.


inecas++ ++ found out a fix:

diff --git a/app/models/setting.rb b/app/models/setting.rb
index 72a094024aa135fa41e0f6454d07c1d79fa3ac0a..e19ad55b24afb74c83f9e3033db2756541830fab 100644
--- a/app/models/setting.rb
+++ b/app/models/setting.rb
@@ -93,12 +93,8 @@ class Setting < ApplicationRecord
   def self.[](name)
     name = name.to_s
     cache_value = Setting.cache.read(name)
-    if cache_value.nil?
-      value = find_by(:name => name).try(:value)
-      Setting.cache.write(name, value)
-      return value
-    else
-      return cache_value
+    cache.fetch(name) do
+      find_by(:name => name).try(:value)
     end
   end

that really works.

This BZ is rather a tracker to ensure we fix it in downstream / to decide where all to fix it.


Version-Release number of selected component (if applicable):
Sat 6.4.2


How reproducible:
very sporadicaly during upgrade. Bit artificial reproducer found by Ivan is provided below.


Steps to Reproduce:
1. Decrease katello rest timeout to see the timeouts much sooner than after 1 hour:
/etc/foreman/plugins/katello.yaml :

  :rest_client_timeout: 30

2. Run either (or all) scripts that fires REST requests in a tight loop:

a) requests to 127.0.0.1/pub :
mkdir -p ~/repro
cat <<EOF | foreman-rake console | tee ~/repro/pub-$(date -Isec).log
   conf.return_format = ""
   log = Logger.new(STDOUT)
   log.info "=== process pid #{\$\$}"
   url = URI.parse('http://127.0.0.1/pub/')
   threads = 1.times.map do |i|
     Thread.new do
       log.info "Start thread no. #{i+1}"
         iteration = 0
         cycle = 0
         while true
         begin
           cycle += 1
           if cycle == 1000
             iteration += 1
             cycle = 0
             log.info "=== Thread #{i+1}: iteration #{iteration}k"
           end
           Setting['http_cache']
           Net::HTTP.get_response(url)
         rescue Exception => e
           log.info "Thread #{i+1}: error at iteration #{iteration * 1000 + cycle}"
           log.info "#{e.class} #{e.message}"
           log.info e.backtrace.join("\n")
           sleep(1)
         end
       end
     end
   end
   threads.each(&:join)
EOF

b) requests to pulp repo details:
cat <<EOF | foreman-rake console | tee ~/repro/pulp-$(date -Isec).log
   conf.return_format = ""
   log = Logger.new(STDOUT)
   pulp_id = Katello::Repository.first.pulp_id
   Katello.pulp_server.extensions.repository
   threads = 4.times.map do |i|
     Thread.new do
       log.info "Start thread no. #{i+1}"
       iteration = 0
       cycle = 0
       while true        
         begin
           cycle += 1
           if cycle == 1000
             iteration += 1
             cycle = 0
             log.info "=== iteration #{iteration}k"
           end
           Katello.pulp_server.extensions.repository.distributions(pulp_id)
         rescue Exception => e
           log.info "Error at iteration #{iteration * 1000 + cycle}"
           log.info "#{e.class} #{e.message}"
           log.info e.backtrace.join("\n")
         end
       end
     end
   end
   threads.each(&:join)
EOF

c) optionally, run requests against candlepin like (in rake console):

  4.times do
    Thread.new do
      begin
      org = Organization.first
      label = org.label
      iteration = 0
      cycle = 0
      while true        
        cycle += 1
        if cycle == 1000
          iteration += 1
          cycle = 0
          puts "=== iteration #{iteration}k"
        end
        Katello::Resources::Candlepin::Owner.get_ueber_cert(label)
      end
      rescue Exception => e
        puts e.message
        puts e.backtrace.join("\n")
        raise e
      end
    end
  end

3. Wait few hours and count number of timeouts


Actual results:
3. VERY depending on system, few to many timeouts happen


Expected results:
3. no timeouts


Additional info:

--- Additional comment from RHEL Product and Program Management on 2019-03-10 19:18:08 UTC ---

Since this bug report was entered in Red Hat Bugzilla, the 'sat-backlog' flag has been set to ? to ensure that it is properly evaluated for release.

--- Additional comment from RHEL Product and Program Management on 2019-03-10 19:18:08 UTC ---

Since this issue was entered in Red Hat Bugzilla, the pm_ack has been set to + automatically for the next planned release.

Comment 3 Nikhil Kathole 2019-04-11 09:22:47 UTC

VERIFIED

Version tested:
# rpm -q satellite
satellite-6.4.3-1.el7sat.noarch

No timeout seen with provided reproducer

I, [2019-04-11T05:06:38.925736 #19214]  INFO -- : === Thread 1: iteration 100k
I, [2019-04-11T05:06:41.913713 #19214]  INFO -- : === Thread 1: iteration 101k
I, [2019-04-11T05:06:45.022606 #19214]  INFO -- : === Thread 1: iteration 102k
I, [2019-04-11T05:06:48.103804 #19214]  INFO -- : === Thread 1: iteration 103k
I, [2019-04-11T05:06:51.326653 #19214]  INFO -- : === Thread 1: iteration 104k


I, [2019-04-11T05:19:39.623569 #1336]  INFO -- : === iteration 16k
I, [2019-04-11T05:19:40.514961 #1336]  INFO -- : === iteration 16k
I, [2019-04-11T05:20:22.603960 #1336]  INFO -- : === iteration 17k
I, [2019-04-11T05:20:24.864712 #1336]  INFO -- : === iteration 17k
I, [2019-04-11T05:20:25.063272 #1336]  INFO -- : === iteration 17k
I, [2019-04-11T05:20:26.603654 #1336]  INFO -- : === iteration 17k
I, [2019-04-11T05:21:08.424344 #1336]  INFO -- : === iteration 18k

Comment 5 errata-xmlrpc 2019-04-29 18:15:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0903

Note You need to log in before you can comment on or make changes to this bug.