Bug 489456 - rhnreg_ks with activation key w/ group assignment times out (3 or 4+min) or fails with key limit due to retries (similarly deleting a system takes 3+ minutes)
rhnreg_ks with activation key w/ group assignment times out (3 or 4+min) or f...
Status: CLOSED CURRENTRELEASE
Product: Red Hat Satellite 5
Classification: Red Hat
Component: Registration (Show other bugs)
520
All Linux
high Severity high
: ---
: ---
Assigned To: Michael Mráka
Red Hat Satellite QA List
:
Depends On:
Blocks: sat540-blockers
  Show dependency treegraph
 
Reported: 2009-03-10 04:18 EDT by Issue Tracker
Modified: 2010-10-23 04:12 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-07-16 08:08:30 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Issue Tracker 2009-03-10 04:18:05 EDT
Escalated to Bugzilla from IssueTracker
Comment 3 Issue Tracker 2009-03-10 04:18:09 EDT
Event posted 02-12-2009 08:27am EST by DanBennett
 	
Test for bad activation keys was conducted this morning. My write-up is in
the attached PDF. The actual test starts at the bottom of page 3.

Differences from our initial report last week:
- these keys were created with the web ui, last weeks were API created
- these keys assign child channels, last weeks did not

Since the SID is never installed in the client, no hardware, software, or
other info is updated on the satellite.

QUESTION: From the client used this morning -- would you like the
up2date.log ? a sosreport ? 


This event sent from IssueTracker by vgaikwad  [Support Engineering Group]
 issue 266660
Comment 5 Issue Tracker 2009-03-10 04:18:12 EDT
Event posted 02-13-2009 07:59am EST by DanBennett 
	
See attached snippet from the HTTP error_log.

This SQLSchemaError produced, I believe, when a system was trying to
register using an Activation Key last night. I've grep'd all error_logs,
including those from when the satellite was in debug=5 yesterday, this is
the only SQLSchemaError entry.

Also curious, but I don't know the concern level is the timestamps on the
first three timestamps of the attached:
18:44:40
19:09:32
18:44:40

What additional info do you need regarding this? Another satellite-debug
do you have all the logs from that time?


   Event posted 02-13-2009 07:59am EST by DanBennett

errlogsnippet.txt
	
File uploaded: errlogsnippet.txt

   Event posted 02-12-2009 10:05am EST by rmunilla

	
Thank you Dan,

We will look through the attached information and update the ticket
again.

Regards,
Robert 


This event sent from IssueTracker by vgaikwad  [Support Engineering Group]
 issue 266660
Comment 9 Issue Tracker 2009-03-10 04:18:19 EDT
Dan,


Thank you for providing the oracle dump.  What we see so far is:

3 deadlock errors in the bdump trace:

Thu Jan 29 07:33:21 2009
ORA-00060: Deadlock detected. More info in file
/rhnsat/admin/rhnsat/udump/rhnsat_ora_16149.trc.
Shutting down instance: further logons disabled

Thu Jan 29 11:40:06 2009
ORA-00060: Deadlock detected. More info in file
/rhnsat/admin/rhnsat/udump/rhnsat_ora_17448.trc.

Thu Feb 12 18:44:40 2009
ORA-00060: Deadlock detected. More info in file
/rhnsat/admin/rhnsat/udump/rhnsat_ora_28285.trc.

------------------
06:43:20: register maltese using 1-T-N-US-SN5RHI64 -- 06:43:20
     (client time), 06:43:21 welcome_message in rhn_server_xmlrpc.log
     -- SID = 1000015199
     timeout & return to prompt at 06:46:53 (client time)
------------------

The last one which happened at 18:44, matches the registration timeout
activity and also the traceback error that was received.

As, shown in the above log the deadlock related information is available
in /rhnsat/admin/rhnsat/udump/rhnsat_ora_* files.

Lets get the tar ball of the following:

# tar cvjf oracle_full_dump.tar.bz2 /rhnsat/admin/rhnsat

We do not want to miss anything this time.

Thanks,
Robert



Internal Status set to 'Waiting on Customer'
Status set to: Waiting on Client

This event sent from IssueTracker by vgaikwad  [Support Engineering Group]
 issue 266660
Comment 11 Issue Tracker 2009-03-10 04:18:23 EDT
Interesting -- rhnsat_ora_28285.trc and accusation toward the application
(lines 22-25)

Early on I
 - restored the database to a backup satellite
 - registered a client using a "bad" Activation key (one that had failed
the previous day on the live host)

I recall, perhaps incorrectly, entering this in the Issue Tracker. I see
the Issue open 01/28, but only see history back to 02/16 in this Issue
(02/12 in the source SizeLimit Issue).
Maybe an entry is there but I don't know how to find it.

Regardless, given that most keys work, I don't have a theory about a
suspect satellite software installation on our live host.


This event sent from IssueTracker by vgaikwad  [Support Engineering Group]
 issue 266660
Comment 12 Issue Tracker 2009-03-10 04:18:25 EDT
Severity increased to 4. If this problem _blossoms_, it could have severe
impact on our process workflow.

A different symptom than we have seen before ---

To date, either an Activation Key has been "good" (and works as
expected), or "bad".

A recent registration attempt (2009/02/17 11:55:33) would have been the
forth system to register with a particular key (3-W-N-JDC-SX5RHI64).

This key has a Usage Limit of 25. There were 3 systems previously
registered with the key.

Due to the distributed nature of our customer base, evidence is being
destroyed as I enter this(, or I would provide more detail).

Generally, symptoms are similar to what we have previously seen with
"bad" keys.


Severity set to: High

This event sent from IssueTracker by vgaikwad  [Support Engineering Group]
 issue 266660
Comment 15 Issue Tracker 2009-03-10 04:18:32 EDT
The registration with my prior update shows the same symptoms to the
client, but has provided different information.

There is a traceback -- attached.

There is "fuzzy" channel accounting. See attached
EntitlementCountingQuestionable.pdf
Specifically, in this case, on
/rhn/admin/multiorg/SoftwareEntitlementSubscriptions.do the organization
has exceeded the Proposed Total usage allocation.

It seems this fuzzy accounting relates to the tables in the problem SQL in
the /rhnsat/admin/rhnsat/udump/rhnsat_ora_* files.

NOTE This item shows a Xen guest taking up channel entitlements.
With this particular case we can increase the allocation. But as more
guest systems are subscribed, we will encounter a hard limit.


This event sent from IssueTracker by vgaikwad  [Support Engineering Group]
 issue 266660
Comment 16 Issue Tracker 2009-03-10 04:18:34 EDT
Hello Rob,

Thank you for providing the full oracle trace dump.

So as per /rhnsat/admin/rhnsat/udump/rhnsat_ora_17448.trc, I see the
following two queries which triggered the deadlock on the RHN Satellite
server:

*** 2009-01-29 11:40:05.949
*** SERVICE NAME:(SYS$USERS) 2009-01-29 11:40:05.852
*** SESSION ID:(312.1) 2009-01-29 11:40:05.852
DEADLOCK DETECTED ( ORA-00060 )

  Current SQL Statement:
  DELETE FROM rhnServerNeededErrataCache
      WHERE server_id = :1
        AND org_id = :2
        AND errata_id = :3
End of information on OTHER waiting sessions.

Current SQL statement for this session:
DELETE FROM rhnServerNeededErrataCache
      WHERE server_id = :1
        AND org_id = :2
        AND errata_id = :3

This is a known issue: https://bugzilla.redhat.com/show_bug.cgi?id=463155
within the code base of RHN Satellite 5.1.1 and there's a possibility
that customers running RHN Satellite server 5.2 might hit it. This bug
will be address/fixed with the release of RHN Satellite server 5.3

The following deadlocks looks new to me and we might have to dig down
deeper to see the culprit code.

*** 2009-02-12 18:44:40.516
*** ACTION NAME:() 2009-02-12 18:44:40.468
*** MODULE NAME:(mod_python) 2009-02-12 18:44:40.468
*** SERVICE NAME:(SYS$USERS) 2009-02-12 18:44:40.468
*** SESSION ID:(303.2) 2009-02-12 18:44:40.468

  Current SQL Statement:
  SELECT SG.GROUP_TYPE, SG.ORG_ID, SG.CURRENT_MEMBERS, SG.MAX_MEMBERS FROM
RHNSERVERGROUP SG WHERE SG.ID = :B1 FOR UPDATE OF SG.CURRENT_MEMBERS
End of information on OTHER waiting sessions.

Current SQL statement for this session:
SELECT CURRENT_MEMBERS FROM RHNPRIVATECHANNELFAMILY WHERE ORG_ID = :B2 AND
CHANNEL_FAMILY_ID = :B1 FOR UPDATE OF CURRENT_MEMBERS

*** 2009-01-29 07:33:21.387
*** ACTION NAME:() 2009-01-29 07:33:21.387
*** MODULE NAME:(/usr/sbin/httpd) 2009-01-29 07:33:21.387
*** SERVICE NAME:(SYS$USERS) 2009-01-29 07:33:21.387
*** SESSION ID:(257.55) 2009-01-29 07:33:21.387

 Current SQL Statement:
  SELECT CURRENT_MEMBERS FROM RHNPRIVATECHANNELFAMILY WHERE ORG_ID = :B2
AND CHANNEL_FAMILY_ID = :B1 FOR UPDATE OF CURRENT_MEMBERS
End of information on OTHER waiting sessions.

Current SQL statement for this session:
UPDATE RHNPRIVATECHANNELFAMILY SET CURRENT_MEMBERS = CURRENT_MEMBERS -1
WHERE ORG_ID IN ( SELECT ORG_ID FROM RHNSERVER WHERE ID = :B1 ) AND
CHANNEL_FAMILY_ID IN ( SELECT RCFM.CHANNEL_FAMILY_ID FROM
RHNCHANNELFAMILYMEMBERS RCFM, RHNSERVERCHANNEL RSC WHERE RSC.SERVER_ID =
:B1 AND RSC.CHANNEL_ID = RCFM.CHANNEL_ID AND NOT EXISTS ( SELECT 1 FROM
RHNCHANNELFAMILYVIRTSUBLEVEL CFVSL, RHNSGTYPEVIRTSUBLEVEL SGTVSL,
RHNSERVERENTITLEMENTVIEW SEV, RHNVIRTUALINSTANCE VI WHERE
VI.VIRTUAL_SYSTEM_ID = :B1 AND VI.HOST_SYSTEM_ID = SEV.SERVER_ID AND
SEV.LABEL IN ('virtualization_host', 'virtualization_host_platform')
AND SEV.SERVER_GROUP_TYPE_ID = SGTVSL.SERVER_GROUP_TYPE_ID AND
SGTVSL.VIRT_SUB_LEVEL_ID = CFVSL.VIRT_SUB_LEVEL_ID AND
CFVSL.CHANNEL_FAMILY_ID = RCFM.CHANNEL_FAMILY_ID ) )

I'll investigate this further and will update you with my findings.

regards,


This event sent from IssueTracker by vgaikwad  [Support Engineering Group]
 issue 266660
Comment 20 Issue Tracker 2009-03-10 04:18:42 EDT
Hello Rob,

So, at this point of time I'm actually gathering the information from the
logs about the deadlock. 

Deadlocks are very order-sensitive. In the sense that what operations you
do, and in what order, really matters.

Looking at the traceback:

_subscribe_sql() - Calls a SQL procedure "rhn_channel.subscribe_server"

Now if you see the deadlock SQL statements:

Current SQL Statement:
 SELECT CURRENT_MEMBERS FROM RHNPRIVATECHANNELFAMILY WHERE ORG_ID = :B2
AND CHANNEL_FAMILY_ID = :B1 FOR UPDATE OF CURRENT_MEMBERS
End of information on OTHER waiting sessions.

Current SQL statement for this session:
UPDATE RHNPRIVATECHANNELFAMILY SET CURRENT_MEMBERS = CURRENT_MEMBERS -1
WHERE ORG_ID IN ( SELECT ORG_ID FROM RHNSERVER WHERE ID = :B1 ) AND
CHANNEL_FAMILY_ID IN ( SELECT RCFM.CHANNEL_FAMILY_ID FROM
RHNCHANNELFAMILYMEMBERS RCFM, RHNSERVERCHANNEL RSC WHERE RSC.SERVER_ID =
:B1 AND RSC.CHANNEL_ID = RCFM.CHANNEL_ID AND NOT EXISTS ( SELECT 1 FROM
RHNCHANNELFAMILYVIRTSUBLEVEL CFVSL, RHNSGTYPEVIRTSUBLEVEL SGTVSL,
RHNSERVERENTITLEMENTVIEW SEV, RHNVIRTUALINSTANCE VI WHERE
VI.VIRTUAL_SYSTEM_ID = :B1 AND VI.HOST_SYSTEM_ID = SEV.SERVER_ID AND
SEV.LABEL IN ('virtualization_host', 'virtualization_host_platform')
AND SEV.SERVER_GROUP_TYPE_ID = SGTVSL.SERVER_GROUP_TYPE_ID AND
SGTVSL.VIRT_SUB_LEVEL_ID = CFVSL.VIRT_SUB_LEVEL_ID AND
CFVSL.CHANNEL_FAMILY_ID = RCFM.CHANNEL_FAMILY_ID ) )

The first SQL statement is part of the same procedure
("rhn_channel.subscribe_server") and the second one is from a function
rhn_channel.update_family_counts which is again called by the
"rhn_channel.subscribe_server" procedure.

It also seems that when the deadlock occurred customer turned off the
debugging from the RHN Satellite server and as a result we were not able
to find out much out of it.

I also suspect that the API script that is being used for the deletion of
server could also be a culprit here. Would it be possible for them to
provide us the scripts that is being used for 'deleting the server
profiles' as well as 'creating the activation key'

Above scripts will help us in reproducing the issue.

regards,

Internal Status set to 'Waiting on Support'

This event sent from IssueTracker by vgaikwad  [Support Engineering Group]
 issue 266660
Comment 21 Issue Tracker 2009-03-10 04:18:44 EDT
There were two "dumps" at the time of the traceback, see attached
udumps.tar.bz2. The "Current SQL" items are different from those you
have; the satellite-debug attachment is about a day and a half before the
traceback.

Do you want a current satellite-debug?

Also, please look at the VirtGuestAndActivation.pdf I uploaded yesterday.
I know it is a pain, but that traceback, IMO, is significantly different
from the problem this Issue is addressing. There seems to be overlap, but
more difference than commonality.

Script to delete server profiles -- most frequently that is done via the
web ui. deleteSystem.pl is in the perlscripts.tar.bz2

Activation keys: Remember, Activation Keys used during the debug=5 were
created in the web ui. The attached script/workbook did not create them.

Script to create keys -- (1) it expects groups to be in place from a prior
step in the process (2) it reads an Excel workbook, as that provides a
structure for laying out the data needed for lots of keys.


Internal Status set to 'Waiting on Support'
Status set to: Waiting on Tech

This event sent from IssueTracker by vgaikwad  [Support Engineering Group]
 issue 266660
Comment 22 Issue Tracker 2009-03-10 04:18:45 EDT
Ref: My 02-20-2009 10:40am EST post below, earlier report with TRACEBACK,
PDF I uploaded on 02/19/2009, etc

This Activation key has been used successfully by the virtual guest to
register the system.
 - No change was made to the key
 - Channel entitlements for the org were increased so that that limit was
not exceeded by the guest registration
 - The "extra" virtual guest systems were deleted from the dom-0 profile
(there were three jdc181 systems attached to that dom-0)

NOTE -- We still do not know what we are dealing with on this end and will
appreciate info from the bz when it comes available (from your update:
https://bugzilla.redhat.com/show_bug.cgi?id=463155)


This event sent from IssueTracker by vgaikwad  [Support Engineering Group]
 issue 266660
Comment 25 Issue Tracker 2009-03-10 04:18:54 EDT
One other note on API created Activation Keys -- we now have over 100
systems registered using various keys that were created with the API
script.

Going by memory, I think we've seen three problem keys that were created
with the script.


This event sent from IssueTracker by vgaikwad  [Support Engineering Group]
 issue 266660
Comment 26 Issue Tracker 2009-03-10 04:18:56 EDT
Oops. My response (3) an hour ago should have been:

The deadlock: Activation Key used was 3-W-N-JDC-AX4RHI32.

That key was initially created with the web ui.
It was deleted and re-created with the web ui the morning after the
incident (2009-02-13 08:55:49).

Salient difference in the initial and re-created key: the clustering Child
Channels were included in the original key, they were not included in the
re-created key.

This was, at least in part, an "exceeding the Channel Entitlements an org
has been allotted" incident.




This event sent from IssueTracker by vgaikwad  [Support Engineering Group]
 issue 266660
Comment 29 Issue Tracker 2009-03-10 04:19:05 EDT
Steps to exceed channel entitlements:
 - create the org
 - give it 6 Virtualization Platform  entitlements in the Subscriptions ->
System Entitlements tab.
 - Register systems using an Activation Key that has the corresponding
channel selected in the Child Channels (Activation Key) tab

Those systems can be either standard (e.g., physical hardware) or Xen
guests; either type consumes channel entitlements.

It is likely the Xen entitlement consumption misbehavior will need a
separate Issue Tracker ticket. We are awaiting progress on this Issue
before considering offshoot Issues.

Internal Status set to 'Waiting on Support'
Status set to: Waiting on Tech

This event sent from IssueTracker by vgaikwad  [Support Engineering Group]
 issue 266660
Comment 30 Issue Tracker 2009-03-10 04:19:08 EDT
We have had three occurrences of "the problem" in the past two days that
do not fit with prior information recorded in this Issue. The appearance
on the client side is the same -- a timeout. As to most everything else,
it is different.

These are our only occurrences for over a week.

These incidents make me wonder if we have two distinct problems:
(1) hitting a channel entitlement limit causes the problem
(2) all others are a satellite performance problem.

Per recent Issue updates, "SEG" is working on item #1.
NOTE that this aspect of the problem is our lowest priority. While I think
there is an indication of a software problem on the satellite, the
workaround is obvious.

Item #2 is the still un-specified problem. And maybe it is more than one
problem. The following presents a few facts that might support it being
only a performance problem.

Symptoms from today and yesterday:
 - Systems registering to Activation keys that have worked for several
other systems.
   Per prior evidence/supposition, a key is either good or bad.
   The possibility immediately presented with these failures is that we
have three different keys that have gone bad.

 - I was subsequently able to register a different client, with the first
attempt, using two of these Activation keys.

 - With two of them, staff was later able to register the client using the
same procedure (and Activation key) that had failed earlier. We did not
edit the key in the time between the failure and the success.
   (For the third -- they registered with rhnreg_ks and can't be
encumbered with testing for our problem. Even so, I was able to register a
different client using that key.)

 - No TRACEBACKs, no anomolies in /rhnsat/admin/rhnsat/bdump, and so on.

When registering an alternate system, I noticed that the registration step
took a Very Long Time. The following from the log records our registration
script writes -- one write before the rhnreg_ks, one after.  The first two
were business-as-usual; not thinking about this problem. The third was with
a key that had "failed" an hour earlier (with a different client). The
Activation Limit on this key is 7; six systems had registered using the
key.
----------------------------------------------------------------------
messages.3:Feb 13 13:44:13 maltese register: sending registration request
messages.3:Feb 13 13:44:34 maltese register: system should now be
registered ID-1000015250
--
messages.3:Feb 13 13:48:47 maltese register: sending registration request
messages.3:Feb 13 13:49:10 maltese register: system should now be
registered ID-1000015251
--
messages:Mar  4 12:29:45 maltese register: sending registration request
messages:Mar  4 12:30:37 maltese register: system should now be registered
ID-1000015562
----------------------------------------------------------------------

Is there some merit to thinking that the Mar 4 request almost timed out?
And if so, would the timeout put the satellite into the self
re-registering loop we see on the satellite?

Additional evidence I recall: With one of the early "bad keys" -- that
is, one we deleted and recreated with the web ui. I restored the database
with the "bad key" to a backup satellite server. I was able to register
a system using this key. No problem encountered.

Regarding performance: I do not have an indication that our satellite is
having a performance problem:
Pretty-much standard from w
[root@aosat ~]# w
 13:49:37 up 37 days,  6:32,  3 users,  load average: 0.00, 0.03, 0.00

Pretty-much standard from free
[root@aosat ~]# free -t
             total       used       free     shared    buffers     cached
Mem:       8177312    8125860      51452          0     121992    4458424
-/+ buffers/cache:    3545444    4631868
Swap:      2048248      39608    2008640
Total:    10225560    8165468    2060092

We do not have tools installed for disk activity reporting. While
corresponding values in /proc/diskstats for /rhnsat and /var/satellite are
dramatically higher than a backup server that has no clients attached; I
don't know what big values, or changes over time, would be.

For the disk layout (and /proc info ??) there are previously uploaded
sosreports.




This event sent from IssueTracker by vgaikwad  [Support Engineering Group]
 issue 266660
Comment 31 Issue Tracker 2009-03-10 04:19:10 EDT
Good key, gone bad, now good.
See my prior entry about good keys gone bad.

This incident: we *finally* had "the problem" with a system we didn't
have to get registered right away. (Now that the system is registered,
that staff has work to do; so we are limited in further ability to test.)

This was the fifth usage of the key 1-T-D-AS-SN5AOI64. When starting
registration attempts on 03/05, one system was currently registered with
the key. The key has an activation limit of 2. Thus, there have been
multiple uses of the key, all except one system profile having been
deleted. We have had no previous reports of problems with this key.

The key was created using the API on Jan 14 (same time most other keys
were created). The key has not been modified since.

This system attempted registration using this key at the following times:
    2009/03/05 12:44:10
    2009/03/05 13:47:06
    2009/03/06 10:34:35
    2009/03/06 15:33:59
    2009/03/09 10:10:33

The four attempts last week failed, showing "the problem." No meaningful
mods were done to the system between registration attempts on 03/05 and
03/06. No mods were made between the last attempt on 03/06, and successful
registration this morning (03/09).



This event sent from IssueTracker by vgaikwad  [Support Engineering Group]
 issue 266660
Comment 34 Justin Sherrill 2009-03-10 16:21:19 EDT
Vishal, could we get a tcpdump between a client and the Satellite (with client not using SSL) during a failure ?
Vishal, from both the client and the satellite.

Something like:

From the client:

tcpdump -s 0 -w  clientfile.bin  host SATELLITE_IP
tcpdump -s 0 -w  satfile.bin  host CLIENT_IP

making sure that the client is set to use  'http://' and not 'https://'
Comment 35 Justin Sherrill 2009-03-10 16:22:36 EDT
Errr, the 1st tcpdump command was for the client,

the 2nd one is for the sat.
Comment 41 Xixi 2009-05-21 20:09:26 EDT
The issue has re-curred for customer, and below is a quick summary and workaround/fix applied which resolved the issue for now - this can supercede all of the previous updates as a final bug summary.

Registration with rhnreg_ks and activation key which involved assigning groups to systems took upside of 3-4 minutes, eventually timing out (if networkRetries=0 in client up2date config) or failing with error due to activation key usage limit (if networkRetries>0).  

Imported customer's db dump and reproduced the long registration time in-house, and it's also found deletion of systems also took 3+ minutes each if there're group memberships involved.  In fact, the following plsql function insert_into_servergroup_maybe took 3.5 minutes on its own during registration -
...
2009/05/19 08:12:03 -04:00 14322 rhnSQL/int_oracle.__call__('rhn_server.insert_into_servergroup_maybe', ('1000017760', '11'))
2009/05/19 08:15:29 -04:00 14322 rhnServer/server_history.add_history('Subscription via Token',)
...
2009/05/21 00:15:00 -04:00 27340
rhnSQL/int_oracle.__call__('rhn_server.insert_into_servergroup_maybe', ('1000017770', '11'))
2009/05/21 00:18:26 -04:00 27340
rhnServer/server_history.add_history('Subscription via Token',)
...

If we remove the groups from activation key and system profile respectively, then both registration and deletion takes a few seconds. 

Per discussion with Justin for a similar case, we rebuilt all indexes on the 3 tables which insert_into_servergroup_maybe function queried/updated. Now, regardless of groups, both registration and deletion take seconds as opposed to 3-4 minutes.  Gave customer the following steps:

1. make sure working backup of satellite db exists;
2. sqlplus rhnsat/rhnsat@rhnsat
3. run the following 7 index rebuilds:
alter index rhn_servergroup_oid_name_uq rebuild tablespace DATA_TBS;
alter index rhn_server_dsid_uq rebuild tablespace DATA_TBS;
alter index rhn_server_oid_id_idx rebuild tablespace DATA_TBS;
alter index rhn_server_created_id_idx rebuild tablespace DATA_TBS;
alter index rhn_server_creator_idx rebuild tablespace DATA_TBS;
alter index rhn_sgmembers_sid_sgid_uq rebuild tablespace DATA_TBS;
alter index rhn_sgmembers_sgid_sid_idx rebuild tablespace DATA_TBS;

which also resolved customer's issue -
"Indexes built & response as you indicate; indexes rebuilt in approx 2 seconds.
Successfully registered client system via HTTPS using Activation Key as before. The registration took just under 3 seconds to complete.
This also appears to have cleared up the problem of deleting these profiles. They were taking 3+ minutes to delete; now taking less than 5 seconds. Other (good) system deletes were taking approx 30 seconds; now taking less than 5 seconds."

I also re-imported customer's db again and started from scratch, this time trying a "db-control gather-stats" instead, but it didn't make any difference.  Only rebuilding the indexes help, and specifially just these 2 out of the 7 above -
alter index rhn_sgmembers_sid_sgid_uq rebuild tablespace DATA_TBS;
alter index rhn_sgmembers_sgid_sid_idx rebuild tablespace DATA_TBS;

So now the question is:
If this is somewhat expected with a long-running database when indexes can reach certain # of levels or otherwise need rebuilding, then can we build in a task or provide a tool to monitor cases such as this so either satellite or the user can do a rebuild if/when necessary?

Internal reproducer and notes in next update.

Notes -
...
       function insert_into_servergroup_maybe (
               server_id_in in number,
               server_group_id_in in number
       ) return number is
               retval number := 0;
               cursor servergroups is
                       select  s.id    server_id,
                                       sg.id   server_group_id
                       from    rhnServerGroup  sg,
                                       rhnServer               s
                       where   s.id = server_id_in
                               and sg.id = server_group_id_in
                               and s.org_id = sg.org_id
                               and not exists (
                                       select  1
                                       from    rhnServerGroupMembers sgm
                                       where   sgm.server_id = s.id
                                               and sgm.server_group_id = sg.id
                               );
       begin
               for sgm in servergroups loop
                       rhn_server.insert_into_servergroup(sgm.server_id, sgm.server_group_id);
                       retval := retval + 1;
               end loop;
               return retval;
       end insert_into_servergroup_maybe;
...

Note You need to log in before you can comment on or make changes to this bug.