Escalated to Bugzilla from IssueTracker
Event posted 02-12-2009 08:27am EST by DanBennett Test for bad activation keys was conducted this morning. My write-up is in the attached PDF. The actual test starts at the bottom of page 3. Differences from our initial report last week: - these keys were created with the web ui, last weeks were API created - these keys assign child channels, last weeks did not Since the SID is never installed in the client, no hardware, software, or other info is updated on the satellite. QUESTION: From the client used this morning -- would you like the up2date.log ? a sosreport ? This event sent from IssueTracker by vgaikwad [Support Engineering Group] issue 266660
Event posted 02-13-2009 07:59am EST by DanBennett See attached snippet from the HTTP error_log. This SQLSchemaError produced, I believe, when a system was trying to register using an Activation Key last night. I've grep'd all error_logs, including those from when the satellite was in debug=5 yesterday, this is the only SQLSchemaError entry. Also curious, but I don't know the concern level is the timestamps on the first three timestamps of the attached: 18:44:40 19:09:32 18:44:40 What additional info do you need regarding this? Another satellite-debug do you have all the logs from that time? Event posted 02-13-2009 07:59am EST by DanBennett errlogsnippet.txt File uploaded: errlogsnippet.txt Event posted 02-12-2009 10:05am EST by rmunilla Thank you Dan, We will look through the attached information and update the ticket again. Regards, Robert This event sent from IssueTracker by vgaikwad [Support Engineering Group] issue 266660
Dan, Thank you for providing the oracle dump. What we see so far is: 3 deadlock errors in the bdump trace: Thu Jan 29 07:33:21 2009 ORA-00060: Deadlock detected. More info in file /rhnsat/admin/rhnsat/udump/rhnsat_ora_16149.trc. Shutting down instance: further logons disabled Thu Jan 29 11:40:06 2009 ORA-00060: Deadlock detected. More info in file /rhnsat/admin/rhnsat/udump/rhnsat_ora_17448.trc. Thu Feb 12 18:44:40 2009 ORA-00060: Deadlock detected. More info in file /rhnsat/admin/rhnsat/udump/rhnsat_ora_28285.trc. ------------------ 06:43:20: register maltese using 1-T-N-US-SN5RHI64 -- 06:43:20 (client time), 06:43:21 welcome_message in rhn_server_xmlrpc.log -- SID = 1000015199 timeout & return to prompt at 06:46:53 (client time) ------------------ The last one which happened at 18:44, matches the registration timeout activity and also the traceback error that was received. As, shown in the above log the deadlock related information is available in /rhnsat/admin/rhnsat/udump/rhnsat_ora_* files. Lets get the tar ball of the following: # tar cvjf oracle_full_dump.tar.bz2 /rhnsat/admin/rhnsat We do not want to miss anything this time. Thanks, Robert Internal Status set to 'Waiting on Customer' Status set to: Waiting on Client This event sent from IssueTracker by vgaikwad [Support Engineering Group] issue 266660
Interesting -- rhnsat_ora_28285.trc and accusation toward the application (lines 22-25) Early on I - restored the database to a backup satellite - registered a client using a "bad" Activation key (one that had failed the previous day on the live host) I recall, perhaps incorrectly, entering this in the Issue Tracker. I see the Issue open 01/28, but only see history back to 02/16 in this Issue (02/12 in the source SizeLimit Issue). Maybe an entry is there but I don't know how to find it. Regardless, given that most keys work, I don't have a theory about a suspect satellite software installation on our live host. This event sent from IssueTracker by vgaikwad [Support Engineering Group] issue 266660
Severity increased to 4. If this problem _blossoms_, it could have severe impact on our process workflow. A different symptom than we have seen before --- To date, either an Activation Key has been "good" (and works as expected), or "bad". A recent registration attempt (2009/02/17 11:55:33) would have been the forth system to register with a particular key (3-W-N-JDC-SX5RHI64). This key has a Usage Limit of 25. There were 3 systems previously registered with the key. Due to the distributed nature of our customer base, evidence is being destroyed as I enter this(, or I would provide more detail). Generally, symptoms are similar to what we have previously seen with "bad" keys. Severity set to: High This event sent from IssueTracker by vgaikwad [Support Engineering Group] issue 266660
The registration with my prior update shows the same symptoms to the client, but has provided different information. There is a traceback -- attached. There is "fuzzy" channel accounting. See attached EntitlementCountingQuestionable.pdf Specifically, in this case, on /rhn/admin/multiorg/SoftwareEntitlementSubscriptions.do the organization has exceeded the Proposed Total usage allocation. It seems this fuzzy accounting relates to the tables in the problem SQL in the /rhnsat/admin/rhnsat/udump/rhnsat_ora_* files. NOTE This item shows a Xen guest taking up channel entitlements. With this particular case we can increase the allocation. But as more guest systems are subscribed, we will encounter a hard limit. This event sent from IssueTracker by vgaikwad [Support Engineering Group] issue 266660
Hello Rob, Thank you for providing the full oracle trace dump. So as per /rhnsat/admin/rhnsat/udump/rhnsat_ora_17448.trc, I see the following two queries which triggered the deadlock on the RHN Satellite server: *** 2009-01-29 11:40:05.949 *** SERVICE NAME:(SYS$USERS) 2009-01-29 11:40:05.852 *** SESSION ID:(312.1) 2009-01-29 11:40:05.852 DEADLOCK DETECTED ( ORA-00060 ) Current SQL Statement: DELETE FROM rhnServerNeededErrataCache WHERE server_id = :1 AND org_id = :2 AND errata_id = :3 End of information on OTHER waiting sessions. Current SQL statement for this session: DELETE FROM rhnServerNeededErrataCache WHERE server_id = :1 AND org_id = :2 AND errata_id = :3 This is a known issue: https://bugzilla.redhat.com/show_bug.cgi?id=463155 within the code base of RHN Satellite 5.1.1 and there's a possibility that customers running RHN Satellite server 5.2 might hit it. This bug will be address/fixed with the release of RHN Satellite server 5.3 The following deadlocks looks new to me and we might have to dig down deeper to see the culprit code. *** 2009-02-12 18:44:40.516 *** ACTION NAME:() 2009-02-12 18:44:40.468 *** MODULE NAME:(mod_python) 2009-02-12 18:44:40.468 *** SERVICE NAME:(SYS$USERS) 2009-02-12 18:44:40.468 *** SESSION ID:(303.2) 2009-02-12 18:44:40.468 Current SQL Statement: SELECT SG.GROUP_TYPE, SG.ORG_ID, SG.CURRENT_MEMBERS, SG.MAX_MEMBERS FROM RHNSERVERGROUP SG WHERE SG.ID = :B1 FOR UPDATE OF SG.CURRENT_MEMBERS End of information on OTHER waiting sessions. Current SQL statement for this session: SELECT CURRENT_MEMBERS FROM RHNPRIVATECHANNELFAMILY WHERE ORG_ID = :B2 AND CHANNEL_FAMILY_ID = :B1 FOR UPDATE OF CURRENT_MEMBERS *** 2009-01-29 07:33:21.387 *** ACTION NAME:() 2009-01-29 07:33:21.387 *** MODULE NAME:(/usr/sbin/httpd) 2009-01-29 07:33:21.387 *** SERVICE NAME:(SYS$USERS) 2009-01-29 07:33:21.387 *** SESSION ID:(257.55) 2009-01-29 07:33:21.387 Current SQL Statement: SELECT CURRENT_MEMBERS FROM RHNPRIVATECHANNELFAMILY WHERE ORG_ID = :B2 AND CHANNEL_FAMILY_ID = :B1 FOR UPDATE OF CURRENT_MEMBERS End of information on OTHER waiting sessions. Current SQL statement for this session: UPDATE RHNPRIVATECHANNELFAMILY SET CURRENT_MEMBERS = CURRENT_MEMBERS -1 WHERE ORG_ID IN ( SELECT ORG_ID FROM RHNSERVER WHERE ID = :B1 ) AND CHANNEL_FAMILY_ID IN ( SELECT RCFM.CHANNEL_FAMILY_ID FROM RHNCHANNELFAMILYMEMBERS RCFM, RHNSERVERCHANNEL RSC WHERE RSC.SERVER_ID = :B1 AND RSC.CHANNEL_ID = RCFM.CHANNEL_ID AND NOT EXISTS ( SELECT 1 FROM RHNCHANNELFAMILYVIRTSUBLEVEL CFVSL, RHNSGTYPEVIRTSUBLEVEL SGTVSL, RHNSERVERENTITLEMENTVIEW SEV, RHNVIRTUALINSTANCE VI WHERE VI.VIRTUAL_SYSTEM_ID = :B1 AND VI.HOST_SYSTEM_ID = SEV.SERVER_ID AND SEV.LABEL IN ('virtualization_host', 'virtualization_host_platform') AND SEV.SERVER_GROUP_TYPE_ID = SGTVSL.SERVER_GROUP_TYPE_ID AND SGTVSL.VIRT_SUB_LEVEL_ID = CFVSL.VIRT_SUB_LEVEL_ID AND CFVSL.CHANNEL_FAMILY_ID = RCFM.CHANNEL_FAMILY_ID ) ) I'll investigate this further and will update you with my findings. regards, This event sent from IssueTracker by vgaikwad [Support Engineering Group] issue 266660
Hello Rob, So, at this point of time I'm actually gathering the information from the logs about the deadlock. Deadlocks are very order-sensitive. In the sense that what operations you do, and in what order, really matters. Looking at the traceback: _subscribe_sql() - Calls a SQL procedure "rhn_channel.subscribe_server" Now if you see the deadlock SQL statements: Current SQL Statement: SELECT CURRENT_MEMBERS FROM RHNPRIVATECHANNELFAMILY WHERE ORG_ID = :B2 AND CHANNEL_FAMILY_ID = :B1 FOR UPDATE OF CURRENT_MEMBERS End of information on OTHER waiting sessions. Current SQL statement for this session: UPDATE RHNPRIVATECHANNELFAMILY SET CURRENT_MEMBERS = CURRENT_MEMBERS -1 WHERE ORG_ID IN ( SELECT ORG_ID FROM RHNSERVER WHERE ID = :B1 ) AND CHANNEL_FAMILY_ID IN ( SELECT RCFM.CHANNEL_FAMILY_ID FROM RHNCHANNELFAMILYMEMBERS RCFM, RHNSERVERCHANNEL RSC WHERE RSC.SERVER_ID = :B1 AND RSC.CHANNEL_ID = RCFM.CHANNEL_ID AND NOT EXISTS ( SELECT 1 FROM RHNCHANNELFAMILYVIRTSUBLEVEL CFVSL, RHNSGTYPEVIRTSUBLEVEL SGTVSL, RHNSERVERENTITLEMENTVIEW SEV, RHNVIRTUALINSTANCE VI WHERE VI.VIRTUAL_SYSTEM_ID = :B1 AND VI.HOST_SYSTEM_ID = SEV.SERVER_ID AND SEV.LABEL IN ('virtualization_host', 'virtualization_host_platform') AND SEV.SERVER_GROUP_TYPE_ID = SGTVSL.SERVER_GROUP_TYPE_ID AND SGTVSL.VIRT_SUB_LEVEL_ID = CFVSL.VIRT_SUB_LEVEL_ID AND CFVSL.CHANNEL_FAMILY_ID = RCFM.CHANNEL_FAMILY_ID ) ) The first SQL statement is part of the same procedure ("rhn_channel.subscribe_server") and the second one is from a function rhn_channel.update_family_counts which is again called by the "rhn_channel.subscribe_server" procedure. It also seems that when the deadlock occurred customer turned off the debugging from the RHN Satellite server and as a result we were not able to find out much out of it. I also suspect that the API script that is being used for the deletion of server could also be a culprit here. Would it be possible for them to provide us the scripts that is being used for 'deleting the server profiles' as well as 'creating the activation key' Above scripts will help us in reproducing the issue. regards, Internal Status set to 'Waiting on Support' This event sent from IssueTracker by vgaikwad [Support Engineering Group] issue 266660
There were two "dumps" at the time of the traceback, see attached udumps.tar.bz2. The "Current SQL" items are different from those you have; the satellite-debug attachment is about a day and a half before the traceback. Do you want a current satellite-debug? Also, please look at the VirtGuestAndActivation.pdf I uploaded yesterday. I know it is a pain, but that traceback, IMO, is significantly different from the problem this Issue is addressing. There seems to be overlap, but more difference than commonality. Script to delete server profiles -- most frequently that is done via the web ui. deleteSystem.pl is in the perlscripts.tar.bz2 Activation keys: Remember, Activation Keys used during the debug=5 were created in the web ui. The attached script/workbook did not create them. Script to create keys -- (1) it expects groups to be in place from a prior step in the process (2) it reads an Excel workbook, as that provides a structure for laying out the data needed for lots of keys. Internal Status set to 'Waiting on Support' Status set to: Waiting on Tech This event sent from IssueTracker by vgaikwad [Support Engineering Group] issue 266660
Ref: My 02-20-2009 10:40am EST post below, earlier report with TRACEBACK, PDF I uploaded on 02/19/2009, etc This Activation key has been used successfully by the virtual guest to register the system. - No change was made to the key - Channel entitlements for the org were increased so that that limit was not exceeded by the guest registration - The "extra" virtual guest systems were deleted from the dom-0 profile (there were three jdc181 systems attached to that dom-0) NOTE -- We still do not know what we are dealing with on this end and will appreciate info from the bz when it comes available (from your update: https://bugzilla.redhat.com/show_bug.cgi?id=463155) This event sent from IssueTracker by vgaikwad [Support Engineering Group] issue 266660
One other note on API created Activation Keys -- we now have over 100 systems registered using various keys that were created with the API script. Going by memory, I think we've seen three problem keys that were created with the script. This event sent from IssueTracker by vgaikwad [Support Engineering Group] issue 266660
Oops. My response (3) an hour ago should have been: The deadlock: Activation Key used was 3-W-N-JDC-AX4RHI32. That key was initially created with the web ui. It was deleted and re-created with the web ui the morning after the incident (2009-02-13 08:55:49). Salient difference in the initial and re-created key: the clustering Child Channels were included in the original key, they were not included in the re-created key. This was, at least in part, an "exceeding the Channel Entitlements an org has been allotted" incident. This event sent from IssueTracker by vgaikwad [Support Engineering Group] issue 266660
Steps to exceed channel entitlements: - create the org - give it 6 Virtualization Platform entitlements in the Subscriptions -> System Entitlements tab. - Register systems using an Activation Key that has the corresponding channel selected in the Child Channels (Activation Key) tab Those systems can be either standard (e.g., physical hardware) or Xen guests; either type consumes channel entitlements. It is likely the Xen entitlement consumption misbehavior will need a separate Issue Tracker ticket. We are awaiting progress on this Issue before considering offshoot Issues. Internal Status set to 'Waiting on Support' Status set to: Waiting on Tech This event sent from IssueTracker by vgaikwad [Support Engineering Group] issue 266660
We have had three occurrences of "the problem" in the past two days that do not fit with prior information recorded in this Issue. The appearance on the client side is the same -- a timeout. As to most everything else, it is different. These are our only occurrences for over a week. These incidents make me wonder if we have two distinct problems: (1) hitting a channel entitlement limit causes the problem (2) all others are a satellite performance problem. Per recent Issue updates, "SEG" is working on item #1. NOTE that this aspect of the problem is our lowest priority. While I think there is an indication of a software problem on the satellite, the workaround is obvious. Item #2 is the still un-specified problem. And maybe it is more than one problem. The following presents a few facts that might support it being only a performance problem. Symptoms from today and yesterday: - Systems registering to Activation keys that have worked for several other systems. Per prior evidence/supposition, a key is either good or bad. The possibility immediately presented with these failures is that we have three different keys that have gone bad. - I was subsequently able to register a different client, with the first attempt, using two of these Activation keys. - With two of them, staff was later able to register the client using the same procedure (and Activation key) that had failed earlier. We did not edit the key in the time between the failure and the success. (For the third -- they registered with rhnreg_ks and can't be encumbered with testing for our problem. Even so, I was able to register a different client using that key.) - No TRACEBACKs, no anomolies in /rhnsat/admin/rhnsat/bdump, and so on. When registering an alternate system, I noticed that the registration step took a Very Long Time. The following from the log records our registration script writes -- one write before the rhnreg_ks, one after. The first two were business-as-usual; not thinking about this problem. The third was with a key that had "failed" an hour earlier (with a different client). The Activation Limit on this key is 7; six systems had registered using the key. ---------------------------------------------------------------------- messages.3:Feb 13 13:44:13 maltese register: sending registration request messages.3:Feb 13 13:44:34 maltese register: system should now be registered ID-1000015250 -- messages.3:Feb 13 13:48:47 maltese register: sending registration request messages.3:Feb 13 13:49:10 maltese register: system should now be registered ID-1000015251 -- messages:Mar 4 12:29:45 maltese register: sending registration request messages:Mar 4 12:30:37 maltese register: system should now be registered ID-1000015562 ---------------------------------------------------------------------- Is there some merit to thinking that the Mar 4 request almost timed out? And if so, would the timeout put the satellite into the self re-registering loop we see on the satellite? Additional evidence I recall: With one of the early "bad keys" -- that is, one we deleted and recreated with the web ui. I restored the database with the "bad key" to a backup satellite server. I was able to register a system using this key. No problem encountered. Regarding performance: I do not have an indication that our satellite is having a performance problem: Pretty-much standard from w [root@aosat ~]# w 13:49:37 up 37 days, 6:32, 3 users, load average: 0.00, 0.03, 0.00 Pretty-much standard from free [root@aosat ~]# free -t total used free shared buffers cached Mem: 8177312 8125860 51452 0 121992 4458424 -/+ buffers/cache: 3545444 4631868 Swap: 2048248 39608 2008640 Total: 10225560 8165468 2060092 We do not have tools installed for disk activity reporting. While corresponding values in /proc/diskstats for /rhnsat and /var/satellite are dramatically higher than a backup server that has no clients attached; I don't know what big values, or changes over time, would be. For the disk layout (and /proc info ??) there are previously uploaded sosreports. This event sent from IssueTracker by vgaikwad [Support Engineering Group] issue 266660
Good key, gone bad, now good. See my prior entry about good keys gone bad. This incident: we *finally* had "the problem" with a system we didn't have to get registered right away. (Now that the system is registered, that staff has work to do; so we are limited in further ability to test.) This was the fifth usage of the key 1-T-D-AS-SN5AOI64. When starting registration attempts on 03/05, one system was currently registered with the key. The key has an activation limit of 2. Thus, there have been multiple uses of the key, all except one system profile having been deleted. We have had no previous reports of problems with this key. The key was created using the API on Jan 14 (same time most other keys were created). The key has not been modified since. This system attempted registration using this key at the following times: 2009/03/05 12:44:10 2009/03/05 13:47:06 2009/03/06 10:34:35 2009/03/06 15:33:59 2009/03/09 10:10:33 The four attempts last week failed, showing "the problem." No meaningful mods were done to the system between registration attempts on 03/05 and 03/06. No mods were made between the last attempt on 03/06, and successful registration this morning (03/09). This event sent from IssueTracker by vgaikwad [Support Engineering Group] issue 266660
Vishal, could we get a tcpdump between a client and the Satellite (with client not using SSL) during a failure ? Vishal, from both the client and the satellite. Something like: From the client: tcpdump -s 0 -w clientfile.bin host SATELLITE_IP tcpdump -s 0 -w satfile.bin host CLIENT_IP making sure that the client is set to use 'http://' and not 'https://'
Errr, the 1st tcpdump command was for the client, the 2nd one is for the sat.
The issue has re-curred for customer, and below is a quick summary and workaround/fix applied which resolved the issue for now - this can supercede all of the previous updates as a final bug summary. Registration with rhnreg_ks and activation key which involved assigning groups to systems took upside of 3-4 minutes, eventually timing out (if networkRetries=0 in client up2date config) or failing with error due to activation key usage limit (if networkRetries>0). Imported customer's db dump and reproduced the long registration time in-house, and it's also found deletion of systems also took 3+ minutes each if there're group memberships involved. In fact, the following plsql function insert_into_servergroup_maybe took 3.5 minutes on its own during registration - ... 2009/05/19 08:12:03 -04:00 14322 rhnSQL/int_oracle.__call__('rhn_server.insert_into_servergroup_maybe', ('1000017760', '11')) 2009/05/19 08:15:29 -04:00 14322 rhnServer/server_history.add_history('Subscription via Token',) ... 2009/05/21 00:15:00 -04:00 27340 rhnSQL/int_oracle.__call__('rhn_server.insert_into_servergroup_maybe', ('1000017770', '11')) 2009/05/21 00:18:26 -04:00 27340 rhnServer/server_history.add_history('Subscription via Token',) ... If we remove the groups from activation key and system profile respectively, then both registration and deletion takes a few seconds. Per discussion with Justin for a similar case, we rebuilt all indexes on the 3 tables which insert_into_servergroup_maybe function queried/updated. Now, regardless of groups, both registration and deletion take seconds as opposed to 3-4 minutes. Gave customer the following steps: 1. make sure working backup of satellite db exists; 2. sqlplus rhnsat/rhnsat@rhnsat 3. run the following 7 index rebuilds: alter index rhn_servergroup_oid_name_uq rebuild tablespace DATA_TBS; alter index rhn_server_dsid_uq rebuild tablespace DATA_TBS; alter index rhn_server_oid_id_idx rebuild tablespace DATA_TBS; alter index rhn_server_created_id_idx rebuild tablespace DATA_TBS; alter index rhn_server_creator_idx rebuild tablespace DATA_TBS; alter index rhn_sgmembers_sid_sgid_uq rebuild tablespace DATA_TBS; alter index rhn_sgmembers_sgid_sid_idx rebuild tablespace DATA_TBS; which also resolved customer's issue - "Indexes built & response as you indicate; indexes rebuilt in approx 2 seconds. Successfully registered client system via HTTPS using Activation Key as before. The registration took just under 3 seconds to complete. This also appears to have cleared up the problem of deleting these profiles. They were taking 3+ minutes to delete; now taking less than 5 seconds. Other (good) system deletes were taking approx 30 seconds; now taking less than 5 seconds." I also re-imported customer's db again and started from scratch, this time trying a "db-control gather-stats" instead, but it didn't make any difference. Only rebuilding the indexes help, and specifially just these 2 out of the 7 above - alter index rhn_sgmembers_sid_sgid_uq rebuild tablespace DATA_TBS; alter index rhn_sgmembers_sgid_sid_idx rebuild tablespace DATA_TBS; So now the question is: If this is somewhat expected with a long-running database when indexes can reach certain # of levels or otherwise need rebuilding, then can we build in a task or provide a tool to monitor cases such as this so either satellite or the user can do a rebuild if/when necessary? Internal reproducer and notes in next update. Notes - ... function insert_into_servergroup_maybe ( server_id_in in number, server_group_id_in in number ) return number is retval number := 0; cursor servergroups is select s.id server_id, sg.id server_group_id from rhnServerGroup sg, rhnServer s where s.id = server_id_in and sg.id = server_group_id_in and s.org_id = sg.org_id and not exists ( select 1 from rhnServerGroupMembers sgm where sgm.server_id = s.id and sgm.server_group_id = sg.id ); begin for sgm in servergroups loop rhn_server.insert_into_servergroup(sgm.server_id, sgm.server_group_id); retval := retval + 1; end loop; return retval; end insert_into_servergroup_maybe; ...