Bug 652852

Summary: meta data sync to ISS slave failed
Product: Red Hat Satellite 5 Reporter: Luc de Louw <luc>
Component: Satellite SynchronizationAssignee: Michael Mráka <mmraka>
Status: CLOSED ERRATA QA Contact: Jiri Kastner <jkastner>
Severity: urgent Docs Contact:
Priority: high    
Version: 540CC: cperry, degts, fdewaley, jfenal, jhutar, jkastner, joshuadfranklin, jwest, marcus.moeller, mmraka, mzazrivec, raud, rvandolson, sandro, stanislav.polasek, stephan.duehr, taw, xdmoon
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: spacewalk-backend-1.2.13-19 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-12-13 14:31:42 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 646488    
Attachments:
Description Flags
output while trying to sync kickstartable trees from ISS master
none
Comment none

Description Luc de Louw 2010-11-12 23:30:53 UTC
Description of problem:
Believe it or not, the tcsh package can not be updated on systems connected to *SLAVE* satellite.

Version-Release number of selected component (if applicable):
5.4

How reproducible:
Twice. Here and and another RHN Satellite user (Case number to be announced)

Steps to Reproduce:
1. Register a Master Satellite
2. Register a Slave Satellite
3. "yum update tcsh" on systems connected to the master sat -> Works
4. ISS 
5. "yum update tcsh" on any system connected  to the SLAVE sat. -> fails
  
Actual results:

server:~# yum update tcsh
Loaded plugins: rhnplugin, security
Cleaning up Everything
Loaded plugins: rhnplugin, security
nagios-agents | 1.1 kB 00:00
nagios-agents/primary | 2.3 kB 00:00
nagios-agents 10/10
rhel-x86_64-server-5 | 1.3 kB 00:00
rhel-x86_64-server-5/primary | 3.5 MB 00:00
rhel-x86_64-server-5 10155/10155
rhel-x86_64-server-supplementary-5 | 1.3 kB 00:00
rhel-x86_64-server-supplementary-5/primary | 262 kB 00:00
rhel-x86_64-server-supplementary-5 868/868
server-mgmt-rhel-x86_64-server-5 | 1.1 kB 00:00
server-mgmt-rhel-x86_64-server-5/primary | 9.2 kB 00:00
server-mgmt-rhel-x86_64-server-5 24/24
Skipping security plugin, no data
Setting up Update Process
Resolving Dependencies
Skipping security plugin, no data
--> Running transaction check
---> Package tcsh.x86_64 0:6.14-17.el5_5.2 set to be updated
--> Processing Dependency: /bin/csh for package: mtools
--> Finished Dependency Resolution
mtools-3.9.10-2.fc6.x86_64 from installed has depsolving problems
--> Missing Dependency: /bin/csh is needed by package mtools-3.9.10-2.fc6.x86_64 (installed)
Error: Missing Dependency: /bin/csh is needed by package mtools-3.9.10-2.fc6.x86_64 (installed)
You could try using --skip-broken to work around the problem
You could try running: package-cleanup --problems
package-cleanup --dupes
rpm -Va --nofiles --nodigest
server:~#


Expected results:
updating the stuff


Additional info:
ISS seems to be broken. A friend (another RHN Satellite user) experienced exactly the  behaviour. 

The odd thing is that only networks connected to a SLAVE and ONLY the pkg tcsh is affected.

Even more odd is the fact that ONLY tcsh is affected at the moment, and ONLY in SLAVE Satellites.

/var/cache/rhn/* was cleaned and ISS again, yum clean did not helped neither.

Comment 1 Luc de Louw 2010-11-12 23:44:13 UTC
Please also see Case #378636

Comment 2 Marcus Moeller 2010-11-13 17:29:05 UTC
We have exactly the same problem on our 5.4 ISS slave

Greets
Marcus

Comment 3 Luc de Louw 2010-11-13 21:09:19 UTC
In meantime, I detected more affected RPMs:

Python
https://rhn.redhat.com/rhn/errata/details/Details.do?eid=10403

e2fsprogs
https://rhn.redhat.com/rhn/errata/details/Details.do?eid=10400

glibc on some systems
https://rhn.redhat.com/rhn/errata/details/Details.do?eid=10405

I was not able to detect some pattern to determine which kind of packages are affected.

Workarounds: 
- Temporary sync your slave directly with rhn if you can.
- Download the packages from the (slave) satellite and yum localinstall it.

At the moment it seems that only RHEL5 base-channels are affected. RHEL6 untested.

Comment 4 Luc de Louw 2010-11-15 13:25:17 UTC
Once one had a ISS between two sat540, it seems that the database is affected, it does not help to delete /var/cache/rhn/* and sync directly with rhn, as the repodata is built from information out of the database.

I wonder if restoring the sat530 db and a subsequent spacewalk-schema-upgrade would help to fix the problem.

Comment 5 Marcus Moeller 2010-11-16 08:19:14 UTC
We have encountered the cause for this problem: ISS seems to ignore symlinks and directories. Please compare the filelist from an affected package (e.g. tcsh) on the ISS master and the slave and you will notice the difference.

Greets
Marcus

Comment 6 Luc de Louw 2010-11-16 17:35:13 UTC
A random pick of packages on the RHEL6 base channel shows more strange things:

On the Master https://sat.expample.com/network/software/packages/file_list.pxt shows that bzip2-1.0.5-7.el6_0.x86_64.rpm only consists of directories and symlinks, no files. abrt-1.1.13-4.el6.x86_64.rpm only consists of directories, so files, no symlinks. 

As Marcus found out, directories and symlinks are ignored. On the Slave satellite this leads in such nice displays like  "This package contains the following files.
No files."

Comment 8 Michael Mráka 2010-11-19 12:11:33 UTC
Fixed in spacewalk nightly:
commit 73c920ce0329b5aa7bde6fbb270b122263ca369e
    652852 - dirs and links have no checksum

Package spacewalk-backend-1.3.6-1.

Comment 9 Marcus Moeller 2010-11-19 12:36:38 UTC
Dear Michael,

what happens to already damaged repodata in the db? Is there a way to force recreation?

Greets
Marcus

Comment 10 Michael Mráka 2010-11-19 12:45:51 UTC
Hi Marcus,

unfortunately there's no way better than remove channel from db (spacewalk-remove-channel --just-db should be enough) and let it sync again. 

Regards,
Michael

Comment 11 Marcus Moeller 2010-11-19 13:08:27 UTC
Does that have any influence on registered systems?

Greets
Marcus

Comment 12 Luc de Louw 2010-11-19 14:54:52 UTC
Hi Marcus,

Unfortunately it seems to be the case:

slave:~# spacewalk-remove-channel --justdb -c rhel-x86_64-server-6

Currently there are systems subscribed to one or more of the specified channels.
If you would like to automatically unsubscribe these systems, simply use the --unsubscribe flag.

The following systems were found to be subscribed:
org_id   id             name
--------------------------------
2        1000000000     rhel6-test
slave:~# 

Maybe restoring the sat530 db and a subsequent spacewalk-schema-upgrade would be an option? Of course only if you did not registred too many systems since the upgrade.

Greets,

Luc

Comment 13 Marcus Moeller 2010-11-19 15:07:27 UTC
Hmm,

that's not a working scenario in our case. I cannot force 500+ system to re-register.

Greets
Marcus

Comment 14 Michael Mráka 2010-11-19 15:26:41 UTC
Markus and Luc,

hold your horses please. This has been just fixed in Spacewalk upstream.
The oficial Satellite errata will be release soon, once it will go through appropriately QA cycle!

Regards,
Michael

Comment 15 Luc de Louw 2010-11-19 23:32:40 UTC
Wild Apache (hardly) can hold back his horses...

Nobody is going to install nightlies on a production (already crippled) satellite

Marcus and I was talking about how to recover from the crippled DB once the fix is available...

Once the fix is officially available on rhn, we still have crippled slave satellites. Will there be a solution and how does this solution looks like?

Greets,

Luc

Comment 20 Michael Mráka 2010-11-24 13:54:27 UTC
Spacewalk git:
commit d4bee4ec00fc89e00dd5c74a684298ebf0e2f686
    added --skip-channels to spacewalk-remove-channel
    this is the way how to remove all packages from the channel(s)

spacewalk-remove-channel in spacewalk night now has ability to delete all packages from channel but not the channel. This is the way how we to refresh packages without been forced to remove and re-register all servers.

Comment 21 Michael Mráka 2010-11-24 14:18:18 UTC
How to recover from the bug:
* install new spacewalk-backend package (both on ISS master and slave)
* on ISS slave flush sync package cache: rm -rf /var/cache/rhn/satsync/packages/*
* on ISS slave remove packages sync'ed over ISS from database (withnout deleting channels themselves): 
    for i in <list of channels sync'ed over ISS> ; do
       spacewalk-remove-channel -c $i --just-db --skip-channels
    done
* resync packages back

Comment 27 Luc de Louw 2010-11-24 23:16:49 UTC
(In reply to comment #21)
> How to recover from the bug:
> * install new spacewalk-backend package (both on ISS master and slave)
> * on ISS slave flush sync package cache: rm -rf
> /var/cache/rhn/satsync/packages/*
> * on ISS slave remove packages sync'ed over ISS from database (withnout
> deleting channels themselves): 
>     for i in <list of channels sync'ed over ISS> ; do
>        spacewalk-remove-channel -c $i --just-db --skip-channels
>     done
> * resync packages back

Dear Michael,

This message greatly relieves my concerns about the data base, thanks a lot. Is there any plan when those fixes will get released? Sorry, I've got our security officers at my neck.

Thanks,

Luc

Comment 28 Clifford Perry 2010-11-24 23:52:49 UTC
Hi - in short, as soon as we can. Beyond just testing this specific bug and path to recovery as outlined, we will also, where appropriate create automated tests for this and run general regression/automated tests as appropriate.

Regards,
Cliff.

Comment 30 Michael Mráka 2010-11-25 09:47:00 UTC
One more forgotten commit in spacewalk git:
commit 4bd2be58dc7da4a43804bd3cf7c8610e5afe284f
    don't require server unsubscribe when --skip-channels is used

Comment 34 Marcus Moeller 2010-11-26 16:17:14 UTC
It's impossible to delete data from base channels as the script always complains that there are existing child channels associated with the base channel:

Error: cannot remove channel rhel-x86_64-ws-4: subchannel(s) exist: 
			rhel-4ws-x86_64-epel-os
			rhel-4ws-x86_64-vmwaretools-4.0
			rhel-x86_64-ws-4-extras
			rhel-x86_64-ws-4-fastrack
			rhn-tools-rhel-4-ws-x86_64

If the --skip-channels parameter is given child channel associations should simply be ignored.

Comment 36 Clifford Perry 2010-11-27 14:59:15 UTC
NOTE - Channel Dumps in RHN - due to the dumps being created from a 5.4 Satellite with this bug, the channel dumps published on RHN Nov-14 which included RHEL 6 dumps also have this bug. As such, if you have a Satellite which imported data from those dumps you will need to remove the package content (as outlined in how to recover comments previously in this bug) and then reimport once new dumps are published (or sync'ing from RHN).

We are going to apply the fix to the Channel Dump creator 5.4 Satellite used and start generating fresh dumps. 

Cliff.

Comment 37 Clifford Perry 2010-11-27 15:03:25 UTC
Created attachment 915169 [details]
Comment

(This comment was longer than 65,535 characters and has been moved to an attachment by Red Hat Bugzilla).

Comment 38 Jiri Kastner 2010-11-28 23:12:09 UTC
verified.
must be deleted /var/cache/rhn/satsync/* content on master and slave, otherwise generated repodata will be still incosistent.

Comment 40 Clifford Perry 2010-11-29 17:27:16 UTC
Slight modification from comment #21

* install new spacewalk-backend package (both on ISS master and slave)
* on ISS master flush sync cache: rm -rf /var/cache/rhn/satsync/*
* on ISS slave flush sync package cache: rm -rf
/var/cache/rhn/satsync/packages/*
* on ISS slave remove packages sync'ed over ISS from database (withnout
deleting channels themselves): 
    for i in <list of channels sync'ed over ISS> ; do
       spacewalk-remove-channel -c $i --just-db --skip-channels
    done
* on ISS slave resync packages back 
    - this will generate new cache for export on master and them re-imported to slave Satellite. 

Basically the old instructions were missing the delete cache from master step. 

As for anyone using Channel Dumps to import and the bumps were bad, you would run the same steps as outlined from Slave Satellites. 


Cliff

Comment 41 Sandro Mathys 2010-12-01 12:27:35 UTC
This issue does not seem to be fixed after we applied the provided steps after having received hotfix_spacewalk-backend-1.2.13-16.el5sat.tar from Red Hat support. I noticed satellite reports "no files" for certain packages again. So far I noticed this for glibc and python in some but not all channels.

Comment 42 Sandro Mathys 2010-12-01 13:58:41 UTC
I just noticed, that we now miss on the ISS slave:
- all kickstart profiles
- all kickstartable trees

I don't know if this is directly connected to this hotfix or the provided steps but it would be a huge coincidence if not.

Comment 43 Clifford Perry 2010-12-01 15:37:00 UTC
Slight modification from comment #21 & #40 - we need to remove the xml* directories from the master cache directory, not satsync. This was a mistake on my part in reading private comments that wasn't noticed till today. 

* install new spacewalk-backend package (both on ISS master and slave)
* on ISS master flush sync cache: 
    rm -rf /var/cache/rhn/xml*
    rm -rf /var/cache/rhn/satsync/*
* on ISS slave flush sync package cache: 
    rm -rf /var/cache/rhn/satsync/packages/*
* on ISS slave remove packages sync'ed over ISS from database (withnout
deleting channels themselves): 
    for i in <list of channels sync'ed over ISS> ; do
       spacewalk-remove-channel -c $i --just-db --skip-channels
    done
* on ISS slave resync packages back 
    - this will generate new cache for export on master and them re-imported to
slave Satellite. 

As for anyone using Channel Dumps to import and the bumps were bad, you would
run the same steps as outlined from Slave Satellites. 


Cliff

Comment 44 Clifford Perry 2010-12-01 15:44:47 UTC
A quick note to anyone on cc following this. Due to comments made in this bug earlier today, even though it is VERIFIED by QA.  We are re-running some further validation/confirmations before releasing this Errata and bugfix. 

Cliff

Comment 46 Michael Mráka 2010-12-06 10:54:49 UTC
This bug is very closely related to bug 659348. An extension to updatedPackages.py has been made to fix data affected by both bugs.

New solution:
* install new spacewalk-backend package (both on ISS master and slave)
* flush sync and exporter cache on both ISS master and slave:
    rm -rf /var/cache/rhn/xml*
    rm -rf /var/cache/rhn/satsync/*
* to fix previously sunced data run:
    upgradePackages.py --update-package-files

Comment 48 Michael Mráka 2010-12-06 13:14:50 UTC
An extension to updatePackages.py to fix already synced data (spacewalk git)
commit c73d420b42791122ca482c3f89ffc8e69b790a59
    update-packages: update package file list functionality

    --update-package-files fixes following problems:
    1. Bug #652852 - meta data sync to ISS slave failed. This script will be
able
    to insert package files which did not make it into the database as a result
    of corrupted channel / ISS export.

    2. Bug #659348 - file list of rhel6 package shows (Directory) instead of
checksum.
    This script will be able to set correct checksum to package files which
lack it.

    All package information are being retrieved from packages on the filer.



Backported to satelite git:
commit 7cbacad481307ccac8cbec17176002236c729822
    update-packages: update package file list functionality
...    
    (cherry picked from commit c73d420b42791122ca482c3f89ffc8e69b790a59)

Comment 49 Marcus Moeller 2010-12-08 07:41:56 UTC
From our observations neither the process described in comment 43 nor comment 46, syncs the kickstartable trees.

Also RHEL 6 channels still show packages without content.

Comment 50 Marcus Moeller 2010-12-08 07:49:35 UTC
Created attachment 467404 [details]
output while trying to sync kickstartable trees from ISS master

Comment 51 Marcus Moeller 2010-12-08 08:06:55 UTC
A little updated concerning rhel 6 packages with no files: it seems that packages where an errata/update has been published for, the updated package shows content, but the initial package does not.

Comment 54 Clifford Perry 2010-12-08 15:05:57 UTC
Marcus Moeller - Please us Red Hat support to help diagnose this. We suspect you may have issues resulting from earlier version of the hotfix. While we appreciate the feedback in this bug, it needs to be troubleshoot correctly via our support channels. 

Regards,
Cliff

Comment 56 errata-xmlrpc 2010-12-13 14:31:42 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2010-0974.html