Bug 63427 - RH 7.2 final uncaught anaconda exception
RH 7.2 final uncaught anaconda exception
Status: CLOSED WORKSFORME
Product: Red Hat Linux
Classification: Retired
Component: anaconda (Show other bugs)
7.2
i386 Linux
medium Severity medium
: ---
: ---
Assigned To: Jeremy Katz
Brock Organ
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2002-04-13 13:37 EDT by R P Herrold
Modified: 2007-04-18 12:41 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2002-04-30 16:26:06 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
anaconda traceback (68.00 KB, text/plain)
2002-04-13 13:38 EDT, R P Herrold
no flags Details
I have same problem on ide plextor 401240 cd (69.56 KB, text/plain)
2002-04-14 13:28 EDT, Patrick
no flags Details
upgrade traceback off host centurion (73.22 KB, text/plain)
2002-04-15 03:04 EDT, R P Herrold
no flags Details
Anaconda install failure dump. (61.87 KB, text/plain)
2002-04-23 12:49 EDT, Herb Krakau
no flags Details

  None (edit)
Description R P Herrold 2002-04-13 13:37:58 EDT
traceback in a moment - -all SCSI hardware, except IDE install CD
Comment 1 R P Herrold 2002-04-13 13:38:47 EDT
Created attachment 53744 [details]
anaconda traceback
Comment 2 Jeremy Katz 2002-04-14 02:02:26 EDT
Looks like bad cds/problems reading the CDs.  Did you burn them yourself?  Do
the md5sums match?  Have any problems reading CD-Rs on this drive in the past, etc?
Comment 3 R P Herrold 2002-04-14 10:55:08 EDT
Yes - self-burned
yes - md5sum checked
Side info -- was using thie CD set to update several servers, using the SAME
physical IDE CD drive -- moving it from host to host, at an ISP I admin some
hosts at.
The CD's worked for hosts before, and hosts after -- I burned them from my
reference copies (which also are md5sum perfect)
--------------

The unusual thing was that this was on a HP E60, and their SCSI approach:


[herrold@swampfox herrold]$ sudo lspci -v -v
Password:
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR+ FastB2B-
        Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort+ >SERR- <PERR-
        Latency: 32

00:04.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] (rev
01)
        Control: I/O+ Mem+ BusMaster+ SpecCycle+ MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B-
        Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
        Latency: 0

00:04.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
(prog-if 80 [Master])
        Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B-
        Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
        Latency: 32
        Region 4: I/O ports at 0500 [size=16]

00:06.0 Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100] (rev 01)
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR+ FastB2B-
        Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
        Latency: 66 (2000ns min, 14000ns max)
        Interrupt: pin A routed to IRQ 11
        Region 0: Memory at fecfe000 (32-bit, prefetchable) [size=4K]
        Region 1: I/O ports at f8e0 [size=32]
        Region 2: Memory at fed00000 (32-bit, non-prefetchable) [size=1M]
        Expansion ROM at <unassigned> [disabled] [size=1M]

00:07.0 SCSI storage controller: Adaptec AHA-7850 (rev 03)
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B-
        Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
        Latency: 64 (1000ns min, 1000ns max), cache line size 08
        Interrupt: pin A routed to IRQ 10
        Region 0: I/O ports at fc00 [size=256]
        Region 1: Memory at fecff000 (32-bit, non-prefetchable) [size=4K]

00:0d.0 VGA compatible controller: Cirrus Logic GD 5446 (prog-if 00 [VGA])
        Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B-
        Status: Cap- 66Mhz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
        Region 0: Memory at fd000000 (32-bit, prefetchable) [size=16M]
        ExpansionROM at <unassigned> [disabled] [size=32K]

[herrold@swampfox herrold]$

I have great continuing trouble with error retry logic when the SCSI is present
-- This is true with my test unit: oldpokey, and another test unit "wildman" LD
6/200 (Similar SCSI controller (dual controllers actually) 200

See also: 56330 30822 30414, and especially: 30250

The anaconda error return catch and retry logic on disk IO seem persistently
fragile here.

This series I was upgrading yesterday are production hosts, but the Skipjack 2
beta is tossing these on 'oldpokey' this last week -- reliably reproducing the
fault is not happening -- the error point is 'mobving around' and I cannot ger a
traceback (the option is not offered.  I am considering doing an chrooted
install so I can catch state.
Comment 4 Patrick 2002-04-14 13:28:41 EDT
Created attachment 53789 [details]
I have same problem on ide plextor 401240 cd
Comment 5 R P Herrold 2002-04-15 03:03:15 EDT
Differing hardware -- P-166, 32M, all IDE -- OH md5sums, but LOTS of drive 
timeout/resets
Died 70% thru disk 1 in an upgrade drom RH 6.1 (heavily patched)
My unit: centurion
------------------
attachment in a moment
------------------
I _hate_ uncaught exceptions -- I am left with a half-upgraded system which is 
quite difficult to recesitate
Comment 6 R P Herrold 2002-04-15 03:04:26 EDT
Created attachment 53824 [details]
upgrade traceback off host centurion
Comment 7 R P Herrold 2002-04-15 03:13:07 EDT
An uncaught exception is particularly painful in that thre is a real chance 
that one will be left with an ext3 converted filesystem, but not yet holding 
the tools in /bin/ to recover the filesystem, or to boot properly.

I really think that Retry and Skip options on I/O failed package are needed at 
any situation which would otherwise be a traceback, for restarting an upgrade 
is MUCH more painful than missing one or two random packages.  Yeah, yeah, I 
know about backups, and have such on this upgrade, taken immediately prior to 
the attempt, but ....

Please pound away at making sure anaconda catched exceptions _everywhere_ once 
the RPM transactio is going -- possibly with a series of sub-group installs to 
let RPM chechpoint its progress. (necuase the alternative, with a bail on an 
upgrade, _all_ the previously installed packages need to be installed again) 
[with lost data on post-install scripts and backed-up config data in some 
cases.]
Comment 8 R P Herrold 2002-04-15 03:26:34 EDT
I see several IDE drive resets in Patrick's traceback as well.  This situation 
is symptomatic of the hardware in the market these days.  I get it with Mitsumi 
and HP marque drives, as well as 'el cheapo' cruft from CompUSA.  Drive resets 
and retries are a fact of life, and need a rethinking on how they are 
intercepted and hamdled.

I ahve had good results, since the RH 6.2 testing days, with letting the drive 
spin down -- and choosing the retry option to spin it up, and when that fails, 
immediately retrying while it is spinning ... timing is part of the solution of 
'walking' a drive through whatever issue it is experiencing.

-------------

Side note also in my latest tracebacb (RH 6.1 to RH 7.2), we had this error:

Upgrading chkconfig.
unpacking of archive failed on file /etc/init.d: cpio: rename failed - Is a 
directory
Comment 9 R P Herrold 2002-04-15 03:29:01 EDT
Host ahs caught up to the space checking part of the install -- it is not 
'short' 187 M of space in /usr, which I know full well is content left 
behind from the prior attempt, which will be over-written this pass through

There _has_ to ne a _"I know it is not recommended -- do it anyway"_ option on 
at least an 'test/expert' mode, to be able to recover from a situation like 
this.
Comment 10 R P Herrold 2002-04-15 03:34:36 EDT
(/usr is 1.3 G -- entire upgrade size is 985M per screen -- I'll blow away 
/usr/share, /usr/man, and /usr/doc to make space ...)
Comment 11 Patrick 2002-04-15 03:48:26 EDT
I solve my problem using a dvd drive instead of cd, seem anaconda was unable to 
eject the cd
Comment 12 R P Herrold 2002-04-15 04:48:43 EDT
... just completed a totally hands off last stag istall -- no retries on CDs at 
all calling for operator intervention ..

with a  brokem named. keymap. fonts, and xfs, due to the space problems.

I'll find the prior similar close later today on this declining to add the 
'Skip' option, for comparison -- this is clearly worse than skipping a couple 
of packages would be.
Comment 13 Jeremy Katz 2002-04-15 11:50:44 EDT
The uncaught exception is because there have been lots of read errors from the
CD causing package installs to fail and then the CD can't be unmounted cleanly. 

Every time someone asks for a skip button, I respond with the same question.

What do the semantics of "SKIP" mean?  If glibc fails to install, what am I
supposed to do.  Especially on an upgrade.

The fact that the RPM errors weren't caught is fixed in Skipjack but even so,
there's not much we can do when parts of a transaction set fail.  RPM just
doesn't have the semantics for handling cases like that.  

CD hardware seems to suck more and more these days... it's getting close to the
level of floppy drives :(

Do you know if DMA is being used for accessing the CD-ROM drive?
Comment 14 R P Herrold 2002-04-16 11:50:56 EDT
grrr ... concur on the sorry state of CD hardware on the IDE side -- but
Gresham's Law has doomed us there.  The Legion IDE DMA blacklist workarounds are
well known and won't be repeated here.  Dunno if it was using DMA -- can we tell
fromt eh DMEAG within the anaconda tracebacks -- seems we should be carching
that infor in a traceback -- if not, shall you RFE, or shall I?

(Netscape is horking out -- I lost a 2 hour on screen compose -- I'll commit
this and continue)
Comment 15 R P Herrold 2002-04-16 12:14:16 EDT
jkatz -- see also my 56330  the issue is broader than hardware

you asked:

>  Every time someone asks for a skip button, I respond with the same question.
> 
>  What do the semantics of "SKIP" mean?  If glibc fails to install, what am I
>  supposed to do.  Especially on an upgrade.

There are about 15 critical static linked owning packages -- ones owning items
in /bin/ and /sbin -- this is a reversed list from /sbin/

[root@oldnews sbin]# cd /sbin; rpm -qf --qf '%{name}\n' `cat ~/rpm-static `| \
      sort -u
dhcpcd
dump
e2fsprogs
glibc
hotplug
initscripts
iproute
kernel-pcmcia-cs
mkbootdisk
mkinitrd
modutils
rmt
shapecfg
-------------

Obviously some are convenience rather than truly 'Critical'.

But for ones in that list, if there is a purported media error, or whatever, we
ARE still in a Unix environment -- If the person doing the upgrade is doing it
'text expert', they are on their own -- they know it -- your support policy can
state it.  But Unix make impossible things merely hard, unlike some other
OpSys's

For the Critical, display a variant "Danger Will Robinson -- you will royally
hork up your system it you click 'SKIP'" -- that is where I ended up a couple of
nights ago anyway, with ext3 partitions and a 6.1 fallback kernel.

For the packages NOT on the list, do what is done now:  Complain into
/tmp/upgrade.log about it, and Skip if told to Skip

Heck -- in THIS upgrade, that is what was done by RPM and anaconda anyway:  see:
https://bugzilla.redhat.com/bugzilla/showattachment.cgi?attach_id=53824 :

Upgrading chkconfig.
unpacking of archive failed on file /etc/init.d: cpio: rename failed - Is a
directory

---------

because I had backported the initscripts to upgrade sendmail somewhere along the
way, and needed to get /etc/init.d/functions  before the LSB conformance changed
of RH 7.0

and:

Upgrading XFree86-100dpi-fonts.
read failed: Input/output error (5)

read failed: Input/output error (5)

----------

again, no big deal -- skip and clean up later

and:


Upgrading xinetd.
read failed: Input/output error (5)

read failed: Input/output error (5)

--------------

harder, but I can (and have) fixed ...

==========================

The point it -- by simply bailing out, the sysadmin is denied the opportunity to
choose -- So long as there are big red signs, Preserve freedom of choice is the
point.

-----------------------------

As to my RPM transacrion set breakup observation, it is custiomary in the
industry to user Checkpoint/Restart as a way to minimize loss and enhance
revocerability.  It would seem taht instead of a mega transaction, a 'Critical'
subset on diwk 1, and then a non-critical "the rest" would be fairly
straightforward.  This is however beyond the scope of a RH 7.2 final issue, and
I will separate and RFE it separately.

-- Russ

Comment 16 Herb Krakau 2002-04-23 12:35:18 EDT
I have the same or a very similar problem.  Anaconda dies (repeatedly) about 
2/3 of the way through an install of 7.2.  I can't get Linux to boot, so I'm 
dead in the water.  I'm green enough not to catch the drift of the other 
comments; is a hardware problem in accessing the CD being suggested as the 
cause?  If that is the case, I would note that I've used the drive with Windows 
2000 for over a year without problems - that is at least circumstantial 
evidence that the drive may be OK.  Any suggestions on how to work around this 
for the time being?

I'll attach anacdump.txt in case it helps.  Is there any other info I can 
supply?
Comment 17 Herb Krakau 2002-04-23 12:49:12 EDT
Created attachment 54985 [details]
Anaconda install failure dump.
Comment 18 Herb Krakau 2002-04-25 14:59:55 EDT
Used a different CD drive, and the install completed successfully.

Note You need to log in before you can comment on or make changes to this bug.