Bug 254

Summary: dump failure--slave children out of sync
Product: [Retired] Red Hat Linux Reporter: dhunt
Component: dumpAssignee: David Lawrence <dkl>
Status: CLOSED NOTABUG QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: 5.2   
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 1999-01-14 18:35:29 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description dhunt 1998-12-01 22:40:07 UTC
11/30/98

This dump command leads to an infinite loop in pass III
(directories):

dump 0uBbf 12000000 63 /home/sda1.dump /dev/sda1

/dev/sda1 is the root file system, containing 1.3 Gig of
stuff.
/home/sda1.dump is a test dump file.  This also fails when
writing to my exabyte SCSI 8mm tape.

  DUMP: Date of this level 0 dump: Mon Nov 30 11:18:02 1998
  DUMP: Date of last level 0 dump: the epoch
  DUMP: Dumping /dev/sda1 (/) to /dev/nst0
  DUMP: mapping (Pass I) [regular files]
  DUMP: mapping (Pass II) [directories]
  DUMP: estimated 1454634 tape blocks on 0.04 tape(s).
  DUMP: dumping (Pass III) [directories]

... here for 6 hours, no progress messages, 99% CPU usage on
one cpu
... no disk access noise

My system is a DELL Precision workstation 410, 400 MHz, 2
CPUs.
I have compiled the kernel (2.0.36) with SMP on.  /dev/sda1
is an 8 Gig Ultra 2
SCSI disk.  This dump procedure worked fine in production
under RedHat 5.1.  I recently
upgraded to rh5.2 and it fails.

11/31/98

Debugging:

-- Built dump
-- Built ext2fsprog-devel

Debugging dump:

(gdb) run 0uBbf 12000000 63 /home/sda1.dump /dev/sda1
Starting program: /usr/src/redhat/BUILD/dump-0.3/dump/dump
0uBbf 12000000 63 /home/sda1.dump /dev/sda1

Breakpoint 1, main (argc=6, argv=0xbffffcc0) at main.c:122
(gdb) c
Continuing.
  DUMP: Date of this level 0 dump: Tue Dec  1 10:40:00 1998
  DUMP: Date of last level 0 dump: the epoch
  DUMP: Dumping /dev/sda1 (/) to /home/sda1.dump
  DUMP: mapping (Pass I) [regular files]
  DUMP: mapping (Pass II) [directories]
  DUMP: estimated 1472006 tape blocks on 0.12 tape(s).
  DUMP: dumping (Pass III) [directories]

... hangs here ...  Hit cntl-C in gdb

  DUMP: Interrupt received.
  DUMP: Do you want to abort dump?: ("yes" or "no")
Program received signal SIGINT, Interrupt.
0x8061bf9 in __wait4 ()
(gdb) where
#0  0x8061bf9 in __wait4 ()
#1  0xeff in ?? ()
#2  0x804a8ee in startnewtape (top=1) at tape.c:560
#3  0x8049232 in main (argc=0, argv=0xbffffcd8) at
main.c:493
(gdb)

--Now recompile with -DTDEBUG:

--Rerun from command line:

[root@cosmic dump]# ./dump 0uBbf 12000000 63 /home/sda1.dump
/dev/sda1
  DUMP: pid=3961 Date of this level 0 dump: Tue Dec  1
10:50:57 1998
  DUMP: pid=3961 Date of last level 0 dump: the epoch
  DUMP: pid=3961 Dumping /dev/sda1 (/) to /home/sda1.dump
  DUMP: pid=3961 mapping (Pass I) [regular files]
  DUMP: pid=3961 mapping (Pass II) [directories]
  DUMP: pid=3961 estimated 1472001 tape blocks on 0.12
tape(s).
  DUMP: pid=3961 Tape: 1; parent process: 3961 child process
3962
  DUMP: pid=3962 Child on Tape 1 has parent 3961, my pid =
3962
  DUMP: pid=3962 dumping (Pass III) [directories]
  DUMP:   DUMP: pid=3964 slave 1, pid 3964
  DUMP: pid=3965 slave 2, pid 3965
pid=3963 slave 0, pid 3963
  DUMP: pid=3963 master/slave protocol botched.
  DUMP: pid=3963 pid = 3963 exits with status 3
  DUMP: pid=3962 The ENTIRE dump is aborted.
  DUMP: pid=3962 pid = 3962 exits with status 3
  DUMP: pid=3961 Child 3962 finishes X_ABORT
  DUMP: pid=3961 pid = 3961 exits with status 3
[root@cosmic dump]#

-- Fiddled around some with -DWRITEDEBUG and number of
SLAVES (from tape.c)
   These seemed to have some effect, but it is sporadic.
This leads me
   to believe that there is a timing problem.  When I set
SLAVES to 5 (from 3)
   and compile with -DWRITEDEBUG, it seems to work, but
spews out loads of
   debug output.

   Since it sometimes works when debug output is on, perhaps
my disk is too
   fast and there is a race condition?


-- I am now stuck.

Comment 1 Doug Ledford 1999-01-14 17:23:59 UTC
DUMP: pid=3963 master/slave protocol botched.
               ^^^^^^^^^^^^

This leads me to believe that there are some possible issues between
dump and the timing of process/thread creation on the Sparc systems.
At this point, it doesn't look like and tape read/write cycles have
yet been started, and the disk read stuff has already worked properly
(pass I and II).  So, I would suggest looking into the master/slave
creation/syncronization code in dump first.  There is also the issue
of a possible race condition in the kernel due to the drive sub-system
and SMP combination.  Compiling as UP and/or slowing down the drive
might make a difference and would be worth testing out to make sure.

Recommended course of action, the user needs to let us know if
compiling UP kernel makes a difference.  If not, then we can look into
the dump code.  If it does, then the problem would seem similar to
another known SMP problem in 2.0.36 and is not likely to be fixed
given the pre-emminent release of 2.2.0.

Comment 2 Doug Ledford 1999-01-14 17:24:59 UTC
Sorry, replace Sparc with Intel in my last comments....