Bug 23669 - can't read/write last block of odd-sized disk from userspace
Summary: can't read/write last block of odd-sized disk from userspace
Keywords:
Status: CLOSED RAWHIDE
Alias: None
Product: Red Hat Linux
Classification: Retired
Component: parted
Version: 7.1
Hardware: All
OS: Linux
high
high
Target Milestone: ---
Assignee: Michael K. Johnson
QA Contact: Brock Organ
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2001-01-09 20:43 UTC by Matt Domsch
Modified: 2005-10-31 22:00 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2001-05-02 15:54:33 UTC
Embargoed:


Attachments (Terms of Use)
parted-1.4.9-efigpt-blkioctl-010313.patch (80.44 KB, patch)
2001-03-13 21:18 UTC, Matt Domsch
no flags Details | Diff
parted-1.4.10-gpt-010319.patch (79.84 KB, patch)
2001-03-19 17:33 UTC, Matt Domsch
no flags Details | Diff

Description Matt Domsch 2001-01-09 20:43:34 UTC
I'm implementing code to handle the Intel EFI GUID Partition Table 
specification, available at http://developer.intel.com/technology/efi.  
Per the spec, a primary GUID partition table is stored on the disk at 
physical sector 1, and a backup of that partition table is stored on the 
last sector of the disk.  On an even-sized disk, user-space tools can 
read/write the last sector.  On an odd-sized disk, they cannot.

I need to read/write the last 512-byte sector on an odd-sized disk (IDE 
and/or SCSI) from user space.  Employing suggestions from l-k, I have 
implemented two IOCTLs that get/set the blksize_size[MAJOR(dev)][MINOR
(dev)] values (via set_blocksize()).  In my application, I read the 
hardsector size of a disk device (/dev/sdb) via an IOCTL, read the current 
blksize_size, set it to the hardsector size, and then continue, resetting 
blksize_size back to the original value when done.

In between, I use fopen64() to open the device /dev/sdb (an odd-sized 
disk), and lseek64()/fwrite64()/fread64() to write/read the last sector of 
the disk as reported by the BLKGETSIZE ioctl.  The seek succeeds, however, 
the reads/writes fail.  Writes fail with "No space left on device".  Read 
returns 0, indicating EOF.   If I read/write the N-1th sector this way, it 
works just fine.

From kernel space, I can bread() the last block with no problems after 
changing the blksize_size, only from user-space do I still have problems.

On even-sized disks, this succeeds both with and without calling my new 
IOCTLs, as expected.  On odd-sized disks, it fails in both cases.  Is 
there anything else I must do to make my change take affect that you can 
suggest?

I open this against all platforms, as I need it to function on IA-64 first 
in the Florence release, but the problem is generic.

I posed this question to Andries Brouwer and Anton Altaparmakov.  Anton 
responded:

I suspect that you will find the problem in linux/fs/block_dev.c functions
block_write() and block_read() which AFAIK only get called when you use
usermode to read a /dev/* blockdevice as you probably are doing. From the
kernel these functions AFAIK never get called and hence the problem
doesn't exist (either way I am surprised that it works as looking at
generic_make_request() there is a check of simillar type and AFAICS it
would fail as well for the same reason. I probably don't understand
something there and/or generic_make_request() is not being called either,
as it obviously works, since you have tried it).

In block_write() you find this (lines 58ff in the file in 2.4.0):
[snip]
        if (blk_size[MAJOR(dev)])
                size = ((loff_t) blk_size[MAJOR(dev)][MINOR(dev)] <<
BLOCK_SIZE_BITS) >> blocksize_bits;
        else
                size = INT_MAX;
        while (count>0) {
                if (block >= size)
                        return written ? written : -ENOSPC;
[snip]

As you can see when size is calculated blk_size is multiplied by 1024 and
then divided by 512, effectively blk_size is multiplied by 2.

The effect of this is that size has the lowest bit always equal to zero
and hence it always will be even.

The subsequent check "if (block >= size)" of course is then true and we
return with -ENOSPC straight away.

In block_read() you find this (lines 195ff in the file in 2.4.0):
[snip]
        if (blk_size[MAJOR(dev)])
                size = (loff_t) blk_size[MAJOR(dev)][MINOR(dev)] <<
BLOCK_SIZE_BITS;
        else
                size = (loff_t) INT_MAX << BLOCK_SIZE_BITS;

        if (offset > size)
                left = 0;
[snip]
        if (left <= 0)
                return 0;
[snip]

And again size is equal to blk_size multiplied by 1024 and hence always
will be even.

My solution is to have my own kernel patch which changes the constant
BLOCK_SIZE_BITE to 512 so that everything works fine. This also has the
effect that blk_size[] is in units of 512 rather than 1024. - Changing the
constant also results in quite a few other changes in the kernel as
unfortunately it is very often the case that people have assumed that
BLOCK_SIZE_BITS == 1024 and have hard coded either the 1024 as a number or
done something else equally shortsighted resulting in the kernel
blockdevice/vfs side breaking completely when you change the
BLOCK_SIZE_BITS. [Note that my patch only modifies the parts of the kernel
I compile in and by no means fixes all block devices / filesystems. - It
definitely makes both ide and scsi work for me but AFAICS it brakes
software RAID at the very least and potentially a lot of filesystems (my
patch fixes up only ext2 as I don't use anything else except for NTFS
which works fine). - If you are interested in my patch just let me know
and I can email it to you (note that it is for kernel version 2.4.0-test8
so I don't know if it still will apply cleanly to 2.4.0 proper but it
should be easy to fix).

Plese let me know how you get on as I have a strong interest in getting
this to work myself for the NTFStools I am writing!

Best regards,

	Anton

Comment 1 Stephen Tweedie 2001-01-10 11:56:50 UTC
The Linux block device layer is set up from the ground upwards to deal in 1k
block sizes or more.  There are some parts which have been extended to deal to
some extent with 512-byte blocks, but all of the device size accounting is still
1k-based.

Changing this would require auditing and testing just about every block device
driver in the system, and every filesystem, and every partition table decoder.
It would break compatibility with all existing third-party driver modules and
filesystems.

Changing the blk_size[] to be defined in terms of sectors, not blocks, would be
a more limited change, but again we'd have to audit every place where that is
used, and again it would break source-level compatibility with every
block-device-related module.

There is a block-device layer rework scheduled for the 2.5 kernel and this will
certainly be on the agenda.  For now, it's an enhancement that we _can_ make,
theoretically, but it's not trivial, and you should be aware of the
compatibility nightmares that it raises.

Comment 2 Matt Domsch 2001-01-10 13:53:55 UTC
Thanks for the insights Steven.

How about these as alternatives to fixing the block layer in the short-term?
1. RawIO.  I've been unsuccessful in making RawIO work to read/write that last 
block, but could be my own error.  Theoretically, should this work?  As we're 
in a beta cycle, modifying dd to alloc on 512 byte boundries should be easy, 
and could allow simplified testing too.
2. SCSI generic.  Issue read/write CDBs straight to a SCSI disk via sg.  
Haven't investigated here, should bypass block layer entirely.  Won't work with 
IDE drives, but I don't think that matters in IA-64 systems today, they're all 
SCSI.
3. Write IOCTLs to get/set the blk_size for a particular device, just as I'm 
doing for blocksize_size, paying attention to what that might break, but 
knowing it's a limited and temporary change (only for the duration of the disk 
access to read/write the partition table).
4. De-stroke odd-sized disks.  I hate doing it, it's a serious kludge, but we 
*could* tell the disk it's one sector shorter than it is.

Thanks,
Matt


Comment 3 Stephen Tweedie 2001-01-10 15:34:18 UTC
Raw IO is subject to the same limits as other IO, because ultimately it uses the
same route through the kernel to get to the low-level disk IO drivers.

dd should already align its buffers correctly these days.

Changing the blockdev size via ioctls may allow you access to the last sector,
but it will _also_ give you access to the sector beyond that, which may well be
critical for the immediately following partition.

Ignoring the last sector is safe and simple, and is what all existing Linux
filesystems do, but whether or not it is acceptable for you will depend on your
own application.

Accessing /dev/sg ought to work fine, but of course it places much more load on
the application programmer and removes a ton of kernel safety-nets.

Comment 4 Matt Domsch 2001-02-13 18:54:52 UTC
Michael  E. Brown from my team has submitted a patch to linux-kernel to address
this.

http://boudicca.tux.org/hypermail/linux-kernel/this-week/0526.html


Comment 5 Matt Domsch 2001-03-05 19:30:30 UTC
http://boudicca.tux.org/hypermail/linux-kernel/2001week07/0526.html is the
proper URL.

linux-2.4.2-blkioctl-sector.patch included in kernel 2.4.2-0.1.19.

However, one comment received from l-k is that there exists an XFS patch which
also
uses IOCTL IO_(0x12,108).  It was suggested that 109 and 110 be used instead.



Comment 6 Matt Domsch 2001-03-13 20:58:02 UTC
block-ioctl.patch went in to kernel by 2.4.2-0.1.23 or before.
I've got a patch for parted-1.4.9 which uses this, but I'll hold off on
submitting it
until we nail down what the IOCTL numbers are going to be for sure.

Comment 7 Matt Domsch 2001-03-13 21:18:16 UTC
Created attachment 12629 [details]
parted-1.4.9-efigpt-blkioctl-010313.patch

Comment 8 Michael K. Johnson 2001-03-13 21:44:12 UTC
Since we have the kernel patched, this leaves parted to be changed.

Comment 9 Matt Domsch 2001-03-19 17:33:34 UTC
Created attachment 13005 [details]
parted-1.4.10-gpt-010319.patch

Comment 10 Matt Domsch 2001-04-30 21:29:26 UTC
GPT patch for parted 1.4.11 available at http://domsch.com/linux/parted.


Comment 11 Bill Nottingham 2001-05-02 15:54:28 UTC
Added as of parted-1.4.11-1; thanks.


Note You need to log in before you can comment on or make changes to this bug.