I'm implementing code to handle the Intel EFI GUID Partition Table specification, available at http://developer.intel.com/technology/efi. Per the spec, a primary GUID partition table is stored on the disk at physical sector 1, and a backup of that partition table is stored on the last sector of the disk. On an even-sized disk, user-space tools can read/write the last sector. On an odd-sized disk, they cannot. I need to read/write the last 512-byte sector on an odd-sized disk (IDE and/or SCSI) from user space. Employing suggestions from l-k, I have implemented two IOCTLs that get/set the blksize_size[MAJOR(dev)][MINOR (dev)] values (via set_blocksize()). In my application, I read the hardsector size of a disk device (/dev/sdb) via an IOCTL, read the current blksize_size, set it to the hardsector size, and then continue, resetting blksize_size back to the original value when done. In between, I use fopen64() to open the device /dev/sdb (an odd-sized disk), and lseek64()/fwrite64()/fread64() to write/read the last sector of the disk as reported by the BLKGETSIZE ioctl. The seek succeeds, however, the reads/writes fail. Writes fail with "No space left on device". Read returns 0, indicating EOF. If I read/write the N-1th sector this way, it works just fine. From kernel space, I can bread() the last block with no problems after changing the blksize_size, only from user-space do I still have problems. On even-sized disks, this succeeds both with and without calling my new IOCTLs, as expected. On odd-sized disks, it fails in both cases. Is there anything else I must do to make my change take affect that you can suggest? I open this against all platforms, as I need it to function on IA-64 first in the Florence release, but the problem is generic. I posed this question to Andries Brouwer and Anton Altaparmakov. Anton responded: I suspect that you will find the problem in linux/fs/block_dev.c functions block_write() and block_read() which AFAIK only get called when you use usermode to read a /dev/* blockdevice as you probably are doing. From the kernel these functions AFAIK never get called and hence the problem doesn't exist (either way I am surprised that it works as looking at generic_make_request() there is a check of simillar type and AFAICS it would fail as well for the same reason. I probably don't understand something there and/or generic_make_request() is not being called either, as it obviously works, since you have tried it). In block_write() you find this (lines 58ff in the file in 2.4.0): [snip] if (blk_size[MAJOR(dev)]) size = ((loff_t) blk_size[MAJOR(dev)][MINOR(dev)] << BLOCK_SIZE_BITS) >> blocksize_bits; else size = INT_MAX; while (count>0) { if (block >= size) return written ? written : -ENOSPC; [snip] As you can see when size is calculated blk_size is multiplied by 1024 and then divided by 512, effectively blk_size is multiplied by 2. The effect of this is that size has the lowest bit always equal to zero and hence it always will be even. The subsequent check "if (block >= size)" of course is then true and we return with -ENOSPC straight away. In block_read() you find this (lines 195ff in the file in 2.4.0): [snip] if (blk_size[MAJOR(dev)]) size = (loff_t) blk_size[MAJOR(dev)][MINOR(dev)] << BLOCK_SIZE_BITS; else size = (loff_t) INT_MAX << BLOCK_SIZE_BITS; if (offset > size) left = 0; [snip] if (left <= 0) return 0; [snip] And again size is equal to blk_size multiplied by 1024 and hence always will be even. My solution is to have my own kernel patch which changes the constant BLOCK_SIZE_BITE to 512 so that everything works fine. This also has the effect that blk_size[] is in units of 512 rather than 1024. - Changing the constant also results in quite a few other changes in the kernel as unfortunately it is very often the case that people have assumed that BLOCK_SIZE_BITS == 1024 and have hard coded either the 1024 as a number or done something else equally shortsighted resulting in the kernel blockdevice/vfs side breaking completely when you change the BLOCK_SIZE_BITS. [Note that my patch only modifies the parts of the kernel I compile in and by no means fixes all block devices / filesystems. - It definitely makes both ide and scsi work for me but AFAICS it brakes software RAID at the very least and potentially a lot of filesystems (my patch fixes up only ext2 as I don't use anything else except for NTFS which works fine). - If you are interested in my patch just let me know and I can email it to you (note that it is for kernel version 2.4.0-test8 so I don't know if it still will apply cleanly to 2.4.0 proper but it should be easy to fix). Plese let me know how you get on as I have a strong interest in getting this to work myself for the NTFStools I am writing! Best regards, Anton
The Linux block device layer is set up from the ground upwards to deal in 1k block sizes or more. There are some parts which have been extended to deal to some extent with 512-byte blocks, but all of the device size accounting is still 1k-based. Changing this would require auditing and testing just about every block device driver in the system, and every filesystem, and every partition table decoder. It would break compatibility with all existing third-party driver modules and filesystems. Changing the blk_size[] to be defined in terms of sectors, not blocks, would be a more limited change, but again we'd have to audit every place where that is used, and again it would break source-level compatibility with every block-device-related module. There is a block-device layer rework scheduled for the 2.5 kernel and this will certainly be on the agenda. For now, it's an enhancement that we _can_ make, theoretically, but it's not trivial, and you should be aware of the compatibility nightmares that it raises.
Thanks for the insights Steven. How about these as alternatives to fixing the block layer in the short-term? 1. RawIO. I've been unsuccessful in making RawIO work to read/write that last block, but could be my own error. Theoretically, should this work? As we're in a beta cycle, modifying dd to alloc on 512 byte boundries should be easy, and could allow simplified testing too. 2. SCSI generic. Issue read/write CDBs straight to a SCSI disk via sg. Haven't investigated here, should bypass block layer entirely. Won't work with IDE drives, but I don't think that matters in IA-64 systems today, they're all SCSI. 3. Write IOCTLs to get/set the blk_size for a particular device, just as I'm doing for blocksize_size, paying attention to what that might break, but knowing it's a limited and temporary change (only for the duration of the disk access to read/write the partition table). 4. De-stroke odd-sized disks. I hate doing it, it's a serious kludge, but we *could* tell the disk it's one sector shorter than it is. Thanks, Matt
Raw IO is subject to the same limits as other IO, because ultimately it uses the same route through the kernel to get to the low-level disk IO drivers. dd should already align its buffers correctly these days. Changing the blockdev size via ioctls may allow you access to the last sector, but it will _also_ give you access to the sector beyond that, which may well be critical for the immediately following partition. Ignoring the last sector is safe and simple, and is what all existing Linux filesystems do, but whether or not it is acceptable for you will depend on your own application. Accessing /dev/sg ought to work fine, but of course it places much more load on the application programmer and removes a ton of kernel safety-nets.
Michael E. Brown from my team has submitted a patch to linux-kernel to address this. http://boudicca.tux.org/hypermail/linux-kernel/this-week/0526.html
http://boudicca.tux.org/hypermail/linux-kernel/2001week07/0526.html is the proper URL. linux-2.4.2-blkioctl-sector.patch included in kernel 2.4.2-0.1.19. However, one comment received from l-k is that there exists an XFS patch which also uses IOCTL IO_(0x12,108). It was suggested that 109 and 110 be used instead.
block-ioctl.patch went in to kernel by 2.4.2-0.1.23 or before. I've got a patch for parted-1.4.9 which uses this, but I'll hold off on submitting it until we nail down what the IOCTL numbers are going to be for sure.
Created attachment 12629 [details] parted-1.4.9-efigpt-blkioctl-010313.patch
Since we have the kernel patched, this leaves parted to be changed.
Created attachment 13005 [details] parted-1.4.10-gpt-010319.patch
GPT patch for parted 1.4.11 available at http://domsch.com/linux/parted.
Added as of parted-1.4.11-1; thanks.