Bug 1122937
Summary: | aarch64 AES crypto: block misalignment causes crash or data corruption and cryptsetup fails to activate partitions | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Richard W.M. Jones <rjones> | ||||||||||
Component: | kernel | Assignee: | Mikuláš Patočka <mpatocka> | ||||||||||
Status: | CLOSED RAWHIDE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||||||
Severity: | unspecified | Docs Contact: | |||||||||||
Priority: | unspecified | ||||||||||||
Version: | 21 | CC: | agk, gansalmon, gmazyland, herbert.xu, itamar, jonathan, kernel-maint, madhu.chinakonda, mchehab, mpatocka, okozina, rjones | ||||||||||
Target Milestone: | --- | ||||||||||||
Target Release: | --- | ||||||||||||
Hardware: | aarch64 | ||||||||||||
OS: | Unspecified | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | |||||||||||||
: | 1123522 (view as bug list) | Environment: | |||||||||||
Last Closed: | 2014-07-29 13:53:10 UTC | Type: | Bug | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | |||||||||||||
Bug Blocks: | 910269, 1123522 | ||||||||||||
Attachments: |
|
Description
Richard W.M. Jones
2014-07-24 12:49:33 UTC
First, are you sure that your key file does not contain EOL char (see man page, for historic reason it is so tricky here...)? Because in keyfiles this is processed as part of key but in terminal input it EOL mean end of passphrase input. (IOW try "echo -n 123456 >/tmp/key" and repeat the test ;-) If it itill fails, you can try to eliminate gcrypt by simply recompile cryptsetup with another crypto backend, I would suggest kernel (because it will not use any userspace library - just it will be slow). If you want to five it a try, add --with-crypto_backend=kernel to spec and recompile the whole cryptsetup/libcryptsetup. Also please paste output of working and non-working command with added --debug, there is a lot more info, thanks. Yup, actually I used echo -n, not echo so it's fine: $ hexdump -C /tmp/key 00000000 31 32 33 34 35 36 |123456| 00000006 Using --with-crypto_backend=kernel fails in the same way. Enabling --debug is fun. It *changes* the behaviour of the good case so it fails as well. BTW I really think this is an aarch64-specific code gen bug. For some reason af.c:AF_merge produces different results from exactly the same input in the two different cases. The only difference between the two cases would be memory/stack alignment. Hm. The --debug should not change anything. So this looks like some nasty alignment bug... (Also passphrase and keyfile is for keyslot processing exactly the same, just buffer of data. So that it worked sometimes was just luck I guess...) Unfortunately I do not have access to any aarch64 machine. Are you sure the bug is in af.c? This is something I would like to definitely fix upstream asap. (But that code run on many strange architectures without problem.) Sorry the hardware is covered by an NDA so we cannot give out access to it (yet). I'm not sure that AF_merge is the problem. However does below look right or wrong to you? Should the outputs be different, given that the inputs are the same? The failure happens immediately afterwards in LUKS_verify_volume_key() where the only difference between the two cases is the content of vk->key. vk->key is generated by AF_merge. Hence I'm looking at AF_merge. Another thing: char is unsigned on ARM. -- AF_merge entry (af.c:128) -- good case: Breakpoint 1, AF_merge ( src=src@entry=0x4242f8 "\371h\nյ{\372\315 +\243gw<\264\006h\333\372f\355\374\274\265\241\235o\256\001\177\371\223|\021\275\020\016[3\271]8\034\226FN\206\017\245}\r\340\323\361*\v\224v\262\224XNTl\302\024\220\320T\025\235\203", <incomplete sequence \303>, dst=dst@entry=0x423158 "", blocksize=32, blocknumbers=4000, hash=hash@entry=0x4228d8 "sha1") at af.c:130 failing case: Breakpoint 1, AF_merge ( src=src@entry=0x4233e8 "\371h\nյ{\372\315 +\243gw<\264\006h\333\372f\355\374\274\265\241\235o\256\001\177\371\223|\021\275\020\016[3\271]8\034\226FN\206\017\245}\r\340\323\361*\v\224v\262\224XNTl\302\024\220\320T\025\235\203", <incomplete sequence \303>, dst=dst@entry=0x4230a8 "", blocksize=32, blocknumbers=4000, hash=hash@entry=0x422848 "sha1") at af.c:130 [The inputs look identical to me] -- AF_merge exit (af.c:145) -- good: Breakpoint 2, AF_merge ( src=src@entry=0x4242f8 "\371h\nյ{\372\315 +\243gw<\264\006h\333\372f\355\374\274\265\241\235o\256\001\177\371\223|\021\275\020\016[3\271]8\034\226FN\206\017\245}\r\340\323\361*\v\224v\262\224XNTl\302\024\220\320T\025\235\203", <incomplete sequence \303>, dst=dst@entry=0x423158 "\274\246\"~\264[R϶\267>\231ũ}\033\254\317\322\037\232t\267cZ]\302\v\305\a[\260!", blocksize=32, blocknumbers=4000, hash=hash@entry=0x4228d8 "sha1") at af.c:145 bad: Breakpoint 1, AF_merge ( src=src@entry=0x4233e8 "\371h\nյ{\372\315 +\243gw<\264\006h\333\372f\355\374\274\265\241\235o\256\001\177\371\223|\021\275\020\016[3\271]8\034\226FN\206\017\245}\r\340\323\361*\v\224v\262\224XNTl\302\024\220\320T\025\235\203", <incomplete sequence \303>, dst=dst@entry=0x4230a8 "c\271t\335\026M<8\243\371\024\276\336\353V+J\033e\353\377R\360~\346[?", blocksize=32, blocknumbers=4000, hash=hash@entry=0x422848 "sha1") at af.c:145 With the same header and keyslot this code is deterministic, so it must always return the same key. And id unsigned char is the problem it should fail every time. If you replace calloc() with e.g. posix_memalign() is it still reproducible? BTW does cryptsetup 1.6.4 work or it is old bug? Anyway, if there is no clear solution soon, Ondra can try to debug it once is back from vacation... ;-) (In reply to Milan Broz from comment #5) > If you replace calloc() with e.g. posix_memalign() is it still reproducible? I didn't specifically try this, but since we fixed valgrind[1] I played around with the --alignment option in valgrind and that made no difference, even forcing large alignment. [1] https://bugs.kde.org/show_bug.cgi?id=337762 > BTW does cryptsetup 1.6.4 work or it is old bug? Both commands fail in 1.6.4. This surely must be an obscure code gen bug ... It could be due to a compiler bug or due to undefined behavior (for example, too big shift count). I suggest - compile cryptsetup with -O0 -fsanitize=undefined (you need at least gcc-4.9 for -fsanitize=undefined). If it works with -O0, the failure is likely caused by optimization bug. If you get some error from the sanitizer, it means that you are doing undefined operation. If it still doesn't work (and doesn't print any error), try compiling the crypto library cryptsetup is using with -O0 -fsanitize=undefined too. Thanks will try this. The next private reply contains details of how Red Hat employees can get access to a machine for dev/testing. (In reply to Mikulas Patocka from comment #7) > If it still doesn't work (and doesn't print any error), try compiling the > crypto > library cryptsetup is using with -O0 -fsanitize=undefined too. Just note: the main point of testing with --with-crypto_backend=kernel is that there is no crypto library used, all crypto operations are done through kernel crypto API socket. So it is probably ideal for testing these bugs because it eliminates linked crypto libraries bugs (well, kernel has some bugs too for sure :). I see that current koji builds are using gcc-4.9, so using that options Mikulas is suggesting is very good idea... When testing, I would suggest to run in loop LUKS device open and header keyslots reencrypt - this will utilise the AF code to the maximum with both write and read operations, IOW run echo "123456" | cryptsetup-reencrypt --keep-key --use-urandom --iter-time 1 <device> echo "123456" | cryptsetup luksOpen --test-passphrase <device> Anyway, I couldn't resist but what's the point in reporting Fedora bug on architecture which package maintainer cannot access because it is under NDA? :-) </rant> The real question is: is it reproducible with ARM AArch64 foundation model (which I hope is available for me)? I'm afraid the sanity checker for aarch64 has only just arrived upstream in gcc, and only works with 4K pages (we use 64K pages). Using -O0 didn't fix it. Currently going through warnings. (In reply to Milan Broz from comment #11) > Anyway, I couldn't resist but what's the point in reporting Fedora bug on > architecture which package maintainer cannot access because it is under NDA? > :-) > </rant> You should be able to buy one real soon. AllWinner are going into full production after October, and there are others coming out sooner. But yes it's not an ideal situation. > The real question is: is it reproducible with ARM AArch64 foundation model > (which I hope is available for me)? Assuming this bug doesn't depend on kernel code (which is unlikely), a much easier and faster way to do it is to follow the instructions here: http://rwmj.wordpress.com/2013/12/22/how-to-run-aarch64-binaries-on-an-x86-64-host-using-qemu-userspace-emulation/#content And BTW you can just use upstream qemu now, don't need to use the special qemu branch. After following Alasdair's suggestion, and tracing AF_merge on x86_64 and aarch64, I've found the point where the trace diverges, for me. I am now using cryptsetup from git with the following command line: sudo LD_LIBRARY_PATH=/home/rjones/d/cryptsetup/lib/.libs libtool --mode=execute gdb --args ./src/cryptsetup luksOpen /dev/fedora/tmptest lukstest I placed a breakpoint at af.c:140. I ran both loops until i = 97 at which point it diverges. ---- On x86_64 ---- Breakpoint 1, AF_merge ( src=0x613b88 "\371h\nյ{\372\315 +\243gw<\264\006h\333\372f\355\374\274\265\241\235o\256\001\177\371\223|\021\275\020\016[3\271]8\034\226FN\206\017\245}\r\340\323\361*\v\224v\262\224XNTl\302\024\220\320T\025\235\203", <incomplete sequence \303>, dst=0x6138b8 "", blocksize=32, blocknumbers=4000, hash=0x612848 "sha1") at af.c:140 140 XORblock(src+(blocksize*i),bufblock,bufblock,blocksize); (gdb) print bufblock $58 = 0x632f90 "\272\241w\337Zx\004\341;i\231*rA\345\f\001l\276\231&ʜ\347%\254\346\256@\nƃ" (gdb) cont Continuing. Breakpoint 1, AF_merge ( src=0x613b88 "\371h\nյ{\372\315 +\243gw<\264\006h\333\372f\355\374\274\265\241\235o\256\001\177\371\223|\021\275\020\016[3\271]8\034\226FN\206\017\245}\r\340\323\361*\v\224v\262\224XNTl\302\024\220\320T\025\235\203", <incomplete sequence \303>, dst=0x6138b8 "", blocksize=32, blocknumbers=4000, hash=0x612848 "sha1") at af.c:140 140 XORblock(src+(blocksize*i),bufblock,bufblock,blocksize); (gdb) print bufblock $59 = 0x632f90 "DW\243 \032%\374\200\276؉\217\017\n\"\332\370\f\022\215\330N\026ٔ{\367\331\063q[\204" (gdb) print i $60 = 97 ---- On aarch64 ---- Breakpoint 1, AF_merge ( src=0x4243e8 "\371h\nյ{\372\315 +\243gw<\264\006h\333\372f\355\374\274\265\241\235o\256\001\177\371\223|\021\275\020\016[3\271]8\034\226FN\206\017\245}\r\340\323\361*\v\224v\262\224XNTl\302\024\220\320T\025\235\203", <incomplete sequence \303>, dst=0x4240a8 "", blocksize=32, blocknumbers=4000, hash=0x423848 "sha1") at af.c:140 140 XORblock(src+(blocksize*i),bufblock,bufblock,blocksize); (gdb) print bufblock $59 = 0x424140 "\272\241w\337Zx\004\341;i\231*rA\345\f\001l\276\231&ʜ\347%\254\346\256@\nƃ" (gdb) cont Continuing. Breakpoint 1, AF_merge ( src=0x4243e8 "\371h\nյ{\372\315 +\243gw<\264\006h\333\372f\355\374\274\265\241\235o\256\001\177\371\223|\021\275\020\016[3\271]8\034\226FN\206\017\245}\r\340\323\361*\v\224v\262\224XNTl\302\024\220\320T\025\235\203", <incomplete sequence \303>, dst=0x4240a8 "", blocksize=32, blocknumbers=4000, hash=0x423848 "sha1") at af.c:140 140 XORblock(src+(blocksize*i),bufblock,bufblock,blocksize); (gdb) print bufblock $60 = 0x424140 "N\374\310\033\273\031OYf\366\231p\265\016\371\006\344\235\tN\317K\221\002\nռB3\005\063\266" (gdb) print i $61 = 97 (gdb) print blocksize $62 = 32 (gdb) print src+(blocksize*i) $63 = 0x425008 "EzBA\"\240(<S\325\062\016\311.\343\"\376\377" (gdb) print hash $64 = 0x423848 "sha1" Of course now after posting that I realize my methodology mistake, which is that 'src' is not completely identical for both inputs. The source string is different starting at AfKey[3087]. All previous bytes in AfKey appear to be identical. It's very strange because we know the preceeding call to LUKS_decrypt_from_storage did not fail. I found out that when we enable unconditionally this fallback piece of code in cryptsetup sources, it works on arm64. I will continue to analyze it tomorrow. /* Fallback to old temporary dmcrypt device */ if (r == -ENOTSUP || r == -ENOENT || 1) return LUKS_endec_template(dst, dstLength, cipher, cipher_mode, vk, sector, read_blockwise, O_RDONLY, ctx); /* Fallback to old temporary dmcrypt device */ if (r == -ENOTSUP || r == -ENOENT || 1) return LUKS_endec_template(src, srcLength, cipher, cipher_mode, vk, sector, write_blockwise, O_RDWR, ctx); (In reply to Mikulas Patocka from comment #17) > I found out that when we enable unconditionally this fallback piece of code > in cryptsetup sources, it works on arm64. I will continue to analyze it > tomorrow. But this is why I asked if it works in 1.6.4 - this part is new code in 1.6.5. So then I guess is bug in crypto_storage.c. Please use upstream git then, there are already some fixes (it should not be relevant here but...:) (Also note kernel selinux bug#1115120) The whole part of this code is replacing encryption of keyslot area by userspace code which utilizes kernel crypto API. (In reply to Milan Broz from comment #18) > (Also note kernel selinux bug#1115120) SELinux is permissive on this machine. Just if Mikulas or anyone has access to this hw: please could you run upstream regression tests? (just "make check" with upstream repo, not only 1.6.5 tar). I added recently some LUKS images which should catch these kind of problems. (But these images are specially modified with zeroed keyslots areas to compress well, so it is kind of synthetic test here.) Anyway, it would be nice to know if this problem have reproducer in tests to not repeat this in future. Thanks! Created attachment 920928 [details]
cryptsetup-test-log.txt
Quite a lot of failures. Also a lot of errors in the kernel
messages which I'll attach in a minute.
This is with cryptsetup.git @ 59fdf2a6bb461a39e6db6b7d515873419f8a8ada
Created attachment 920929 [details]
cryptsetup dmesg output during test
Another odd thing, maybe. When I copied the test LUKS partition (created in comment 0) from the aarch64 machine to an x86-64 machine, I could not open it at all. Is that expected? Or does that indicate that the luksFormat command has created a broken LUKS partition? I did `md5sum' on both the source and destination copies of this partition and they are bit for bit identical. Thanks! ok, something is apparently broken on that arch (that segfault is very strange). I am trying to setup qemu here, but I have quite limited time (and my testing environment is Debian now so will see if it works). Anyway, if read fails, I would expect new LUKS header to be corrupted as well. The error messages in syslog (IO error, corrupted verity device) are expected but the crash I see there appears like a kernel bug. Mikulas can analyse this better with access to this system. (In reply to Milan Broz from comment #24) > ok, something is apparently broken on that arch (that segfault is very > strange). I made a valiant attempt to collect core dumps (for stack traces), but somehow even with ulimit -c unlimited and /proc/sys/kernel/core_pattern set correctly, it's not producing any core dumps. The corruption is in crypt_storage_decrypt (and crypt_storage_encrypt). It receives the same data as on x86-64, yet it produces different result. These functions just call the kernel, so the bug is probably in the kernel. The bug is page-alignment-related. The resulting data is corrupted in 80-byte chunks. The stride between the corrupted chunks is 4096. If I modify cryptsetup so that the buffer passed to crypt_storage_decrypt is 4096-byte aligned, the bug goes away and the volume can be activated. For now, it looks like a kernel bug when accessing the userspace in the AF_ALG subsystem. The reason for the bug is that the code in arch/arm64/crypto/aes-glue.c assumes that the pointer is aligned to AES block size (16 bytes). If it isn't, crash or data corruption happens. Created attachment 921078 [details]
A patch for the bug
Created attachment 921079 [details]
A patch for a similar bug in arm32 code
(In reply to Mikulas Patocka from comment #28) > Created attachment 921078 [details] > A patch for the bug This patch (the arm64 one) fixes the original cryptsetup problem I was having in the libguestfs test suite. Thanks Mikulas! (Mainly the arm32 part which is in stable kernel is scary....) Whatever, do we have some built image with new kernel which can I run in Qemu? (similar to that F190 buildroot image) (I have also schroot for aarch64 but it is unusable for these kind of kernel related tests and qemu complains even for some loop ioctl so cryptsetup tests are not directly usable in this chroot.) (In reply to Milan Broz from comment #31) > Thanks Mikulas! > (Mainly the arm32 part which is in stable kernel is scary....) > > Whatever, do we have some built image with new kernel which can I run in > Qemu? > (similar to that F190 buildroot image) > > (I have also schroot for aarch64 but it is unusable for these kind of kernel > related tests and qemu complains even for some loop ioctl so cryptsetup > tests are not directly usable in this chroot.) Last time I tried it, qemu couldn't successfully emulate a whole aarch64 system, just userspace processes. (This is supposed to be fixed/implemented at some point). (In reply to Richard W.M. Jones from comment #32) > Last time I tried it, qemu couldn't successfully emulate a whole > aarch64 system, just userspace processes. (This is supposed to > be fixed/implemented at some point). Yes, seems so. Cryptsetup tests definitely depend on some kernel ioctl emulations. Just FYI I have hw ARM testing environment but it is stuck with 3.4 kernel for now (one of the first ARM Chromebooks) and all cryptsetup tests works there. So seems we need to run test on ARM64 as well regularly, at least in emulated environment. These patches hit Linus' tree today. They'll be in the -rc7.git1 build tomorrow. |