Bug 1171670 - Fails to build document in a non-ASCII directory
Summary: Fails to build document in a non-ASCII directory
Status: NEW
Alias: None
Product: Publican
Classification: Community
Component: publican   
(Show other bugs)
Version: 4.2
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Carlos Munoz
QA Contact: Ruediger Landmann
URL:
Whiteboard:
Keywords:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-12-08 10:50 UTC by Raphaël Hertzog
Modified: 2018-09-21 23:05 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

Description Raphaël Hertzog 2014-12-08 10:50:38 UTC
My book directory is ~/écriture/systemd (note the "é" which is not ASCII). When I tried to build my book, it fails with this:

$ LANG=C publican build --formats=html --langs=en-US
FATAL ERROR: en-US/Book_Info.xml: No such file or directory at /usr/lib/x86_64-linux-gnu/perl5/5.20/XML/Parser/Expat.pm line 470.
 at /usr/bin/publican line 993.

I changed the Publican->new call on line 993 to containe "debug => 1" in the parameters but it doesn't print any supplementary information.

With strace I get a confirmation that the problem is due to mishandling of the path name (besides the fact that it builds if I copy it to /tmp):

stat("en-US/Book_Info.xml", {st_mode=S_IFREG|0644, st_size=1426, ...}) = 0
open("en-US/Book_Info.xml", O_RDONLY)   = 4
ioctl(4, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, 0x7fff9dc9af40) = -1 ENOTTY (Inappropriate ioctl for device)
lseek(4, 0, SEEK_CUR)                   = 0
fstat(4, {st_mode=S_IFREG|0644, st_size=1426, ...}) = 0
fcntl(4, F_SETFD, FD_CLOEXEC)           = 0
brk(0x499a000)                          = 0x499a000
read(4, "<?xml version='1.0' encoding='ut"..., 8192) = 1426
read(4, "", 8192)                       = 0
getcwd("/home/rhertzog/\303\251criture/systemd", 4095) = 33
open("/home/rhertzog/\303\203\302\251criture/systemd/en-US/systemd-survival-guide.ent", O_RDONLY) = -1 ENOENT (No such file or directory)
close(4)                                = 0
write(2, "FATAL ERROR: en-US/Book_Info.xml"..., 162FATAL ERROR: en-US/Book_Info.xml: No such file or directory at /usr/lib/x86_64-linux-gnu/perl5/5.20/XML/Parser/Expat.pm line 470.
 at /usr/bin/publican line 993.
) = 162

This is with publican 3.2.6, libxml-parser-perl 2.41-3, libexpat1 2.1.0-6+b3.

I'm not sure that publican is at fault. In fact, it might well be XML::Parser::Expat... if that's the case, it would be nice if you can figure out some simple test case and submit it to the XML::Parser::Expat upstream developers.

Comment 1 Jeff Fearn 🐞 2014-12-08 22:44:54 UTC
Could be a bug in XML::TreeBuilder where it uses File::Spec->rel2abs

Comment 2 Raphaël Hertzog 2014-12-09 07:28:01 UTC
Yeah, I remember we already had a similar issue in the past (non-regression tests... :)). Though I couldn't find the reference of the bug.

Comment 3 Jeff Fearn 🐞 2014-12-09 22:32:00 UTC
648126 & 875021 are similar, but they are publican specific. This looks to be deeper in the stack, and while I'm happy to force UTF8 always on in publican I'm not so sure it's a good idea to do it deeper in the stack.

Comment 4 Raphaël Hertzog 2014-12-17 09:08:25 UTC
So your guess about XML::TreeBuilder was right. I fixed the problem by replacing the line with the File::Spec->rel2abs() call with this:

  my $relpath = $directories ? File::Spec->catfile($directories, $sysid) : $sysid;
  $file = decode_utf8(abs_path(encode_utf8($relpath)));

This requires adding "use Cwd qw(abs_path);" and "use Encode;" at the top. I'm not sure what's the best way forward.

Comment 5 Raphaël Hertzog 2014-12-17 09:59:30 UTC
It looks like I have a better solution. The problem comes from the fact that we pass an UTF-8 Perl string to rel2abs when all the filesystem functions really want raw bytes (independent of encoding).

I just added the following two lines before the rel2abs call:

  $sysid = encode_utf8($sysid) if utf8::is_utf8($sysid);
  $directories = encode_utf8($directories) if utf8::is_utf8($directories);

And it works!

Then I wanted to test what happens if "$directories" contain a non-ASCII character, so I edited publican.cfg to say « xml_lang: "ené-US" » and called « publican build --formats=html --langs=ené-US » and it failed but not at the above location:

DEBUG: Publican: config loaded
Setting up ené-US
	Processing file tmp/ené-US/xml/Common_Content/Conventions.xml -> tmp/ené-US/xml/Common_Content/Conventions.xml
Can't open file 'tmp/ené-US/xml/Common_Content/Conventions.xml' Couldn't open tmp/ené-US/xml/Common_Content/Conventions.xml:
No such file or directory at /usr/share/perl5/XML/TreeBuilder.pm line 315.
 at /usr/share/perl5/Publican/Builder/DocBook5.pm line 480.

So any parameter that you use to build a path should really be passed through encode_utf8() first if it's read/received as an UTF-8 string.

Comment 6 Raphaël Hertzog 2014-12-17 10:00:17 UTC
Obviously you need "use Encode qw(encode_utf8);” for the above fix to work.

Comment 7 Jeff Fearn 🐞 2014-12-17 22:26:39 UTC
(In reply to Raphaël Hertzog from comment #4)
> I'm not sure what's the best way forward.

Just open a CPAN-RT against XML::TreeBurilder and make a pull request against https://github.com/jfearn/XML-TreeBuilder if you are keen.

I'm on good terms with the upstream guy, most of the time anyway, so it shouldn't be a problem to get a patched version out soon! ;)

Comment 8 Raphaël Hertzog 2014-12-19 08:13:30 UTC
:-)

Both done:
https://github.com/jfearn/XML-TreeBuilder/pull/1
https://rt.cpan.org/Ticket/Display.html?id=101006

Note that you might want to review Publican's code to make sure that you don't feed Unicode strings down to other modules which expect paths. Because everything that comes from configuration files (or even command line) seems to end up as unicode string.


Note You need to log in before you can comment on or make changes to this bug.