Bug 875125

Summary: epub output includes files that are not needed (and not listed in the OPF file)
Product: [Community] Publican Reporter: Raphaël Hertzog <raphael>
Component: publicanAssignee: Jeff Fearn 🐞 <jfearn>
Status: CLOSED CURRENTRELEASE QA Contact: tools-bugs <tools-bugs>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.0CC: rlandman+disabled, rlandman
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: 4.0.0 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-12-19 02:46:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Raphaël Hertzog 2012-11-09 15:20:11 UTC
publican epub's output includes all the files in the "images" directory even those that are not used (for instance I have .dia files used to generate some .png than end up in the .epub). The same is true for the html output but at least in the HTML output those files are never downloaded by the end-user. In the epub case since it's all in a single archive, it inflates the size of the file for no good reason.

Furthermore those files are not listed in the OPF file and thus lead to warnings emitted by epubcheck 3.0-RC1:

WARNING: /home/rhertzog/x/tdah/publish/en-US/Debian/6.0/epub/debian-handbook/Debian-6.0-debian-handbook-en-US.epub: item (OEBPS/images/etude-cas.dia) exists in the zip file, but is not declared in the OPF file

This error has been reproduced with the Debian Handbook:
$ git clone git://anonscm.debian.org/debian-handbook/debian-handbook.git

Comment 1 Jeff Fearn 🐞 2013-07-09 07:20:25 UTC
We used to have a feature that removed the non referenced images from the output and it resulted in a lot of complaining.

So much so we reduced it to:

$ publican print_unused_images
List of unused Image files in en-US

pondering ...

Comment 2 Raphaël Hertzog 2013-07-09 07:32:10 UTC
I don't know what those complaints were… but if they were legitimate, then maybe it means that we need some options?

Comment 3 Jeff Fearn 🐞 2013-07-09 07:44:28 UTC
In the specific case of epubs the inclusion is invalid and unused files should be excluded.

Comment 4 Jeff Fearn 🐞 2013-07-09 07:45:07 UTC
(In reply to Jeff Fearn from comment #3)
> In the specific case of epubs the inclusion is invalid and unused files
> should be excluded.

In the specific case of epubs the inclusion is invalid and unused images should be excluded.


Comment 5 Jeff Fearn 🐞 2013-09-30 06:41:30 UTC
The approach I'm going to take here is to add every file in the OEBPS directory to the manifest. To remove the files we'd need to parse every XML, CSS, and jscript file. I don't think that is realistic.

Comment 6 HSS Product Manager 2013-09-30 06:47:40 UTC
HSS-QE has reviewed and declined this request. QE for this bug will be handled by IED.

Comment 7 Jeff Fearn 🐞 2013-10-01 00:07:26 UTC
Made code include all files in list. Excluding unused content from output is a more generic problem as it affects all HTML output.

To ssh://git.fedorahosted.org/git/publican.git
   8711fbe..9d087c5  HEAD -> devel

Comment 8 Ruediger Landmann 2013-10-11 01:12:37 UTC
Unused images still seem to get included by publican-3.9.9-0.fc19.t4.noarch

This images directory contains two images, one used and one not. Both of them get included in the .epub:

$ ls en-US/images/
powertop.png  Sun_Conure_on_perch.jpg

$ publican build --formats epub --langs en-US

$ ls tmp/en-US/epub/OEBPS/images/
powertop.png  Sun_Conure_on_perch.jpg

$ unzip -l tmp/en-US/Red_Hat_Enterprise_Linux-6-Power_Management_Guide-en-US.epub |grep OEBPS/images
   125512  10-11-2013 11:10   OEBPS/images/powertop.png
  3362485  10-11-2013 11:10   OEBPS/images/Sun_Conure_on_perch.jpg

Comment 9 Jeff Fearn 🐞 2013-10-11 01:38:01 UTC
To clarify on #7, the fix in this bug is to correctly list all files shipped. A fix for shipping unused files is a much larger issue and covers more than epubs,  nd won't be addressed in this bug.

Comment 10 Ruediger Landmann 2013-10-12 03:49:15 UTC
In that case, in an EPUB built with publican-3.9.9-0.fc19.t4.noarch with an usused image:

$ unzip -l tmp/en-US/Red_Hat_Enterprise_Linux-6-Power_Management_Guide-en-US.epub |grep OEBPS/images
   125512  10-12-2013 13:41   OEBPS/images/powertop.png
  3362485  10-12-2013 13:41   OEBPS/images/Sun_Conure_on_perch.jpg

$ grep powertop tmp/en-US/epub/OEBPS/content.opf


$ grep perch tmp/en-US/epub/OEBPS/content.opf

confirm both these files in the OPF file; and 

$ $ java -jar epubcheck-3.0.1.jar /home/rlandmann/Documents/books/rhel/Power_Management_Guide/trunk/6-trunk/tmp/en-US/Red_Hat_Enterprise_Linux-6-Power_Management_Guide-en-US.epub

is clean