Bug 40553 - Unnecessary conversion upper case letters to lower case
Summary: Unnecessary conversion upper case letters to lower case
Status: RESOLVED MOVED
Alias: None
Product: poppler
Classification: Unclassified
Component: pdftohtml (show other bugs)
Version: unspecified
Hardware: Other Linux (All)
: medium normal
Assignee: poppler-bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-09-01 08:16 UTC by Piotr Bolek
Modified: 2018-08-20 21:34 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
The fragment of pdf file which converts bugly capital letters to lowercase (36.39 KB, application/octet-stream)
2011-09-01 08:20 UTC, Piotr Bolek
Details

Description Piotr Bolek 2011-09-01 08:16:10 UTC
The upper case letters are sometimes converted to lower
case. The bug occurs both in the case of conversion to
html as well as to xml.

The conversion to txt using pdftotext produces proper
results.

The attached example pdf file.

The first paragragph of convertion results to xml (wrong) and to
text (right by pdftotext) is shown below:

=============================================================================
XML:
=============================================================================
<text top="67" left="85" width="333" height="21" font="1">wybrałem się więc z całą rodziną do jednego z tych </text>
<text top="85" left="68" width="350" height="21" font="1">sklepów z prezentami-duperelami, gdzie zapach różnych </text>
<text top="103" left="68" width="347" height="21" font="1">suszonych pachnidełek jest tak intensywny, że przypra-</text>
<text top="122" left="68" width="350" height="21" font="1">wia o zeza. nie bacząc na to, że dzieci leżały na podłodze </text>
<text top="140" left="68" width="347" height="21" font="1">dusząc się, spędzałem w sklepie kolejne godziny wybie-</text>
<text top="159" left="68" width="350" height="21" font="1">rając sobie stołek za mały lub w złym kolorze, tak, żebym </text>
<text top="177" left="68" width="350" height="21" font="1">mógł  stracić  jeszcze  trochę  czasu  na  jego  zwrot  do  </text>
<text top="196" left="68" width="43" height="21" font="1">sklepu.</text>

=============================================================================
TEXT:
=============================================================================
   Wybrałem się więc z całą rodziną do jednego z tych
sklepów z prezentami-duperelami, gdzie zapach różnych
suszonych pachnidełek jest tak intensywny, że przypra-
wia o zeza. Nie bacząc na to, że dzieci leżały na podłodze
dusząc się, spędzałem w sklepie kolejne godziny wybie-
rając sobie stołek za mały lub w złym kolorze, tak, żebym
mógł stracić jeszcze trochę czasu na jego zwrot do
sklepu.
Comment 1 Piotr Bolek 2011-09-01 08:20:28 UTC
Created attachment 50815 [details]
The fragment of pdf file which converts bugly capital letters to lowercase
Comment 2 Nick Ruffilo 2012-12-07 17:46:39 UTC
I can also confirm that this bug exists in pdftohtml version 0.21.0 for linux (using static build).  This is quite an annoyance as I'm converting 100+ page PDFs and I have to manually check and fix Capitalization.

While I can't provide samples of the PDFs, I can tell you that some fonts produce this effect more than others, but things aren't consistent.  Sometimes things are capitalized and sometimes not.
Comment 3 GitLab Migration User 2018-08-20 21:34:31 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/poppler/poppler/issues/22.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.