Multi-byte Unicode and PDF

ddick on 2008-11-02T08:55:58

After an excellent talk on the Encode module by Stephen Edmonds and a close encounter with the ÃÂµ symbol, i've been playing with the various unicode symbols a lot.

A missing component seems to be putting multi-byte unicode into a PDF document. Try this little beggar 狗 on for size.

PDF::API2, htmldoc and html2ps all seem to have problems with characters when the encodings uses more than 1 byte to represent 1 character.

Anyone know a tool that can do the job?

Funny that you bring this up now...

bart on 2008-11-02T12:38:01

I spent a few days just this week pinning down a bug in PDF::API2 that prints some characters from Unicode strings (in this case, just the Polish character set) on top of each other.

I nailed it. I just have to feed it back to its maintainers.

I have not tried it with your character, I don't even have a clue as to what character set it fits in (something CJK, but that's it) and thus, what font I ought to use.

Re:Funny that you bring this up now...

srezic on 2009-03-15T18:07:18

I spent a few days just this week pinning down a bug in PDF::API2 that prints some characters from Unicode strings (in this case, just the Polish character set) on top of each other.
I nailed it. I just have to feed it back to its maintainers.
Just stumbled over the same problem, using Croation characters. Do you have a patch/solution/workaround? If so, can you send it to me?

Re:Funny that you bring this up now...

bart on 2009-03-17T07:51:11
When is the last time that you updated PDF::API2? Because it is fixed in the latest release on CPAN (0.72).

Re:Funny that you bring this up now...

srezic on 2009-03-18T22:24:07
Indeed. I first tried Debian's current package, which has $PDF::API2::VERSION set to 2.015, just like the current CPAN version, so I thought it's the current one. But Debian's libpdf-api2-perl is only based on 0.69.

PDF::Reuse can do Unicode

grantm on 2008-11-03T23:45:50

The maintainer of PDF::Reuse accepted my patch to add this functionality earlier this year.

It's my understanding that if you stick to the built-in PDF fonts you're stuck with characters in the Latin-1 range (roughly speaking). You have to use embedded fonts to get at Unicode characters outside that range.

Re:PDF::Reuse can do Unicode

ChrisDolan on 2008-11-04T05:50:43

That's correct. Appendix D of the PDF Reference explicitly lists the minimum glyphs that must be supported in the 14 standard fonts.
That said, I would not be surprised if non-Latin-1 Unicode characters worked fine in one of the basic fonts on a recent mainstream OS. To get Unicode in strings, you may need to employ to the hex notation (angle brackets).