Module name wanted

miyagawa on 2007-03-24T00:37:19

I want a name for my new module, that automatically detects the best, conservative encodings to be used in Email messages, from the strings.

It'll be useful to encode email message in iso-2022-jp if all content are in Japanese, iso-2022-kr for Korean etc. Gmail does it by default: http://mail.google.com/support/bin/answer.py?ctx=%67mail&hl=en&answer=22841>
I'm thinking of Encode::Email::Best and Encode::Mail::Traditional. Have a suggestion?


Coordinate with RJBS

Alias on 2007-03-24T01:04:47

He looks after Email:: these days, and probably has the best idea of where it would fit.

Re:Coordinate with RJBS

miyagawa on 2007-03-24T01:20:55

Well I was thinking about Email:: namespace at first, but the actual code wouldn't do anything specific with Email messages actually.

It tries to encode the messages into a narrow-to-wide certain set of encodings and see if all characters are safely encoded, using Encode:: and possibly with Dan's Encode::InCharset.

Anyway I'll think about it more.

Encode::First

Juerd on 2007-03-24T02:10:07

What a coincidence. I was planning on writing exactly that, this weekend, inspired by Mutt's send_charset option.

I was going to name it Encode::First, and duplicate Encode's encode interface, but with a colon (or perhaps comma) separated list of encodings, of which the first that supports all codepoints will be used. It would return a two-element list: encoding and byte string.

Typical usage would be:

        my ($enc, $buf) = encode_first('us-ascii:iso-8859-1:iso-8859-15:utf-8', $string);

This would encode "2.5" as ascii, "2½" as latin1, "€ 2,50" as latin9, and "€ 2½" as utf-8.

I was also considering optimizing "iso-8859-1:utf8" by simply trying utf8::downgrade with FAIL_OK (ignoring the return value), and then examining the UTF8 flag. This would be the default for when the specified encoding list was empty or undef.

I'd be delighted if you would use this interface and module name; it would save me some trouble, while giving me exactly what I've been wanting all week. :)

Re:Encode::First

miyagawa on 2007-03-24T02:52:37

Oh yeah, I like that interface. Maybe I'll suggest an utility function that takes the string and array reference to return the best encoding, and also provide an encode() compatible function just as you described. Thanks!

Does it really have anything to do with mail?

Aristotle on 2007-03-24T02:16:17

It seems to me that email is just what you want to use the module for. I don’t see how the module’s operation actually has anything whatsoever to do with email. “Best” doesn’t really say anything; maybe Encode::MinCharsetPicker?

(Btw, I’d have the module only suggest the minimal applicable charset, but not actually do the encoding itself (or only if you ask for it by way of a convenience function). Probably the main function should simply take a list of encodings and then try to pick the applicable encoding with the smallest index.)

Re:Does it really have anything to do with mail?

miyagawa on 2007-03-24T02:54:13

As said in the other comment replies, the actual code doesn't have anything to do with email, other than the default "list of encodings known to be safe in emails" are almost specific to email (which is the point of this module) obviously.

I'd probably make two functions, one is compatible as encode() (and does encoding itself) and other one like detect_best_encoding(), which returns the name of the encoding but doesn'nt encode itself.

Re:Does it really have anything to do with mail?

Juerd on 2007-03-24T11:53:56

The easiest way to detect the "best" encoding would be to just encode it, with a CHECK argument to make it fail if impossible. Why create a utility function to throw away the encoded string, if the user can easily choose to do so himself?

my ($enc) = encode_first(...);

Or, have you found another efficient way of finding a suitable encoding?

Re:Does it really have anything to do with mail?

miyagawa on 2007-03-24T12:03:54

Yes, I was thinking of the exact same logic, as well as using charset tables like the one used in Encode::InCharset. I prefer the easiest, if not the most efficient, so I guess that'll be same as what you described.

The reason we want the encoding itself back it that we'd like to use it in the Email header. If we return the encoded string only, the caller doesn't know which encoding it's actually encoded in.

Re:Does it really have anything to do with mail?

Aristotle on 2007-03-24T12:49:43

The reason I suggested that sort of interface is that some APIs expect to receive character strings that they will then encode themselves; XML serialisers come to mind. In such a case, giving the caller an encoded string is pretty useless.

Re:Does it really have anything to do with mail?

Juerd on 2007-03-24T11:56:12

Oh, and a list of several email-safe country-specific encodings is of course more common than latin1:utf8, and would make a better default.

Minimal Enclosing

bart on 2007-03-24T20:25:18

I don't really have a good name for your module, but I'm going to put an image in your head, the image I see when I read about your intentions and cut out all the fluff.

What you would appear to be wanting to do, is to find an as small as possible character set that contains every single character in your text. That appears to be related to finding a minimum size geometric shape that contains every vertex in a set. Terms that spring to mind are minimal enclosing circle or rectangle — the latter is also known as bounding box.

It's up to you to distill a proper name out of this, but I'm somewhat thinking in the direction of "Minimal Enclosing Charset". Heh.

You have seen this?

jhi on 2007-03-27T02:11:11

http://www.eki.ee/letter/