Str and Buf -- I think I get it now

masak on 2009-07-06T14:06:57

So Str and Buf aren't merely there in Perl 6 to separate out the two related concepts "sequence of characters" and "sequence of bytes", respectively. They're there to institute a whorfian discipline where it won't even be possible to think the wrong thoughts, about strings and sequences of bytes.

This is the natural consequence of working Joel Spolsky's "There Ain't No Such Thing As Plain Text" into the language design. By all means, treat real strings as Str, but make sure byte sequences can't cross that moat without being decoded somehow. Perl 5 totally blurs the distinction, as moritz++ explains. It is not alone among programming languages in that respect. In fact, I'd be interested to hear about some language that makes the same Str/Buf distinction as Perl 6.

It took me a while to reach this point, but now it seems perfectly obvious. I remember, only a few weeks ago, being shook by TimToady's claim that Strs do not know their byte sequence in the general case. But I see it now. A Perl 6 string is not a sequence of bytes. It's a sequence of characters, at least by default. Likewise, a Buf is not a sequence of characters, not even metaphorically. It's a sequence of integer values. And the difference isn't some picky play with words, but the encoding/decoding step itself.

This is the act of building knowledge into the class hierarchy of the language itself, so that people's thoughts will be channeled in the right direction. "Arrgh, why can't I get the number of bytes on this Str object? Oh look, the manual says I have to convert to Buf to do that. Oh look, I have to supply an $encoding parameter to do that. I don't see why, but fair enough — if that's the price for using all the other cool stuff in Perl 6."

The battle of encoding-aware programs is won, not necessarily through making the programmers aware of encodings, but to make the language provide primitives that Do The Right Thing.


Graphemes rather than characters

cowens on 2009-07-06T15:21:27

Just a quibble, but I believe it is more correct to say a string in Perl 6 is made up of graphemes rather than characters. The two Perl 5 strings "\x{F6}" (LATIN SMALL LETTER O WITH DIAERESIS) (and "\x{6F}\x{308}" (LATIN SMALL LETTER O and COMBINING DIAERESIS) should be the same string in Perl 6 (unless the codes pragma is turned on).

Re:Graphemes rather than characters

masak on 2009-07-06T16:28:06

Yes, you are right. The Perl 6 spec is full of references to "characters", but in a few places it mentions that this term defaults to being the same as "graphemes".

I think my growing familiarity with Unicode is not yet at the stage where I immediately reach for the term "graphemes". :) Maybe some day.

Babel-17

jmm on 2009-07-06T17:54:53

This "you can't think the wrong thoughts" target reminds me of the novel Babel-17 by Samuel Delaney which uses the same sort of concept but in a human language that causes the person who thinks in that language to automatically think the right thoughts in a very powerful way.

Re:Babel-17

masak on 2009-09-12T11:48:33

I think the idea is sufficiently old. Umberto Eco details various attempts made during the years in his book The Search for a Perfect Language. Newspeak in 1984 has words selectively pruned from it so that dangerous thought becomes difficult or impossible.

Re:Babel-17

jmm on 2009-09-14T13:28:32

I think it is strongly related to the Sapir-Whorf Hypothesis - http://en.wikipedia.org/wiki/Linguistic_relativity

Re:Babel-17

masak on 2009-09-14T13:52:48

Aye. I even linked to that in my blog post. :)

What of binary manipulation?

Aristotle on 2009-07-06T19:26:09

So if I want to find a “magic number” byte sequence in a binary file, how do I do that in Perl 6?

Re:What of binary manipulation?

masak on 2009-07-06T20:18:53

According to S32/IO, the return type of slurp (which reads a whole file at once) is Str|Buf. A Buf is returned when a parameter :bin is passed to slurp. After that, you can treat the Buf you get as an array (because Buf does Positional, and do as advanced indexing operations as you need to find your byte sequence.

I wish I could show this with real, working code, but Buf isn't implemented just yet in Rakudo.

Re:What of binary manipulation?

Aristotle on 2009-07-07T00:46:01

What kind of pattern matching facilities does Buf support?

Re:What of binary manipulation?

masak on 2009-07-07T10:54:17

The spec is a bit silent on that point, so I asked on #perl6. The conclusion seems to be "convert it to a string if you want to pattern match".

Then again, if smartmatching with list semantics is what you're after, that should work. Something like $buf ~~ (*, 104, 101, 108, 108, 111, *) to find "hello" in an ASCII-encoded Buf.

Java does (almost) the same

dakkar on 2009-07-07T08:34:04

In Java, a java.lang.String and a byte[] have nothing in common, and there are no (non-deprecated) ways of converting between them without specifying an encoding.

It's one of the very few features I like about Java…

Re:Java does (almost) the same

Dom2 on 2009-07-09T18:47:03

There are (at least) two things wrong with Java's encoding support:

  1. No way to avoid UnsupportedEncodingException, even for UTF-8, which is guaranteed to be present.
  2. The concept of a "system default encoding" is flawed and leads to bugs in portability. You should be forced to always specify an encoding.

Re:Java does (almost) the same

dakkar on 2009-07-10T07:40:20

  1. true
  2. I thought all methods that converted without specifying an enconding were deprecated… anyway, yes, implicit encondings are a very bad idea

and while we're at it,

3. internal enconding is utf-16, and it's visible at the language level, so that the "length" method gives you completely useless information

Re:Java does (almost) the same

Dom2 on 2009-07-10T08:57:19

Very true about the 16 bit character. Thankfully, it's less of a problem for me right now.