A simple `unfold` in Perl

Aristotle on 2008-11-08T16:49:41

For many years, a very simple issue about writing an unfold in Perl 5 has stymied me.

(An unfold is the opposite of fold, of course. The latter is better known to Perl programmers as reduce from List::Util, which is a function that takes a block and a list and returns a single value by repeatedly applying the block to successive values. An unfold does the opposite: it takes a block and a single value and then repeatedly applies the block to the value, accumulating the return values to eventually return them as a list.)

A good unfold as I want it will allow the same thing as map does: for any iteration to return any number of elements, including zero. Also, the elements returned from an iteration must be passed out of the block as its return value for that iteration – otherwise you might as well write a conventional loop instead.

But then you have a problem: how do you know when the block is finished generating output? There is no possible return value you could use for this. The Python implementations of this function that I’ve seen invariably expect an exception to be thrown to signal the end of the iteration, which struck me as profoundly un-Perlish. But how else do you signal the termination condition out of band?

I could think of no sensible solution given the combination of constraints that I adopted. Until an hour ago, that is, when a flash of inspiration struck me.

I’ll let the code speak for itself.

sub induce (&$) {
    my ( $c, $v ) = @_;
    my @r;
    for ( $v ) { push @r, $c->() while defined }
    @r;
}

In this version, the initial scalar value is made available to the block in $_ (somewhat cleverly done via for, which is necessary because local $_ has lots of subtle pitfals), and the block is called repeatedly as long as $_ is defined. The block therefore signals the detection of its exit condition by undefining $_.

It’s that simple. It’s so simple I cannot believe it took me so many years to think of it, but there it is.

Here is a silly example:

my @reversed = induce { @$_ ? pop @$_ : undef $_ } [ 1 .. 10 ];

Of course, the eagle-eyed will immediately notice a problem: this produces 11 return values, the last one being the undef that got returned from undef $_. The fact that induce will collect the value of the final iteration rather than throw it away was a conscious design decision: there are two cases for the return value of the final iteration, either it is useful or not. What happens if it it’s not useful, but would be collected? Then you have to suppress it, which is easy. What happens if it’s useful, but would be thrown away? Then you have to arrange for the block to remember state so it can return the useful value first and then signal the termination condition on its next invocation. Since it is so much easier to suppress useless values that would be kept than to retain useful values that would be dropped, I chose to have induce keep the final iteration’s value.

There are various ways to arrange the suppression in the above example. One of them is to check more eagerly for whether another iteration will be necessary:

my @reversed = induce { my @l = @$_ ? pop @$_ : (); @$_ or undef $_; @l } [ 1 .. 10 ];

That is clearly extremely awkward. A simpler (and insignificantly less efficient) approach is to suppress the value returned by the undef function:

my @reversed = induce { @$_ ? pop @$_ : ( undef $_, return ) } [ 1 .. 10 ];

That’s much better, but still not pretty. In particular, that return can be awkward to place due to precedence rules. In the above example it requires those annoying parens. (Try it: the code compiles but breaks without them.) Instead, we’ll take a page from Javascript and write another function:

sub void {}

No, seriously. The point of this function is, well, to take any arguments you pass it, throw them away, do nothing, and return nothing – most importantly, to return an empty list in list context.

I admit that writing this, err, function greatly amused me. But the end result is quite satisfying:

my @reversed = induce { @$_ ? pop @$_ : void undef $_ } [ 1 .. 10 ];

So these two functions, induce (named like this instead of unfold, of course, to contrast with List::Util’s reduce) and void, will probably start appearing in my small scripts from now on.


Since $_ is global…

phaylon on 2008-11-17T16:30:20

couldn't you write…

my @reversed = induce { @$_ ? pop @$_ : end_of_fold } [ 1 .. 10 ];

…and have end_of_fold undef $_?

If you used a different localised variable, you might also be able to unfold undef values. But I don't really know if that is wanted.

--phaylon

Re:Since $_ is global…

Aristotle on 2009-01-16T13:00:51

That makes it seem like the end_of_fold function is somehow special. I’d prefer people who use this to understand exactly how it works, and avoiding low-gain abstractions helps.

returning refs?

Eric Wilhelm on 2009-01-16T01:06:53

Why not return []'s which get flattened into the output until undef is returned?

    my @rev = induce {@$_ ? pop @$_ : undef} [1..10];

But maybe I'm missing something as to why the prototype is &$ instead of &@, or am just wanting a better example? I would think that unfold might be more like:

    my @flat = map({map({$_} @$_)} [1..10],[12..20]);

or thereabouts - and the 'reverse' example does not clarify that for me.

Re:returning refs?

Aristotle on 2009-01-16T13:21:32

Why not return []’s which get flattened into the output until undef is returned?

Because then there has to be an escape mechanism for the block to be able to return an actual [] that should be preserved in the output, and the block has to make sure to use it when appropriate.

But maybe I’m missing something as to why the prototype is &$ instead of &@

Because unfold (“induce”) takes a single value and inflates it to a list, as opposed to fold (reduce) which takes a list and deflates it to a single value. In the example I gave, which is admittedly idiotic, the single value is an array ref which gets inflated to an array.

A more reasonable example might be turning a number into a list of factors of powers of some base:

my $number = 4711;
my $base = 12;
my @power = induce {
    my $r = $number % $base;
    $number /= $base;
    $r || void undef $_;
} $number;

Or cutting a string into chunks of 3 characters:

my $str = "foobarbaz";
my @chunk = induce { (length) ? substr $_, 0, 3, '' : void undef $_ } $str;

Re:returning refs?

Eric Wilhelm on 2009-01-17T00:17:44

Why not return []’s which get flattened into the output until undef is returned?

Because then there has to be an escape mechanism for the block to be able to return an actual [] that should be preserved in the output, and the block has to make sure to use it when appropriate.

No, you would not reach inside the returned ref - only flatten it at the first level. It is either [] or undef, so [[]] would be a list of one reference.


    my @power = induce {
        my $r = $_ % 12
        $_ /= 12;
        return $r ? [$r] : undef;
    } 4711;


    my @chunk = induce {
      (length) ? [substr $_, 0, 3, ''] : undef
    } "foobarbaz";

The return $condition ? [@stuff] : undef seems like it might be forming a pattern.

unfoldr

int32 on 2009-01-29T21:44:58

I played a bit with the concept, and found an issue I can't resolve, possibly you can help me here. The problem concerns whether $_ is an alias or a copy to the parameter passed. If it is a copy then you can't modify it unless it is a reference. If it is an alias you can't modify it if it is read-only. You may ask why should it be modified at all, and here comes the classic unfoldr:

unfoldr { $_ ? ($_, $_ - 1) : () } 10;

10,9,8,7,6,5,4,3,2,1

which allows to pass arbitrary values to the callback . Now back to your example, if I want to send values other than the original parameter to myself (e.g. callback), then I can't. Also, I can't send an "undef" back to me that way. That seems too much of a limitation.

So a question. Why not a good old unfoldr?

unfoldr { @$_ ? ( pop @$_, $_ ) : () } [1..10];

10,9,8,7,6,5,4,3,2,1

Re:unfoldr

Aristotle on 2009-01-29T22:24:41

Read my post; I addressed that concern. To reiterate: because the empty list is a valid return value. I want to be able to make compute state transitions on the input value without them having to contribute to the resulting list.

It’s not hard to convert a read-only value to modifiable copy.

if I want to send values other than the original parameter to myself (e.g. callback), then I can’t.

You can. You just have to capture them with a closure. Admittedly that’s cumbersome.

So yeah, it’s not syntactically very nice in certain edge cases. But all versions have tradeoffs of various forms. This one is the most minimally restrictive while maximally idiomatic rendition of unfold that I have seen so far. And yet since writing this entry I have grown too dissatisfied with it to use it.

The problem, ultimately, is you need to return two lists: one for the parameters to the next iteration, one for the result values produced during the current iteration. Unfortunately there’s just no nice way to do this.

Re:unfoldr

int32 on 2009-01-29T22:42:42

Oh well, I could say the same, read my post; I addressed that concern too :) unfoldr actually allows to return empty lists. Here's a hacked quote from haskell description:

The unfoldr function is a `dual' to foldr: while foldr reduces a list to a summary value, unfoldr builds a list from a seed value. The function takes the element and returns empty list if it is done producing the list or returns (@a,$b), in which case, @a is a appended to the list and $b is used as the next element in a recursive call.

As to the problem as you address it that you need to return two lists, I don't think there is a point to do so, at least for unfoldr, because it transforms a scalar to a list, not a list to a list. A list and a scalar is just fine.

Re:unfoldr

Aristotle on 2009-01-29T22:54:21

But then you need to drag it along through all of your return values. That is cumbersome.

Using $_ for the purpose is more idiomatic. The problem is in reusing $_ to signal the last iteration – that prevents you from unfolding undef, should you want to, and it bloats the code by your having to be careful not to return that undef when you meant to stop iterating.

However I can’t think of any good way to signal termination out of band.

Re:unfoldr

Aristotle on 2009-01-29T22:53:45

Btw, no, your point about alias vs copy is mistaken. The induce function above takes a copy of the value and then aliases $_ to that copy, so you are guaranteed to be able to modify the value, although if you want to preserve its last value past the unfolding, you want to arrange for that yourself.