Heading towards PPI 1.200, featuring new features!

Alias on 2006-10-07T05:02:19

Thanks to a large spurt of activity by Chris Dolan (much of it for the purpose of adding new features and policies to Perl::Critic) the 1.118 version of PPI (currently on CPAN) will now be considered the last version in the 1.100 series of releases.

The next few releases will be developer releases, versioned as 1.199_xx, for testing the new features which will appear in 1.200.

At this point I'd expect something like 5-10 releases in the 1.199_xx series, as we'll need to write some even-more-paranoid tests for a couple of the new features and ramp up the line-noise testing again for a while. Please note that this increased line noise testing also means the install time for the PPI dev releases will be relatively long.

Currently, there are three major features scheduled for the 1.200 release.

1. Newline round-tripping

The one current caveat to PPI's "100% round-trip guarantee" is that it will localize the newlines to the current platform, if your files are set to the "wrong" platform.

I did this originally because I was concerned about "regex soup" issues (because the regex form of a cross platform newline is not /\n/, but /\015{1,2}\012|\015\012/ and would need to be used in a lot of places) and because no software that I'm aware of effectively deals with files that contain mixed newlines, so I didn't have good working system to copy off.

But Chris and the Perl::Critic guys really want to have the newlines in the PDOM model with byte-level correct. In particular, they want to create policies detecting and/or banning wrong-platform or mixed-platform newlines, and currently only access to a document can be guaranteed by Perl::Critic because of things like STDIN/STDOUT pipe mode. They don't necesarily have access to the original file to run their newline checks.

After much email too'ing and fro'ing, Chris has managed to convince me he can implement full newline round-tripping, to the point where if your file starts with mixed newlines, we can output the original newlines correctly on a byte for byte basis.

I like to think of this is making PPI 110% round-trip compliant! :)

Obviously this is going to create some issues for people currently using code that uses plain newlines regular expressions or simple splits, so the default behaviour will probably remain as localized newlines in the model, with some sort of document constructor flag in place to control newline behaviour.

For example, potentially it may look like. my $doc = PPI::Document->new( 'filename.pm', newlines => 'local', # current behaviour newlines => 'cr', # force to a specific newline type newlines => 'mixed', # line-by-line accurate with auto-default newlines => 'cr-mixed' # line-by-line with explicit default ); Most of the hard work is, as usual, dealing with edge cases.

For example, search (Perl::Critic's use case) is fine, but what happens when you insert a tree section into a document, and they have different newlines? What if both the inserted content and the document have mixed newlines? (think cut and paste)

Resolving those weird cases will take some work, but I think we have a conceptual design that will mean that for most people, the API will just Do What You Mean and all current code and behaviours will be back-compatible.

2. Sub-classing of PPI::Token::Number

Numbers in Perl are extraordinarily complicated. While PPI does a pretty good job of identifying numbers, it hasn't typed them very well.

Currently it has a single PPI::Token::Number class, with a ->type method to identify binary, hex, octal, and so on and so forth.

After reading through the code following the 1.000 release, Audrey Tang made the offhand comment that having a ->type method was silly, since for data objects (which is what the PDOM objects are) their class is their type, and that I should just have subclassed the main number class into a series of sub-classes for each type of number.

It was such an obvious thing to do that it's been near the top of my to do list ever since, but I've never had the time to sit down and implement it.

Fortunately, Chris has decided to take a shot at it as well, and seems to be getting along quite well. The actual sub-classes are yet to be finalised though, because we might end up with some unusual cases like PPI::Token::Number::Unicode (which is something you probably don't want to know exists).

But it should all be resolvable.

And with numbers refactored, that creates an opportunity to implement a third long-desired feature, that the number problem was blocking.

3. Adding the PPI::Element::literal method

One of the original "really cool uses" for PPI was the idea that you can take features of various modules that are currently suspicious from a security point of view (or can only be used safely in limited situations), and make them completely safe for use everywhere (although of course at a speed penalty).

One of these modules with existing security issues is Data::Dumper. Data::Dumper dumps out Perl structures as Perl code and uses the interpreter to read it back in.

However, arbitrarily parsing data in the form of code creates a security hole big enough to drive a truck through.

The PPI::Element::literal method is a solution for this sort of problem, and has been in planning a while.

This method will take an element, for example the string 'foo', and return the literal value of it, as if Perl itself has parsed it as a value.

This seems fairly trivial, until you consider that we can implement the literal method recursively.

So now we can take something like the following... $VAR1 = { foo => [ 1, 2, 3, qw{foo bar}, { this => \'that', }, ], qr/\n/, }; ... and we can determine that something will be assigned to $VAR1, and then determine what the literal vale of the structure will be that you are assigning, exactly as if Perl had parsed it. In the ->literal implementation we can trap anything with an error, illegal content, or catch things that can't be determined at all (think "foo$bar") and throw an exception.

With sane number classes, we can start implementing the literal method for numbers, and then from there also implement it for the string types, hash|array|scalar|ref constructors, and so on and so forth (possibly even including bless( $var, 'class') expressions in the Data::Dumper output in some cases).

Other additions for 1.200

As well as the three major features detailed above, 1.200 will also see the PPI::Document::File class upgraded from experimental to supported (in preparation for larger and more complex things changing documents), another layer of self-analysis test scripts, and the usual set of minor parsing bugs.

Most notably, with the literal method coming, I'd like to try and support explicit constructors (think +{ foo => 'bar' } and similar things) better, and handle the "block vs hash constructor" problem a bit better.

And of course, the line noise tests will now start throwing BOTH types of newlines at us.

As for when 1.200 is expected to be release, I think definitely by Christmas.

Hopefully this Christmas :)

truly awesome

amoore on 2006-10-09T17:02:55

Thanks for all of your work on PPI. It's making it possible (maybe even easy?) to have tons of cool tools like Perl::Critic, more automated refactoring in our editors, static source code analyzers, and all of that stuff I haven't even thought of. Thanks again!