A simplified parseable format for Changes files

hex on 2007-11-09T13:04:23

There's a lot of discussion going on at the moment about machine-readable Changes (or CHANGES) files: miyagawa, LTjake. hanekomu put together a new module, Module::Changes, to parse a "Changes.yml" file; RGiersig made some suggestions for the content of that file.

Discussion so far has mainly been around the use of of YAML. Points raised:

  • YAML is less expressive than RDF (me)
  • RDF is hard to write (miyagawa)
  • People want a simple format (everyone)
  • The format should be transformable from human to machine (everyone)
  • Even YAML can have too much chrome (Alias)

Thinking about all of these, I propose the following. Design constraints were (a) granularity (including Skud's suggestions of what to mention), (b) an absolute minimum of chrome, and (c) trivial to transform into other formats (such as RDF).

    v! 1.3
    @ 2007-11-08T11:15
    # This version was codenamed Muffin because we were listening to Frank Zappa at the time.
    m! This project is now maintained by ZIRCON (of Zircon Software fame).
    l! We have switched licenses. This software now uses the Greater Zork Software License.
    Please ensure that you have read the new license before using this software.
    a! New frobnitz() method - save 50 lines of manual frobnitzing by using this instead!
    b! Fixed the error in quack() where it would actually moo instead of quack. [RT 1234]
    c! The calling convention for rumpelstiltskin() has CHANGED. See perldoc.
    t! Test coverage is now 100%! Go us!

    v 1.3_01
    @ 2007-11-07T09:20
    # Developer preview for 1.3 and the CPAN testers.

    v 1.2.1
    @ 2007-11-02T20:08
    d Fixed some POD formatting mistakes.
    c Refactored accessors into AUTOLOAD. Makes no external difference.
    r Removed the deprecated honkhonkhonk() method as warned several versions ago.

As you can see, each version is represented by a block of lines. Double line breaks separate versions. Each line begins with a token denoting what it describes, optionally suffixed with an exclamation mark, which means "important". When applied to a version number, it implies "major release". (Applying it to a date or comment is meaningless and should be ignored by any parser.) The token is followed by \s+. If an item is split onto multiple lines, it is understood to continue until a new token or block break is reached.

These are the tokens:

    @  Release date. In W3C datetime format (ISO 8601).
    #  A comment.
    a  An addition to the code.
    b  A bugfix. Linking to a ticket here would be nice if it exists.
    c  A change to existing code.
    d  A change to documentation.
    l  A change to licensing.
    m  A change to the maintainer.
    r  A removal of something from the code.
    t  A change to tests.
    v  A version number.

I haven't gone quite as far as RGiersig did in his specification, as I felt that was a bit heavy. For example, release stability in my scheme is indicated by the version number - that should be implied from the existing convention of underscored version numbers for developer releases.

Vague other thoughts - case-insensitive tokens? And maybe a standard block of comments at the beginning of the file explaining what the tokens are to new readers.

Thoughts? I actually like this enough that I might start using it myself.

Update: There's a second draft now.


using x! instaead of !x for important items

claes on 2007-11-09T13:11:56

in most programming languages prefix ! is a negation so I think using postfix ! is a better choice.

just my 0.02 EUR

Re:using x! instaead of !x for important items

hex on 2007-11-09T13:22:03

Good point - I changed it.

Hard to read...

Alias on 2007-11-09T14:35:06

Unfortunately, due mainly to lack of indenting I don't like you format.

Of course, I don't like all the other proposals equally as much.

Re:Hard to read...

hex on 2007-11-09T14:47:47

The spec allows you to do this if you want:

a!
  Added some groovy new feature.
b
  Fixed that stupid little bug in the gnomon.

Any better?

Ambiguity

Ovid on 2007-11-09T15:25:28

Each line begins with a token denoting what it describes ... The token is followed by \s+. If an item is split onto multiple lines, it is understood to continue until a new token or block break is reached.

Maybe I missed something, but what do you do if the word 'a' is the first letter of an item split onto multiple lines? How does the parser know that's not a token?

Re:Ambiguity

hex on 2007-11-09T16:30:12

Ooh, good catch. As it stands, it wouldn't. The workaround is not to split a line before an "a" :-)

If anyone can think of a patch to the spec to fix that without adding complexity (I can't off the top of my head) I'd be interested to hear it.

Re:Ambiguity

Ovid on 2007-11-09T17:13:14

As a format which could conceivably be written in other (human) languages, can you guarantee that none of them will have the same issue? Or that someone might refer to their 'd' subroutine and mess things up?

Maybe subsequent lines could be indented or the preceding line could end in a backslash?

Re:Ambiguity

hex on 2007-11-09T17:41:02

confound on IRC suggested starting continued lines with a '.', but that's more chrome to impede a quick visual scan of the document, as are backslashes. On the other hand, the backslash is a well-known line continuation indicator. I prefer though your suggestion of indenting. Leading whitespace already seems to be commonly used on CPAN to indicate a continued comment.

a We added a new shiny feature that you'll all love:
  a magic automatic doodad configurator.
b! A major bug got fixed. Really major. It was so awful,
  in fact, that I can only talk about it in Latin:
  Lorem ipsum dolor sit amet, consectetuer adipiscing
  elit. Nulla iaculis mi quis mi. Quisque nibh neque,
  gravida quis, bibendum vitae, aliquet ut, enim.

remove/delete

bart on 2007-11-10T09:22:28

Looking at your list of abbreviations, it makes a lot of sense to me, except for one detail: the "d". I believe I'm not the only one to expect the "d" to mean "delete". Instead, "d" is for docs, "r" is for remove. So you've solved it by choosing another word instead of "delete"...

But I'm sure mistakes are bound to be made. That people accidentally use "d" instead of "r".

Instead I'd prefer to use another letter for "documentation", but I can't think of any other word.

Re:remove/delete

hex on 2007-11-10T14:19:39

Hmm. "d" for "delete" is a good point, however I find the phrasing "I deleted a feature" a little awkward.

How about we take a leaf out of diff -u's book and circumvent the issue of which word to use?

-  removed something
+  added something

There's no ambiguity in that...

Re:remove/delete

Eric Wilhelm on 2007-11-11T06:26:44

I find the alphabetical codes rather unreadable. They mix too much with the text. Having tags for version number and date seems redundant when those two items are essential (and currently standard anyway.) In my rendition, the version and date are a non-indented header over a set of indented paragraphs which start with sigils.

v0.1.1 2007-11-10

    + added new thing() method

    - removed old deal() method

    * fixed bug #12578

    % changed code for blah()

    ? documentation updates for bop()

    $ license change

    ^ maintainer change

    = fixed tests on VMS

    ! incompatible change notice blah blah

I think most of those sigils are self-explanatory, except maybe 'maintainer' -- it looks like a house.

Re:remove/delete

Aristotle on 2007-11-11T07:49:03

I thought the metaphor for the maintainer symbol was that someone changed their “hat” (as in “putting on my group leader hat, I say that […]”). That seemed funnily apt to me.

Compatibility and security

Juerd on 2007-11-11T19:14:59

I think that *incompatible changes* and *security fixes* are very important to indicate separately in a machine readable way. Just like security fixes make you install the new version asap, incompatible changes make you wait until you have tuits for updating your code. (And when they're there together, good luck.)

For these, I suggest "i" and "s".

Actually, single letters make bad identifiers. How about the following self-descriptive tags:

new
fix
doc
incompatible
license
maint
security
tests

Where changes and removal to code is "incompatible" - if it's not incompatible, it's an addition ("new") or a fix.

And please allow whitespace instead of T in the timestamp.

v 1.00
@ 1234-12-34 12:34
# I'm so happy with this release
security: fixed buffer overflow in Foo->cookies
fix: orange should have been blue, not red.
incompatible: removed emacs support
Could even uppercase the tags...

v 1.00
@ 1234-12-34 12:34
# I'm so happy with this release
SECURITY: fixed buffer overflow in Foo->cookies
FIX: orange should have been blue, not red.
INCOMPATIBLE: removed emacs support
Result: writable AND readable, and important things stand out because they're longer

Re:Compatibility and security

Skud on 2007-11-11T23:11:45

I agree with Juerd on all points, but most especially that using a word rather than a letter helps a lot with readability. So, what he said.

Re:Compatibility and security

hex on 2007-11-12T00:18:41

Compatibility, security, fix: agree that these are necessary splits to "bug fix" ("b" in my original scheme).

Timestamps: these follow the format specified in ISO 8601, where the "T" is a mandatory separator. I'd like to stick to an existing standard of date representation if possible.

I think uppercasing is too shouty... adding the important marker would make you end up with "FIX! SECURITY! NEW!". It's a bit tabloid newspaper. :-)

With all this in mind I'm going to post a revised spec shortly for a second round of comments. Thanks!

Re:Compatibility and security

Juerd on 2007-11-12T10:45:38

I think the important marker itself is not important if you split out security/incompatible. If something new is important, bump the version number.

As for the timestamp, you'd have two things, whitespace separated, instead of one. dateTtime may be the standard, but date time is much more commonly seen in the wild. And for a very good reason.