Filter::QuasiQuote 0.01 is now on CPAN

agent on 2008-08-05T10:27:07

After reading Audrey's blog post mentioning GHC's upcoming quasiquoting feature (as well as that quasiquoting paper), I quickly hacked up a (simple) quasiquoting mechanism for Perl, hence the Filter::QuasiQuote module already on CPAN:



http://search.cpan.org/perldoc?Filter::QuasiQuote

I'm looking forward to using sensible filters in my production code (e.g. OpenResty) and eliminating ugly Perl code for with embedded DSL. For example, instead of writing



    my $sql = "alter table " . quote_identifer($table) . " drop column " . quote($column) . ";";

I can simply write

    use OpenResty::QuasiQuote::SQL;
    my $sql = [:sql| alter table $table drop column $column; |];



Also, a JSON-like DSL can be used to describe valid Perl data structures and to generate the Perl code doing validation.

Filter::QuasiQuote supports subclassing, so the OpenResty::QuasiQuote::SQL module mentioned above could be derived from it. Also, multiple concrete filter classes could be composed in a single Perl source file. Just put a series of use statements together:



    use MyQuote1;
    use MyQuote2;

and it should work. Because it's required that filters always return Perl source aligned in a single line, line numbers won't get corrupted.

Of course, lots of nice consequences of the Haskell quasiquotations will be lost in my implementation, such as type safety. But the Perl version is much more flexible and powerful (by some definition) ;)



It's still in alpha and could be buggy. Feel free to report bugs or send wishlist to the CPAN RT site or directly to me ;)

Enjoy!


Don't use that on a high volume site

btilly on 2008-08-05T18:44:26

At least not against a good database. (eg Oracle, PostgreSQL, etc - but MySQL would be fine.)

The problem is that there is a trade-off between time spend preparing a query and query performance. It would not be inappropriate to think of preparing a query as a "compile and optimize this SQL" step. MySQL spends very little energy preparing, and therefore its ability to handle complex queries suffers. Most other databases put a lot of energy into preparing, and so it is very important to try to avoid recompiling the same query over and over again.

When you use placeholders you get the opportunity to let the database reuse query plans, which cuts query preparation costs. But your approach quotes things over and over again, which results in a lot of extra prepares, which can result in a database falling over because it runs out of CPU. I've seen it happen - it is not pretty.

It would be good if you could build the same thing with an option to do some kind of late binding.

Personally at work I faced a problem kind of like that, and my solution was to pass my SQL through a template engine (allowing me to do things like conditionally include complex chunks of SQL) and then pass variables in by name (I'm using DBD::Pg, which supports binding variables by name). I've been very happy with how it has worked out for me. However that suggests that you might want to think about how to set things up to expand a quasi-quoted thing multiple times...

Re:Don't use that on a high volume site

Aristotle on 2008-08-05T21:30:05

Also, source filters. Yes, Module::Compile, but still, yuck.

Personally I prefer SQL::Interp (via DBIx::Simple actually), which lets me write code very nearly like the quasiquoting snippet, except it’s still plain Perl and it uses placeholders and does binding under the covers. It’s a fabulous duo.

Re:Don't use that on a high volume site

btilly on 2008-08-05T23:40:11

If my situation was simpler then that would look good to me as well. Unfortunately I have big chunks of SQL that are included or not based on one condition, and if included require specific parameters to be bound or not as well. Templating syntax works well for this, but standard template tools only return a string, so I really, really want to work with a string by itself, not a string plus parameters.

Using :param_name isn't portable, but it is highly effective for my job.

Re:Don't use that on a high volume site

Aristotle on 2008-08-06T07:44:58

That is actually exactly the situation that SQL::Interp addresses. It lets you do something like this:

my ( $sql, @bind_value ) = sql_interp(
    'SELECT foo FROM bar'
    ( $do_include_baz
        ? ( 'LEFT JOIN baz ON baz.foo = foo.id AND baz.quux =', \$quux )
        : ()
    ),
    'WHERE', \%condition,
);

sql_interp will combine all those SQL snippets into a single string, duly putting in with placeholders for all the variables, and shuffling all the values into a single list according to the positions of their corresponding placeholders. (It’ll also translate that hash into a list of ANDed equality comparisons.) So despite the fact that the database sees positional placeholders, you don’t need to repeat the conditional logic from building the SQL in building the bind list, and the query is written with the Perl variables inline – precluding any bugs in lining up placeholders with bind values, no matter how complex the query composition gets.

Re:Don't use that on a high volume site

btilly on 2008-08-06T15:14:08

Yes, you can do that - but the result is unreadable. Which would you prefer to read?

my ( $sql, @bind_value ) = sql_interp(
        'SELECT foo ',
            ( $include_bar
                    ? ', bar'
                    : ()
            ),
            ' , count(*) as records '
    , 'FROM some_table
          WHERE x = ', \$x
    , ' GROUP BY foo '
          ( $include_bar
                    ? ', bar'
                    : ()
          ),
          ' ORDER BY foo '
            ( $include_bar
                    ? ', bar'
                    : ()
            ),
);

(Geez, I can't even figure out how to indent that usefully. Certainly the output isn't indented properly!) or

push @queries,
    {
        sql => qq{
            SELECT foo
[% IF include_bar -%]
                , bar
[% END -%]
            FROM some_table
            WHERE x = :x
[% FOREACH section = ["GROUP", "ORDER" -%]
            [% section %] BY foo
[%   IF include_bar -%]
                , bar
[%   END -%]
[% END -%]
        },
        params => {
            x => $x,
        },
        template_vars => {
            include_bar => $include_bar,
        }
    };

Now pretend for a second that you have a whole bunch of these variables, a large group by section, and you're writing a lot of these. Oh, and that the GROUP BY and ORDER BY sections are kind of long. Oh, and that during debugging you need to look at the generated SQL.

If you're imagining all of that, then you'll start to see what my life is like. :-)

Re:Don't use that on a high volume site

Aristotle on 2008-08-07T09:18:49

Neither. :-)

But the first example shows why I specifically pointed out that the list you pass is a normal Perl list and so is amenable to all of the facilities Perl provides. You can factor it just in the same way you would factor any other part of the code:

my $columns = join ', ', 'foo', ( $include_bar ? 'bar' : () );

push @queries, [
    'SELECT $columns, count(*) as records',
    'FROM some_table',
    'WHERE', { x => $x },
    map { "$_ BY $columns" } qw( GROUP ORDER ),
];

And then processing @queries consists merely of passing each entry through sql_interp.

This does leave the SQL indenting issue, admittedly. I wish SQL::Tidy had gone anywhere, but it appears not to have.

Re:Don't use that on a high volume site

agent on 2008-08-06T02:54:54

Actually I am going to use it on a very very busy site ;) SQL preparing does not make much sense for me since our platform will usually work in the context of PostgreSQL PL/Proxy cluster *AND* the query could be dynamic enough to defeat ordinary DBI param binding.

Actually runtime performance is the reason to choose source-filter solutions in the first place ;)

Also, the SQL example is, well, merely an example...Filter::QuasiQuote's power reveals in the context of true DSLs ;)

Re:Don't use that on a high volume site

btilly on 2008-08-06T04:43:04

My experience says that with a sane design you can run one of the top couple thousand busiest websites on the internet on a handful of webservers, paying only a modest amount of attention to performance of the code on your webservers.

That same experience says that tiny mistakes in how you handle your database can cause that same site to melt unexpectedly.

The lesson is to not worry about webserver performance, but be paranoid about database performance. Which means use placeholders properly. If you do it dynamically, sure, you might get query variants that are not in cache, and you have to pay the overhead of preparing. But what matters is that most of the time for most of your queries you avoid that overhead.

Sure, clustering the database helps. But from what I've heard, splitting such a cluster after you run out of headroom is not really fun. And why buy yourself problems that you don't need to have? There are plenty of ways to dynamically build complex queries and pass in parameters. There is a huge performance benefit to doing so. That performance benefit comes at a known major bottleneck. Why wouldn't you do this?

Re:Don't use that on a high volume site

agent on 2008-08-06T06:15:51

I must say that all you say is indeed true for an ordinary web application :)

But unfortunately I can't use prepare+execute in my OpenResty platform in particular. Why? Because it must scale by design to serve lots of apps here in Yahoo! China and Alibaba. So it must be a cluster or something like that.

The PL/Proxy database server requires frontend queries to be of the following form:

      select xquery('account', 'select * from posts...', 1);

That is, the user sql query itself must be a dynamic string (by definition), and passed to a PL/Proxy function ("xquery" here).

Silly me, I don't see how preparing such things in advance would help in terms of performance ;)

Also, we use PgBouncer extensively. According to the DBD::Pg documentation, "programs such as PgBouncer which cache connections at a low level should not use prepared statements via DBD::Pg".

Furthermore, in other contexts like importing lots of data into Pg, I've found directly sending multiple insert statements at a time could be *much* faster than first preparing an insert statement, then executing the statement handle to do the actual insertion one by one. And yeah, it's a "special" case :)

I believe, however, preparing a common-use not-so-tricky SQL query first for many subsequent reuses will indeed be a big win ;) I won't argue against this ;)

But my project sadly falls out of that category, it seems. Or am I taking things terribly wrong here?

Thanks!

Re:Don't use that on a high volume site

btilly on 2008-08-06T06:43:43

Responding out of order.

On sending multiple insert statements at once. Yes, that can be a big win because you're cutting down on round trips to the database. Each round trip takes unavoidable resources on the client, server, and network. With network latency typically being the biggest deal. However there is an upper limit to the win from that. A compromise that works fairly well is to prepare a bulk insert that inserts multiple records, thereby bypassing the prepare and reducing round trips. YMMV. But better still is to use the COPY functionality. If need be, COPY to a temp table, then run an insert from that. It is more work because you will need to create said table though, and get the datatypes right. Also with the version of DBD::Pg that I'm using you have to guess the size of buffer to use. But I think the new one automagically gets it right.

I haven't used PgBouncer. What I've found sufficient for my needs is just having a limited number of pre-forked Apache servers, and then have proxies in front of them in accelerator mode. The children reuse their connections from request to request. Avoiding the complexities of connection pooling simplifies things and seems like a design win to me, though I grant that there are situations where you would want connection pooling.

I have not used PL/Proxy. Glancing at the documentation it does seems that it does not support preparing queries. I guess that the resulting overhead is gained back from having multiple machines. However my immediate reaction is to think that you're not going to be able to use it with any relational features of the database. And once you give up the relational features of the database, I'd be inclined to stop pretending to be relational and look for a more exotic database that did exactly what I wanted.

If you're doing an insert rarely, select often, then what would be really nice to have is a queryable hot backup. The technology for that exists in Oracle and is really good. It doesn't exist for postgres. $work is willing to help fund a PostgreSQL hacker to add it, but the core pg team hasn't found anyone to take it. (More precisely, they have found someone who is working on it, but he won't be tackling that part of the project any time soon...)

Re:Don't use that on a high volume site

agent on 2008-08-06T07:17:42

Right, preparing a bunch of insert statements first would be faster :) Merely have to deal with the last few specially :) Thanks for the tip.

We use PgBouncer at the PL/Proxy level to cache connections to the data nodes. On the FastCGI level, a pre-forked lighttpd is used. Database connection to the PL/Proxy nodes are reused across fastcgi loops in a similar fashion as you described :)

Well, I don't think the use of PL/Proxy necessarily means loss of relationality. Relational constraints still hold for data residing on the same data node. We just partition the relational data according to application or even actual queries. For example, if two tables in a single app never join in any query, we then have a choice to put them on two different nodes.

Moving stuff intelligently and automatically across data nodes in the same big cluster and automatically producing backup nodes to distribute loads on hot spots is something we'll be working on ;)

Regarding the queryable hot backup technique, yeah, some members of our team are actively working on it. Not sure if it'll get opensourced though.

Thanks!

Re:Don't use that on a high volume site

btilly on 2008-08-06T14:45:49

Tell me if a donation could help the open-sourcing decision. I'm btilly, at gmail dot com.

Re:Don’t use that on a high volume site

Aristotle on 2008-08-06T07:49:23

You are aware of DBD::Gofer I suppose?

Re:Don’t use that on a high volume site

agent on 2008-08-06T08:38:09

Oh, not yet :)) Thanks for the info, I'll try it out to see if it'll fit in my OpenResty architecture :)

Re:Don’t use that on a high volume site

Aristotle on 2008-08-06T09:03:54

Then be sure to check out the slides from Tim’s OSCON ’07 DBD::Gofer talk.