XML: SAX and Twig, also TWiki

toma on 2003-01-01T10:10:16

XML: SAX and Twig


I have been reading about XML::SAX, and it is starting to make sense. I am concerned about memory usage as compared to XML::Twig. I need to process an XML file that is at least 10MB and possibly as large as 100MB. I would like to limit the RAM usage to less than 256MB, although 512MB might be okay. I have a hard limit of 2GB of RAM, since I am using 32 bit perl. It looks like XML::Twig can be set up to work with SAX, so this might help to solve the memory problem.



These file sizes are spooky. I think of XML as one-big-happy-file-that-describes-a-thing. Perhaps "the-thing" is too complicated for a single file. If so, I will need a new approach. I may need to learn about namespaces or some other way to partition a large XML dataset.



I thought up a way to eliminate the redundancy in the XML reader/writer for my flat/lumpy files. I can have a data structure that specifies the flat file in XML. Redundant portions of the XML reader and writer can be generated from this file.



It would be nice if someone had already written this. There are many tradeoffs in the design of such a thing, and I don't want to get bogged down in it. I will look at some of the SAX drivers for non-XML data sources.



I think removing reader and writer redundancy will be worthwhile, since I have at least a dozen and perhaps thirty of these file formats to translate to and from XML. As my buddy Steve says, "Make things that are the same the same and things that are different different."



Twiki
One of the things I like about PerlMonks is that I get new ideas that have nothing to do with what I am working on. Today, for example, I downloaded, built, and ran TWiki. Suddenly I get it and I hope that I will be using TWiki for something that will be useful and yet disruptive. At work there is a large dataset of free-text startup content, which is duct-taped to the side of an exquisitely normalized database. This text is the output from an extensive ongoing collaboration. It looks like a great opportunity for a wiki.

The main challenge will be scalability. I plan on evaluating this within the next few months.



New Modules
I am still trying to get TWiki working for creating new users. I didn't have any email set up on the machine where I was running TWiki, and that seemed to be a problem. I got the email working, but I still have the same problem. I rebuilt perl 5.8.0 in the process, and updated a bunch of modules as recommended by the results of running r command in the cpan program.