Progress, of sorts

aurum on 2008-09-02T03:35:11

I now have a script which produces output that looks like this. Each capital letter represents a manuscript. (OK, so in real life the words are lined up in columns, but I can't make use.perl play nicely with Unicode characters inside an <ecode> tag, which is the one that would preserve spacing.)


Word variation!  Context: 
մինչեւ ցայս վայրս բազմաջան եւ եւ աշխատաւոր քննութեամբ գտեալ գրեցաք >> ի զշարագրական գրեալս զհարիւրից ամաց, զորս ի << բազում ժամանակաց հետա հետաքննեալ հասու եղաք։ ընդ այնքանեաց տեսողացն եւ

Base  ի զշարագրական գրեալս  զհարիւրից ամաց, զորս ի
----                                              
ABH:    զշարագրական գրեալս  զհարիւրից ամաց  զորս ի
G:      զշարագրական գրեալսն հարիւրից  ամաց  զորս ի
C:    ի ժամանակական գրեալս  հարիւրից  ամացն զորս  
J:      զշարագրական գրեալս  զճից      ամաց  զոր  ի
DFI:    զշարագրական գրեալս  զճից      ամաց  զորս ի
E:      զշարագրական գրեալս  զճ        ամաց  զորս ի

Of course it doesn't take any input yet. One thing at a time.

Formatting -- yours and use.perl.org's

theorbtwo on 2008-09-02T07:36:11

I had a long reply using Text::Unidecode here, but use.perl.org *really* doesn't want to format things the way I want it to (half the time it seems to double-encode my unicode, and never do multiline code or pre tags!), so I'll try using words instead of pictures to explain what I'm trying to talk about.

First, the easy question: How are those alternate readings sorted? It doesn't seem to be by first ms with that reading, nor by number of readings -- is it just hash order?

Second, the hard question -- what do you mean by alignment? Does it align each word with it's other varients, or just the beginning of each alternate reading? How do you keep the forms of zsharagrakan (thanks, Text::Unidecode) aligned with each-other and treat the i as a word that is sometimes not there -- edit distance?

Re:Formatting -- yours and use.perl.org's

aurum on 2008-09-02T11:08:51

Q1) Alternate readings are unsorted. That is intentional - I don't want to inadvertently give priority to the reading in ms A, or the reading with the most words, or anything.
Q2) Alignment can occur in one of two ways. The first is by a small enough edit distance, as you say - it's how I keep the instances of "zsharagrakan" aligned. That was my "fuzzy match." The second is what is called a "negative variant" - the words aren't alike at all, but they coincide in placement. This is why "zhamanakakan" is lined up with "zsharagrakan". In the case of the 'i' at the beginning, it means that none of the manuscripts besides C have a word in that place, so it is in fact being treated as a word that is sometimes not there.
What this doesn't display (yet) is the "fuzzymatch" and "variant" relationships that have been calculated between words. I'm hoping it will become obvious what to do with that information when I start handling user input.

Re:Formatting – yours and use.perl.org's

Aristotle on 2008-09-02T19:10:32

half the time it seems to double-encode my unicode

Put posts and comments through “encode 'us-ascii', $your_post, Encode::HTMLCREF”. That will make them come out as intended.
and never do multiline code or pre tags!

That’s on purpose; Slashcode has its own special <ecode> tag for that purpose (whose distinguishing features are: 1. you can write raw angle brackets and ampersands inside, and Slash will turn them into entities for you; 2. it uses <pre>, so very long lines will wrap properly (something that you can achieve in modern browsers via CSS by saying white-space: pre-wrap)).
(IMO it should nowadays just use Markdown. (Slash is older than Markdown, mind.) But since I have global shortcuts to translate the clipboard from Markdown to HTML, I don’t personally care either way.)

Re:Formatting – yours and use.perl.org's

theorbtwo on 2008-09-02T20:41:25
Ah. Perlmonks does that to, but it calls the new tag code. (Or c, for the lazy.)
Re:Formatting – yours and use.perl.org's

aurum on 2008-09-03T11:06:45

Slashcode has its own special <ecode> tag for that purpose (whose distinguishing features are: 1. you can write raw angle brackets and ampersands inside, and Slash will turn them into entities for you;
This is the part that doesn't play nicely with UTF-8, actually, although the <ecode> tag is almost always what I want - the Armenian characters get converted into entities upon comment submit, and those entities themselves have their ampersands turned into entities upon ecode conversion.

Re:Formatting – yours and use.perl.org's

Aristotle on 2008-09-04T18:07:39

The conversion to entities is your browser’s doing, actually. It sees that the form should be submitted in ISO-Latin1, so it turns all the non-Latin1 characters into entities. Slashcode can’t actually know that you didn’t mean to send them that way. There is therefore no way to get around this.
All you can do is use plain <code> tags with <br> tags for linebreaks, sequences of   for tabs, and manual escaping for ampersands and less-thans. It’s a pain to do manually, but a tolerable amount of work with a good editor.

Progress, of sorts

aurum on 2008-09-02T03:35:11

Formatting -- yours and use.perl.org's

theorbtwo on 2008-09-02T07:36:11

Re:Formatting -- yours and use.perl.org's

aurum on 2008-09-02T11:08:51

Re:Formatting &#8211; yours and use.perl.org's

Aristotle on 2008-09-02T19:10:32

Re:Formatting &#8211; yours and use.perl.org's

theorbtwo on 2008-09-02T20:41:25

Re:Formatting &#8211; yours and use.perl.org's

aurum on 2008-09-03T11:06:45

Re:Formatting &#8211; yours and use.perl.org's

Aristotle on 2008-09-04T18:07:39

Re:Formatting – yours and use.perl.org's

Re:Formatting – yours and use.perl.org's

Re:Formatting – yours and use.perl.org's

Re:Formatting – yours and use.perl.org's