Extracting From: addresses

brian_d_foy on 2004-07-03T00:00:14

Because I am too stupid to pull this out of my shell history or make an alias for it, I post the one-liner I use to extract From addresses from email that my spam blocker incorrectly flags as spam. Those addresses end up in my white list.

grep ^From: fix | perl -pe 's/From:\s+//; s/.*<(.*@.*)>.*/$1/' | sort | uniq >> ~/.procmail/goodfile


I suppose that is rather unixy, those other programs have Perl implementations somewhere.

Now I'll probably forget how to find this post, too.


That can be trimmed down

Aristotle on 2004-07-03T11:33:22

No need for grep when you're using Perl, nor for substitutions when you're not using sed.
perl -lne '/From:\s+<(.*@.*)>/ and print $1' | sort -u >> ~/.procmail/goodfile

Re:That can be trimmed down

merlyn on 2004-07-03T14:09:52

perl -lne '/From:\s+<(.*@.*)>/ and $u{$1}++; END { print for sort keys %u}' inputfile

Re:That can be trimmed down

brian_d_foy on 2004-07-03T15:37:36

This one won't work either (and you should know that :). Most the the time that regex does not match.

Re:That can be trimmed down

Aristotle on 2004-07-03T17:33:38

Yeah, but that's a mouthful, both to remember and type, as opposed to the transition from Perl to grep.

Re:That can be trimmed down

Aristotle on 2004-07-03T18:36:55

Actually, since the sort in the original code is only required to support uniqueness, you could simplify that code using the old !$seen{$key}++ trick.

Re:That can be trimmed down

brian_d_foy on 2004-07-03T15:35:58

You can't do the substitution in one step because the pattern does not always match, and even the pattern that you use will mostly fail rather than mostly succeed.

The address lines look like:
From: "Fred Flinstone" <fred@example.com>
From: barney@example.com
I may not need the grep, but it sure makes things easier to figure out when there is a problem. When I try to do such things in one big step, I usually find that I miss something special about the input then have to waste a lot of time figuring out why 10% of the data foil the script.

I could do the sort -u, though :)

Re:That can be trimmed down

Aristotle on 2004-07-03T17:41:17

Ah. Well, you can still do something like
perl -lne 's/^From:\s+// or next; print /<(.*@.*)>/ ? $1 : $_' | sort -u >> ~/.procmail/goodfile
That keeps the "conceptual grep" separate from the matching, also. :)

Re:That can be trimmed down

brian_d_foy on 2004-07-03T18:19:42

This one doesn't work either, unless I type the email into standard input.

Not to be mean, but there really are good reasons for not trying to be too clever. My kludgey looking one liner works, it is easy to understand, and is even faster. Remember, test any code that you post because once it gets online, it lives on forever. :)

Just for giggles, I timed these against a file of over 1,000 email. I added the filename as a command line argument after your perl invocation, and on average, it took about 0.25 seconds on my PowerrBook. My original one liner took on average 0.060 seconds. That is all the same time really, but if I had to sort through a million email addresses, things would be different.

Certainly, for some people without fast grep or sort programs, the all-in-one Perl solution is probably faster

Re:That can be trimmed down

Aristotle on 2004-07-03T18:32:45

In the cases where speed is a concern, this is a very sed-able task. Jibing about the lack of filename was superfluous, btw.

Re:That can be trimmed down

brian_d_foy on 2004-07-03T18:58:29

I'm sorry you think I was "jibing", but not everyone is going to realize why your one-liner just sits there appearing to do nothing. Leaving out the input is not a minor detail. :)

Re:That can be trimmed down

Aristotle on 2004-07-03T19:14:00

Yes, I realized that the jibe wasn't entirely undeserved after I posted that. I should have been consistent and left out the output redirection as well. Whether the data source and destination are relevant is a matter of view, but specifying one and not the other makes no sense.