String::Approx - match and substitute approximately (aka fuzzy matching)

NAME

SYNOPSIS

        use String::Approx qw(amatch asubstitute);

Approximate is defined here as k-differences. One difference is an insertion, a deletion, or a substitution of one character. The k in the k-differences is the maximum number of differences.

For example 1-difference means that a match is found if there is one character too many (insertion) or one character missing (deletion) or one character changed (substitution). Those are exclusive ors: that is, not one of each type of modification but exactly one.

The default approximateness

The default approximateness is 10 % of the length of the approximate pattern or at least 1: 0-differences being the exact matching which can be done very effectively using the usual Perl function index() or normal regular expression matching.

amatch

        use String::Approx qw(amatch);

        amatch("PATTERN");
        amatch("PATTERN", @LIST);
        amatch("PATTERN", [ @MODS ]);
        amatch("PATTERN", [ @MODS ], @LIST);

The PATTERN is a string, not a regular expression. The regular expression metanotation (. ? * + {...,...} ( ) | [ ] ^ $ \w ...) will be understood as literal characters, that is, a * means in regex terms \*, not "match 0 or more times".

The LIST is the list of strings to match against the pattern. If no LIST is given matches against $_.

The MODS are the modifiers that tell how approximately to match. See below for more detailed explanation. NOTE: The syntax really is [ @MODS ], the square brackets [ ] must be in there. See below for examples.

In scalar context amatch() returns the number of successful matches. In list context amatch() returns the strings that had matches.

Example:

        use String::Approx qw(amatch);

        open(WORDS, '/usr/dict/words') or die;

        while (<WORDS>) {
            print if amatch('perl');
        }

or the same ignoring case:

        use String::Approx qw(amatch);

        open(WORDS, '/usr/dict/words') or die;

        while (<WORDS>) {
            print if amatch('perl', ['i']);
        }

asubstitute

        use String::Approx qw(asubstitute);

        asubstitute("PATTERN", "SUBSTITUTION");
        asubstitute("PATTERN", "SUBSTITUTION", @LIST);
        asubstitute("PATTERN", "SUBSTITUTION", [ @MODS ]);
        asubstitute("PATTERN", "SUBSTITUTION", [ @MODS ], @LIST);

Also the SUBSTITUTION is a string, not a regular expression. Well, mostly. Most of the regular expression metanotation (., ?, *, +, ...) will be not understood as literal characters, that is, a * means in regex terms \*, not "match 0 or more times". The understood notations are

$`: the part before the approximate match
$&: the approximately matched part
$': the part after the approximate match

The MODS are the modifiers that tell how approximately to match. See below for more detailed explanation. NOTE: Yes, the syntax is really [ @MODS ], the square brackets [ ] must be in there. See below for examples.

The LIST is the list of strings to substitute against the pattern. If no LIST is given substitutes against $_.

In scalar context asubstitute() returns the number of successful substitutions. In list context asubstitute() returns the strings that had substitutions.

Examples:

        use String::Approx qw(asubstitute);

        open(WORDS, '/usr/dict/words') or die;
        while (<WORDS>) {
            print if asubstitute('perl', '($&)');
        }

or the same ignoring case:

        use String::Approx qw(asubstitute);

        open(WORDS, '/usr/dict/words') or die;
        while (<WORDS>) {
            print if asubstitute('perl', '($&)', [ 'i' ]);
        }

Modifiers

The MODS argument both in amatch() and asubstitute() is a list of strings that control the matching of PATTERN. The first two, i and g, are the usual regular expression match/substitute modifiers, the rest are special for approximate matching/substitution.

i: Match/Substitute ignoring case, case-insensitively.
g: Substitute globally, that is, all the approximate matches, not just the first one.
k: The maximum number of differences. For example 2.
Ik: The maximum number of insertions. For example 'I2'.
Dk: The maximum number of deletions. For example 'D2'.
Sk: The maximum number of substitutions. For example 'S2'.
k%: The maximum relative number of differences. For example '10%'.
Ik%: The maximum relative number of insertions. For example 'I5%'.
Dk%: The maximum relative number of deletions. For example 'D5%'.
Sk%: The maximum relative number of substitutions. For example 'S5%'.

The regular expression modifiers o m s x are not supported because their definitions for approximate matching are less than clear.

The relative number of differences is relative to the length of the PATTERN, rounded up: if, for example, the PATTERN is 'bouillabaise' and the MODS is ['20%'] the k becomes 3.

If you want to disable a particular kind of difference you need to explicitly set it to zero: for example 'D0' allows no deletions.

In case of conflicting definitions the later ones silently override, for example:

        [2, 'I3', 'I1']

equals

        ['I1', 'D2', 'S2']

EXAMPLES

The following examples assume the following template:

        use String::Approx qw(amatch asubstitute);

        open(WORDS, "/usr/dict/words") or die;
        while (<WORDS>) {
                # <---
        }

and the following examples just replace the above '# <---' line.

Matching from the `$_`

Match 'perl' with one difference

        print if amatch('perl');

The one difference is automatically the result in this case because first the rule of the 10 % of the length of the pattern ('perl') is used and then the at least 1 rule.

Match 'perl' with case ignored

        print if amatch('perl', [ 'i' ]);

The case is ignored in matching (i).

Match 'perl' with one insertion

        print if amatch('perl', [ '0', 'I1' ]);

The one insertion is easiest achieved with first disabling any approximateness () and then enabling one insertion (I1).

Match 'perl' with zero deletions

        print if amatch('perl', [ 'D0' ]);

The zero deletion is easily achieved with simply disabling any deletions (D0), the other types of differences, the insertions and substitutions, are still enabled.

Substitute 'perl' approximately with HTML emboldening

        print if amatch('perl', '<B>$&</B>', [ 'g' ]);

All (g) of the approximately matching parts of the input are surrounded by the HTML emboldening markup.

Matching from a list

The above examples match against the default variable $_. The rest of the examples show how the match from a list. The template is now:

        use String::Approx qw(amatch asubstitute);

        open(WORDS, "/usr/dict/words") or die;
        @words = <words>;
        # <---

and the examples still go where the '# <---' line is.

Match 'perl' with one difference from a list

        @matched = amatch('perl', @words);

The @matched contains the elements of the @words that matched approximately.

Substitute 'perl' approximately with HTML emphasizing from a list

        @substituted = asubstitute('perl', '<EM>$&</EM>', [ 'g' ], @words);

The @substituted contains with all (g) the substitutions the elements of the @words that matched approximately.

ERROR MESSAGES

amatch: $_ is undefined: what are you matching against?

asubstitute: $_ is undefined: what are you matching against?

These happen when you have nothing in $_ and try to amatch() or asubstitute(). Perhaps you are using the Perl option -e but you did forget the Perl option -n?

amatch: too long pattern.

This happens when the pattern is too long for matching.

When matching long patterns, String::Approx attempts to partition the match. In other words, it tries to do the matching incrementally in smaller parts.

If this fails the above message is shown. Please try using shorter match patterns.

See below for LIMITATIONS for more detailed explanation why this happens.

asubstitute: too long pattern.

This happens when the pattern is too long for substituting.

The partitioning scheme explained above that is used for matching long patterns cannot, sadly enough, be used substituting.

Please try using shorter substitution patterns.

See below for LIMITATIONS for more detailed explanation why this happens.

VERSION

Version 2.1.

LIMITATIONS

Fixed pattern

The PATTERNs of amatch() and asubstitute() are fixed strings, they are not regular expressions. The SUBSTITUTION of asubstitute() is a bit more flexible than that but not by much.

Pattern length

The approximate matching algorithm is very aggressive. In mathematical terms it is O(exp(n) * x**2). This means that when the pattern length and/or the approximateness grows the matching or substitution take much longer time and memory.

For amatch() this can be avoided by partitioning the pattern, matching it in shorter subpatterns. This makes matching a bit slower and a bit more fuzzier, more approximate. For asubstitute() this partitioning cannot be done, the absolute maximum for the substitution pattern length is 19 but sometimes, for example it the approximateness is increased, even shorter patterns are too much. When this happens, you must use shorter patterns.

Speed

Despite the about 20-fold speed increase from the String::Approx version 1 agrep is still faster. If you do not know what agrep is: it is a program like the UNIX grep but it knows, among other things, how to do approximate matching. agrep is still about 30 times faster than Perl + String::Approx. NOTE: all these speeds were measured in one particular system using one particular set of tests: your mileage will vary.

For long patterns, more than about 40, the first

Incompatibilities with `String::Approx` v1.*

If you have been using regular expression modifiers (i, g) you lose. Sorry about that. The syntax simply is not compatible. I had to choose between having amatch() match and asubstitute() substitute elsewhere than just in $_ and the old messy way of having an unlimited number of modifiers. The first need won.

There is a backward compability mode, though, if you do not want to change your amatch() and asubstitute() calls. You have to change your use line, however:

        use String::Approx qw(amatch compat1);

That is, you must add the compat1 symbol if you want to be compatible with the String::Approx version 1 call syntax.

AUTHOR

Jarkko Hietaniemi <jhi@iki.fi>

ACKNOWLEDGEMENTS

Nathan Torkington <gnat@frii.com>