RegEx Golf
Inspired by the famous XKCD cartoon that started it and by Google’s Peter Norvig over on O’Reilly.com I decided to play some RegEx Golf myself this Easter…
If you’re not familiar with the game – the idea is to write the shortest regular expression possible that will select every element of a list, without selecting any elements from a second list. But what are regular expressions, you ask? Well, they’re a special sequence of characters that specify a particular textual patter – so you can use the appropriate regular expression when programming to automatically extract specific text from a list, file, or other text. There are some great tutorials on the web if you want to learn regular expressions (and you really should – because they’re great) – for example regexone.com…
I didn’t want to use lists of Presidents of the United States (as that had been done) – so I decided to have a go at the age-old challenge: animal, mineral, or vegetable…
Having three lists to work with makes things a bit tougher – as we have two sources of potential mismatches to avoid – meaning we’ll have to be more specific than we otherwise might be…
After some after a quick web search I came up with the following lists:
Animals
alligator ant bear bee bird camel cat cheetah chicken chimpanzee cow crocodile deer dog dolphin duck eagle elephant fish fly fox frog giraffe goat goldfish hamster hippopotamus horse kangaroo kitten lion lobster monkey octopus owl panda pig puppy rabbit rat scorpion seal shark sheep snail snake spider squirrel tiger turtle wolf zebra
Minerals
agate alexandrite amethyst aquamarine aventurine azurite basalt calcite carnelian citrine coal diamond dolomite emerald feldspar fluorite garnet gold granite gypsum hematite howlite jade jasper jet kunzite labradorite limestone magnetite malachite mica nephrite obsidian opal peridot pumice pyrite quartz ruby salt sandstone sapphire sard silica soapstone sodalite talc topaz tourmaline tsavorite turquoise unakite wairakite yeatmanite zircon zoisite
Vegetables
artichoke asparagus avocado bamboo basil bean broccoli capers carrot cauliflower celeriac celery chard chestnut cabbage chives cress cucumber fennel garlic ginger gourd kale kohlrabi leek lentils lettuce maize okra olive onion parsley parsnip pea pepper potato pumpkin radicchio radish rhubarb rocket romaine rutabaga seaweed shallot sorrel spinach sprouts salsify squash succotash tomato turnip watercress yam zucchini
Note that I played around slightly with the lists – to include a few tricky words – because this is no fun if it’s too easy, right? 🙂
Inspired by Peter Norvig – I too used Python, and a function (based on Peter’s) to automatically check my expression for me.
def mistakes(regex, first_set, second_set, third_set): "The set of mistakes mades by the regex in classifying things in the first_set from the second_set and the third_set" return ({"Should have matched: " + F for F in first_set if not re.search(regex, F)} | {"Should not have matched: " + S for S in second_set if re.search(regex, S)} | {"Should not have matched: " + T for T in third_set if re.search(regex, T)})
Now we have that we can start playing. One easy technique might be to try to form a regex that matches all of the animals which start with ‘a’ – for example, ^a(ll|n) would match any word where a is the first letter, and is followed by either a double-l or an n… We can then do the same thing for words starting with b – adding it to the a’s using an OR…
^a(ll|n)|^b(i|ee|ear)
We need to do a bit more this time – to match bee & bear but not bean…
If we do this for the whole string (avoiding matching minerals and vegetables) we might get:
initial_animal = "^a(ll|n)|^b(i|ee|ear)|^c(h(ee|i(c|m))|ro|ow|a(m|t))|^d(e|u|o(g|lp))| ^e[al]|^f(i|o|r|ly)|^g(ir|o(a|ldf))|^h(a|i|or)|^k(an|i)|^l(io|o)| ^mo|^o[cw]|^p(an|i|up)|^ra[bt]|^s(c|eal|h(ar|ee)|n|qui|pid)| ^t(i|urt)|^wo|^ze"
And if we try it:
print(mistakes(initial_animal, animal, mineral, vegetable))
we get:
set([])
Showing us it’s a valid solution for these lists.
But it’s not very short – and we can do better than that.
Let’s take a look at the minerals. Here we can see straight away that there are some common factors we can exploit to help us. For example, quite a lot of words end -ite. So we could use "ite$"
to find all words ending -ite – but in fact we don’t need to do that, since none of the animals or vegetables contain ite, we can save ourselves a character an just use "ite"
.
Going through the word-list using this technique (spotting things like no animals or vegetables begin with ‘j’, and although goldfish contains the word gold – it’s not at the end of the string) I came up with the following:
re_mineral = "ire|ite|ise|(r|l)ine|ian|one|[at]z|pal|ic[ae]|^j| sum|sar|con|alt|gold$|net|agat|ys|by|alc|ond|oal|dsp|rid|ald"
Using this substring matching technique for the animals gets us a shorter list too…
re_animal = "be(e|ar)|nk|^ow|ige|ark|ors|tte|ide|ant|ly|roo| ig|pan|pus|o(x|g|w|at|lf)$|lio|ppo|qui|rtl|ee[rtp]|ken|fish| ppy|ird|ebr|cr]a[mt]|eal|ff|ake|bit|uck|pio|ms|ail|glbst|cod|hin$"
And we can do the same thing for the veg too…
re_vegetable = "ess|ok|bba|ya|[pe]k|llo|iz|iac|ut|cc|rro|voc|bean| urd|ppe|ves|ea$|(r|le|f)y$|agu|nac|low|ve$|orr|ts]uc|i[pc]$|mai|rh| ock|ato$|boo|uas|til|eed|cha|nn|ing|lr|kal|oni|ers|umb|asi|dis"
We can now use an assertion to test all of the regular expressions together
def verify (re_a, re_m, re_v, animals, minerals, vegetables): assert not (mistakes(re_a, animals, minerals, vegetables) | mistakes(re_m, minerals, vegetables, animals) | mistakes(re_v, vegetables, animals, minerals)) return True verify (re_animal, re_mineral, re_vegetable, animal, mineral, vegetable)
No output means we pass…
So that’s my attempt at this – I don’t think it’s too bad – but I suspect you could get shorter if you tried…
If you want the full source-code (such as it is!) – then go to https://github.com/Auctoris/regex_golf
If you find any shorter solutions, or have any comments about my solutions, then let me know…
Filed under: Computing,Python,Uncategorised - @ March 30, 2016 14:30