Monday, 18 June 2012

Anagram Finder

Picking a problem at random today, I find http://www.4clojure.com/problem/77 the objective is to take a list of words and return a set of sets of words where each set of words contains two or more words from the original list that are anagrams of each other.

For example starting from this list

user=> (def s ["meat" "mat" "team" "mate" "eat"])
#'user/s
user=> s
["meat" "mat" "team" "mate" "eat"]


We want to get this:-

#{#{"mate" "meat" "team"}}

Now, strings in Clojure are sequences and therefore they can be sorted.  If two words are anagrams then when you sort them they return the same list of letters.  So the quick test is #(= (sort %1) (sort %2))

user=> (#(= (sort %1) (sort %2)) "meat" "team")
true
user=> (#(= (sort %1) (sort %2)) "meat" "vale")
false


To rearrange our source list we call upon the built-in list function group-by, which gathers together items from a list that return the same value to a given function.  For our purposes this function required is just sort, because we want to group together words with the same letters:-

user=> (group-by sort s)
{(\a \e \m \t) ["meat" "team" "mate"], (\a \m \t) ["mat"], (\a \e \t) ["eat"]}


Well, here we get back a complete map.  We don't need the keys to the map - that is to say the actual list of letters that each word contains - we just want the values, the lists of matching words: the map function vals does this:-

user=> (vals (group-by sort s))
(["meat" "team" "mate"] ["mat"] ["eat"])


But I do not want to see ["mat"] nor ["eat"] in the output because we are asked for sets only where there are two or more anagrams.  So I am going to filter these out, and the required predicate - well, seq will eliminate empty lists, and I can combine this with rest to eliminate lists with only one element - so my predicate is going to be (comp seq rest):-

user=> (filter (comp seq rest) (vals (group-by sort s)))
(["meat" "team" "mate"])


The requirements ask for a set of sets, so I will map the set creator function set into my returned sequence and apply it to the whole list, so:-

user=> (set (map set (filter (comp seq rest) (vals (group-by sort s)))))
#{#{"mate" "meat" "team"}}


So the function than answers the problem is

#(set (map set (filter (comp seq rest) (vals (group-by sort %)))))

Hmm. The disturbing thing about that code is that because it is just a sequence of faceless off-the-shelf built in functions it offers no clue as to what it does. You could be tempted to rewrite to use some local functions with descriptive names - not to change the function but to make the code less Sphinxlike.  Maybe

(defn anagram-set [word-list]
  (let
    [anagram sort,
    two-or-more (comp seq rest)]
    (set (map set (filter two-or-more (vals (group-by anagram word-list)))))))
     

Or something.  I don't know.