Multiword expression lookup: multiset subset retrieval
2024-03-23

I've recently spent quite a bit of time thinking about how to find multiword expressions (MWEs) in a sentence. MWEs are a pretty messy topic and there is a lot of ambiguity about what even counts as an MWE, but for today I want to put that aside and talk about approaches to automatically identifying MWEs. I am a fan of lexicon-based approaches to MWE identification, which just means that given a very large list of MWEs, you are trying to figure out which of them might be present in a given sentence. This can be broken down into a pipeline that looks something like this:

  1. Retrieve all of the MWEs that could be present in a sentence (the "possible MWEs") from the lexicon; this can also be thought of as filtering the lexicon down to just entries whose constituents are all present in the sentence. The majority of this blog post will be about how to do this efficiently, because with a poorly structured lexicon this can be quite slow.
  2. Gather all combinations of constituent words which could form a possible MWE in the sentence as "candidates"; this just means finding each combination of words in the sentence that correspond to a possible MWE. We will cover how to do this at the end of the post.
  3. Decide if each "candidate" is actually an MWE - that is, whether its constituents take on an idiomatic/non-compositional meaning. This requires a system capable of making judgements about meaning in context, which typically means machine learning. I published a paper last year about one possible method to do this, but there are a variety of possible approaches, which are beyond the scope of this blog post.

Example sentence

For the above sentence, these three steps look like this:

  1. Retrieve run_down, run_over, fall_down, fall_over as possible MWEs - this is every MWE in our lexicon with all of its constituents present in the sentence.
  2. Find candidates for these MWEs by mapping each of them to groups of words in the sentence, as pictured in the above diagram.
  3. Filter these so that we keep only the candidates whose meaning is that of the relevant MWE. fall_down and run_over are obviously wrong, and run_down as an MWE means something like (of a vehicle) to hit a person and knock them to the ground, so we are left with fall_over.

Note that there are sometimes multiple candidate word groups for a single MWE. For example, if we replace the last down with over for I ran down the stairs and fell down, there are now two combinations of words that can form run_down - one for ran and the first down, and another for ran and the second down. This is also why it is convenient to split step #1 and #2 into separate steps.

Retrieving possible MWEs

Now, the main topic: retrieving possible MWEs for step #1. While some MWEs have constraints on how they can be formed in a sentence, if we include verbal MWEs then there are very few guarantees. They do not have to be contiguous - see put_down in She put her beloved dog down - and worse, they do not even have to be in order - see the beans have been spilled for spill_the_beans. Finally, the constituent words of an MWE are not always unique, such as in face_to_face.

Given that constituent words are neither required to be in order nor unique, the formalization of our possible MWE retrieval problem is: given a multiset S of words in the input sentence, and a set L containing multisets for each possible MWE, find all members of L that are strict subsets of S.

MWE retrieval equation

This means a worst case runtime of O(M * |L|) where M is the average size of an MWE multiset, which is a pretty expensive upper bound. The naive approach of checking if every MWE in the lexicon is a subset of the words in the sentence will end up processing every MWE multiset for every sentence, and is consequently very slow.

class NaiveApproach:
    def __init__(self):
        self.data = [
            (mwe['lemma'], Counter(mwe['constituents']))
            for mwe in get_mwes()
        ]

    def search(self, words: list[str]) -> list[str]:
        word_counter = Counter(words)
        return [
            mwe for mwe, constituents in self.data
            if all(
                word_counter[constituent] >= count
                for constituent, count in constituents.items()
            )
        ]

This code takes an average of 28 seconds on my laptop to process (call search() on) 1,000 sentences. Fortunately, we can make this much faster using a trie1. Tries are prefix trees most commonly built out of characters, but because we are dealing with words and not characters, we will build ours out of words.

MWE trie

Using the MWE trie as our lexicon, we can gather possible MWEs with a depth-first search starting at the root, which aborts when we hit a node for a word missing from the sentence. That is, we can traverse only the parts of the trie that are subsets of the words in the sentence.

class TrieNode:
    __slots__ = ['lemma', 'children']

    def __init__(self, lemma: Optional[str]):
    	# lemma represents a possible MWE that terminates at this node
        self.lemma = lemma  
        self.children = {}


class Trie:
    def __init__(self):
        self.tree = self._build_tree(get_mwes())

    def _build_tree(self, mwes: list[dict[str, str]]):
        root = TrieNode(None)
        for mwe in mwes:
            curlevel = root
            for word in mwe['constituents']:
                if word not in curlevel.children:
                    curlevel.children[word] = TrieNode(None)
                curlevel = curlevel.children[word]

            curlevel.lemma = mwe['lemma']

        return root

    def search(self, sentence: list[str]) -> list[str]:
        counter = Counter(sentence)
        results = []
        self._search(self.tree, counter, results)
        return results

    def _search(self, cur_node: TrieNode, counter: Counter, results: list):
        possible_next_constituents = [c for c in counter if counter[c] > 0 and c in cur_node.children]

        for constituent in possible_next_constituents:
            next_node = cur_node.children[constituent]
            counter[constituent] -= 1
            if next_node.lemma is not None:
                results.append(next_node.lemma)
            self._search(next_node, counter, results)
            counter[constituent] += 1

This allows us to store only a single copy of any prefixes shared between multiple MWEs in our lexicon, but the main benefit is that searching this way means we will expend no compute on MWEs whose first word is not present in the sentence. This is much faster, and gets through 1,000 sentences in 0.8 seconds on average. However, we can still make it a little faster.

Word frequency in English is very imbalanced, and many MWEs start with common words. For example, my relatively small lexicon has 169 MWEs starting with in, such as in_theory, in_unison, in_vain, etc. Since we only want MWEs whose constituents are all present in the sentence, it makes more sense to look at the words least likely to be present first - that is, the lowest frequency words. We can do this by sorting the constituent words in the MWEs before we insert them into the trie using precomputed word frequency, such that the lowest frequency words come first. This does mean that in rare cases where MWEs share the same words and are differentiated only by order (like roast_pork and pork_roast) we will need to attach multiple MWEs to one node in the trie, but this requires only minor changes.

class OrderedTrie:
	# not pictured here: 
    # 1) TrieNode now holds a list of lemmas instead of a single lemma
    # 2) _search needs one line changed to return all lemmas on a node 

    def __init__(self, word_data: dict[str, int]):
    	# any missing words are treated as last in the frequency list
        self.word_freqs = defaultdict(lambda: len(word_data), word_data)
        self.tree = self._build_tree(get_mwes())

    def _reorder(self, words: list[str]) -> list[str]:
        # sort by word frequency, then alphabetically in case
        # both words are missing from word_freqs
        return sorted(words, key=lambda w: (self.word_freqs[w], w), reverse=True)

    def _build_tree(self, mwes: list[dict[str, str]]):
        root = OrderedTrieNode([])
        for mwe in mwes:
            curlevel = root
            for word in self._reorder(mwe['constituents']):
                if word not in curlevel.children:
                    curlevel.children[word] = OrderedTrieNode(word)
                curlevel = curlevel.children[word]

            curlevel.lemmas.append(mwe['lemma'])

        return root

Using this sorted constituent trie approach, it takes only 0.5 seconds on average to process 1,000 sentences, which is about a 40% speedup over the aforementioned trie. The average time for each of the three methods can be seen in the graph below (log scale).

Average time by method

Moving from the naive approach to using a trie is arguably a fairly obvious optimization; I think the interesting part is the further speedup we get from using word frequency to inform trie construction. Most importantly, it's also a good demonstration of how much it can help to have a good understanding of the data/domain you are trying to process. This further speedup was only made possible by thinking about what the distribution of the input data (words in English sentences) looks like.

Mapping retrieved possible MWEs to candidate word groups

Now that we know how to retrieve our possible MWEs, let's look briefly at step #2: finding every combination of words in the sentence that could constitute a given MWE. For the sentence I ran down the stairs and fell down and the MWE run_down, we start by building simple representations of our sentence as tokens and our MWE as a multiset.

from collections import namedtuple, defaultdict
from itertools import combinations, product

token = namedtuple("Token", ["form", "idx", "lemma"])
sentence = [
    token("I", 0, "I"),
    token("ran", 1, "run"),
    token("down", 2, "down"),
    token("the", 3, "the"),
    token("stairs", 4, "stairs"),
    token("and", 5, "and"),
    token("fell", 6, "fall"),
    token("down", 7, "down"),
    token(".", 8, "."),
]

# build a map of lemmas to tokens
# so we can look up tokens by their lemma
lemma_to_tokens = defaultdict(list)
for t in sentence:
    lemma_to_tokens[t.lemma].append(t)

# mwe: "run_down"
lemma_counter = {
    "run": 1,
    "down": 1,
}

The next part is confusing to look at, but what we're doing isn't actually that complicated. We represent tokens choices for each lemma in the MWE as lists of tuples, and want to gather all possible options. This is just N choose K for each lemma, where N is the number of times the given lemma appears in the sentence and K the number of times it appears in the MWE. These tuples will usually be only one element, except in MWEs that have repeated constituents such as face in face_to_face.

candidate_word_combos = [
    list(combinations(lemma_to_tokens[lemma], lemma_counter[lemma]))
    for lemma in lemma_counter
]

Running this on our example input gives us:

[
    [
        (Token(form='ran', idx=1, lemma='run'),)
    ], 
    [
        (Token(form='down', idx=2, lemma='down'),), 
        (Token(form='down', idx=7, lemma='down'),)
    ]
]

Finally, we take the cartesian product of each of these lists of tuples, and unpack the tuples. Because each tuple represents possible ways of choosing tokens for given lemma, this is effectively looking at all combinations of ways to choose words for each lemma, and gives us our original objective - every combination of words that could constitute this MWE. To finish, we sort the results to make sure that the resulting tokens are in order.

mwe_combinations = {
    tuple(x for y in p for x in y) 
    for p in product(*candidate_word_combos)
}

sorted_mwe_combinations = [
    sorted(raw_combo, key=lambda t: t.idx) 
    for raw_combo in mwe_combinations
]

The final result:

[
    [
        Token(form='ran', idx=1, lemma='run'), 
        Token(form='down', idx=2, lemma='down')
    ], 
    [
        Token(form='ran', idx=1, lemma='run'), 
        Token(form='down', idx=7, lemma='down')
    ]
]

 


1

Note that while the trie-based approach runs much faster on average, its theoretical worst case runtime is the same as the naive approach. However, getting anywhere near this upper bound with the trie would require a sentence containing most or all of the MWEs in the lexicon, which is not realistic.

My article for Seiken News
2021-05-09

The Institute of Industrial Science, where I am currently doing research, publishes a newsletter called Seiken News every two months. I had the opportunity to write an article for the April issue's "PROMENADE" section, which is reserved for researchers from abroad. This was my first time writing anything substantial in Japanese, so I thought it was worth holding on to. The article primarily discusses my arrival in Japan and my life since then, but if you're interested, you can read it in "Seiken News #189" at the link provided below.

Incidentally, the proverb written in both Japanese and Yiddish at the beginning of the article is one I originally learned in English as "Man makes plans, and God laughs" from my parents. Only during the research for this article did I discover that it's originally a Yiddish proverb. I of course knew that my ancestors were Eastern European Jews, but I was surprised to find a direct Yiddish translation in the English I normally use. Made me feel a bit closer to my roots.

You can view the article here

Categorizing grammatical error explanations
2021-04-12

I recently worked on a research project attempting to group grammatical errors into categories that could be explained similarly, and develop a process to automatically generate explanations for a particular category. This ultimately didn't pan out as a research idea, but the time I spent working on it gave me some perspective on how one might categorize errors for the purpose of explaining them. This post will be about some of the categorizations I settled on for trying to explain errors, and what I found interesting about them - it will not include technical implementation details from said research, although I will link to a few papers and other outside resources. A couple disclaimers up front:

  1. These categories are subjective, and only really useful as a way to group together errors that can be explained similarly.
  2. There is very rough information about the distribution of these categories at the end of the post, but they are mostly presented in the order I thought would be enjoyable to read, not from most to least common.
  3. A post covering every possible type of error would be too long for most people to read, and too long for me to write, so this post won't attempt to do that.
  4. These categories aren't limited to English, but the examples are English and some categories may be primarily applicable to English.

How are errors normally classified?

There isn't really a standard set of categories for grammatical errors, so categorizations can vary wildly depending on the goals of the project. The closest thing that the natural language processing community has to a standard set of categories is provided by an automatic classification tool called ERRANT, which categorizes errors in pairs of erroneous and corrected sentences based on the content that was changed in the corrected version.

ERRANT classification examples

ERRANT is a great tool for analysis, but its categories are not exceptionally useful for explaining to someone why their error is an error, and it classifies about 20% of errors as OTHER1. Note that this prevalence of OTHER type errors is not a reflection of some kind of failure on ERRANT's part; some portion of errors simply do not fit cleanly into any kind of category.

The theoretical linguistics community has a wider variety of categorizations to offer, although most are still aimed at analysis, and I don't know of any that specifically target being comprehensible for language learners or are widely used in a teaching context. If you do, please let me know.

Some errors are beyond help

As the prevalence of ERRANT's OTHER category hinted at, some errors are simply not classifiable or explainable. This is most often when the content produced by a learner is so garbled that it looks near random, and restoring it to a correct state would require something closer to a rewrite. Below are some excerpts taken from publically available grammatical error correction data2:

Unfortunately I had ever been in the US so I have any idea about how much money I will probably need and about what kind of clothes I have to take with me.

Regarding the programme you have organised, it is great ideas everything that you have planed, but I would like to suggest to you something that the students saw in an advertisement.

How would you explain what's wrong with the above sentences? The first seems to be a case of the author dropping negations in places they are necessary, but not in any way that fits cleanly into a category of some kind: ever should be never, have should be don't have, and so on. The second sentence is surprisingly comprehensible despite being a grammatical mess, but the area around it is great ideas everything doesn't reduce cleanly to a specific type of error. That said, there are plenty of sentences with errors that are quite a bit more amenable to classification.

As my research focused on explaining errors to language learners, the classifications I present below will be heavily focused on the reasons behind an error, such as what specific rule it breaks or why the learner might have made the mistake. These classifications are not necessarily intended to be explanations in and of themselves, but each category represents a group of errors that can be explained similarly. Additionally, many errors can arguably fit into more than one category.

Semantic errors

These are my favorite kind of error, partly because in the strict sense they often have nothing to do with actual grammar, and partly because they can be pretty fun. This category can really be summarized as: "you produced (mostly) valid output that doesn't communicate your intended meaning".

Let's look at an example:

Interestingly, nowadays, I live in the scientific world which I read in the book.

This sentence obviously looks weird, but what's wrong? Everything looks fine grammatically - we get a reasonable sentence by just switching out one transitive verb for another.

Interestingly, nowadays, I live in the scientific world which I saw in the book.

This is still slightly unnatural, mostly due to article use, but we will come back to articles later. The point I want to make is that the fundamental problem in the original sentence is in the phrase the scientific world which I read in the book. This should be even more obvious if we take this from a relative clause to a normal English sentence: I read the scientific world in the book. The English verb read takes a direct object, as in I read a magazine, but there is no sensible interpretation of read a world. The author of this sentence was just one about away from what they likely wanted to say, which was the scientific world which I read about in the book.

Here are a couple more examples of sentences with semantic errors:

I would prefer to live in log cabins because I think they are more comfortable than tents.

She has to do it , and is a sacrifice .

If I look back in the past I can find that computer is following the same street of television, telephon and a lot of other things...

The semantic issues, in order, are:

  1. Living in multiple log cabins at once
  2. The subject of the sentence being a sacrifice, when the intended meaning is that the action is a sacrifice
  3. follow the same street doesn't have the idiomatic meaning in English that the author is looking for, in the way that follow the same route would

Word choice and native language interference

A lot of semantic errors boil down to choosing the wrong word - I.E. read instead of saw or street instead of route. Oftentimes even if the learner knows of the correct word, they don't know that it's the correct word for their specific sentence, and they end up making the choice using information from their first language. Concretely, this usually means using the first available translation of the word with the correct semantics in their first language. Let me give an example.

The meaning of the Japanese word おもしろい (omoshiroi) encompasses the standard uses of both funny and interesting in English. While these are sufficiently basic words that most learners know better, if you see a native Japanese speaker say something like the comedy was so interesting or the documentary was so funny, there's a good chance that the root of the problem is that both these words map back to おもしろい and they picked the wrong translation. In order to be able to identify errors of this particular type you obviously need to speak the native language of the learner whose errors you're correcting, but if you can identify the word in the learner's native language which has been (mis)translated into the incorrect word it can make explaining the problem much easier (and also may help clarify whether something is actually expressing the intended meaning or not).

Graph of おもしろい mistranslation

This kind of native language interference based on mistranslation seems like the kind of thing that should be fairly accessible to automated approaches as well, although I can only find one paper in that vein.

Words that don't play well together

This category lumps together a fairly wide range of grammatical phenomena that arguably don't have a lot to do with each other, except that they can all be explained as an incompatibility between a small number of words or phrases.

The prototypical example of this in English is the subject/verb agreement error. English requires present tense nouns to match the plurality and/or person of their subject; this is the reason the dog runs is fine, but the dogs runs is not.

SVA error example

Tense agreement is another good one; English often doesn't like it when verbs in a subordinate clause are in a different tense than the verb that owns this clause. In simple terms, this is why you can't say I thought that I will go to the store. This phenomenon also extends to sequences of verbs - changing tense in sequential verbs doesn't work in English.

tense error example

This could even include cases where coreferring nouns and pronouns mismatch, such as plural subjects and singular pronouns like My friends took his dog home, although these can often also be explained as semantic errors.

Case frames

Oftentimes, the words that can reasonably be used with a given word are not defined by broad grammatical rules like subject/verb agreement, but by the word itself. Linguistics - in particular, semantics - has a concept for this, which it calls case frames. Wikipedia has some relatively brief articles on the topic, but the idea is that a given word (usually a verb) has a finite set of "frames" that it can invoke, each with its own meaning and information about what words can reasonably be used along with that word. This is all very abstract, so let's look at an example (from this paper):

  • afford: V to-inf (afford to pay/miss)
  • afford: V n (afford a lawyer/attorney/car/house)
  • afford: V n n (afford them the opportunities/protection)

The V is standing in for the actual verb afford here (because it could be conjugated, I.E. afforded), but what you should take from this is that afford can be used with the following:

  1. A to-infinitive verb.
  2. A single noun as a direct object.
  3. A noun direct object and another noun indirect object.

Arguably some other parts of speech in English can also have case frames, such as good <at> <ing-verb> ("good at running"). All that said, how is this all relevant to explaining grammatical errors? The answer is that a very large number of errors that would otherwise defy explanation can be reduced to violations of a case frame. For example:

There were so many seats that it took long to finish clearing .

This is perfectly understandable, but sounds weird because the appropriate case frame for take (as well as all the other case frames for take) doesn’t include an adjective. Instead, this take requires a noun - its direct object. Other examples of case frame violations include things like I am interesting in..., or ...asking you a full refund. interesting has no case frames that accept any other words, and while ask has a case frame that takes a direct object like ask a question, it’s nonsensical here and should be ask for.

I am not suggesting that every learner should receive a lengthy explanation of case grammar, but I do think that it is a good way to formalize a lot of word incompatibility issues. Even if the term "case frame" never comes up, showing a learner whose mistake has violated a case frame a set of examples from the common case frames for that word is a quick and easy way to communicate acceptable usage. Also, case frames are deeply tied in with preposition usage - the preposition(s) following a verb typically reflect a specific case frame - and at least from the data we have publically available, preposition errors account for about one in ten errors1.

Case frames can be fairly easily computed from a large corpus of text, so there has been some work on using automated methods to extract correct case frames and then using them to explain errors to learners3. The University of Kyoto also has a great database of Japanese case frames, which can be searched here, although I know of no equivalent resources for English.

A full explanation of the possible applications of case frames is beyond the scope of this post, but it's worth noting that while the frames here contain syntactic information about acceptable parts of speech and verb forms, the original idea behind case frames was focused on semantic information, such as the type of object that was semantically acceptable for a verb (I.E. the object of drink is typically a liquid). This gets fairly subjective fairly quickly, but if you like semantic case frames, they can explain quite a few errors that would otherwise fall into the Semantic errors category above. The first example about reading a world is a pretty clear semantic case frame violation. A great resource for semantic case frames is FrameNet.

Determiners and mass nouns

This is the one category which corresponds nearly 1:1 with an existing ERRANT category. Determiner errors are hard - with determiner errors accounting for approximately one in ten errors1 - and unfortunately, also often hard to explain. Note that determiners include possessives like your, quantifiers like all, and a few other things, but the main offenders here are really the articles the and a.

Let's return to a previous example:

If I look back in the past I can find that computer is following the same street of television, telephon and a lot of other things...

There are two determiner problems here:

  1. There should be a the after the first that for I find that the computer. As is, it sounds like computer is the direct object of that, and while find has a case frame that accepts a direct object, it's not the right one here.
  2. Either television or telephone need a the before them. Normally when listing nouns, the comes at the beginning of the list, but it can go either way here because television is a mass noun.

I bring up mass nouns here because while many determiner errors are resistant to any kind of reasonable explanation, one of the common mistakes that's relatively straightforward to explain is that mass nouns like water, information, etc. don't take a or the in front of them. Outside of that, this is a category of errors that I often find myself unable to explain, so I don't have any sage advice here.

Most explanations reduce to something along the lines of "the is appropriate for referencing specific things, where a is for non-specific things", but I've yet to see one that doesn't break down or at least stretch pretty thin for edge cases. If you can confidently justify why the is more appropriate than a in the below sentence4 without resorting to "that's just how English is spoken”, then you know something I don't. It sure seems to me like there's more than one forecast for tomorrow, and the person asking the question isn't asking about one in particular.

What's the forecast for tomorrow?

The last thing I will mention is that in regard to article errors in particular, their frequency can depend heavily on whether the learner's native language includes them or not.

Spelling errors

This is a fairly trivial category, but I included it because it's such a common one that it felt negligent to leave it out. In the overwhelming majority of cases, all you can really do with spelling errors is provide the correct spelling for the misspelled word, and if you're teaching English apologize for the incredibly inconsistent spelling. That said, there are occasionally errors that look like misspelled words but have something more interesting going on.

As they know about your interestings and personality , it is easy to help you .

interestings isn't a word in English, but it sure looks like it could be. One way to look at this is that the person writing this has correctly spelled a word that just happens not to exist, and might benefit more from an explanation of why that word doesn't exist (interesting is an adjective and has no plural) than a simple spelling correction.

In closing

The error examples here were all marked as errors by professional annotators in publically available data, but it's not just categorizations of errors that are subjective - even native speakers can disagree about what constitutes an error at all. Additionally, some things marked as errors really just boil down to nonstandard language usage or unusual collocations. fast car is arguably preferable to quick car only because the previous is more common, and musical show only sounds weird to native speakers because English has a noun musical that means the same thing.

I promised that I would provide rough information about the distribution of these error types at the end of the article, so below is a table containing the percentage of erroneous sentences that had a type of error when I annotated a couple hundred sentences at random. This is not the percentage of errors - that would probably be more useful, but this was much easier to annotate and I was in a hurry - so with overlap it's not going to sum to 100%.

Error CategoryPresent in % of examples
Semantic error24%
Words that don't play well together25%
Determiners and mass nouns19%
Spelling errors22%
Unclassifiable15%

Thanks for reading until the end. I hope that this was interesting, if not useful.


2

Example sentences are sampled from BEA-2019 FCE v2.1 and NUCLE datasets.

4

Unlike other examples, I made this one up.