Blog

Onion = Cry Vegetable: Generating compound words for artificial languages

2025-02-16

One of the things that struck me about learning Japanese is that once you know a decent number of kanji, there are a lot of words you can read and understand without ever having seen before. This is because kanji often convey specific meanings: the word for building (建物) is build+thing, confidence (自信) is self+believe, etc. In some cases you can even guess the kanji of a word you hear for the first time, and get to its meaning that way.

There are cases in English where you can guess a word you've never seen before, but most of them are just instances of productive¹ affixes like -ness as in kindness or re- in rewrite, or words with a recognizable latin root. Arguably, the closest thing English has to words that are guessable from their kanji is compound words. Not all of these are easy to guess from their constituents, but some are: we have mailbox, firefighter, waterfall and more. This matters because easily guessable compound words reduce the effort required for memorizing new vocabulary. For example, Esperanto - the most widely used constructed language - makes extensive use of compounding as well as productive affixes (although we are only talking about compounding today).

The question now is whether there is a decent way to automate the process of generating compounds. This involves both considerations about individual compounds - the combined words have to be semantically related to the idea expressed by the compound in order to be guessable - and the whole vocabulary, since the more compounds a word is used in, the less useful it becomes as a hint. So, we need a method that can consider multiple critera while searching for an optimal vocabulary of compounds.

As a search problem

Our starting point is a set of meanings that we want to represent with words (compound or not), which we will call ideas and represent with English words. Given this base set of ideas I, we can think of the process of constructing a vocabulary as the process of selecting a set of base words B from I (B ⊂ I), and expressing the remaining elements of I as compounds of two words from B².This results in a set of compounds C, where each element is a tuple representing an expressed idea and two base words: C = { (i, b1, b2) ∣ b1 ∈ B, b2 ∈ B, i ∈ I }.

For example, say that I was the three words below:

sky

water

rain

I would say that the optimal outcome here is to take sky and water as base words, and express rain as skywater. So we have the two aforementioned base words, which combine to form a compound skywater.

B = {sky, water}

C = {(rain, sky, water)}

Unfortunately, for most actual languages I is going to be thousands or tens of thousands of elements³, making it infeasible to do this manually. However, building B and C from I can be thought of as a lengthy sequence of choices, where at each step we choose to either copy an element of I to B, or to express it as a compound of elements from B and add that combination to C. This means that while the decision space is very large, we can search it automatically. We just need some criteria to score possible results.

For example, a scoring function S(I, B, C) could try to:

Maximize the quality of each compound in C
Minimize the number of base words used in a large number of compounds
Maximize the number of compounds to the extent possible without violating #2⁴

My attempt (Monte Carlo Tree Search)

The Monte Carlo Tree Search (MCTS) algorithm is best known for its use in game-playing AI, but it is usable for anything that can be modeled as a sequence of decisions, as long as you have a way to score outcomes. It seemed like a natural choice for this problem, although I'm not an expert in optimization methods. This page provides a good overview of MCTS, so I will cover only the basic ideas here. I will borrow the graphic they used though, because it communicates the core ideas pretty well.

MCTS graphic

MCTS essentially consists of four steps repeated iteratively.

Selection: starting from the root, travel down the tree looking at node scores to find a promising leaf node.
Expansion: generate a single child for that leaf node.
Simulation: repeatedly generate children to build out a transient subtree of the aforementioned child node, up to completion or a maximum depth, and score it.
Backpropagation: update the scores of each node along the path to the child node generated in step #2 based on the result of the simulation.

The idea is that if you have a good scoring function and do a decent job tuning hyperparameters to balance exploration and exploitation, MCTS will find a path down the tree that leads to a high-scoring state. In our case, that means a good compound vocabulary, built by the choices made at each node in the path. To make implementation simpler, I only tried to optimize compound selection and creation, fixing the target number of generated compounds as a percentage of the size of I. That is, I only tried to optimize #1 and #2 of the criteria from the end of the previous section. Consequently, each node in the MCTS tree corresponded to selecting one yet unused idea from I, choosing two base words from B to combine, and adding the resulting compound to C.

Generating and Evaluating Compounds

In order to make sure that the words combined to generate compounds were semantically related to the idea that they represented, I used ConceptNet to gather candidates for words to combine. ConceptNet is a knowledge graph connecting English word nodes with edges representing semantic relationships; traversing the edges connected to the idea for a compound to find base words to combine ensures there is some kind of semantic relationship between the idea expressed by the compound and the words that compose it. That is, we want something self-explanatory like firefighter and not something like honeymoon.

Concept net

Concretely, for each generation, an idea is chosen randomly from ideas in I, and words to be combined into a compound are chosen randomly from the subset of candidates gathered from ConceptNet that were also in I. Any words chosen this way are added to B. I experimented with using scoring to choose better child nodes during expansion and simulation, but this dramatically slowed down processing. Also, theoretically scoring can be done only at the end of simulation and then backpropagation will take care of individual node scores, so I scored only at the end of simulation.

Developing a good scoring function for evaluating word combinations as compound words is a difficult task on its own, so I opted for the simplest thing I could think of: word vector cosine similarity of the words used in the compound to the word representing the idea. That is, the score was the average of cosine similarity for the two base words and a relation score based on manually assigned values for types of relations in ConceptNet (I.E. x is a y should score higher than x desires y, etc.). So we end up with, score(compound) = (similarity(b1, i) + similarity(b2, i) + relationScore) / 3.

MCTS Scoring function

The scoring function used for MCTS needs to be able to score states resulting from a given path down the tree, as opposed to specific nodes or compounds. In this case, a state is just a set of generated compounds C, along with counts for each base word in B, where we define base words as any words used in compounds. I tried to optimize the aforementioned criteria #1 (compound quality) and #2 (minimal base words with too many uses) by computing the score as the sum of the score for each compound divided by the sum of squares for base word usage counts.

Scoring function

Results

As is likely obvious from my explanation up until now, I took some shortcuts with implementation in order to get to a working proof of concept. Consequently, between a fairly flimsy scoring function and not having enough compute on my laptop to run large numbers of MCTS iterations, I did not end up with something I would call a good solution to the problem. However, the process did yield a few outputs that I think are worth sharing. The code is available here for anyone interested in trying to do this better.

Cherrypicked Outputs

These are all actual outputs of the process, although the combinations are technically unordered.

segment+year = month
crime+theft = robbery
beach+edge = shore
act+wedding = marriage
cry+vegetable = onion
computer+storage = disk
air+crime = pollution
bottom+dress = skirt

Some of the combinations are kind of abstract, like act+wedding for marriage, but I quite like cry+vegetable for onion. In all cases though we can see that the compounds make some kind of sense for the ideas they represent; onions are vegetables that make you cry, a month is a segment of a year, etc. That said, most of the outputs are not this good - below are a few examples I don't think turned out very well.

class+senior = freshman
kitchen+meal = cook
chicken+male = hen

Like many of the other outputs, these three are made up of words that show up in similar contexts to the idea word but are not a good fit to express the intended idea, which is a symptom of using a scoring method based on word vectors (distributional semantics). Full results can be seen here.

Final Thoughts

A robust implementation of MCTS for generating compound words, including good scoring function(s), could probably be its own paper. I think to get something working well, you would at least need:

A good compound scoring function
An MCTS implementation that could also explore the number of ideas expressed as compounds
A lot of compute

You might also want:

Intermediate scoring to be smarter about generating child nodes in MCTS (I.E. smarter than completely random)
More tools than just ConceptNet for finding words with semantic relations to the idea being expressed

I am not going to take this project all the way to a paper, but I do think it's an interesting project that has the potential to be useful for artificial language construction. If you happen to want to pick up the torch and flesh this out though, please don't hesitate to reach out.

Productivity is a linguistics term.

Obviously compounds of more than two words exist, but two-word compounds are much more common, and limiting ourselves to compounds of only two words helps keeps the problem framing manageable.

The exception being Toki Pona, with only a little more than a hundred words.

⁴

There is a natural tension between minimizing base words used in many compounds and maximizing the number of compounds, because I is finite - the more ideas from I we express as compounds, the fewer base words we have to choose from for making any given compound.

Multiword expression lookup: multiset subset retrieval with tries

2024-03-23

I've recently spent quite a bit of time thinking about how to find multiword expressions (MWEs) in a sentence. MWEs are a pretty messy topic and there is a lot of ambiguity about what even counts as an MWE, but for today I want to put that aside and talk about approaches to automatically identifying MWEs. I am a fan of lexicon-based approaches to MWE identification, which just means that given a very large list of MWEs, you are trying to figure out which of them might be present in a given sentence. This can be broken down into a pipeline that looks something like this:

Retrieve all of the MWEs that could be present in a sentence (the "possible MWEs") from the lexicon; this can also be thought of as filtering the lexicon down to just entries whose constituents are all present in the sentence. The majority of this blog post will be about how to do this efficiently, because with a poorly structured lexicon this can be quite slow.
Gather all combinations of constituent words which could form a possible MWE in the sentence as "candidates"; this just means finding each combination of words in the sentence that correspond to a possible MWE. We will cover how to do this at the end of the post.
Decide if each "candidate" is actually an MWE - that is, whether its constituents take on an idiomatic/non-compositional meaning. This requires a system capable of making judgements about meaning in context, which typically means machine learning. I published a paper last year about one possible method to do this, but there are a variety of possible approaches, which are beyond the scope of this blog post.

Example sentence

For the above sentence, these three steps look like this:

Retrieve run_down, run_over, fall_down, fall_over as possible MWEs - this is every MWE in our lexicon with all of its constituents present in the sentence.
Find candidates for these MWEs by mapping each of them to groups of words in the sentence, as pictured in the above diagram.
Filter these so that we keep only the candidates whose meaning is that of the relevant MWE. fall_down and run_over are obviously wrong, and run_down as an MWE means something like (of a vehicle) to hit a person and knock them to the ground, so we are left with fall_over.

Note that there are sometimes multiple candidate word groups for a single MWE. For example, if we replace the last down with over for I ran down the stairs and fell down, there are now two combinations of words that can form run_down - one for ran and the first down, and another for ran and the second down. This is also why it is convenient to split step #1 and #2 into separate steps.

Retrieving possible MWEs

Now, the main topic: retrieving possible MWEs for step #1. While some MWEs have constraints on how they can be formed in a sentence, if we include verbal MWEs then there are very few guarantees. They do not have to be contiguous - see put_down in She put her beloved dog down - and worse, they do not even have to be in order - see the beans have been spilled for spill_the_beans. Finally, the constituent words of an MWE are not always unique, such as in face_to_face.

Given that constituent words are neither required to be in order nor unique, the formalization of our possible MWE retrieval problem is: given a multiset S of words in the input sentence, and a set L containing multisets for each possible MWE, find all members of L that are strict subsets of S.

MWE retrieval equation

This means a worst case runtime of O(M * |L|) where M is the average size of an MWE multiset, which is a pretty expensive upper bound. The naive approach of checking if every MWE in the lexicon is a subset of the words in the sentence will end up processing every MWE multiset for every sentence, and is consequently very slow.

class NaiveApproach:
    def __init__(self):
        self.data = [
            (mwe['lemma'], Counter(mwe['constituents']))
            for mwe in get_mwes()
        ]

    def search(self, words: list[str]) -> list[str]:
        word_counter = Counter(words)
        return [
            mwe for mwe, constituents in self.data
            if all(
                word_counter[constituent] >= count
                for constituent, count in constituents.items()
            )
        ]

This code takes an average of 28 seconds on my laptop to process (call search() on) 1,000 sentences. Fortunately, we can make this much faster using a trie¹. Tries are prefix trees most commonly built out of characters, but because we are dealing with words and not characters, we will build ours out of words.

MWE trie

Using the MWE trie as our lexicon, we can gather possible MWEs with a depth-first search starting at the root, which aborts when we hit a node for a word missing from the sentence. That is, we can traverse only the parts of the trie that are subsets of the words in the sentence.

class TrieNode:
    __slots__ = ['lemma', 'children']

    def __init__(self, lemma: Optional[str]):
    	# lemma represents a possible MWE that terminates at this node
        self.lemma = lemma  
        self.children = {}


class Trie:
    def __init__(self):
        self.tree = self._build_tree(get_mwes())

    def _build_tree(self, mwes: list[dict[str, str]]):
        root = TrieNode(None)
        for mwe in mwes:
            curlevel = root
            for word in mwe['constituents']:
                if word not in curlevel.children:
                    curlevel.children[word] = TrieNode(None)
                curlevel = curlevel.children[word]

            curlevel.lemma = mwe['lemma']

        return root

    def search(self, sentence: list[str]) -> list[str]:
        counter = Counter(sentence)
        results = []
        self._search(self.tree, counter, results)
        return results

    def _search(self, cur_node: TrieNode, counter: Counter, results: list):
        possible_next_constituents = [c for c in counter if counter[c] > 0 and c in cur_node.children]

        for constituent in possible_next_constituents:
            next_node = cur_node.children[constituent]
            counter[constituent] -= 1
            if next_node.lemma is not None:
                results.append(next_node.lemma)
            self._search(next_node, counter, results)
            counter[constituent] += 1

This allows us to store only a single copy of any prefixes shared between multiple MWEs in our lexicon, but the main benefit is that searching this way means we will expend no compute on MWEs whose first word is not present in the sentence. This is much faster, and gets through 1,000 sentences in 0.8 seconds on average. However, we can still make it a little faster.

Word frequency in English is very imbalanced, and many MWEs start with common words. For example, my relatively small lexicon has 169 MWEs starting with in, such as in_theory, in_unison, in_vain, etc. Since we only want MWEs whose constituents are all present in the sentence, it makes more sense to look at the words least likely to be present first - that is, the lowest frequency words. We can do this by sorting the constituent words in the MWEs before we insert them into the trie using precomputed word frequency, such that the lowest frequency words come first. This does mean that in rare cases where MWEs share the same words and are differentiated only by order (like roast_pork and pork_roast) we will need to attach multiple MWEs to one node in the trie, but this requires only minor changes.

class OrderedTrie:
	# not pictured here: 
    # 1) TrieNode now holds a list of lemmas instead of a single lemma
    # 2) _search needs one line changed to return all lemmas on a node 

    def __init__(self, word_data: dict[str, int]):
    	# any missing words are treated as last in the frequency list
        self.word_freqs = defaultdict(lambda: len(word_data), word_data)
        self.tree = self._build_tree(get_mwes())

    def _reorder(self, words: list[str]) -> list[str]:
        # sort by word frequency, then alphabetically in case
        # both words are missing from word_freqs
        return sorted(words, key=lambda w: (self.word_freqs[w], w), reverse=True)

    def _build_tree(self, mwes: list[dict[str, str]]):
        root = OrderedTrieNode([])
        for mwe in mwes:
            curlevel = root
            for word in self._reorder(mwe['constituents']):
                if word not in curlevel.children:
                    curlevel.children[word] = OrderedTrieNode(word)
                curlevel = curlevel.children[word]

            curlevel.lemmas.append(mwe['lemma'])

        return root

Using this sorted constituent trie approach, it takes only 0.5 seconds on average to process 1,000 sentences, which is about a 40% speedup over the aforementioned trie. The average time for each of the three methods can be seen in the graph below (log scale).

Average time by method

Moving from the naive approach to using a trie is arguably a fairly obvious optimization; I think the interesting part is the further speedup we get from using word frequency to inform trie construction. Most importantly, it's also a good demonstration of how much it can help to have a good understanding of the data/domain you are trying to process. This further speedup was only made possible by thinking about what the distribution of the input data (words in English sentences) looks like.

Mapping retrieved possible MWEs to candidate word groups

Now that we know how to retrieve our possible MWEs, let's look briefly at step #2: finding every combination of words in the sentence that could constitute a given MWE. For the sentence I ran down the stairs and fell down and the MWE run_down, we start by building simple representations of our sentence as tokens and our MWE as a multiset.

from collections import namedtuple, defaultdict
from itertools import combinations, product

token = namedtuple("Token", ["form", "idx", "lemma"])
sentence = [
    token("I", 0, "I"),
    token("ran", 1, "run"),
    token("down", 2, "down"),
    token("the", 3, "the"),
    token("stairs", 4, "stairs"),
    token("and", 5, "and"),
    token("fell", 6, "fall"),
    token("down", 7, "down"),
    token(".", 8, "."),
]

# build a map of lemmas to tokens
# so we can look up tokens by their lemma
lemma_to_tokens = defaultdict(list)
for t in sentence:
    lemma_to_tokens[t.lemma].append(t)

# mwe: "run_down"
lemma_counter = {
    "run": 1,
    "down": 1,
}

The next part is confusing to look at, but what we're doing isn't actually that complicated. We represent tokens choices for each lemma in the MWE as lists of tuples, and want to gather all possible options. This is just N choose K for each lemma, where N is the number of times the given lemma appears in the sentence and K the number of times it appears in the MWE. These tuples will usually be only one element, except in MWEs that have repeated constituents such as face in face_to_face.

candidate_word_combos = [
    list(combinations(lemma_to_tokens[lemma], lemma_counter[lemma]))
    for lemma in lemma_counter
]

Running this on our example input gives us:

[
    [
        (Token(form='ran', idx=1, lemma='run'),)
    ], 
    [
        (Token(form='down', idx=2, lemma='down'),), 
        (Token(form='down', idx=7, lemma='down'),)
    ]
]

Finally, we take the cartesian product of each of these lists of tuples, and unpack the tuples. Because each tuple represents possible ways of choosing tokens for given lemma, this is effectively looking at all combinations of ways to choose words for each lemma, and gives us our original objective - every combination of words that could constitute this MWE. To finish, we sort the results to make sure that the resulting tokens are in order.

mwe_combinations = {
    tuple(x for y in p for x in y) 
    for p in product(*candidate_word_combos)
}

sorted_mwe_combinations = [
    sorted(raw_combo, key=lambda t: t.idx) 
    for raw_combo in mwe_combinations
]

The final result:

[
    [
        Token(form='ran', idx=1, lemma='run'), 
        Token(form='down', idx=2, lemma='down')
    ], 
    [
        Token(form='ran', idx=1, lemma='run'), 
        Token(form='down', idx=7, lemma='down')
    ]
]

Note that while the trie-based approach runs much faster on average, its theoretical worst case runtime is the same as the naive approach. However, getting anywhere near this upper bound with the trie would require a sentence containing most or all of the MWEs in the lexicon, which is not realistic.

My article for Seiken News

2021-05-09

The Institute of Industrial Science, where I am currently doing research, publishes a newsletter called Seiken News every two months. I had the opportunity to write an article for the April issue's "PROMENADE" section, which is reserved for researchers from abroad. This was my first time writing anything substantial in Japanese, so I thought it was worth holding on to. The article primarily discusses my arrival in Japan and my life since then, but if you're interested, you can read it in "Seiken News #189" at the link provided below.

Incidentally, the proverb written in both Japanese and Yiddish at the beginning of the article is one I originally learned in English as "Man makes plans, and God laughs" from my parents. Only during the research for this article did I discover that it's originally a Yiddish proverb. I of course knew that my ancestors were Eastern European Jews, but I was surprised to find a direct Yiddish translation in the English I normally use. Made me feel a bit closer to my roots.

You can view the article here

NEXT ›