In this blogpost, I will discuss a formalism for discussing how a given language treats numerics. Let me begin with a warning. This post is a slapdash, amateur attempt to summarize some essential material from linguistics.
The goal of this post is to develop some language for discussing how numbers are treated the language spoken by the native people of Santa Cruz, California in the early 1800s. Recall that in the previous blogpost concluded with a brief look at the information available about the language spoken by native people living in Santa Cruz, California. The record was a short list of vocabulary words, eleven of which were words for numbers, that was recorded in 1878 (approximately three generations after contact with Europeans).
A second source is an April 1860 article published in the California Farmer and Journal of Useful Science. The relevant portion is the vocabulary list displayed below.
Vocabulary list from the California Farmer and Journal of Useful Science From the California Digital Newspaper Collection |
How can we analyze a list like this? We could directly compare this list to other lists of numbers, but this ignores some important structure. Many of the numbers appear to be built out of others. The word for "thirteen" is "capan-üsh" which appears to be formed by appending "üsh" to the word for "three," which is "capan." The words for all of the teens appear to be formed in a similar manner.
This structure is similar to the structure of words for numerals in English. The English word for 21, "twenty-one" is given by concatenating the words "twenty" (for 20) and "one" (for 1). Furthermore, "twenty" is built by modifying "two" to "twen-" and then concatenating this with "-ty."
This type of recursive can be conveniently described using formal language theory. The idea is best introduced with an example. Consider the palindromes in the letters "x" and "y." We can generate all palindromes by starting with the symbol "S" (for "Start") and repeatedly applying the following rules under the "S" has been replaced by a and b:
S --> x
S --> y
S --> xSx
S --> ySy
The idea is formalized as follows. A formal language L is a subset of the set of words in a finite set ∑, called the alphabet. In example of palindromes, ∑={ x, y } and L = { x, y, xx, yy, xxx, xyx, yxy, yyy,...}.
We are interested in languages generated by repeatedly applying simple rules like "S --> x." Thus we define a formal grammar G for the alphabet to be a tuple consisting of (1) a finite set N called the non-terminal symbols that is disjoint from words in the alphabet ∑, (2) a finite set M called the terminal symbols that are elements of the alphabet ∑, (3) a distinguished non-terminal symbol S (for start), and a finite collect of production rules. A production rule is a function
(words in N union ∑) (N) (words in N union ∑) --> (words in N union ∑).
The language generated by G is the set of all words constructed by starting with the start symbol S and then repeatedly applying production rules until all non-terminal symbols have been removed.
A slightly more complicated example is the following:
N = { SENTENCE, NOUN-PHRASE, VERB-PHRASE, VERB, DETERMINER, NOUN }
and
∑ = { all valid English words }.
The distinguished non-terminal symbol is SENTENCE, and the production rules are:
SENTENCE --> NOUN-PHRASE VERB PHRASE
NOUN-PHRASE --> DETERMINER NOUN
VERB-PHRASE --> VERB
VERB-PHRASE --> VERB NOUN-PHRASE
as well as production rules involving terminal symbols
NOUN --> people
NOUN --> world
NOUN --> artwork
VERB --> earns
VERB --> gone
VERB --> came
DETERMINER --> the
DETERMINER --> a
By repeatedly applying the production rules, we get:
SENTENCE
NOUN-PHRASE VERB-PHRASE
DETERMINER NOUN VERB-PHRASE
the NOUN VERB-PHRASE
the people VERB NOUN-PHRASE
the people had NOUN-PHRASE
the people had DETERMINER NOUN
the people had the NOUN
the people had the artwork
The idea is that a formal grammar can help us understand how speaker create well-formed sentences. The formal grammar just handles the manner in which sentences are subdivided into phrases. The meaning of sentences are ignored. In the last example, we could also form then sentence "a world had the people" which seems nonsensical. The grammar that we have exhibited also produces sentences in which we do the subject and verb do not agree. An example of such a sentence is "the people earns artwork" (instead of the correct "the people earn artwork").
In general, it is difficult to construct a formal grammar that fully captures the sentences of a spoken language. Formal grammars work well for describing just the numerical system of a spoken language because the numerical system is much simpler and often strongly exhibits the type of recursive structure that formal grammar capture.
The two examples of formal grammars that we have looked at context-free grammar. These are grammar where each production rule has the form "Terminal Symbol --> word." The two formal grammars we displayed earlier were examples context-free grammar. An example of a grammar that is not contact free is:
A formal grammer From Wikipedia |
An example of a word in this language is "aabbcc" which can be produced as
S
a S B C
a a B C B C
a a B C Z C
a a B W Z C
a a B W C C
a a B B C C
a a b B C C
a a b b C C
a a b b c C
a a b b c c
This grammar generates the language L of words like "aaabbcc" which consist of the same number of the letters "a," "b," and "c," arranged in lexicographical order. The grammar fails to be content-free because the production rules involve more than one terminal symbol on the right-hand side. Consider the non-terminal symbol "C." The rules show that this can be replaced by "c" if it is preceded by a "b;" by "c" if it is preceded by a "c." Thus the production rules involving "C" depend on the context in which "C" appears. A now standard result in formal language theory (the pumping lemma) shows that the language generated by this grammar cannot be generated by any context-free grammar.
What about the languages of interest: the well-formed words for numbers in a spoken language such as English. We need to take some care in how we formulate this question. In all spoken languages that have been studied, only finitely many numbers can be described. In the list of Ohlone words that we studied at the beginning of this post, the largest number is 100. The website dictionary.com has words for all numbers smaller than 10^36 (a "Decillion" is 10^33") but no word for "10^36."
We can describe the numbers in the Ohone language by writing down one production rule for each of the twenty-five words:
S --> impeach
S --> uthin
S --> caphan
etc.
We can do the same thing in English only now we need to write down 10^36-1 rules. This method of describing This is just an overly complicated way of writing down all the numbers.
A more interesting context-free grammar for the numbers in English is given by production rules that are listed below as I.6 through I.17.
The first part of a context-free grammar for numbers in English From "Grammars for Number Names" |
The first part of a context-free grammar for numbers in English From "Grammars for Number Names" |
Hurford's phrase structure rules for English From The Linguistic Theory of Numerals |
NUMBER/ NUMBER/ PHRASE/ NUMBER M/ / NUMBER M/ / / M/ / / / / / / / / / / / /
Another example of a phrase structure tree |