Blind man with a math degree: December 2023

In this blogpost, I will discuss a formalism for discussing how a given language treats numerics. Let me begin with a warning. This post is a slapdash, amateur attempt to summarize some essential material from linguistics.

The goal of this post is to develop some language for discussing how numbers are treated the language spoken by the native people of Santa Cruz, California in the early 1800s. Recall that in the previous blogpost concluded with a brief look at the information available about the language spoken by native people living in Santa Cruz, California. The record was a short list of vocabulary words, eleven of which were words for numbers, that was recorded in 1878 (approximately three generations after contact with Europeans).

A second source is an April 1860 article published in the California Farmer and Journal of Useful Science. The relevant portion is the vocabulary list displayed below.

Vocabulary list from the California Farmer and Journal of Useful Science
From the California Digital Newspaper Collection

How can we analyze a list like this? We could directly compare this list to other lists of numbers, but this ignores some important structure. Many of the numbers appear to be built out of others. The word for "thirteen" is "capan-üsh" which appears to be formed by appending "üsh" to the word for "three," which is "capan." The words for all of the teens appear to be formed in a similar manner.

This structure is similar to the structure of words for numerals in English. The English word for 21, "twenty-one" is given by concatenating the words "twenty" (for 20) and "one" (for 1). Furthermore, "twenty" is built by modifying "two" to "twen-" and then concatenating this with "-ty."

This type of recursive can be conveniently described using formal language theory. The idea is best introduced with an example. Consider the palindromes in the letters "x" and "y." We can generate all palindromes by starting with the symbol "S" (for "Start") and repeatedly applying the following rules under the "S" has been replaced by a and b:

S --> x
S --> y
S --> xSx
S --> ySy

The idea is formalized as follows. A formal language L is a subset of the set of words in a finite set ∑, called the alphabet. In example of palindromes, ∑={ x, y } and L = { x, y, xx, yy, xxx, xyx, yxy, yyy,...}.

We are interested in languages generated by repeatedly applying simple rules like "S --> x." Thus we define a formal grammar G for the alphabet to be a tuple consisting of (1) a finite set N called the non-terminal symbols that is disjoint from words in the alphabet ∑, (2) a finite set M called the terminal symbols that are elements of the alphabet ∑, (3) a distinguished non-terminal symbol S (for start), and a finite collect of production rules. A production rule is a function

(words in N union ∑) (N) (words in N union ∑) --> (words in N union ∑).

The language generated by G is the set of all words constructed by starting with the start symbol S and then repeatedly applying production rules until all non-terminal symbols have been removed.

A slightly more complicated example is the following:

N = { SENTENCE, NOUN-PHRASE, VERB-PHRASE, VERB, DETERMINER, NOUN }

and

∑ = { all valid English words }.

The distinguished non-terminal symbol is SENTENCE, and the production rules are:

SENTENCE --> NOUN-PHRASE VERB PHRASE
NOUN-PHRASE --> DETERMINER NOUN
VERB-PHRASE --> VERB
VERB-PHRASE --> VERB NOUN-PHRASE
as well as production rules involving terminal symbols

NOUN --> people
NOUN --> world
NOUN --> artwork
VERB --> earns
VERB --> gone
VERB --> came
DETERMINER --> the
DETERMINER --> a

By repeatedly applying the production rules, we get:

SENTENCE
NOUN-PHRASE VERB-PHRASE
DETERMINER NOUN VERB-PHRASE
the NOUN VERB-PHRASE
the people VERB NOUN-PHRASE
the people had NOUN-PHRASE
the people had DETERMINER NOUN
the people had the NOUN
the people had the artwork

The idea is that a formal grammar can help us understand how speaker create well-formed sentences. The formal grammar just handles the manner in which sentences are subdivided into phrases. The meaning of sentences are ignored. In the last example, we could also form then sentence "a world had the people" which seems nonsensical. The grammar that we have exhibited also produces sentences in which we do the subject and verb do not agree. An example of such a sentence is "the people earns artwork" (instead of the correct "the people earn artwork").

In general, it is difficult to construct a formal grammar that fully captures the sentences of a spoken language. Formal grammars work well for describing just the numerical system of a spoken language because the numerical system is much simpler and often strongly exhibits the type of recursive structure that formal grammar capture.

The two examples of formal grammars that we have looked at context-free grammar. These are grammar where each production rule has the form "Terminal Symbol --> word." The two formal grammars we displayed earlier were examples context-free grammar. An example of a grammar that is not contact free is:

A formal grammer
From Wikipedia

An example of a word in this language is "aabbcc" which can be produced as

S
a S B C
a a B C B C
a a B C Z C
a a B W Z C
a a B W C C
a a B B C C
a a b B C C
a a b b C C
a a b b c C
a a b b c c

This grammar generates the language L of words like "aaabbcc" which consist of the same number of the letters "a," "b," and "c," arranged in lexicographical order. The grammar fails to be content-free because the production rules involve more than one terminal symbol on the right-hand side. Consider the non-terminal symbol "C." The rules show that this can be replaced by "c" if it is preceded by a "b;" by "c" if it is preceded by a "c." Thus the production rules involving "C" depend on the context in which "C" appears. A now standard result in formal language theory (the pumping lemma) shows that the language generated by this grammar cannot be generated by any context-free grammar.

What about the languages of interest: the well-formed words for numbers in a spoken language such as English. We need to take some care in how we formulate this question. In all spoken languages that have been studied, only finitely many numbers can be described. In the list of Ohlone words that we studied at the beginning of this post, the largest number is 100. The website dictionary.com has words for all numbers smaller than 10^36 (a "Decillion" is 10^33") but no word for "10^36."

We can describe the numbers in the Ohone language by writing down one production rule for each of the twenty-five words:

S --> impeach
S --> uthin
S --> caphan
etc.

We can do the same thing in English only now we need to write down 10^36-1 rules. This method of describing This is just an overly complicated way of writing down all the numbers.

A more interesting context-free grammar for the numbers in English is given by production rules that are listed below as I.6 through I.17.

The first part of a context-free grammar for numbers in English
From "Grammars for Number Names"

This context-free grammar produces the words for numbers up to about 10^34. However, the grammar reflect some important features of the language. For example, the words "twenty," "thirty," ..., "ninety" all appear to be produced by combining the words "two," "three," ..., "nine" with the suffix "-ty." The words "thirteen," "fourteen," ..., "nineteen" are producing in a similar way using the suffix "-teen." This structure is not reflected in the formal grammar.

The linguist James R. Hurford proposed a more sophisticated model of how numbers are expressed in a language. He splits the process of forming a word for a number into two pieces: the formation of the semantic component (the value being expressed) and the phonological component (the word you say). Both are constructed from a formal grammar that Hurford calls the "phrase structure rules." The formal grammar is displayed below. (The symbol "/" represents the abstract number "1." A sequence of the symbols "/" represents the abstract number obtained by counting how often "/" appears. For example, "///" represents the abstract number "3.")

Hurford's phrase structure rules for English
From The Linguistic Theory of Numerals

(The brackets "{" and "}" indicated that one of the bracketed symbols must be chosen. A pair of parentheses indicates that the symbol can optionally be included.)

The language produced by this formal grammar is not very interesting. It is just all non-empty strings in the terminal symbol "/." However, we get interesting information by keeping track of how a word in the language is generated. This is usually encoded by a binary tree in a natural way. Consider the derivation:

NUMBER
/ NUMBER
/ PHRASE
/ NUMBER M
/ / NUMBER M
/ / / M
/ / / / / / / / / / / / /

Here's a crudely drawn picture of the phrase structure tree:

A phrase structure tree

The tree should be interpreted as follows. The value of each node is determined recursively as follows. The value of a leaf is just the number of "/" marks. Thus, reading from left to right, the leaves in the above example have values "1," "1," "1," and "10." Each node labeled "NUMBER" has value equal to the sum of the values of the nodes directly below it (its "immediate constituents"). The value of a node labeled "Phrase" is the product of the values of the nodes directly below it. Finally, the value of a node labeled "M" is equal to value of the second immediate constituent raised to the power of the first immediate constituent. (If there is only one immediate constituent, the node has value equal to "10."

In the above example, the only node labeled "M" has value "10," and the only node labeled "Phrase" has value "20." Working our way up to the root, we see that the root has value "21." This is the semantic component.

Hurford then has a separate (and more complicated) set of rules for using the phrase structure tree to produce the phonological component (i.e. the word that you say). These are more complicated, so I will omit them for now. One important fact is that there are, in general, many phrase structure trees for the same number. For example, both the example we just looked at and the example below have semantic component equal to "21." The rules produce for the first tree the word "one and twenty" and "twenty-one" for the second tree. Most English speakers would say that the second word is correct while the first phrase is incorrect (or at least old-fashioned).

Another example of a phrase structure tree

One issue that I am completely glossing over is why Hurford's formalism for describing the linguistic theory of numerals is correct or reasonable. In fact, in his book, Hurford recognizes that other linguists have proposed alternative theories, and the evidence for/against any particular theory was limited at the time he was writing.

For my purposes, what's most useful is that his formalism will help us understand how numbers are expressed in languages, especially how to compare and contrast different languages.

Thursday, December 7, 2023

History of Mathematics and the native people of Santa Cruz, Part 2: A formalism for numerical systems

Pickle's story of leaving South Carolina

Report Abuse

Labels