I think this might be a first for this site. Yes, we have a lot of intelligent talk over here about a variety of different subjects, but I don't think anyone has ever tried to legitimately teach a course here, or at least offer the kind of material that would be used in such a thing.

But you've interpreted the title correctly if you guessed that I plan to give a crash course in basic linguistics here. Thumbs-up

What Is Linguistics?

Linguistics is not just the learning of many languages, although many linguists do end up polyglots. Linguistics is the scientific study of language and all the processes behind it, and it can be a very broad subject. Most of what I'm going to put in this thread consists of fundimentals that one starts with when first venturing into linguistics. We won't be talking about things like Optimality Theory or Transformationalist Minimalism or anything hardcore-theoretical like that; more like how to analyse and describe any language. The five areas I'll focus on are:

- Phonetics (properties of speech sounds). More specifically, I'll be focussing on articulatory phonetics, which pertains to the production of speech sounds within the vocal tract.
- Phonology (how speech sounds pattern).
- Morphology (word formation).
- Syntax (sentence formation).
- Typology (cross-linguistic language tendencies).

Of course, if you were to go deeper, not only would you see subdivisions in these (such as auditory phonetics, acoustic phonetics, generative phonology, generative grammar vs. functionalist grammar, etc.) but there are also other areas of linguistics.

- Sociolinguistics (the interaction between language and culture)
- Psycholinguistics (the interaction between language and other cognitive processes)
- Semantics (I think you can figure this one out)
- Pragmatics (language in usage)
- Discourse (the structuring of language beyond the level of the sentence, this and pragmatics often go together)
- Applied Linguistics (how to put the theory into practice; this would include things like principles of translation, principles of literacy, application of phonetics and phonology for speech-pathological purposes, application of auditory phonetics for audiology and hearing, etc. For Christians this also includes things like hermeneutics - the science of interpretation - which ties in very closely to things like semantics, pragmatics, and even sociolinguistics)
- Historical and Comparative Linguistics (I think you can figure this one out as well; a lot of the basic analytical methods are actually rooted in phonology, since it is regular sound correspondences that tip one off to common parentage of two words from separate languages. I think this is my favourite part of linguistics overall!)

I'll start with phonetics, but I need to scrounge up some sound files before I do. You can't really talk about individual sounds without having sound files to link them to!


Phonetics is, as I said, the study of the properties of speech sounds. Basically, you study the final product, looking at the different speech sounds individually to discern how they are produced, and when not done just for curiosity, it is done with the purpose of eventually being able to mimic the sounds. That's what articulatory phonetics in particular is good for, and that's what will be included here.

First, let's lay down some groundwork.

The International Phonetic Alphabet

This is what linguists use to write words down, even when there's no written form of the language. It's been in development for well over a century now, having first been proposed back in 1887 and having had several changes made to it since. I'm going to have to use the nocode tag a lot in the phonetics section as well, because in IPA, phonetic transcriptions are written using square brackets. All 26 letters of the Roman alphabet are in it, although they don't all sound like they would in English. The ones that sound exactly the same in every scenario are [b], [d], [f], [g], [h], [l], [m], [n], [s], [v], [w], and [z]. Ones that sound like English in certain contexts are [k], [p], and [t], ones that sound close to English are [a], [o], and [u], and ones that sound nothing at all like their English spelling counterparts are [c], [e], [i], [j], [q], [r], [x], and [y].

Just to get your feet wet, here's a standard IPA chart. If you don't understand everything, don't worry; this will all be revealed in time.

[Image: screen-shot-2015-02-04-at-11-04-40.png]

Consonants: places of articulation, and the structure of the vocal tract

This is one of a few things I'm going to add that isn't straight from memory (the history of the IPA is another such thing) - it's a vertical cross-section of the vocal tract. I can't draw worth crap, so here it is. ;)

[Image: phoneticsvocaltract.jpg]

These days, when first explaining places of articulation, phonetics profs have a strong tendency to start with the lips and go down the tract. This is evident on standard IPA charts as well. While classifying places of articulation can sometimes get incredibly precise to the point of being nitpicky, there's a more basic set of classifications that phoneticians will use in most cases:

Bilabial - using both lips to produce the sound.
Labiodental - upper teeth against lower lip to produce sound.
Linguolabial - tongue-tip against upper lip to produce sound.
Dental - tongue tip between the teeth
Alveolar - you know that big bump behind your teeth, centred in your mouth? That's the alveolar ridge.
Postalveolar - behind the alveolar ridge. There are three subtypes of these:
- Palato-Alveolar; the body of the tongue is right behind the alveolar ridge
- Alveolo-Palatal; the body of the tongue approaches the alveolar ridge and the back of the tongue is up against or close to the hard palate; the tongue tip is lowered somewhat
- Retroflex - same place as palato-alveolar, but with the tongue curled back further
Palatal - back of the tongue against or related to the hard palate
Velar - back of the tongue against or related to the soft palate (velum)
Uvular - back of the tongue against or related to the uvula (the punching bag thingy in the back of your throat)
Pharyngeal - in the throat between the uvula and the vocal folds; these are the only sounds that use the root of the tongue.
- Includes epiglottals, which are made with a flap just above the vocal folds.
Glottal - no usage of the tongue at all, completely relating to the vocal folds.

So you ask, "what about nasal sounds?" Those are included in manners of articulation.

- Plosives (oral stops) mean a complete stop is made to the airflow, and the sound is made when that stop is released. Unreleased plosives occur utterance-finally fairly commonly.
- Fricatives are made when the airflow is restricted to the point of friction, causing noise.
- Affricates are a combination of a plosive and a fricative, where a quick plosive followed by a fricative release results in a single sound. Note: these aren't to be confused with aspirated plosives. More on those later.
- Nasals (nasal stops) mean that the airflow is stopped in the mouth but the velum is raised, allowing air to freely pass through the nose. Pharyngeal and glottal nasals are impossible.
- Approximants form when the airstream is restricted, but not to the point of creating friction. The approximants near the back of the mouth more or less always have an equivalent vowel.
- Flaps require rapid contact from and release from a particular protruding surface and as such cannot be produced in certain places of articulation; palatal, velar, purely pharyngeal (that is, with no epiglottis), and glottal flaps are impossible.
- Trills are basically extended sequences of flaps. Rather than just quick contact, the tongue is held close enough to the point of contact that the airstream causes the tongue to make rapid on-off contact with the surface. Again, in certain contexts they are impossible, and trying to do it will merely result in a fricative or an approximant.

There are extensions of these basic manners as well.

- When a sound is lateral, it means that the airstream goes around the sides of the tongue. Fricatives, affricates, approximants, and flaps can be lateral.
- Sibilants (or stridents, or grooved fricatives) are formed when the air is pushed towards the sharp edge of the teeth via a groove in the tongue, resulting in an extra noticeable layer of noise. Only alveolar and post-alveolar sounds can be strident, and stridence is only noticeable in fricatives and affricates. It should be noted, too, that the sibilant/non-sibilant difference never means sound contrast. I'll talk more about that when I get to phonology.

Airstream mechanics - voicing and "non-pulmonic" consonants

The last basic is voicing. Ever hear a [s] and a [z] and think "Hey, these are really similar! But how are the different?" The answer is voicing. Now the most basic types of voicing are voiced and voiceless - either the vocal folds are shut when the sound is made (voiceless) or they are vibrating (voiced). But the bigger picture is actually more complicated than that, since the vocal folds don't have bipolar function, with voicing ranging from breathy voice, which has almost no vibration of the vocal folds, to creaky voice, in which a stiffening of the cartilage in the larynx causes an almost complete blockage. I won't go into too great a detail here, as that would be beyond the scope of a basic linguistics class.

There is also aspiration, which is considered part of voicing. This happens when the glottis opens briefly after (or sometimes before) a plosive, causing an extra bit of air to be released. English has aspiration. I'll talk more about that when I get to phonology.

What has been brought up so far is the range of pulmonic consonants, where the source of the final airstream is the lungs. Sure, the air almost always comes from the lungs, but sometimes the airstream is held up en route and therefore the final airstream source is either partially or fully from elsewhere. These are the "non-pulmonic sounds."

The most common kind of non-pulmonic sound is called an ejective. These are formed when the glottis shuts before a plosive then opens during its release, causing the air to rush out and producing a "heavy-hitting" variation of a plosive, affricate, or in rarer cases even a fricative. These are always voiceless, and limited to plosives, affricates, and fricatives. Rather than having a pulmonic airstream, these are said to have a glottalic airstream because the glottis is the final source of the air. Ejectives are widespread in terms of languages in which they are spoken, but few major languages have them and they are noticeably absent from European languages, if you don't count the Caucasus.

Implosives are always stops, and they're what I was referring to when I said "partially from elsewhere." They have an ingressive glottalic airstream - meaning that air rushes inward at the opening of the glottis rather than outward - caused by the larynx being lowered during production as the glottis is closed, but there is also an egressive pulmonic airstream. The odd combination results for something of a gulping sound, and they are regularly voiced. While they do occur in languages elsewhere (a particularly noteworthy case of this is Sindhi, a major language of southern Pakistan), they are most commonly found in sub-Saharan Africa.

And then there are clicks, which have a lingual airstream. This is what I mean when I say the air almost always initially comes from the lungs. They don't with clicks. Instead, the tongue blocks off the flow at the soft palate and the sound comes from the release of a vacuum created after the articulator is released along with the tongue on the velum. These are generally somewhat loud, although dental clicks aren't; languages that use these in everyday meaningful speech are exclusive to Africa and almost exclusive to the southern quarter of the continent; there was one attested language in Australia with clicks, but it is a) a created register and b) extinct. Clicks can be heard outside of the realm of actual words in our culture, as dental and alveolar lateral clicks are used to call animals, and dental clicks are used to show pity or disapproval (this was eventually written down as tsk-tsk).

Secondary articulations and double articulations

Occasionally you'll get times when there is a secondary element to an articulation. The sound [w] (spelt with the same letter in English and a number of other languages) is probably the most common example of this, as it has a primary velar articulation with a secondary labial articulation - the lips are rounded. French has a similar sound phonetically, with a primary palatal articulation and a secondary labial articulation, denoted by the symbol [ɥ]. These are both examples of "labialisation." There's actually a more common secondary articulation than this (although labialisation is quite common) and that is palatalisation; this is also the name of a very widespread phonological assimilation process. (More on that later.) East Slavic languages (Russian/Ukrainian/Belarusian/Rusyn) are ready examples of this.

But then you have situations where two places of articulation are hit by exactly the same manner at more or less exactly the same time. This is double articulation, where the articulations are equally audible and equally important. Most of these are either plosives or nasals, and the process is most common in West Africa. I cringe at people pronouncing NHL hockey player Kyle Okposo's last name. :p In English our usual tendency is to treat it as two separate consonants, but given how the co-articulated version sounds, I've heard more commentators treat the velar half of that labial-velar stop like it didn't even exist! Examples of languages that have this in their very names are the Liberian language Kpelle and the Nigerian language Igbo.

There is a hard-to-classify double-articulated fricative in Swedish, called the "sje-sound" in Swedish grammar literature and [ɧ] in the IPA.

Naming conventions for consonants

When writing the technical names for consonants (which you will have to do sometimes if you specialise in phonetics or phonology), the convention is generally voicing-place-manner. Sub-manners like "lateral" or "ejective" come between main place and main manner. The same is true of secondary articulations. Like this:

[k] is a voiceless velar plosive. [kʰ] is a voiceless aspirated velar plosive. [l] is a voiced alveolar lateral fricative. [ɫ] is a voiced velarised alveolar lateral fricative. [k͡p] is a voiceless labial-velar plosive, while [kʷ] is a voiceless labialised velar plosive.


Vowels are generally voiced sounds produced with next to no tension, in the back half of the vocal tract. All languages have 'em.

There are three features that divide base segmental vowels up in a phonetic sense - tongue height, frontness/backness, and roundedness. The convention for these is obviously different, too. For example, [ɑ] is a low back unrounded vowel, while [y] is the high front rounded vowel, which is found in languages like French and Finnish.

There are two naming conventions for height, though. One can either use high, near-high, high-mid, mid, low-mid, near-low, and low, or close, near-close, close-mid, mid, open-mid, near-open, and open. The IPA prefers the latter in academia, but the former is good for starters or if you want to keep things simpler.

Post 4

And now comes the fun part - the sounds themselves! (I couldn't find a consistent sound chart for all of them - you'll have to search the individual sounds on Wikipedia, because most if not all have a sound file with them.)


All languages have plosives. Clean voiceless plosives are the most common, although many of these can also be aspirated.

[p] - voiceless bilabial plosive. Occurs in a large percentage of languages. The English citation form is actually the aspirated [pʰ] but [p] does exist in certain environments (more about this in phonology); [pʰ] and [p] are considered different sounds in many Indo-Aryan languages, the Chinese languages, and Scottish Gaelic, among others. However, Standard Arabic lacks the sound, as do many regional Arabic dialects.

[t̼] - voiceless linguolabial plosive. Incredibly rare; only attested in disordered speech until it was discovered being used contrastively in a group of languages of Vanuatu. (Yes, I looked this one up.)

[t] - voiceless alveolar plosive. Citation form is aspirated in English. Almost every language has it, with Hawaiian being among the very few exceptions. As with [p] and also [k], the aspirated and non-aspirated variants are contrasted in Indo-Aryan languages, the Chinese languages, and Scottish Gaelic, among others. Can vary somewhat in placement, from dental to post-alveolar.

[t͡p] - voiceless labial-alveolar plosive. VERY rare; only decisively attested in one specific language in Papua New Guinea.

[ʈ] - voiceless retroflex plosive. Found primarily contrastively in Asia and the Pacific, especially in South Asia and Australia. Also occurs in Swedish and Norwegian.

[c] - voiceless palatal plosive. Exists in English as a variant of [k] happening before [i] and [e]. Considered a different sound in Hungarian and Albanian among others.

[k] - voiceless velar plosive. Citation form is aspirated in English. VERY common, almost as much so as [t] and more than [p]. Tahitian is a rare counter-example.

[k͡p] - voiceless labial-velar plosive. Fairly rare, occurring mainly in West and Central West Africa. NHL player Kyle Okposo has this sound in his last name, or at least in the original pronunciation thereof - his father is from Nigeria. Also occurs in the language name Kpelle.

[q] - voiceless uvular plosive. Occurs in a number of non-Indo-European languages, which are many but dispersed. Several Turkic languages, Inuktitut, a large number of Salishan languages (if not all of them), all Wakashan languages, many Northeast and Northwest Caucasian languages, some dialects of Arabic and Hebrew (indeed, it's posited that Biblical Hebrew had this sound), and several other indigenous languages of North America have this sound. Iranic languages have it as well, under influence from Arabic.

[q͡ʡ] - voiceless uvular-epiglottal plosive. Supposedly occurs in Somali; could actually just be a clean uvular plosive [q].

[ʡ] - voiceless epiglottal plosive. Quite rare. Occurs in Haida and Archi, and some linguists believe it occurs in Nuu-Chah-Nulth.

[ʔ] - glottal stop. Better understood as an absence of sound. Almost every word in human language that is perceived as beginning in a vowel rather actually begins with a glottal stop phonetically when at the beginning of an utterance. Many languages use this word-medially or finally as a contrastive sound. Believe it or not, English is among these. "Uh-oh" is transcribed [ʔʌʔɔw] phonetically in Western American/Canadian English.

Voiced plosives are less common than their voiceless counterparts, but still common.

[b] - voiced bilabial plosive. Occurs in numerous languages, including English.

[d̼] - voiced linguolabial plosive. Incredibly rare. Had to look this up, too. Attested in Vanuatu, and the Kakojo dialect of Bijago. And disordered speech.

[d] - voiced alveolar plosive. Occurs in numerous languages, including English. Can vary somewhat in placement, from dental to post-alveolar. Is the only voiced stop to occur in Finnish, as a variant of [t] in certain environments.

[ɖ] - voiced retroflex plosive. Occurs in languages of India, as well as in Swedish and Norwegian.

[ɟ] - voiced palatal plosive. Occurs primarily in Eastern Europe as a contrastive sound, most notably in Hungarian, Albanian, Czech, Slovak, and Latvian.

[g] - voiced velar plosive. Common, but the least frequent of the "common six plosives" ([p, t, k, b, d, g]).

[ɡ͡b] - voiced labial-velar plosive. Rare, primarily found in West and Central West Africa, as in the language names Igbo and Gbe.

[ɢ] - voiced uvular plosive. Quite rare. Attested in Mongolian, some dialects of Arabic (non-contrastive), and Canadian indigenous language Kwak'wala, among some others.


Very few languages lack fricatives phonetically, although some language families lack them contrastively. More on that in phonology.

[ɸ] - voiceless bilabial fricative. Not particularly common in contrast, although African language Ewe has it. But as a variant of other sounds it is surprisingly common, and occurs in Spanish and Japanese in this manner, among others.

[f] - voiceless labiodental fricative. Fairly common. Occurs in most Indo-European languages, including English, French, Italian, German, etc.

[̼θ] - voiceless linguolabial fricative. VERY rare. Only attested in Vanuatu.

[θ] - voiceless interdental fricative. Fairly rare. Occurs contrastively in English, Icelandic, Castillian Spanish, Albanian, Greek, and Bashkort, among others.

[s] - voiceless alveolar (grooved) fricative. Most common fricative. Most languages said to not have this sound are in the Pacific, and include Hawaiian and Maori.

[ɬ] - voiceless alveolar lateral fricative. Found primarily in North America, the Caucasus, and southern Africa, also attested contrastively in Welsh and some languages of East and Southeast Asia.

[ʃ] - voiceless palato-alveolar (grooved) fricative. Quite common. Most Indo-European and Turkic languages, and many indigenous languages of the Americas, have this sound.

[ʂ] - voiceless retroflex (grooved) fricative. Not super-common, but not exactly rare, either. Occurs in languages of India, North Germanic languages, Chinese languages, Polish, and the East Slavic languages (Russian/Ukrainian/Belarusian/Rusyn).

[ꞎ] - voiced retroflex lateral fricative. Attested only in Toda, a Dravidian language of southern India.

[ɕ] - voiceless alveolo-palatal (grooved) fricative. A bit less common. Occurs contrastively in Polish, Russian, Chinese languages, and some languages of the Caucasus; also attested in Japanese as a variant of [s].

[ç] - voiceless palatal fricative. Rare in contrast. Does, however, occur semi-frequently as a variant of another sound, even in certain English dialects (as a variant of [h]). German, Greek, Dutch, and Finnish among other such languages.

[ʎ̥˔] - voiceless palatal lateral fricative. Attested only in a couple of Afro-Asiatic languages of Central Africa.

[x] - voiceless velar fricative. Quite common. A number of Indo-European languages have this in dialectal inventory; Spanish, Russian, and Mandarin are three particularly major languages that have this sound. Old English had this sound.

[ɧ] - voiceless postalveolo-velar fricative. This one is rare and also somewhat controversial. It is only clearly attested in Swedish as the "sje-sound" which is written "sj." Officially it is considered a co-articulation of [ʃ] and [x], but this is still a point of argument amongst those who study Swedish phonetics and phonology. Supposedly also occurs in the Kölsch dialect in western Germany.

[ʟ̝̊] - voiceless velar lateral fricative. Attested in some Chimbu-Wahgi branch of the Trans New Guinea family, and also in Northeast Caucasian language Archi.

[χ] - voiceless uvular fricative. A rather harsh sound, not quite as common. Fairly common in indigenous languages of western North America, the Caucasus, the Middle East, and several dialects of German.

[ħ] - voiceless pharyngeal fricative. Fairly rare. Occurs mainly in Semitic languages and languages of the Caucasus, also in some Interior Salish languages.

[h] - voiceless glottal fricative. Occurs in a large variety of languages, English included.

And of course, your voiced fricatives as well:

[β] - voiced bilabial fricative. Is attested as a contrastive sound (most notably in Ewe) but occurs much more frequently as a variant of another sound, as in Spanish, Japanese, and Korean, among others.

[v] - voiced labiodental fricative. Occurs contrastively primarily in Europe, the Middle East, and Siberia, but is attested elsewhere. (I call this the "hard v.")

[ð̼] - voiced linguolabial fricative. VERY rare, attested only in Vanuatu.

[ð] - voiced interdental fricative. Fairly rare. Attested contrastively in English, Icelandic, Arabic, Bashkort, Welsh, and a number of indigenous North American languages. Also a variant in Greek, Spanish, and some dialects of Hebrew.

[z] - voiced alveolar (grooved) fricative. MUCH less common than its voiceless counterpart. Mainly found contrastively in Europe, Africa, and the Middle East.

[ɮ] - voiced alveolar lateral fricative. Quite rare; attested contrastively in Northwest Caucasian languages, Zulu, Xhosa, and a small number of other scattered languages from diverse families.

[ʒ] - voiced palato-alveolar (grooved) fricative). Not particularly common. Found contrastively in many Slavic languages, French, Portuguese, many Na-Dene languages, and various Turkic and Uralic languages. English has it as a variant of [s] or [z].

[ʐ] - voiced retroflex (grooved) fricative. Kinda rare. Occurs in Russian, Polish, Vietnamese, and Mandarin, among others.

[ʑ] - voiced alveolo-palatal (grooved) fricative. Very rare. Occurs in Chinese languages, Polish, Sorbian, and some Northwest Caucasian languages contrastively. Also occurs in some languages as a variant, such as Russian, Portuguese, and Catalan.

[ʝ] - voiced palatal fricative. Incredibly rare in contrast with other sounds - only attested contrastively in Scottish Gaelic and some Berber languages of Saharan Africa. A number of others have this as a variant.

[ɣ] - voiced velar fricative. Surprisingly common. Occurs contrastively in a number of Caucasian languages and languages of the Americas (especially North), and also in Scottish and Irish Gaelic, Arabic, and certain dialects of Hebrew, among others.

[ʟ̝] - voiced velar lateral fricative. Attested in some Chimbu-Wahgi branch of the Trans New Guinea family, and also in Northeast Caucasian language Archi.

[ʁ] - voiced uvular fricative. Occurs mainly in Europe, the Middle East, and North America; in Europe it is often referred to as the "guttural r" which is used in French, Portuguese, German, many dialects of Dutch, and Danish - in this regard it is also used in Hebrew and Inuktitut. It occurs in Turkic and various Caucasian languages contrastively as well.

[ʕ] - voiced pharyngeal fricative/approximant. Hard to tell in most cases whether there is friction or not since it is so close to the glottis and has no trilling feature. Fairly rare. Occurs contrastively in Arabic, Interior Salish and Wakashan languages (North America), and Caucasian languages. Is the consonantal equivalent of [ɑ].

[ɦ] - voiced (most often breathy-voiced) glottal fricative. Exists in numerous languages, but seldom contrasting with the voiceless [h].


These are usually written with digraphs (that is, two symbols) connected by a ligature.

[p͡ɸ] - voiceless bilabial affricate. VERY rare. Unattested in contrast, and most of the languages in which it is attested - mainly on a dialectal level - are West Germanic (English, German, Dutch)

[̪p͡f] - voiceless labiodental affricate. VERY rare. Most of the languages it is attested in at all are West Germanic (German, Lëtzebuergesch, Bavarian) and it apparently also exists contrastively in Tsonga, one of the official languages of South Africa.

[t͡θ] - voiceless interdental affricate. Quite rare. Exists in a small number of indigenous North American languages - some Coast Salish languages have this sound, as do a few Na-Dene languages of the Northwest Territories in Canada.

[t͡s] - voiceless alveolar affricate. Very common, although non-contrastive in English. Best-known in contrast from German, Italian, Slavic languages, and Chinese languages; also exists in a number of indigenous languages of the Americas, the Caucasus, and southern Africa.

[t͡ɬ] - voiceless alveolar lateral affricate. Somewhat rare. Attested primarily in North American indigenous languages (especially in the west) and southern Africa, but also occurs in Icelandic, and Mexican Spanish (through borrowings from Nahuatl).

[t͡ʃ] - voiceless palato-alveolar affricate. Quite common - this is the English "ch" sound. Fairly widespread. Might actually be more widespread an affricate than [t͡s], which is typologically unusual.

[ʈ͡ʂ] - voiceless retroflex affricate. Rarer. Attested mainly in pockets of languages - Slavic languages (primary West and East Slavic), Northwest Caucasian languages, and Chinese languages.

[t͡ɕ] - voiceless alveolo-palatal affricate. Very rare in contrast (Chinese languages, Polish, and Serbo-Croatian, for example) but more common as a variant (Russian, Korean, even certain dialects of English).

[c͡ç] - voiceless palatal affricate. Very rare. Languages that use it often have the plosive [c] as a free variant of it. Some Samic languages of Northern Europe have this attested as contrastive, as does Hungarian.

[c͡ʎ̥˔] - voiceless palatal lateral affricate. Only a couple African languages have this sound contrastively.

[k͡x] - voiceless velar affricate. Quite rare. Most commonly seen in contrast in Southern Africa - Tswana for example.

[k͡ʟ̝̊] - voiceless velar lateral affricate. VERY rare. Attested contrastively in Archi and the Laghuu language of Vietnam.

[q͡χ] - voiceless uvular affricate. Very rare. Mainly found contrastively in the Caucasus and in the Pacific Northwest. I call this the "death rattle" sound. :p

[ʡħ] - voiceless pharyngeal-epiglottal affricate. Apparently attested in Haida, a language of British Columbia and Alaska. Otherwise unattested.

[ʔ͡h] - glottal affricate. Never occurs contrastively. Attested as a variant in Queen's English (aka Received Pronunciation) and certain dialects of Chinese languages.

And the voiced ones.

[b͡β] - voiced bilabial affricate. Unattested in contrast and very rare otherwise. Some British dialects and this one specific African language have it attested as a variant.

[b̪͡v] - voiced labiodental affricate. Very few languages have this attested in contrast, and these are primarily in southern Africa. Occurs in some West Germanic languages as a variant.

[d͡ð] - voiced interdental affricate. VERY rare and only attested as a variant in languages where dental variations of the alveolar plosive [d] or certain dental sound combinations occur.

[d͡z] - voiced alveolar affricate. Not the most common of sounds, but more common than most voiced affricates. Occurs regularly in languages of Eastern Europe - Slavic languages either have it contrastively (most South Slavic languages) or as a variant (East and West Slavic languages); also attested in Albanian, Caucasian languages, Hungarian, Italian, Armenian, and a few Chinese languages and dialects.

[d͡ɮ] - voiced alveolar lateral affricate. VERY rare, and not known to occur contrastively. Attested in Xhosa and Séliš (aka Kalispel-Flathead).

[d͡ʒ] - voiced palato-alveolar affricate. The most common voiced affricate, occurring in a wide range of locations.

[ɖ͡ʐ] - voiced retroflex affricate. Rare. Seen mainly in Slavic languages and languages of China.

[d͡ʑ] - voiced alveolo-palatal affricate. Rare, and usually seen as a variant. Contrastive in some Chinese languages, Polish, and Serbo-Croatian.

[ɟ͡ʝ] - voiced palatal affricate. VERY rare. Hungarian and Samic languages have it in free variation with the plosive [ɟ]. Some dialects of Albanian have it as well.

[ɡ͡ɣ] - voiced velar affricate. VERY rare and unattested contrastively.

[ɡ͡ʟ̝] - voiced velar lateral affricate. Attested contrastively only in Laghuu in Vietnam, and Hiw in Vanuatu.

[ɢ͡ʁ] - voiced uvular affricate. Unattested but considered possible.

[ʡ͡ʕ] - voiced pharyngeal-epiglottal affricate. Unattested but considered possible. How, I really don't know. :lol: I blame Eric.


Nasals are typically voiced.

[m] - voiced bilabial nasal. Incredibly common, with very few languages (among them the ironically named Mohawk - in its own language it is Kanien’kéha; Rotokas also lacks it) not having it.

[ɱ] - voiced labiodental nasal. Almost nonexistent contrastively, but any language with both [m], and [f] and/or [v], WILL have this as a variant. Very hard to tell apart from [m] to my ears!

[n̼] - voiced linguolabial nasal. You guessed it. VANUATU! :lol:

[n] - voiced alveolar nasal. Incredibly common, with very few languages (among them Samoan and Rotokas) lacking it.

[n͡m] - voiced labial-alveolar nasal. VERY rare; only attested in one language of Papua New Guinea.

[ɳ] - voiced retroflex nasal. For the most part, limited to India, Australia, and Scandinavia. Vietnamese also has it as a variant.

[ɲ] - voiced palatal nasal. Surprisingly common, especially in Europe. A number of languages have this contrastively (most Romance languages, Hungarian, Czech, Slovak, most South Slavic languages, Irish Gaelic, Albanian) while others (including English) have this as a variant.

[ŋ] - voiced velar nasal. Not hugely common but not super-rare, either. Occurs frequently as a variant of [n] before velar plosives/fricatives/affricates. In languages where it occurs contrastively it is sometimes consigned to the part of a syllable after the vowel (as in languages like English; this is more common amongst Indo-European languages with the sound) but sometimes not (certain African languages, Samoyedic languages, some Caucasian languages, Chinese languages, several languages of the Philippines).

[ŋ͡m] - voiced labial-velar nasal. Rare, and primarily occurs in West and Central West Africa.

[ɴ] - voiced uvular nasal. Rarer, and usually occurs as a variant of [n] or [ŋ] preceding another uvular. Attested as contrastive in some Eskaleut languages (Inuktitut, Kalaallisut) and one language of Papua New Guinea.

(remember, anything beyond uvular is impossible for a nasal)

Voiceless nasals are usually variants, but not always. On rare occasion, they are contrastive. The Hmong-Mien language family is known for its voiceless nasals - heck, there's one even in the name of the family!

[m̥] - voiceless bilabial nasal. Occurs contrastively in the Hmong-Mien family, Burmese, Yupik, Shixing, Kildin Sami, and Mazatecan languages, and a few others.

[n̥] - voiceless alveolar nasal. Occurs contrastively in the Hmong-Mien family, Burmese, Yupik, Shixing, Kildin Sami, and Mazatecan languages, and a few others.

[ɳ̊] - voiceless retroflex nasal. Only occurs contrastively in one language of New Caledonia.

[ɲ̊] - voiceless palatal nasal. Occurs contrastively in the Hmong language, Burmese, Shixing, and the Mazatecan languages.

[ŋ̊] - voiceless velar nasal. Occurs contrastively in Burmese, Yupik, and Shixing, and Mazatecan languages.


There are five approximants that are actually known to contrast with or at least differ from their fricative counterparts:

[ʋ] - voiced labiodental approximant. The "soft V" found in Dutch, North Germanic languages, Finnish, Czech, Slovak, and some South Slavic languages. Not particularly common outside of Europe, but it does occur.

[ɹ] - voiced alveolar/postalveolar approximant. Your stereotypical "English R," fairly rare and tends to be dialectal and a variant in the languages it does occur in. Primarily occurs in Indo-European languages, but is also attested in Burmese, Vietnamese, and Igbo. Is considered a vowel in some dialects of Mandarin.

[ɻ] - voiced retroflex approximant. Quite rare, mainly occurs in India and Australia, but is also attested in southern South America, and dialectally in certain Germanic languages, including English.

[j] - voiced palatal approximant. One of the two most common approximants, occuring in a wide number of languages; is the consonantal equivalent of [i]. (Written with "y" in English) Sometimes takes on noise and becomes [ʝ], but this is a variant.

[ɰ] - voiced velar approximant. Very rare contrastively, primarily occurring as a variant of [k], [g], or [ɣ]. The consonantal equivalent of [ɯ]. Turkish did have this, but has since lost it.

And then there're these gems:

[l] - voiced alveolar lateral approximant. VERY common. Often has a velarised variant; occasionally this [lˠ] is contrastive with [l] (Albanian, for example)

[ɭ] - voiced retroflex lateral approximant. Rare. Attested primarlily in Dravidian and North Germanic languages, also in Australia, and in Korean and Khanty.

[ʎ] - voiced palatal lateral approximant. Not really common, not really rare. A large number of Indo-European languages (of Europe, anyway) have this sound, as do Basque and Hungarian. Outside of Europe, the best examples are in Quechua, Aymara, and more neutral American and Canadian dialects of English.

[ɥ] - voiced labialised palatal approximant. VERY rare contrastively, occurring in Abkhaz and Iaai. (Not sure if there are others.) However, it is more common as a variant, with languages with the vowel [y] and [w] usually having it as a variant of [w] (such as some Chinese languages, Korean, and Shixing); occurs in French as a variant of [y].

[ʟ] - voiced velar lateral approximant. Acoustically almost indistinguishable from the velarised alveolar; the one that is simply velar is much rarer and attested only in the Pacific, and in Scots.

[w] - voiced labio-velar approximant. One of the two most common approximants, occurring in a wide variety of languages. Is the consonantal equivalent of [u].

[ʟ̠] - voiced uvular lateral approximant. Unattested contrastively but apparently possible, and tentatively attested as a variant in certain American Englishes. How, I don't know.

Contrastive voiceless approximants are generally very rare, as they tend to mainly occur as variants of their voiced counterparts following voiced plosives or affricates. The following-listed are only ones that are attested as contrastive.

[ʋ̥] - voiceless labiodental approximant. Attested contrastively only amongst English-speaking South Africans of Indian extraction.

[l̥] - voiceless alveolar lateral approximant. Usually a variant of [l], only attested contrastively in Shixing, Tibetan, and Moksha.

[ɭ̊] - voiceless retroflex lateral approximant. Attested contrastively only in the Iaai language of New Caledonia and the Dravidian language Toda.

[j̊] - voiceless palatal approximant. Attested contrastively in some Samic languages, Moksha, and Jalapa Mazatec (possibly other Mazatecan languages as well).

[ʎ̥] - voiceless palatal lateral approximant. Attested contrastively only in Shixing.

[ʍ] - voiceless labio-velar approximant. Attested contrastively in a number of English dialects (including Southern American English; these dialects preserve an audible contrast between the words "whine" and "wine," of which this is the one in "whine" ) and Hupa; Old English had this sound; many languages have this take on noise and become [xʷ] instead.


As with approximants and nasals, these are usually voiced.

[ⱱ̟] - voiced bilabial flap. VERY rare - attested primarily in scattered African languages - and not yet proven to be contrastive in any language, rather being a variant of...

[ⱱ] - voiced labiodental flap. Quite rare - attested in scattered African languages, mainly in Central Africa.

[ɾ] - voiced coronal flap. Don't worry about the term "coronal" for now - I'll expand on that when I do phonology. The typical place of articulation for this alveolar, but it can also be post-alveolar or dental, and these never contrast. Anyway, these are somewhat common as an "r-sound" in language, as in Spanish, Turkish, and Arabic, and also as a variant in Russian. It does occasionally occur as a variant of [t] and [d] as well, as in North American Englishes, Estuary English, and Danish. There's even a nasalised variant of this - [ɾ̃] - that occurs in several North American Englishes as a merger of [n] and [t]. (It isn't as common in Canada.)

[ɺ] - voiced coronal lateral flap. VERY rare, the only major language with this sound is Japanese; occurs as a postalveolar in one particular dialect of Norwegian.

[ɽ] - voiced retroflex flap. As with most retroflex sounds, this is most common to North Germanic, and languages of India and Australia, but in the cases of the North Germanic languages it occurs as a variant of laterals.

[ɭ̆] - voiced retroflex lateral flap. Very rare; occurs primarily in South and Southeast Asia, the most notable contrastive example being Pashto.

[ʟ̆] - voiced velar lateral tap. Only attested in two languages of Papua New Guinea, and not contrastively; why it is able to be a tap on the soft palate is because air can still pass around the edges of the tongue, but the time of contact is still very short.

[ɢ̆] - voiced uvular flap. Very rare and never occurs in contrast. German, Dutch, and Limburgish are attested to have it.

[ʡ̮] - voiced epiglottal flap. Only attested in one language - Dahalo - and even there it isn't contrastive.

Voiceless flaps do occur, but only as variants. They've never been attested in contrast.


[ʙ] - voiced bilabial trill. Like mimicking a horse, except with voicing! :p Occurs in diverse languages on most continents, but is quite rare.

[r] - voiced coronal trill. Usually alveolar, and very common as an "r-sound."

[ɽ͡r] - voiced retroflex trill. VERY rare - attested contrastively only in three languages and tenuously in some dialects of Dutch.

[ʀ] - voiced uvular trill. Never contrasts with the fricative [ʁ], and is found as the "guttural R" of languages like French, Portuguese, German, some dialects of Dutch, and Hebrew, among others; Most of the languages in question are Indo-European.

[ʢ] - voiced epiglottal trill. Basically a growl; only attested as a speech sound in the Aghul language of Dagestan, Russia, and in some dialects of Arabic.

Voiceless trills are relatively rare in contrast. Only those attested in contrast are shown.

[r̥] - voiceless coronal trill. This does occur in contrast in some languages, such as Icelandic, Moksha, and Welsh.

[ʜ] - voiceless epiglottal trill. Patterns like a fricative; still quite rare, being attested in languages like Chechen, Dahalo, Haida, and some dialects of Arabic.


Ejectives come in three subtypes. There are ejective plosives, ejective affricates, and ejective fricatives. These are always voiceless. All are fairly rare, but ejective plosives and affricates are more common than ejective fricatives.

Ejective plosives

If a language has ejectives, it WILL have ejective plosives.

[p'] - bilabial ejective plosive. Found in scattered languages, but particularly common in the Americas, southern Africa, the Cape Horn area, and the Caucasus (including even Armenian, which is an Indo-European language, quite possibly the only one to have contrastive ejectives).

[t'] - alveolar/dental ejective plosive. While these are written with the same symbol in most cases, the Dahalo language of Kenya has both. Otherwise, in much the same way as the bilabial, these are particularly common in the Americas, southern Africa, the Cape Horn area, and the Caucasus.

[ʈʼ] - retroflex ejective plosive. VERY rare, supposedly occurs in Gwich'in, a Na-Dene language primarily spoken in northwestern Canada and western Alaska.

[cʼ] - palatal ejective plosive. Very rare, occurring mainly in North American indigenous languages, like Haida for example; also attested in isolated languages in southern Africa and the Caucasus, and in Hausa, a major trade language of central west Africa.

[k'] - velar ejective plosive. Found in scattered languages, but particularly common in the Americas, southern Africa, the Cape Horn area, and the Caucasus. Labialised versions of this are common in the Caucasus and western North America.

[q'] - uvular ejective plosive. Limited mainly to the Caucasus and western North America; labialised versions of this occur in Northwest Caucasian, Salishan, Wakashan, and some Na-Dene languages. Palatalised versions occur in Abkhaz and occurred in now-extinct Ubykh, which also had pharyngealised and labialised-pharyngealised versions as well!

[ʡʼ] - epiglottal ejective plosive. VERY rare, only attested in Dargwa, a Northeast Caucasian language.

Ejective affricates

[t͡sʼ] - alveolar ejective affricate. Primarily attested in languages of the Caucasus and the Pacific Northwest.

[t͡ɬ'] - alveolar lateral ejective affricate. I LOVE this sound. As with the above, it is primarily attested in languages of the Caucasus and the Pacific Northwest.

[t͡ʃʼ] - palato-alveolar ejective affricate. Primarily attested in languages of the Caucasus and the Pacific Northwest.

[ʈ͡ʂʼ] - retroflex ejective affricate. Very rare. Attested in Adyghe, a Northwest Caucasian language. Proposed for Avar and Yokutsan languages.

[c͡ʎ̝̥ʼ] - palatal lateral ejective affricate. VERY rare. Attested in Dahalo, and an African language isolate called Hadza.

[k͡xʼ] - velar ejective affricate. Fairly rare. Occurs mainly in southern Africa, esp. Zulu and Xhosa, but also attested in Haida and Hadza.

[k͡ʟ̝̊ʼ] - velar lateral ejective affricate. VERY rare, only occurring contrastively in Archi, a Northeast Caucasian language, and in that language having plain and labialised forms thereof. Occurs in some southern African languages as a variant of [k͡xʼ].

[q͡χʼ] - uvular lateral ejective affricate. The ultimate throat-killer. :p Occurs as a variant of [q'] in Northeast Caucasian and also Salishan languages. It's actually easy to go from one to the other because of the force of an ejective causing vibration of the uvula and therefore some noise.

Ejective fricatives - these are all VERY rare.

[fʼ] - labiodental ejective fricative. Attested in Kabardian, a Northwest Caucasian language. Proposed for Yapese.

[θʼ] - dental ejective fricative. Attested in South Arabian languages (esp. Mehri) and also in Yapese.

[sʼ] - alveolar ejective fricative. Mostly found in scattered North American indigenous languages; attested dialectally in Adyghe and Hausa.

[ɬ’] - alveolar lateral ejective fricative. Found in most Northwest Caucasian languages (not Abkhaz, though, which actually doesn't have ejective fricatives); also attested in Tlingit.

[ʃʼ] - palato-alveolar ejective fricative. Attested in Adyghe and the Keresan languages.

[ʂʼ] - retroflex ejective affricate. Attested in the Keresan languages.

[x’] - velar ejective affricate. Attested in Tlingit.

[χ’] - uvular ejective affricate. Attested in Tlingit, and supposedly Georgian as a variant.


Implosives only have one form - they pattern in the same way as plosives. Implosives are more commonly voiced. They occur most frequently in Sub-Saharan Africa and Southeast Asia, but also happen in the Sindhi language and the Saraiki dialect of Punjabi, in Pakistan.

[ɓ] - voiced bilabial implosive. Occurs in a number of African and Southeast Asian languages, and also in Sindhi and Saraiki. Apparently attested in Southern American English at the beginning of words. (Not sure I buy that. ) Notable languages with it include Vietnamese, Khmer, and Hausa. Also attested in some Mayan languages of Guatemala.

[ɗ] - voiced alveolar implosive. Occurs in a number of African and Southeast Asian languages, and also in Sindhi and Saraiki. Notable languages with it include Vietnamese, Khmer, and Hausa.

[ᶑ] - voiced retroflex implosive. Much rarer; attested in Ngadha, a language of Indonesia, and Oromo, a major vernacular language of Ethiopia.

[ʄ] - voiced palatal implosive. Not as common as some implosives, and not attested in Southeast Asia, being mainly used in Africa. Does appear in Sindhi and Saraiki, however.

[ɠ] - voiced velar implosive. Occurs primarily in Africa, but also occurs in Sindhi and Saraiki.

[ʛ] - voiced uvular implosive. Occurs in actual language in a very odd place; the Mam language of Guatemala. Apparently no other language has it, which is odd in the sense that it is outside the regular occurrence zones of implosives. Used by many other languages, though, as a way to mimic gulping sounds!

There are voiceless implosives attested; those few languages confirmed using them in contrast are mostly in Africa.

[ɓ̥] - voiceless bilabial implosive. Attested contrastively in Serer-Sine, a Senegalese language, and in the Owere dialect of Igbo in Nigeria.

[ɗ̥] - voiceless alveolar implosive. Attested contrastively in Serer-Sine and in the Owere dialect of Igbo.

[ʄ̊] - voiceless palatal implosive. Attested contrastively only in Serer-Sine.

[ʛ̥] - voicless uvular implosive. Attested in Kaqchikel and Q’anjob’al, both Mayan languages spoken in Guatemala. These are also the only languages on the books that are claimed to have a voiceless implosive without having the voiced counterpart.


Only one language outside of southern Africa has ever been attested having clicks, and that language (in Australia) is now extinct; clicks have numerous contrastive forms, but only the baseline forms are shown here.

[ʘ] - bilabial click. Not a kissing sound, rather a lip-smacking sound done with non-puckered lips. Attested in the Tuu and Kxa languages of southern Africa.

[|] = dental click. Used as an actual speech sound in a number of southern African languages, including Zulu, Xhosa, and the languages formerly designated as "Khoisan." (This term fell out of use when it was determined that those languages didn't necessarily form a provably cohesive language family) Used paralinguistically in English to denote pity or disapproval, written "tsk-tsk."

[ǃ] = alveolar click. Occurs in a number of southern African languages, including Sesotho, which has no clicks in any other place of articulation, Xhosa, Zulu, and the languages formerly designated as "Khoisan."

[ǂ] = palatal click. Only occurs in languages formerly designated as "Khoisan."

[ǁ] = lateral click, formed by making the click on the side of the mouth rather than in the middle or across the whole mouth. Used in many southern African languages, including Xhosa (the name of which has the aspirated variant of this), Zulu, and the languages formerly designated as "Khoisan."

[ǃ˞] = retroflex click. Extremely rare, actually only being attested in the Central !Kung language of Namibia.


Unlike consonants, which occur in more or less a specific spot in the mouth, vowels have a larger range and are often relative to the language. Still, there are specific formant frequency ranges that are set out for each vowel. You would never hear an [i] with the formants of an [ɑ], for example. Anyway.

Keep in mind as well, there are two different conventions for vowel height. Vowels can also be contrastively long and short, which they were in Latin and Old English, and remain in languages like Finnish, Japanese, Navajo, etc. It can be contrastively nasalised in languages like the Na-Dene languages of North America (Navajo, Tlingit, Apache, Dakelh, Tsilhqut'in, etc.), French, most dialects of Portuguese, Polish, etc. Furthermore, phonation can vary; some languages have contrastive creaky voice.

What are shown here are your basic segmental vowels.

Front Vowels

[i] - close/high front unrounded vowel. Among the most common vowel sounds - most languages have this sound in contrast. Can't actually think of a language off the top of my head that doesn't. It isn't always a "clean" [i], though; some dialects of English have a slight diphthong instead, [ɪj].

[y] - close/high front rounded vowel. Considerably less common than its unrounded counterpart. Occurs primarily in Europe and North/Central Asia. English is probably the only major Germanic language that doesn't have it cross-dialectally, and even some dialects of English (most stereotypically Scottish) have it. Major languages with this sound include French, German, Mandarin, Cantonese, Wu, Dutch, Turkish, Hungarian, and Finnish.

[ɪ] - near-close/near-high front unrounded vowel. Not as common as its close/high counterpart, but still occurs fairly widely, particularly in Europe and Africa. In English, this replaced the short [i] after vowel-shift, and prescriptive grammars still call it "short-I," but there is more than just length that is a factor here; other West Germanic languages follow suit in this regard.

[ʏ] - near-close/near-high front rounded vowel. Fairly rare. Almost all the languages that use it are Indo-European, as a short variant of [y], although it is also attested in Turkish and Hungarian. On top of this, the bulk of the languages from Indo-European that use it are either themselves Germanic (Dutch, Icelandic, Swedish, Faeroese, German, Limburgish, Norwegian) or have a historically heavy Germanic influence (French; one could possibly make this argument for non-Indo-European Hungarian as well).

[e] - Close-mid/high-mid front unrounded vowel. As with [i], this is sometimes not a "clean vowel" but a diphthong. Fairly common; occurring in some "5-vowel systems" and more or less all "7-vowel" and "9-vowel" systems. (I'll talk about that in phonology). Noteworthy examples of this being found clean are Cantonese, German, French, Hindi, Scottish English, and Arabic.

[ø] - Close-mid/high-mid front rounded vowel. A little less common, occurring primarily in Europe and Northern/Central Asia. Occurs in most Germanic languages and in French (why the crap isn't it in more standard English, then? ) and also occurs in Turkish, Wu, and Hungarian, among others.

[e̞] or [ɛ̝] - Mid front unrounded vowel. This sound occurs in languages that don't have a contrast between two mid-range front unrounded vowels, such as Spanish, Finnish, Japanese, Romanian, a number of Slavic languages, Hebrew, and Tagalog, among others.

[ø̞] or [œ̝] - Mid front rounded vowel. Fairly rare. Uralic languages with a front rounded vowel in the mid-range, like Finnish, Estonian, Võro, and Hungarian, have this sound, as does Turkish. In a number of Germanic languages, English included, it occurs dialectally, although in some cases this is argued.

[ɛ] - open-mid/low-mid front unrounded vowel. Fairly common. Occurs in some "5-vowel, ""7-vowel" and "9-vowel" systems, although in some of these, it is supplanted by [ə].

[œ] - open-mid/low-mid front rounded vowel. Rarer, and typically a variant of [ø]. The only language I can think of where these two actually contrast is French, which is a nightmare for me, because I have trouble telling the two apart even now after years of training!

[æ] - near-open/near-low front unrounded vowel. Fairly rare. In a number of languages that have it it only occurs in certain dialects. Some languages that consistently have it are English, Finnic languages (Finnish, Estonian, etc.), Northern Azeri, Farsi, and Tsilhqut'in (I can verify this on personal experience as I studied the language with two speakers in my undergrad). In Southern American English, it contrasts with [a].

[a] - open/low front unrounded vowel. Fairly common. Since most languages have only a single low-range vowel, this is either the default form of it, as in Spanish, Mandarin, Arabic etc., or a variant thereof. A number of "5-vowel," "7-vowel," and "9-vowel" systems have this, with this being the only one that is unpaired in those cases.

[ɶ] - open/low front rounded vowel. Extremely rare and only attested in a few Germanic languages as a variant. Probably because it would make you look like a contortionist trying to pronounce the thing while keeping your lips rounded!

Central vowels

With the exception of [ə], the mid-range of the central vowels is hard to find in contrast. This is because in usage, they seldom if ever contrast with their back counterparts.

[ɨ] - close/high central unrounded vowel. [ə] aside, this might be the most common central vowel attested in language, appearing contrastively in a large range of languages in various places around the world. Russian, Uzbek, Mongolian, Irish Gaelic, and Mandarin have the prototypical examples of this sound.

[ʉ] - close/high central rounded vowel. Typically a variant of [u]; dialectal in English, contrastive in Swedish and Norwegian, and in fact Swedish contrasts three high rounded vowels, a rarity in language.

[ɘ] - close-mid/high-mid central unrounded vowel. Rare in contrast, occurs dialectally to a degree in English, but only consistently in languages like Kazakh and Skolt Sami. Often freely variant with back vowel [ɤ].

[ɵ] - close-mid/high-mid central rounded vowel. Rare in contrast, occurring definitively in languages such as Cantonese, Mongolian, and Tajik but being conflated with other vowels in a number of other supposed attestations.

[ə] - mid central vowel. If a language reduces vowels, it will have this as a variant, and a LOT of languages reduce vowels! It is much less common contrastively, but still occurs fairly frequently; Northwestern and Northeastern Caucasian languages, indigenous languages of Western North America, Indo-Iranian languages (Hindi, Punjabi, Marathi, Kurdish), Armenian, Albanian, Palauan, and even French have this contrastively.

[ɚ] - rhotacised mid central vowel. This can also be written as a syllabified [ɹ]. Very rare in terms of number of languages that use it, but for number of speakers? Most North American Englishes (New England or certain New York or Southern sub-dialects notwithstanding) have this, as do some widely-spoken dialects of Mandarin.

[ɜ] - open-mid/low-mid central unrounded vowel. Rare in contrast, but surprisingly, Queen's English (RP) is one of the languages that does use it, replacing the sequence [ə] + [ɹ]. Minority languages like Paicî (New Caledonia) and Ladin (northern Italy) have it in contrast as well.

[ɞ] - open-mid/low-mid rounded vowel. Mainly occurs as a variant, occasionally occurring dialectally, and only occurring contrastively in one language, the Kashubian language of Poland.

[ɐ] - near-open/near-low central unrounded vowel. Surprisingly common as a variant; rarer in contrast, but is considered the "citation form" corresponding to the letter "a" in languages like Catalan, Cantonese, and the Baltic languages. A rounded equivalent does occur in one language, the Sabiny language of Uganda.

Back vowels

[ɯ] - close/high back unrounded vowel. While it never contrasts with [ɨ], the two are quite distinct nonetheless. Somewhat common, occurring in many Turkic and Mongolic languages, some Chinese languages, some languages of southeast Asia, Korean, and Scottish Gaelic.

[u] - close/high back rounded vowel. Very common, easily the most common rounded vowel in any world language. English doesn't have a clean [u] per se, with it instead being partly diphthongised. Some languages, like Japanese, Wu, Swedish, and Norwegian, have a form where the lips are rounded but don't stick out (called "compressed"), giving it a slightly different sound. But typically the rounding of this vowel results in protruded lips.

[ʊ] - near-close/near-high back rounded vowel. Fairly common either in five/seven/nine-vowel systems, a variant of [u] as in Russian or Québecois French, or as development from a historic short [u] as in English; does contrast in English (compare the words soot [sʊt] and suit [sut]). There is an unrounded vowel of this height and backness, but it occurs strictly as a variant or a dialectic sound.

[o] - close-mid/high-mid back rounded vowel. Quite common. Many English dialects don't have a clean [o]; languages that do include French, German, some dialects of English (Scottish English, Singlish, Indian English), and a number of others, many of whom have 7-vowel or 9-vowel systems. Wu Chinese has a "compressed" form of this vowel.

[ɤ] - close-mid/high-mid back unrounded vowel. Fairly rare. Languages that do have it include some languages of East and Southeast Asia, most notably Mandarin, Taiwanese, and Thai.

[o̞] - mid back rounded vowel. As with the front vowels, back vowels "go mid" when they don't distinguish multiple vowels in the mid-range. Finnic languages, Spanish, several Slavic languages, Japanese, Turkish, and Hebrew are among those that have this vowel.

[ɤ̞] - mid back unrounded vowel. Quite rare. Estonian, Võro, Danish, and Bulgarian have this sound attested, as do certain dialects of English (Norfolk and Cardiff supposedly) and Vietnamese.

[ɔ] - open-mid/low-mid rounded vowel. Common, and from what I've seen it's probably more common than [o]. This "open O" occurs in all 7-vowel and 9-vowel systems and in a fairly large number of 5-vowel systems. Its place in English is dialectally varied, but in R-retaining dialects it is generally the variant of [o] before [ɹ]. In R-drop dialects, "or" generally becomes a long [o], usually written [o:]

[ʌ] - open-mid/low-mid unrounded vowel. Not really that common. Occurs in a large number of English dialects (especially in North America), though, as the vowel in words like "butt" "shut" "fronting," etc. Also appears in Standard Korean, Tamil, and as a variant of [ə] in Salishan languages.

[ɑ] - open/low unrounded vowel. The only unrounded back vowel that is more common than its rounded counterpart, and the continuum between it and centralised [ä] gives us quite possibly the most common vowel sound area in language. Arguably all languages have a vowel somewhere in this formant range; the only other vowel that is in the same range of commonality is [i]. Languages contrasting three low vowels is almost unheard of, although apparently Skolt Sami does, contrasting this, [æ], and [ɐ].

[ɒ] - open/low rounded vowel. In contrast to its unrounded counterpart, this is very rare; it's usually dialectal or a variant of something else; an exception to this is its frequency in a large number of dialects of English (including my own, "Western Canadian"), and it is also attested in Farsi, Uzbek, Hungarian, the dialects of Western Desert in Australia, and Assamese. In most cases, the vowel-rounding is less pronounced than in other rounded vowels.

Welp, that's the end of the basics of phonetics.
So having seen all these sounds, you think, "Okay, seems simple enough! I think I can learn all these languages now!" Not so fast. A language may have a number of phonetic sounds in the language, but you'll find out quickly enough that pairs or groups of these can be considered the same sound in a language, or even in various contexts within a language. When someone asks how many sounds are in a language, they usually want to know how many distinct sounds are in a language, and it comes down to how the sounds pattern and make sense within each individual language.

This is in the realm of what is called "phonology."

You know how I used the word "contrastively" a lot when describing the individual sounds? Well, that was actually cheating a little. That's something you would hear more in a phonology class than in a phonetics class. And consistent sound contrast gives the listener/linguist an idea of what sounds are actually perceived in a language.

The first word you need to know is phoneme. You run into this word everywhere in phonology. It is the underlying abstract sound in the brain that can surface as one or multiple sounds on the surface. The surface sounds that are tied to your underlying phoneme are called allophones. Every time I used a word "variant" in the section on phonetics and individual sounds, you could replace every last one of those with "allophone." The great thing about phonemes is that you write them down differently than you do allophones, or sounds in phonetic transcription as you would in a phonetics class or lab session. And for the rest of this series on linguistics, I intend to write words "phonemically" to simplify things a bit. Phonetics uses square brackets for everything, but phonology only uses it for allophones. For phonemes /ðe ɑɹ ɹɪtn̩ bətwi:n slæʃəz/!

Sound Classes

I'll lay some groundwork before talking about phonemes a little more. Sounds pattern in various ways, and there are actually classes of sounds that lump together a lot of the places or manners of sounds.

There are four major place feature classes, although sometimes a sound will have more than one of these (co-articulated sounds generally do). First there are labial sounds, which comprise any sound using the lips as one of the articulators. Coronal sounds use the tip, blade, or front of the tongue, while dorsal sounds use the back. Laryngeal sounds are either made using the root of the tongue or just the glottis. As far as the different phonetic places go, here's how they line up:

Labial: bilabial, labiodental, linguolabial; rounded vowels are said to have a secondary labial feature.
Coronal: linguolabial, dental, alveolar, palato-alveolar, retroflex, alveolo-palatal
Dorsal: alveolo-palatal, palatal, velar, uvular; furthermore, all vowels are considered dorsal - perhaps controversially so in the case of low-back vowels - because the formants get their resonance from the area between the hard palate and the upper pharynx.
Laryngeal: uvular (rarely), pharyngeal, epiglottal, glottal

There is a two-way sonority contrast as well. Sonorants are sounds that are produced without turbulent airflow in the vocal tract, while obstruents have turbulence, resulting in noise or complete airflow stoppage. This is a grouping of manners.

Obstruents: plosives, affricates, fricatives
Sonorants: all other consonants, and also vowels.

And then there's continuancy. Continuants have continuous airflow through the mouth, while non-continuants do not. Notice I said, "through the mouth." Nasals are NOT continuants because the airflow goes through the nose. Other than this, plosives and affricates are non-continuants as well, while all other consonants, and vowels, are considered continuants.

More About Phonemes

Here's an example from Canadian English (the following is also true of most of Western American English, North Central American English, Northern American English, and "General American") of a phoneme and its allophones.

Let's take /t/ for example. When asked to pronounce this, most people will say [tʰ]. That's because English has what's called "aspiration." But we don't say /bɪt/ as [bɪtʰ]. Instead, we say either [bɪt] or [bɪt̚] (the diacritic on the second one is for an unreleased stop). On top of this, we never say "stick" [stʰɪk] or "schtick" [ʃtʰɪk]. There's no aspiration in those contexts. Some people don't even aspirate the first /t/ in "antidote." (I do, but the aspiration is weaker.) We do, however, aspirate the "t" in "tick" [tʰɪk] and "catastrophe" [kətʰæstɹ̥əfi].

So what happens is that /t/ becomes [tʰ] at the very beginning of a syllable (in some dialects it's even more specific than that, taking on the aspirated form at the beginning of a stressed syllable - I'll talk about stress later, because there are actually multiple types thereof) and [t] elsewhere, with a variant of an unreleased [t̚] being a free variant when it is the only sound after a vowel, at the end of an utterance. (I'll talk about free variation later.) (Aspiration also exists in other Englishes, but the distribution of it is different.)

If only it were that simple. :p

In North American Englishes, if a /t/ is between a vowel and a syllabic approximant, it will instead become [ɾ]. So an underlying /bʌtɹ̩/ "butter" will actually surface as [bʌɾɹ̩], with the sound represented by the "tt" being relatively short in length. This is a "flapping process", which mainly occurs with /r/ in languages but also occur with /t/, /q/, and sometimes even the sequence of /nt/, which happens in some American Englishes and with certain speakers in Canada. Flapping actually falls under a larger group of processes called lenition, in which a sound becomes "weakened" for articulatory reasons.

As for aspiration, it is the opposite, which is called fortition. The sound becomes stronger. You'll need to remember this for when I talk about syllable structure, because both fortition and lenition play into it somewhat.

Back on track - the phoneme /t/ actually has more than two allophones - [t], [tʰ], [t̚], and [ɾ]. (There are actually more than this, but I'll get to this later)

Phonological processes

Last time I introduced the concept of the phonological process, but now I'll explain it a bit. It's how a phoneme becomes an allophone. Sometimes nothing happens at all! This would result in an elsewhere form (not sure if this is an actual linguistic term, but it's a helpful and accurate one nonetheless) - sound X becomes Y in environment A, Z in environment B, and X elsewhere.

I mentioned lenition and fortition last time. There are actually several processes that fall under these two umbrellas.


There is flapping, which I mentioned last time, but this isn't the most common one. It does happen with /t/ and /d/ in English, and often happens with /r/ in languages that have it. This has a tendency to happen between vowels.

There is what is called spirantisation, where a plosive or affricate becomes a fricative, typically between vowels. We have a bit of this in English, but there's actually more going on than just simple spirantization, and I'll bring examples of this up later. An example of pure spirantisation would look more like this:

/bat͡sa/ -> [basa]

/baga/ -> [baɣa]

Intervocalic voicing happens when a voiceless sound takes on voicing between two vowels. It's common, and actually, Old English had it - the remnants of it can be seen in very old words where the second vowel has since become silent, as in "knife" vs. "knives."

Example would look like this:

/kasa/ -> [kaza]
/tupo/ -> [tubo]

There's also post-nasal voicing. This doesn't happen in English (or any Germanic language for that matter) but it does happen in some languages - an underlying voiceless obstruent will become voiced after a nasal.

/sompa/ -> [somba]


Fortition processes are less common in general, and many of them happen to sounds over time rather than in active usage. However, there is one particularly common fortition process called final devoicing. It's pretty self-explanatory - you get an underlyingly voiced segment that becomes voiceless at the end of a word. An example from German:

/hand/ -> [hant] "Hand" (gee, I wonder what this means? :P)

It also happens in Slavic languages. An example from Russian:

/ɔlʲɛg/ -> /ɐlʲɛk/ "Oleg" (bloke's name - this name will come up as an example later as well)

Some languages have post-nasal fortition; while the lenition process usually deals with voice, this process deals with continuancy, and usually gives it the ol' heave-ho. Examples I've seen generally follow a pattern like this:

/kumva/ -> [kumba]
/kinzi/ -> [kindi]
/nrasa/ -> [ndasa]

Place Assimilation

Now these two process types could both be regarded as assimilation of features of one sound to another, in terms of continuancy (or lack thereof) or voicing. But there's also place assimilation. Although consonants will interact with one another from time to time, the most common triggers are actually vowels. A very common one is palatalisation, where the tongue is moved towards the hard palate. In some languages this is actually contrastive, such as East Slavic languages (Russian, Ukrainian, etc.) and a number of Uralic languages (not Finnish, but still including Estonian to a degree), but in others it is actually an allophonic process. Japanese is one of the easiest examples to explain, where underlying /si/ will surface as [ɕi]. This also happens in Polish (exactly the same) and Korean (slightly different, as the surface form is [ʃi] - untrained anglo ears wouldn't really be able to tell the difference). This is why L1 speakers of Japanese and Korean struggle to differentiate "see" from "she" when learning English at first.

Velarisation occurs when the tongue is pulled back to the soft palate (velum). Aside from languages like Irish Gaelic and Russian, though, where velarisation is actually contrastive this usually only happens with /l/, but is it ever common! Usually what happens, is that /l/ will surface as [lˠ] when at the end of a syllable. Now in some dialects of English (including many North American dialects), [lˠ] is actually the elsewhere form, if not the form that occurs everywhere. In Queen's English, though, [lˠ] will only occur after back vowels at the end of a syllable.

Labialisation will typically occur in the environment of a rounded vowel, and usually before it. The labial feature of a vowel will be passed back to the preceding consonant, resulting in a secondary articulation. This is actually not all that common, as usually when labialised consonants are talked about, they are contrastive.

Other contiguous assimilation processes

Not all voicing assimilation processes happen because of lenition. In Russian, there is an intuitive process where the second consonant in a cluster determines the voicing of both.

/futbɔl/ -> [fudbɑlˠ]

English has this going in the opposite direction, but it's only at the boundary of word parts. (I'll talk more about this in morphology)

/tɑkd/ -> [tʰɑkt] "talked." (Canadian/Western American/North Central American - the process is the same in other Englishes, but the vowels are different!)

These two actually demonstrate another point. Assimilation can move in either direction. When it moves towards the end of the word (as in the English examples), it is referred to as progressive assimilation. If it moves towards the beginning of the word (as in the examples from Russian or Japanese), it is called regressive assimilation. I find regressive assimilation is more common, but progressive isn't exactly rare.


Dissimilation usually happens in morphophonology, where a sound becomes less like its preceding sound for ease of pronunciation. More on that later.


When a sound is inserted, usually a vowel into a tough consonant cluster, for ease of pronunciation. Doesn't happen quite as much in English, although the "e" (an epenthetic schwa [ə]) in kisses to separate two alveolar fricatives is one example.


When any two sounds are reversed for ease of pronunciation. This is more common in English than one might think, especially dialectally. For example, the oft-lampooned North American (especially Southern American) pronunciation of nuclear, that is, [nukj̊əlɚ], is actually metathesis - the more "accepted" pronunciation is [nukl̥i(ə)ɹ] (so you have not only metathesis of the [i], which is shortened to [j], with the [l], but also epenthesis of a schwa after the fact). Now in Hebrew, metathesis is an active part of the verb system, with the reflexive hitpael construction metathesising the first root consonant with the last consonant of the prefix hit- if the root consonant is coronal. In the SENĆOŦEN language of the Greater Victoria area in BC, metathesis in and of itself has a grammatical meaning, marking the "actual" aspect (similar to the English present progressive).

Next lesson will feature vowel harmony and a few other long-distance processes, plus a discussion of what's called "phonotactics."

Vowel Harmony

I could talk about this one for a while, but I'll keep it simple. Vowel harmony is a process where all vowels in a word or word root will share one or more specific feature. This is typically binary, and the most common types of it are backness (front vs. back vowels), height (high vs. non-high vowels), roundedness (rounded vs. unrounded vowels), and tongue-root position (advanced vs. retracted; commonly called ATR harmony).

Backness harmony is common in languages of northern Eurasia, especially Uralic and Turkic languages. Standard Estonian is the only Uralic language known to have completely lost its vowel harmony, but the most comprehensive system is found in Finnish, where [a, o, u] and [æ, ø, y] cannot occur in the same word root. [i, e] are considered "neutral vowels" as they do not have a back counterpart, however, when a suffix attaches to a word of all neutral vowels, it will take the front-vowel surface form if it exists. Say for example, the inessive case suffix -ssa/-ssä will be -ssä on a word with all neutral vowels, but -ssa if there's even a single back vowel present in the root. Now, in the case of a compound word, the very last root will determine how the suffix harmonises. If the last root is front or all neutral vowels, the suffix will come out with the front suffix. If it has back vowels, or neutrals with one back vowel, it will come out with the back suffix. I yanked a list from Wikipedia for examples:

(B) kaura → kauralla
(B) kuori → kuorella (ignore the i → e change; I'll come back to that later)
(N) sieni → sienellä (again, because the root has all neutral vowels)
(F) käyrä → käyrällä
(B) tuote → tuotteessa (something else going on here; will come to it later)
(F) kerä → kerällä
(B) kera → keralla

Roundedness harmony is scattered, but is perhaps best known from the Turkic languages, where it interacts with backness harmony. It is only the high vowels that are directly affected by the roundedness harmony, meaning that [i, y] and [ɨ, u] cannot occur in the same root. As with Finnish and most other vowel harmony languages, compounds will buck the trend and any affixes will harmonise to an adjacent root.

ATR harmony is mainly found in Africa, but occurs elsewhere as well. According to Dr. Rod Casali (my advanced phonology professor who is a relative expert on this system), ATR languages have a strong tendency to have 7-vowel or 9-vowel systems, with [a] being the ATR-neutral vowel. With this said, the typical ATR contrasts would have [ɪ, ʊ, ɔ, ɛ] vs. [i, u, o, e] in a 9-vowel inventory (and from what I remember of Dholuo from when I studied it, this is their vowel inventory). 7-vowel systems will typically drop the mid, [+ATR] pair, leaving two neutral vowels so no [e] or [o]. But some languages do other things. It could be argued that the "pharyngeal harmony" in Mongolian is ATR harmony.

Height harmony is a bit rarer and can be trickier to explain. In feature-based phonology, there are binary features [± high] and [±low], meaning that it is both theoretically possible, and actually the case, that a vowel can be [-high] and [-low] at the same time. (These are the mid vowels, categorised as open-mid/close-mid and sometimes just mid in phonetics.) However, as Dr. Casali pointed out to us in AP class, harmony where the harmonising feature is the [±low] feature, while theoretically possible, does not actually exist. So it would actually be high vowels and non-high vowels (lows and mids) being separated. I've heard Nez Percé (a Penutian language of the Pacific Northwest), Coeur D'Alene (a nearly extinct Salishan language of northern Idaho) and a few African languages, among them Rwanda-Rundi, cited as examples of height harmony.

There are also nasal and rhotic harmony systems - the latter is incredibly rare, only being attested in the recently-extinct Yurok language of northern California.

Syllable structure and phonotactics

You'd be amazed how much diversity there is in syllable structure. Cross-linguistically, there is a whole host of different combinations of permissible syllable structure, ranging from languages that only allow "CV" syllables (that is, one consonant and one vowel) to languages like Nuxalk and certain Tamazight languages that don't even have syllabic sonorants in some words - in Nuxalk, [ɬχʷtʰɬt͡sʰxʷ] "you spat on me" has all voiceless obstruents!

Now any syllable will have at least an onset and a nucleus. The nucleus is generally what helps one perceive a syllable, and in the vast majority of cases, it will be a vowel. (Even those which aren't, which are sonorants in most cases, are understood to "have a vowel," sometimes transcribed with a schwa [ə]. Many languages do allow coda consonants, but many do not. Even those that do sometimes limit how large a coda can be and/or what consonants can go in it. Onsets and codas can be simple (one segment, and in the case of an underlying starting vowel, that will almost always be a glottal stop [ʔ]) or complex (more than one segment - in the case of codas this is surprisingly rare although English is notorious for it).

Most languages allow only simple codas; some of them (Finnish for example) allow two in certain scenarios, and a scant few allow more. English is peculiar in that it allows fairly large codas, sometimes up to four consonants. Consider the word strengths [stɹ̥ɛŋkθs] for example. Already odd for having a large onset, the English phonological rule of putting an epenthetic (inserted) voiceless plosive between a nasal and a voiceless obstruent of a different place of articulation leads to a four-sound coda, very rare even in English. In fast speech for a number of English speakers, though, this is dodged by either assimilating the nasal to the place of the obstruent (give us [stɹ̥ɛnθs]) or deleting the obstruent after insertion (giving us [stɹ̥ɛŋks]).

An example of a very restrictive language for codas is Japanese. Only nasals (underlyingly /n/ and assimilating to the following onset) or geminate consonants (basically double-length consonants) can end a syllable, and only vowels or nasals can end a word. (Japanese folks get around this to some degree with using voiceless vowels when trying to pronounce foreign words that end in a consonant.) Finnish, on the other hand, does allow syllables to end in just about anything, provided it's not at the end of a word - only vowels or alveolar consonants (obstruents or sonorants) can end a word.

Is there rhyme or reason to this? Absolutely. With the exception of some languages of the Pacific Northwest and the Caucasus Mountains, languages tend to base what they can do with a syllable - their phonotactics - on what's called the "Sonority Sequencing Hierarchy." (Will talk about this in more detail later)( The most sonorous thing in a word will be the nucleus, and can only be followed by something with equal or less sonority. The usual ranking, from most to least sonorous, has traditionally been vowel - semivowel approximant ([w], [j], etc.) - other approximant ([l], [ɹ], etc. - trill, nasal - fricative - affricate/plosive. The behaviour of /s/ in many languages, though, has led me to believe that sibilant fricatives are actually less sonorous than plosives or affricates. This isn't the only thing that contributes to phonotactics, because there are some perfectly SSH-compatible sequences of sounds that English speakers have trouble with when learning other languages, that are not in native English words. One such sequence, /ʃt/, has actually become part of English phonotactics because of the long-standing influence of German and Yiddish. But take the Bulgarian word for "mayor," kmet. Most English speakers would pronounce this kuh-MET [kʰəmɛt]. It is pronounced as a single syllable in the original language. ([km̥ɛt]) Inversely, English's large codas are problematic for speakers of most other languages and will insert vowels to compensate. English also doesn't like having two plosives in a row in one syllable, and will either insert a vowel between the two (as in trying to pronounce the Russian word for "what," kto, properly) or delete one (usually the first) plosive entirely, as in "ptarmigan." Nasals also contribute to this, as in the word "tmesis" or the Norwegian name "Knut," or in codas, the word "hymn."

For the same reason English codas are hard for speakers of many languages, English speakers have real trouble with the vowelless words of Salishan languages, or even the seven-consonant onsets of Georgian (there's actually been a study or two done on how Georgians cope with this).

Syllable weight and stress assignment

Oftentimes, how a syllable is composed will also determine how stress is assigned in a word. Now sometimes, primary stress patterns much more regularly - take Finnish or Hungarian, for example, where primary stress is word-initial, Polish, where stress is on the penultimate (second-to-last) syllable, or Permian languages (Komi, Udmurt), where primary stress is word-final. But you'll get languages where primary stress patterns according to syllable weight; one such language is Latin.

So what determines syllable weight? In Latin, for example, it's all in the vowels. Latin distinguishes vowel length; in a word with all short vowels, the antipenultimate (third-to-last) syllable is stressed, but in a word with at least one long vowel before the end, it's the penultimate syllable. When measuring syllable weight, the unit used is the mora, marked in linguistic notation by the Greek letter mu. (μ)

While it doesn't do this for primary stress, Finnish secondary stress is weight-based. Finnish also distinguishes short and long vowels (and consonants as well, but this is inconsequential to stress), and while the primary stress is always initial, it is every second mora that gets secondary stress. (It is worth noting that Finnish secondary stress can be hard to perceive in fast speech!)

Of course, then there's English, which does have some stress rules, but they're too complex for the scope of this lesson and might as well be a form of "lexical stress," where you just have to memorise it. A language that has legit lexical stress is Russian.


Now we get into those pesky pitch-related things. Let's start with tone. Tone can be defined as "variant pitch that results in change of the core meaning of a word or word root." You'd be surprised how many languages actually have tone. Conservative estimates suggest that half of the world's seven thousand plus languages have tone, and I've heard estimates as high as 70%. When most people think of tone, they immediately think of the Chinese languages, and perhaps rightly so - Mandarin is the world's most spoken mother tongue, and it is a tonal language. But you'd be surprised to know that a large number of tone languages are actually spoken in Africa; also, a number of Central American/Mexican indigenous languages have tone, and a smaller number of indigenous languages in Canada and the USA, most notably in the Na-Dene language family (Navajo, Apachean languages, Gwich'in, many Dene languages of British Columbia).

How does tone work, exactly, though? Like stress and intonation (which I'll bring up later), it is suprasegmental, meaning that it affects more than a single sound. This may seem odd to people who are only familiar with a Chinese language, but hear me out. Tone doesn't always attach to sounds, per se. It moves parallel to word parts, or morphemes (remember this term, because I'll use it a LOT when I get to morphology). Tones tend to work a little differently in so-called "Asian-type" tone systems than they do in so-called "African-type" systems, too. I'll expand on this as I go along.

Tone starts with registers, and that's all some languages actually have. The simplest register systems have two tones, high and low. Languages with three registers are common as well, with a mid tone or neutral tone involved; as far as registers go, Mandarin falls into this category. There have been languages analysed as having five registers, though - extra-high, high, mid, low, and extra low!

There are also contours. In African tone languages, these are often analysed merely as sequences of register tones, but this is a little trickier to do with Asian tone languages because word parts are generally monosyllabic, whereas in African tone languages they very often aren't.

Regardless, my tone professor gets irked when people ask the question, "how many tones are in your language?" The question probably stems from a knowledge of Chinese prescriptive grammar, where this is is the terminology used. Said professor prefers the term "tone melody" in this context, even for an Asian language. There are a number of factors that play into African and Western Hemisphere tone languages that are the reason for this.

If you were to analyse tone as parallel to Mandarin, for example, you'd think the Mende language of West Africa had an absurd number of tones, because there are high, low, rising, falling, and rise-fall tones, which can change absolute pitch as one progresses in a word. However, there's a method to how these tones are distributed. There is an underlying melody that goes with a morpheme, and it is assigned to tone-bearing units (which are typically vowels, but as I found when I studied Chumburung, they can be coda consonants as well) in a fixed pattern - in Mende, it is one tone per TBU from beginning to end until the last TBU, then all remaining tones in the melody are assigned to the last TBU. If there is only one tone in the melody, it is assigned to all syllables; in the case of a two-tone melody in a trisyllabic morpheme, the first tone is only assigned to the first syllable and the second to the other two. I forget what each of the words means, but I'll use the actual attested sequences /mba/ and /ɲaha/ as well as a hypothetical /kiguɾu/ to demonstrate the five melodies that exist in the language:

L - [mbà], [ɲàhà], [kìgùɾù]
H - [mbá], [ɲáhá], [kígúɾú]
LH - [mbǎ], [ɲàhá], [kìgúɾú]
HL - [mbâ], [ɲáhà], [kígùɾù]
LHL - [mba᷈], [ɲàhâ], [kìgúɾù]

But then there's this pesky little thing called downstep. Tones, within a melody or otherwise, influence one another. In most cases, it's the lows that have the influence, pushing everything slightly lower every time they surface. It was a couple of tonologists, one of which was my tone professor, Dr. Keith Snider, that had to come up with a new theory of tone analysis just to explain it. If you're interested, you can look up "register tier theory." But within an utterance, low tones will pull the overall pitch of any following lows and highs down a notch. (Obviously this has its limits from a purely mechanical point of view!) This happens in African and Western Hemisphere tone languages, and on rare occasion even Asian tone languages.

There are two types of downstep as well. Automatic downstep is when the tones initiating the downstep are there to be heard. Non-automatic downstep, on the other hand, means there's something more going on. Occasionally there will be a morpheme whose segments will delete for whatever reason, leaving just the tone behind. But since it can't attach to the TBUs of another morpheme, it just "floats" there. (Yes, the technical term used for such a tone is floating tone. ) But its effects are felt. Floating low tones cause downstep, even if the tone itself is not actually pronounced. So if one has a high followed by a "mid" in these languages, the best practice is to start looking for anything that could've left a floating tone! Now here's a real kicker for you - sometimes, a floating tone can be a morpheme in and of itself. The majority of the time, this tone is low. Non-automatic downstep happens primarily in West Africa.

Much rarer than this is upstep. For whatever reason, low tones trigger upstep. I'll have to talk with Keith to get my facts straight on this, but the only language I recall this happening in from memory is Krache, which is a close relative to Keith's language of study, Chumburung. (He was a Bible translator in Ghana.)

Moving back to Asian tone languages for a moment, very often, low tone is accompanied by creaky voice.

Languages with tone include: the Chinese languages, the entire Kra-Dai family (Thai, Lao, and those related), Vietnamese and some of its relatives (not Khmer, though), some Tibeto-Burman languages (including Tibetan, Burmese, and Dzongkha), Punjabi and a couple of minority languages fairly closely related (the only Indo-European languages with full-on tone, btw), the entire Hmong-Mien family, the entire Oto-Manguean family (now spoken exclusively in Mexico), Nilotic languages like Luo and Dinka, a large number of Niger-Congo languages (including major ones like Bambara, Igbo, Yoruba, Lingala, Zulu, Sesotho, Setswana, but not Swahili, Fula, or Wolof), about half of the Na-Dene languages (see above), Iroquoian languages (primarily spoken in Ontario and Quebec, but also in New York), Chadic and Omotic languages (most notably Hausa, a major language of Niger and northern Nigeria), and many languages of Papua New Guinea. This is by no means an exhaustive list.

Next lesson will include intonation, which unlike tone, doesn't change the core meaning per se, just adds nuance. Not only that, but intonation spreads out over an entire utterance.


This is always a tricky one to explain because there is so much variation. But intonation is the use of pitch at the level of the utterance to add nuance to the overall meaning of the idea. In some languages, intonation works in more or less fixed patterns, while other languages can use it with more versatility. In Finnish, which has freer word order than English due to its extensive case system (I'll touch on this in morphology and syntax), shifting word order or adding discourse particles is often used where we in English would use changes in intonation.

Here's an example of English intonation changing the overall meaning of an utterance:

"You went to the ball and didn't even think to invite me?" is going to be our sample sentence. ;) Italics will indicate rising pitch and/or volume to denote a focus on that word via intonation.

You went to the ball and didn't even think to invite me? -> Speaker implying that subject did something that others didn't.
You went to the ball and didn't even think to invite me? -> Speaker implying that subject wasn't going to go, but did anyway.
You went to the ball and didn't even think to invite me? -> Speaker implying that subject was thinking of possibly going someplace else, or that the ball was very important to him/her (speaker, that is).
You went to the ball and didn't even think to invite me? -> Speaker really offended that subject didn't think to invite him/her. Possibly implying that somebody else invited him/her.
You went to the ball and didn't even think to invite me? -> Not much different than above, but perhaps a bit angrier, and no implication that someone else did.
You went to the ball and didn't even think to invite me? -> Speaker implying that subject might have been thinking about something else, such as excluding speaker, or something completely different.
You went to the ball and didn't even think to invite me? -> Speaker implying that subject invited others to his/her exclusion.

Now yes-no questions in English have a distinct intonation (even though this actually varies between North American Englishes and other Englishes). Questions with an interrogative word (this is considered a pronoun in the case of "what" or "who(m)" and an adverb in the case of "where," "when," "why," and "how" ) in their basest form use a statement intonation. Some non-standard questions like "who did what where to who(m)?" may have a question intonation.

There is a link between intonation and certain grammatical functions. Many Indo-European languages have a "question intonation." (Not sure if they all do - I haven't researched this that much.) But some other languages instead mark questions with a particle or an affix. Finnish is an example of this, where the first word of an interrogative utterance takes the -ko or -kö marker depending on vowel harmony. Salishan languages also do this.

Some languages, like French, have distinguishable list intonations as well. When you are listing things, there is a rising intonation for each item except the last one, which has a falling intonation.

Back to English. Intonation is a very versatile thing in this language. Sometimes stretching a word out in length (which wouldn't work in languages that have actual contrastive vowel length, such as Finnish - wouldn't've worked in Old English for this exact reason) adds emphasis. Also using pitch height or vowel length to indicate surprise is common, especially with the word "what," where a flat high pitch indicates sheer shock, but a high-falling pitch is usually substituted for the full sentences "what's your problem?" or "what the crap are you looking at?" Sometimes even epiglottalisation (aka growling) of a word, as well as a rise in pitch (it can be very slight) can indicate anger or disgust with a particular subject. That's part of intonation.


Now this is a bit complicated, and some parts are actually controversial. It does have to do somewhat with syllable structure, though. Languages whose syllables are limited to CV or CVC at most don't have this issue, but every other language does. The Sonority Sequencing Principle was devised to answer the question, "Is there a logical order in which different types of segments are ordered in a single syllable?"

Now scroll back to where I talked about syllable structure if you forgot what onset, coda, and nucleus were. (Or use the "find" function - since on the new forum these are all a single post!) Although the onset and the coda aren't treated the same with regards to syllable weight (in most languages, anyway - I had a theory at one point where Estonian perhaps did due to an advanced phonology project I was doing), what they are the same as, generally speaking, is in the pattern where, the farther a segment is from the nucleus (which is typically a vowel), the less sonorous it is.

The initial sequence I was taught was plosive/affricate < fricative < nasal < lateral/trill < approximant < vowel. Upon further investigation I think this is actually wrong, or at least insufficient. While plosive-fricative onset beginnings are found and fricative-plosive coda endings are quite common, I find that in onset position, fricative-plosive sequences are more common. This is especially true of sibilants. In English, we seem to have issues with things that start with /ts/ (affricate or not) like the place name "Tsawwassen" or the original pronunciation of "tsunami," and so we often delete the opening /t/ (I don't, but I have enough linguistics training that my native speaker intuition has been somewhat compromised :P ) , but /ʃt/ (as in "stein" or "schtick"), even when it is only found in more recent borrowings, comes as easy as 1-2-3, and is quite a common sequence in those languages that do allow complex onsets (more than one consonant).

In short, I'd actually argue that fricatives, from a standpoint more consistent with what I see cross-linguistically, are actually less sonorous than plosives, even though from a purely phonetic point of view they are more so because they involve continuous airflow either through the mouth or (in the case of nasal consonants) through the nose. Just ftr, ejectives, implosives, and clicks are all treated as plosives in this, because they pattern the same way.

And here's where things get controversial. Certain languages have words that seem to either have very unsonorous things as nuclei, like fricatives in Berber languages, or even seem to have no nuclei at all, such as in some Salishan languages. How then do we determine what is actually a syllable? The only plausible answer I've heard is to rely on L1 speaker intuition. On another note, the people that came up with this probably studied a huge corpus of words. One never really sees a segment from every single sonority class in the same word, because of phonotactic limits on the number of consonants in an onset or coda. Maybe in Georgian onsets, but those buck the whole thing to a degree. When it comes to "linguistic universals" (and I'll talk about this when I FINALLY get to typology), most so-called universals are more tendencies than absolute rules. You'll always find an exception.
Morphosyntax is actually a blend of two things - morphology, the study of wordforms, and syntax, the study of sentence formation. Why they are more and more commonly grouped together is that the relation between the two has become clearer and clearer as more languages become documented and studied.

Here's the thing - languages that make greater use of morphology have far fewer syntactic restrictions, while languages that make very little use of morphology have very rigid syntactic rules. There's actually a scale to determine this, but first, one needs to learn about the building blocks of words - morphemes. The scale entirely depends on morphemes, which are the smallest units that can carry meaning of any sort. And morphemes aren't restricted to just things that can stand alone as words. They include such things as affixes, clitics, "tonemes", "chronemes", or even morphophonological processes such as reduplication, stem vowel changes, or weak suppletion. On occasion, you'll even get strong suppletion, which sees two words with the same core meaning but different inflections look completely different.

But back to the scale for a second. There are rough boundaries as to where each of the categories lie, but one could see two of these four as extremes of the other two. Languages such as modern English that have a lower morpheme-per-word ratio on the whole and rely more on syntax are classified as analytic languages. The extreme of this more base category is the isolating language, which has no inflectional morphemes used to denote grammatical relations. The example I often see given for a purely isolating language is Vietnamese. Mandarin is still isolating in the sense that it doesn't use inflectional morphemes, but it is starting to trend away from being isolating because of how much it (and indeed other Chinese languages) are using compounding.

But then you have synthetic languages, which many Indo-European languages still are to one degree or another, which rely on inflectional morphology to some extent to denote grammatical relations. Some have more rigid syntax (like French for example) while others have freer word order (Finnish, Baltic languages, to a lesser extent East and West Slavic languages). But there's actually another split within the ranks here - while morphemes-per-word is the main component of the scale, synthetic languages (and even analytic languages that still use inflectional morphemes) can be described as being either fusional or agglutinative, depending on the meanings encoded in a single morpheme. English is very fusional, for example, and Indo-European languages tend to have a high degree of fusionality. It's a spectrum, though, and not a hard bipolar system. Finnish, for example, is "somewhere in the middle," having some agglutinative morphemes, but other fusional morphemes (the best example of the latter in Finnish is the number-person and tense-aspect-mood affixes - more about these things later). Some more extreme examples of agglutinative languages include indigenous languages such as Na-Dene, Algic, and Salishan languages, and also languages of the North Caucasus.

Some Important Terms

Before I get into the discussion about different kinds of affixes, it's important to make a distinction between root, stem, and base. I honestly had to look this one up again, because there is so much seeming overlap between the three (especially in English) that sometimes it's a fuzzy distinction to make.

A root is the absolute core of a word, and sometimes it has the same surface form as a base. But a base can also include derivational affixes, which I will talk about in a moment. A stem is simply the form of a word that can be inflected for grammatical purposes. Now a stem has to have a lexical meaning, ie, it has to have meaning standing alone. Not all roots, or even all bases, are stems for this exact reason. With this in mind, let's move ahead.

Inflection vs. Derivation

When dealing with morphological processes, there are categories within categories. The first one I'm going to talk about is function-based. Inflections add grammatical/discursive information while adding very little if anything to the actual core meaning of the word. They also don't typically change the part of speech of a word. (The most controversial "inflectional" category is the gerund, which while not changing the core meaning, does actually change the function of a verb so that it will function nominally.)

In English, we still have a bit of inflectional morphology, as opposed to languages like the Chinese languages and Vietnamese, which don't have any at all, or Japanese, which uses separate words to mark inflection. English is odd in that it actually marks third-person singular on verbs -(e)s, which in most languages that use inflection is either completely unmarked, or the least marked form. Aside from our irregular forms, most of which are from older words, -(e)s is also the noun plural, but in spite of having the same shape, it is considered a different suffix from the third-person-singular of verbs.

Another such "homophonous set" of suffixes is the present participle -ing and the gerund -ing. While both verb forms, the present participle serves a markedly different function. Gerunds can be modified by the sorts of things you would expect to modify a noun, such as the plural -s (never taking the alternate form -es because in ends in /ŋ/ - more on that when I go into morphophonology), articles, and adjectives. Participles, while you can use them as adjectives, serve verbal purposes, and when used adjectivally are basically the equivalent of a super-reduced relative clause - "rolling stones" and "stones that/which roll" mean basically the same thing.

We also have our "regular past tense" -ed. Our irregular past tenses are from older verbs and include an alveolar ending and/or a stem change.

By linguistic definitions, -'s is not actually a suffix but a clitic. I'll talk more about those later.

There's also the less common but still somewhat productive -en past participial suffix. A lot of past-parts these days are -ed instead.

Finally, there are the two degree suffixes for adjectives: -er for comparative and -est for superlative.

Many languages have more complicated inflectional systems, with affixes for case (grammatical/adverbial function), gender/noun class, and animacy in nouns and adjectives, and tense (timing), aspect (state of action), mood (reality vs. intent/wish), negation, and number/person, in verbs.

Derivation, on the other hand, makes a change to the core meaning of the word, and typically - but not always - changes the part of speech. While English inflectional affixes are exclusively suffixes, derivational affixes can be prefixes, although obviously there are derivational suffixes as well, and even one derivational circumfix (think prefix + suffix together) as in the case of en- -en, as with "enlighten." The primary adverbial suffix -ly is also derivational.

In English, there are derivational affixes that can change:

Verbs - into other verbs (ex-, in-, re-, un-, mis-), into nouns (controversially the gerund -ing, -(at)ion, -ance/-ence/-ancy/-ency, -or/-er, -ist, -ism, -ment, -age), and into adjectives (-ant/-ent do count even though they didn't in the original Latin, as they were present participle markers, -able/-ible).

Nouns - into other nouns (anti-, -phobe, -phobia, neo-, proto-, mis- (rare), dys-, -ite, -ist, -ian, -age), verbs (-ise, -ate, de-, un-, dis-, -ify), and adjectives (-ful, -less, -y, -like/-ly (not the same as the adverbial), -al, -ous, -ic/-ac/-iac, -ish).

Adjectives - into other adjectives (-ish), into verbs (en-/em-, -en, en- -en/em- -en, -ise, -ify), and nouns (-ness, -ity)

Morphological Processes

Now how do languages build words? Obviously they have roots and bases and stems and everything like that, but there are also processes that occur to bring about the final word forms. In English we have prefixes and suffixes, ONE circumfix, and some stem changes. These are only a few of the possibilities. The main morphological processes are referred to as the "Big Ten" by Payne (2006).

1-4. Affixation
1. Prefixation
2. Suffixation
3. Infixation
4. Circumfixation

5. Reduplication
6. Transfixation
7. Stem change
8. Autosegmental variation
9. Compounding
10. Deletion/Subtractive morphology.

Now affixation comes easy for us English speakers. We do it all the time even in our analytical language, as the examples in my previous post show. Although we do have our fair-share of prefixes and our one circumfix going "EN-YAY ME-EN!" in the background. :P English is largely a suffix-heavy language. You can draw a comparison with a language like Finnish, whose affixes are almost completely suffixes, German, which has a much richer overall mix, or, to the other extreme, a Na-Dene language like Tsilhqut'in, which is very prefix-heavy. There are actually typological tendencies that each sort of language mixes up with. But I digress.

So we know what a prefix is. You take your root/base/stem do, you tack on un-, and you get the new base/stem undo. Not exactly rocket science. In languages like the aforementioned Tsilhqut'in and relatives such as Navajo and Dakelh, there will be whole sequences of prefixes attached to the root, and there's a logic to them that is unique to the language or family, aside from the typological universal (I think) of derivational suffixes/prefixes/circumfixes always being closer to the root, and inflectional ones being farthest away.

How about a language other than English for suffixes, though? Indo-European languages, even the most analytical of the lot, have a number of them. Take Russian, for example. Its case system (which I'll talk more about after I've touched on syntax) has a whole host of suffixes to go with it. Let's do something easy to start. The word for "book" is kniga, which is a feminine noun, complete with its set of case endings. Here are the singular forms:

nominative (subject): knig-a
accusative (direct object): knig-u
genitive (possessor): knig-i
dative (indirect object): knig-e
prepositional: knig-e
instrumental: knig-oj

It gets more complicated after this, so for the purposes of this thread, I'm done with this example.

Infixes are common in Austronesian languages (especially in the Malay Archipelago). They're a bit odd in that they go inside the stem. In Ilocano, for example, the infix -in- marks the perfective aspect - the completeness of an action - which is typically used to express the past tense. So take the root patay for example. It actually means "death" when used alone, but one can add verbal affixes to it, and pinatayko means "I killed (something)." There are a few others in Ilocano.

Here's where I have to make a note here. You know the phenomenon where you can insert one word (usually a vulgarity or invective) into another? Such as fan-flipping-tastic? This "expletive infixation." isn't true infixation, because it's inserting an entire standalone word, and where it goes is determined by stress, whereas in typical infixation stress really has no bearing. It's closer to what's called tmesis, where parts of a semantic word are morphophonologically separated by another semantic word. English periphrasis and German separable prefix verbs are other examples of tmesis. Okay, away from the rabbit trail. :P

Circumfixes are harder to pin down, because sometimes a combination of a separate prefix and suffix can be misconstrued as a circumfix, and sometimes elements of the "circumfix" can change for non-phonological reasons, leading some to analyse it as a separate prefix and suffix. An example of the latter is in German, where ge- -t and ge- -en are the two most common forms of the past participial affix - since ge- never changes, some linguists do analyse it as being separate affixes. The same can be true of the superlative degree "circumfix" of certain Slavic languages and Hungarian, since the suffix part of it, used alone, forms the comparative, and a prefix on top of that forms the superlative.

Less controversial circumfixes can be found in Berber languages, where they are used to create feminine forms, or in languages like some Arabic dialects, Guarani, and Chukchi, as negation.

Reduplication is a process where all or part of a word is duplicated to render grammatical meaning or some other nuance. In terms of stricter grammatical reduplication, this is most common in Austronesian languages and in indigenous languages of North America, and it also happens in Greek and Somali. Partial reduplication is more common, since you'd expect most languages that use morphology to this extent to have polysyllabic words! In Lushootseed for example, pastəd "white person" can be pluralised as paspastəd. In Ilocano, reduplication frequently shows up in the verbal system - taking a verb base and using the same kind of partial reduplication forms the imperfective aspect. (Most roots in Ilocano are better classified as nouns when used alone.)

Total reduplication has a variety of uses as well. In Malay languages within Austronesian, it is used for pluralisation. Orang "person, man" is pluralised as orang-orang "people" in Malay and Indonesian, for example. In Halkomelem, there is a "dispositional" aspect, which functions more like an English adjective but is technically a verb (a number of languages actually form their descriptive words this way rather than having a separate class of adjectives), which is formed by total reduplication of a root and means "prone/inclined to do X." Wikipedia provided a nice example of this: [qʷél] "to speak" becomes [qʷélqʷel] "talkative."

Besides having grammatical functions, reduplication can add lexical nuance. We do this all the time in English. First, there's just straight total reduplication. For example, there are contexts where "home-home" is used to distinguish one's family homestead from one's current place of residence. It can be used to distinguish a word's primary sense from a secondary sense "funny-funny vs. 'I-dunno-about-this-bloke'-funny," literal from figurative, and so forth. We also have what's called "ablaut reduplication," which is a mix of total reduplication with a stem vowel change, and generally indicates repetitiveness of an action (often called "iterative" in linguistic terminology). Think things like "jibber-jabber," "wibbly-wobbly," "pitter-patter," "chit-chat," "clink-clank," and so forth. There are other forms of partial reduplication used lexically even in English.

Transfixation is probably the hardest thing to get used to when studying Hebrew or Arabic, or indeed any Semitic language. The basis of verbs and nouns is found in the "triconsonantal root," which is represented by the actual letters in the Hebrew and Arabic writing systems. Vowels, which are either written as points, or left out entirely in Hebrew, are grammatical in function and can render either verbs or nouns depending on the root in question. Take the ubiquitous k-t-v root in Hebrew (note: the "v" can change to "b" depending on position in the final word). katav means "(he/she) wrote," and from the same root you get "ketuvim," which is the name of the Jewish Wisdom writings such as Psalms, Proverbs, Job, Ecclesiastes, and the Song of Solomon (in the Christian Old Testament and Jewish Tanakh).

Stem changes are interesting, but also very frustrating for initial language learners. West Germanic languages such as English, German, and Dutch, are infamous for it, because the past tenses of these languages often rely on such things. Oftentimes these are historical remnants of old verbs that at one point had a productive system as its explanation, but now these things are considered "irregular verbs" because it is not morphophonologically predictable given the current sound-scheme of the language. Think, for example, goose vs. geese, or more thoroughly, catch vs. caught. (There's a legitimate explanation for how this worked in Old English, but it's a bit lengthy. I'll spare you for now. :whistle: Also, there are so many historical morphophonological processes that cause the stem change that I'll spare you that as well.)

Obviously this isn't unique to Indo-European languages, or even to vowels. Finnish "consonant gradation" falls under this category as well, with the root lahte- surfacing in its "dictionary form" as lahti "bay", but if the case suffix closes or geminates the second syllable (geminates are analysed as being one segment spread across two syllables oftentimes), the /t/ becomes a [d], in some of their fifteen cases. While the surface form of the partitive surfaces as lahtea (because vowel gradation doesn't occur), you get case forms such as lahden (genitive), lahdessa (inessive), lahdeksi (translative), and lahdella (adessive). This is true of all Finnish voiced stops, long or short. But because of Finnish's general lack of voiced stops (even as allophones only [d] actually exists) and the historical Uralic existence of voiced fricatives, which have all but disappeared from Finnic languages, the patterning seems irregular.

/p/ -> [v~ʋ] (historically it was actually [w])
/t/ -> [d]
/k/ -> Ø (it disappears completely, and was [ɣ] historically)

Autosegmental variation involves a change outside of segments to give a change in meaning, dealing with stress, tone, nasalization, and length, among other things. For stress, English has an entire class of bisyllabic noun-verb pairs where the verb (which in a lot of cases was the original form) is stressed finally, while the noun is stressed initially. Adding confusion to the mix for learners, stress is not represented in the spelling system like it is (at times) in Spanish. Take the written word record. /ɹɪ'kʰoɹd/ is the verb, to put data into a physical or digital medium, while /'ɹɛkʰɚd/ is the noun, which is the resulting combination of data and the medium in question. Of course, the change in stress also leads to vowel reduction in English, adding further to the confusion (but that's pure phonology by that point). English is wack. :lol:

In the case of tone, many African languages (Niger-Congo, and also some of those historically classified as Nilo-Saharan and Khoisan) actually mark a change in tense with a "toneme," that is, a tone melody that is otherwise unattached to a tone-bearing unit that has an inflectional usage. I remember observing this when I read up on the Dholuo language (which is Nilotic) in my Advanced Field Methods class, because that is the language that we were working on. The example Payne (2006) gives of nasalisation is from a language of Gujarat in India, where nasality of the first vowel determines whether a pronoun is singular or plural. Length can be used in the same sense as tone in some places in Finnish, and is sometimes referred to in this usage as a "chroneme." The third-person singular form of most verbs is formed simply by lengthening the last vowel of the stem. The illative case also has a chroneme in the singular, where you double the last vowel of a noun stem and add /n/.

Compounding is something a large number of languages do. You take two roots and smoosh 'em together to make a new word stem. Given the nature of compounding, it's strictly derivational in nature. More synthetic languages such as Finnish (which has the world's largest palindrome when it's in the nominative case, three roots long) or German can go pretty crazy with this, but you can have polysynthetic languages that put even those two to shame! It's worth noting, as well, that not all compounds are written as such, either as a contiguous word or with hyphens. English actually has to at times, to state in writing what intonation does in speech. Take black bird vs. blackbird. The former has the phrasal head emphasised, so bird. The latter has black stressed. This differentiates between a bird that is black and a particular species of bird.

Even some isolating languages compound. It's about the only morphological process that Mandarin Chinese actually uses (one could make an argument for derivation reduplication in some very specific contexts), doing everything else through syntax and context. But what I was told in Contrastive Linguistics in my undergrad - which focussed on comparing and contrasting Mandarin and English - is that Mandarin loves its two-root compounds.

Subtractive morphology, which is sometimes just called "deletion," is by far the rarest of these, attested in some Nilo-Saharan languages and the Muskogean languages of the USA (originally from The South). This is where one deletes a segment or two that is actually in the root to create meaning. Payne (2006) uses the example of Murle, a Surmic language of South Sudan, where there is no affix for either singular or plural, and the plural is formed by deleting the last consonant, regardless of what it is. In Alabama, a Muskogean language, one can drop the last two segments of the penultimate syllable to indicate that the verb has a plurality of undergoers (transitive object or intransitive subject) in what's called the pluractional aspect. (Not a term I'm going to use very often. )

Besides the Big Ten, there's also the ever-annoying (to prescriptive grammarians) zero-conversion, which means you don't do a perkeleen thing to the word, you just use it as a different part of speech than its prototypical usage! :lol: An infamous example of this is the word derp :herpderp: . Initially an interjection, I have since seen it used (and indeed used it) as a verb, a noun, and adjective (although derpy is probably more common), and even an adverb (although I've never used it in this last fashion)!

Free vs. bound morphemes

So we've looked at a couple dimensions of morphemes - where they fit in a word (morphological processes) and what their function is (derivational vs. inflectional). There's also the issue of whether or not a morpheme can stand on its own or not. Remember that scale from earlier, as to how one distinguishes isolating, analytic, synthetic, and polysynthetic languages? It's a continuum, based on the average of morphemes per word. Directly correlating with this (and quite possibly caused by it) is the ratio of free morphemes to bound morphemes. The closer one gets to the extremes of isolating languages (Vietnamese, for example), the more the ratio skews towards free morphemes. The opposite is also true. Polysynthetic languages have a very high proportion of bound morphemes, which cannot stand on their own as words. For example, in a typical Na-Dene language such as Tsilhqut'in or Navajo, verb roots are never free. They will always have tense-aspect-mood markers, and typically will also have pronominal markers (person-number) for at least the subject and sometimes even the object.

English, on the other hand, is a good middle ground. It has a good number of bound morphemes, and also a good number of free morphemes. Obviously, suffixes, prefixes, and our ONE productive circumfix are bound morphemes. But even if you discount the number of Latinate roots and affixes that became lexicalised together (ie, the parts have no meaning by themselves in English - ones that are more productive are sometimes called "cranberry morphemes" if Wikipedia is to be believed :P), there are still a few roots in English that could be considered bound. Probably the ones we use most are -cracy and -archy, which Wikipedia very annoyingly lists as suffixes. No, they're not suffixes. They're bound noun roots, coming from the Greek words for "power" and "rule" respectively. "Democracy" could be interpreted as two bound roots, with -dem(o)- ("people-related") also turning up in demographic and epidemic, while -cracy "power/rule of X" can turn up in a number of places. A suffix usually adds to a core meaning. Bound roots have their own. So semantically, democracy is actually a compound - "rule of the people." Now can affixes attach to bound roots? You bet they can. The words acracy and anarchy are both examples of this. (They can mean the same thing, but acracy as a word is usually only used in political philosophy.) -archy is much more frequently prefixed, though. Number prefixes give us a whole host of words, ranging from monarchy (originally meaning rule of one, with mon(o)- being the prefix) to something like decarchy, meaning a nation with ten rulers. Hyperarchy ("excessive government," literally "overgovernment") is another such prefixed use of a bound root.

Now the thing about bound morphemes is that they invariably attach to specific parts of speech. Sure, we have forms that phonologically overlap and even attach to the same part of speech, like the gerundive and present participial forms in English. But what do you do with an "affix" that can supposedly attach to anything and still mean the same thing? From a syntactic and semantic point of view, it isn't bound. "But aren't all affixes bound?" Yes they are. What you're dealing with is a clitic; these are "syntactically free but phonologically bound." They typically play a functional role at the level of the phrase, sentence, or larger unit of discourse, and as such they can attach (phonologically speaking) to any word regardless of part of speech. Probably the most common clitics in English are the contracted forms of the verbs "to be" or "to have," especially when used as auxilliary verbs (more on those when I discuss parts of speech), and can modify/be modified by entire phrases. The English possessive -'s is also a clitic that can modify entire noun phrases, which can include relative clauses post-modifying the head noun.


Not all phonological processes happen "just because." There is a certain amount of interaction between phonology and morphology in languages that utilise morphology to greater extents, which is referred to as morphophonology or morphophonemics.

English does have a bit of this. But the morphophonology in English can actually be split into two groups - one where the native speaker can actually perceive the difference, and one where (barring linguistic training) they can't. In certain phonological theories that I won't dive into because it is way beyond the scope of a beginning linguistics class, there are actually specific named categories for both kinds of morphophonology.

Let me explain. Consider the /k/ -> [s] phenomenon found in English words of Greek/Latin origin, or the /t/ -> [ʃ] palatalisation process in words of Latinate origin. (Transcriptions represent Western American and Canadian Englishes; this does happen in other Englishes but the surface forms are slightly different)

plastic /plæstɪk/ + -ity /ɪti/ -> plasticity [pl̥ˠæstɪsɪɾi:]
reprobate /ɹɛpɹobe:t/ + -ion /jɑn/ -> reprobation [ɹɛpɹ̥əbɛjʃən]

These are examples of words where a speaker can actually discern a change in the sound when the morphophonological process is applied. Now compare that to the forms of the English simple-past, where the untrained English speaker actually can not tell the difference in surface form.

talk /tɑ:k/ + -ed /d/ -> talked /tʰɑ:kt/
slog /slɑ:g/ + -ed /d/ -> slogged /slɑ:gd/

There's a small caveat in examples with alveolars. If the final consonant of the base is an alveolar plosive (/t/ or /d/), which the suffix is as well, this process doesn't occur, instead being resolved by epenthesis - remember, that means addition - of a schwa between the base and the suffix, and in most North American Englishes, this triggers the flapping process, resulting in waded and waited sounding exactly the same. This doesn't happen in a number of Englishes, including Received Pronunciation (aka Standard British English), meaning that one can tell waded and waited apart in those dialects.

The exact same process occurs with the English plural -/z/ and third-person singular -/z/, where it becomes voiceless when affixed to a base ending in a voiceless consonant, and epenthesis occurs after sibilants (/s/, /z/, /ʃ/, and /ʒ/) for distinction purposes.

Is this unique to English? Not by a flipping long shot. Finnish is loaded with this stuff. Not only does one have the "weak suppletion" found in most Finnish noun and verb roots, referred to in Finnish grammars and Finnicist linguistics works as "gradation" (and I touched on this in my last post), but the final form of any suffix or clitic ALWAYS harmonises to the last root of a base, and some Finnicists will use shorthand to denote that the front-back feature is left unspecified in the underlying form of the suffix. Others will argue that, since stems with all neutral vowels always take a front-vowel variation of a harmonising suffix, that the suffixes are underlyingly front and harmonise when the root has one or more back vowels. Consider:

lahti -> lahdessa (inessive), lahdesta (elative), lahtea (partitive)
tyhmä ("stupid") -> tyhmässä, tyhmästä, tyhmää
risti ("cross, sharp-sign (in music)") -> ristissä, rististä, ristiä (/i/, represented by the letter I in Finnish, is a neutral vowel.)

You want a case of wacky morphophonology? Check out the verb tehdä on Wiktionary. But here's the clincher - it is considered a regular verb, because the processes involved in its different forms are very regular in Finnish - the root in its underlying form is actually /tek/- but it is subject to consonant gradation and the /k/ -> [h] / _[t,d] rule.

Of course, numerous other languages (most that use morphology) also have morphophonological processes; processes that are triggered by the formation of words. And morphophonology need not be restricted to segments, either - in Chumburung (and many other languages of Africa), for example, tonal changes in a word can occur because of downstep or even upstep, typically that of the automatic variety, because an affix with an underlying tone changing the surface tone melody of the entire word.

Parts of Speech

Every language eventually classifies its words. Now the boundary classifications get a little fuzzy at times, but languages at least tend to distinguish words that denote the action of a statement (verbs) and words that denote actors of a statement, or undergoers, or whatever (nouns), even if the roots of the word fall into one category or the other, or even in between. These are called parts of speech, and there are two types of these: closed-class parts of speech aren't generally open to regular new additions, although additions do happen over protracted periods of time, and open-class parts of speech where new words are being added to them all the time.

Open-class parts of speech: Nouns, verbs, adjectives, adverbs

Closed-class parts of speech: Particles, determiners, pronouns, adpositions, conjunctions, auxilliaries

Unsure: Interjections

Let's start with the obvious open-class ones.

Nouns are prototypically objects (person, place, thing), or abstract concepts, and in languages that use morphology they take typical nominal inflections such as pluralisation and case. As I said of all open classes, there are always new nouns being added - perhaps daily, and I'd argue that the class of "noun" is the most-added-to of any part of speech in language.

English more or less sticks to this prototypical description of a noun, and derivational suffixes can turn them into other parts of speech. But in other languages this isn't necessarily the case. In Philippine languages, for example, the root of a word could be "used alone" as a noun but could turn into a verb with the addition of the Philippine verbal affixes (which could be prefix, suffix, infix, circumfix, or reduplicant). Dholuo (a Nilotic language primarily spoken in Kenya, along the eastern shores of Lake Victoria) isn't quite this extreme, but the roots of all Luo adjectives are actually nouns! A variety of prefixes exist (such as ja- and ra-) that turn certain nouns into adjectives. It's like an inverse of the English suffix -ness in a sense.

The verb not only denotes the main action of the sentence but functions as the syntactic core, outside of null-copular constructions which occur in some languages. There are occasional verbs which don't denote action, but rather existence or equivalency (most notably be) or possession (most notably have). Where a verb takes a so-called "subject complement" or an adjective rather than a nominal object, it is called a copula, and English verbs like be or become are considered thus. Languages such as Hebrew and Russian lack a verb equivalent of "to be" in their present tense, as do other languages, and these are where your null-copular constructions happen. A strange feature of most Uralic languages is the lack of a verb "to have," although Hungarian does have one; this has also spilled over into East Slavic languages. Semitic languages, Celtic languages, and Burmese also do this.

As with nouns, verbs are always being added to, as we find new ways to describe actions, or coin words to describe actions we'd not seen quite in this manner before. They take verbal inflections, for things like tense, aspect, mood, person (first, second, third, and sometimes even others!), and number (singular, dual, plural, sometimes others), as well as affixes denoting participles and infinitive. In more synthetic or polysynthetic languages, voice (active and either passive or antipassive or practically universal - others include "middle," causative, applicative, reflexive, reciprocal, etc.), object person-number, negation, discourse particles, and fun things like that are added to this.

Adjectives modify nouns by adding attributes. Now unlike nouns and verbs, adjectives are by no means universal. In Navajo for example, adjectival fuctions are carried by verbs. In Dholuo, on the other hand, what English would get by adding -ness to an adjective root is actually itself the root, and a prefix is attached to make the adjective! Besides taking the same inflections nouns can in many languages - not English, since we don't have what's called agreement and therefore a noun doesn't have to take an adjective with the same inflections, such as pluralisation, case, gender, or what-the-crap-ever - adjectives have the degree inflection: the comparative degree along with the conjunction than (in English we sometimes replace the comparative degree with the prepositive adverb more) means that something has more of a certain quality than the thing it is being compared to.

Adjectives can also be used in a predicative fashion in the aforementioned copular constructions ("the dog is black"), or can even completely replace a noun in a few restricted instances, usually having to do with an inherent personality quality (and in these cases, typically accompanied by a definite article). ("The meek shall inherit the earth.")

There are different categories of adjectives, naturally. They can pertain to subjective opinion, colour, size, shape, age, origin, source material, etc.

One could be tricked into thinking adverbs modify only verbs, just going by the name, but there are certain sorts of adverbs that modify adjectives rather than adverbs; those adverbs that do modify verbs are sometimes said to modify entire clauses. Of course, there are some adverbs that are easily spotted in English by their -ly ending; similar basic adverbial endings occur in most European languages (-lich in German, -ment in French, -o in numerous Slavic languages, -sti in Finnish, etc.). In more polysynthetic languages these are often incorporated into the verb. In languages with more extensive case systems, nouns are made to function as adverbs, or the equivalents of an adpositional phrase (more on that in Syntax), in that they modify verbs, verb phrases, or entire clauses.

Here's where the open classes end... or is it? One could argue that interjections are also an open class, although this class is much less frequently added to. These are basically spontaneous expressions of feeling, whether that be emotion or uncertainty or whatever. Probably the most commonly discerned interjections are swear words, but we need to remember that greetings ("hi!") or other such terse emotional responses ("whoops," "ouch," "derp," :herpderp: "duh," "wow," "aaugh," "grr," etc.) are interjections as well. One odd thing about interjections is that they can be multiple words as well. Probably my personal favourite to use is "WHAT THE CRAP??" :wtf:

When we actually get into the definitive closed-class section, these are areas where additions are seldom if ever made, and when they are they are practically never borrowed. They also tend to have the same purpose as inflectional affixes, or they have a meaning that is strictly grammatical. Let's start with adpositions, which occur in all but perhaps the most polysynthetic of languages. Even borderline-polysynthetic languages such as Finnish have these. Depending on the typology of the language, these can either be prepositions (English has primarily prepositions, as do many other Indo-European languages such as French, German, Russian, and so forth), postpositions (as in Finnish; German has a couple, as does English, believe it or not), and rarely, even circumpositions, in which a prepositive word and a postpositive word combine for meaning. I actually encountered one such word when I was doing a paper on Kurmanji Kurdish for a class back in 2010. ;)

Adpositions tend to have the same function as non-grammatical cases: they usually denote a phrase that adds information to a sentence that could theoretically be left out in context, such as location, source, goal, instrument, recipient, benefactor, and so forth. In one case in English, a preposition shares the form of a particle - "to" can indicate a phrase of destination location, goal, or recipient, but it is also used as the infinitive particle with verbs. Other examples of prepositions in English are "of" (possession or association), "from" (source or benefactor), "by" (this would take a long time to properly capture all the semantic ranges), "at," "beside," "over," "under(neath)," "for" (beneficiary - this has the same form as a somewhat dated but still used conjunction), "with" (accompaniment or instrument) and "behind." This list is by no means exhaustive. Postpositions, on the other hand, are limited to "ago" (used in expressions of past time), "aside" (used to exclude something from a list of things), and "notwithstanding" (which has more or less the same meaning as the preposition "except (for)").

Auxiliaries are often classed as a type of verb in prescriptive grammars, because to some degree, they behave like them (one feature of both that always seems to hold up consistently is that they both can be negated), and some, such as "will" share form with an actual verb (at least in the present tense). Some actual verbs are used with an auxiliary function as well: "have," for example, is used for perfective aspect, "like," "want," and "need" indicate a desire to do something, and "go" is used to form the future tense. But pure auxiliaries include words like "can," "shall," "must," "may," and "will" (the one that is an auxiliary). They cannot EVER stand alone outside of a context. You can't just say "I can" without having some sort of main verb to go along with it within the context of a conversation. The past tenses of some of these have taken on a life of their own as irrealis versions of their realis presents. The difference between the two is the difference between something that is absolutely assured and something that is not. ("Must" lost its historical past.)

Will (future tense) -> "Would" (conditional)
Can (ability) -> "Could" (conditional)
May (in the sense of permission) -> "Might" (expresses possibility, but with an element of doubt)
Shall (former future tense, now used as as either a command or as an element of definiteness that the action is going to be completed) -> "Should" (Not as strong as "must," but implies heavy recommendation and prescription of an action).

In the case of "could," it is still used as a past tense in certain contexts.

In much the same way that auxiliaries behave like verbs, so pronouns behave like nouns. But there's a big difference in why they are used. Auxiliaries add to the main verb. Pronouns exist to replace nouns in a sentence. They don't inflect for number like regular nouns in their standard usage, rather, they have completely separate forms, based on person (who the subject is), number, and in some languages, gender and/or animacy, and case.

Person is usually denoted as first (where the speaker is the subject, or one within the subject group), second (where the listener is the subject or the subject group), or third (where the subject is neither speaker nor listener), although in some languages (*cough*FINNISH*cough*) there is the concept of "zero person" (where the semantic subject is left out entirely, much like a passive). The numbers that are practically universal in pronouns are singular and plural, with some languages having dual (two people), trial (three people), and/or "paucal" (a few people). Dual existed in Sanskrit, Ancient Greek, and Old English, is constructed in Proto-Indo-European, and is still attested in Slovene; it also exists in many Austronesian and (according to Wikipedia) Semitic languages (I know Hebrew has some dual forms). Trial is largely limited to the Oceanic branch of Nuclear Malayo-Polynesian within Austronesian, while paucal occurs sporadically, being attested in such languages as Fijian, Kurmanji Kurdish, Arabic, and apparently some Cushitic languages in pronouns.

English has the following personal pronouns: I (1sg), you (2nd person - English is VERY weird in not having separate singular and plural forms for the second person in the standard, although some dialects have tried to "correct" this - Southern US English "y'all" for example), he (3sg male), she (3sg female), it (3sg inanimate), we (1pl), and they (3pl, sometimes used as a 3sg nonstandardly). Finnish has different pronouns for third-person plural depending on animate or inanimate objects. French has a grammatical gender difference in its third-plurals.

In polysynthetic languages, pronouns as separate words are eschewed in favour of incorporation into the verb, and in less synthetic languages that mark person and number on the verb, they are often dropped anyway. (Finnish, German, Hebrew, Italian, etc.)

There also exist possessive (in English, "mine, yours, his, hers, ours, theirs," and in Southern US English, "y'all's"), demonstrative pronouns (which have the exact same form as the demonstrative determiners which I'll talk about below), interrogative ("who" and "what" - the other "wh-question" words are considered adverbs since the answer requires an adpositional/adverbial answer), reflexive (the "-self" words in English), reciprocal ("each other" is the main one in English), and indefinite ("anyone," "someone," "none," "nobody," etc.) pronouns. In Finnish, case-inflected pronouns outside the grammatical cases (so nominative, accusative, genitive, and partitive) function adverbially, and it is so for any such pronouns in other languages.

Determiners are words that are part of noun phrases, and they are simply there for identification purposes and have no lexical descriptive meaning. There are several types of these.

Articles, for example, indicate definiteness or lack thereof. In English, we have "the," the definite article, and "a(n)," the indefinite article. Similar articles appear in West Germanic languages and sometimes in North Germanic languages (although in Norwegian and Danish definiteness is sometimes marked by a suffix), Romance languages, Albanian, Greek, Semitic languages, and many Austronesian languages. But many other languages don't use them at all: Slavic languages, Uralic languages, and Turkic languages are just a few examples.

Demonstratives, or as linguists sometimes call them, deictics (from the Greek word meaning "to point"), are words that add a certain degree of specificity to what one is talking about. In English, for example, the demonstratives are "this," "that," "these," and "those," depending on number and distance (which in English is distal/far away and proximal/close). Several languages make a three-way distinction, either making a distinction betwen closeness to speaker and closeness to hearer, or, as with Ilocano, making a distinction between something within reach and something not within reach but still relatively close. Ilocano also throws a bit of a curveball into the mix, as well, with separate deictics for things out of sight and things that no longer exist. :wtf:

Quantifying determiners specify the size of a group without actually using a number. Anything resembling "many," "(a) few," "each," "every," and even "no" falls into this category.

Numbers as determiners specify exact amounts (cardinal numbers) or ranking (ordinal numbers). Cardinal numbers may also be used pronominally.

Rabbit trail: There are other uses for numbers that fall more into the range of adverbs, such as multiplicatives (once, twice, thrice) and distributives (which exist in English but are seldom used; they're more productive in languages like Romanian and Georgian), or adjectives, such as multipliers (single, double, triple) which aren't usually regarded as numbers in English by prescriptive grammarians.

There are also interrogative determiners. English actually only has one of these in its standard, "which," but "what" is often used in this manner as well. In Russian, kakoj and its corresponding gender-case forms are interrogative determiners.

Finally, possessive determiners - sometimes called "possessive adjectives" - indicate... well... possession. English has these; one needs to be careful to distinguish them from possessive pronouns, which stand alone. They're "my," "your," "his," "her," "its" (NOT "IT'S" WITH AN APOSTROPHE - THIS DRIVES ME BATCRAP WHEN PEOPLE CONFLATE OR MIX UP THE TWO :headbrick: ), "our," and "their."

Conjunctions link two clauses or phrases together. It's as simple as that. There are three types of conjunctions, coordinating conjunctions, which link two equal clauses or phrases together, correlative conjunctions, which generally link two phrases together to relate the two closely somehow, and subordinating conjunctions, which is exclusively for clauses and makes one clause dependent on the other. Also, subordinating conjunctions generally have more meaning behind them than the simple logical linkage of the coordinating conjunction.

"And" is the most common coordinating conjunction used in English, presenting a simple linkage of two non-contrasting ideas. "Or" presents an exclusive alternative, while the mashup "and/or" provides for the possibility of both the inclusion and the exclusion of both options. "But" and "yet" are primarily used with two clauses and show a contradiction or contrast of some kind. "So" demonstrates consequence.

For correlative conjunctions in English, there's "either... or" which is used to present two options, contrasting or otherwise, but one can use their negative counterpart "neither... nor" to rule both out. "Both... and" would be a true opposite to "neither... nor" because it implies that both options are included. "Whether... or" is a weird one in that it's both correlative and subordinating, in the sense that you can't have a clause that has just such a contrast in it - there has to be a main clause. Let me demonstrate:

"Whether Jarkko continues this thread or not," by itself, is ungrammatical. "People will study linguistics whether Jarkko continues this thread or not" is perfectly grammatical. The point of this particular correlative conjunction set is either to introduce options that could just as easily be left out, or to express ignorance or apathy about the two options (in this latter case, the main verb is usually either to do with cognition, like know, understand, hear, etc., or to do with feelings, like care, give a rip/hoot/crap/whatever).

A commonly-used subordinating conjunction (probably the most common one, actually) is "if," which denotes an unsatisfied condition. (The "satisfied" counterpart is "since.") "When" can be used to denote an unsatisfied condition that is absolutely known to be satisfied in the future (lending itself to the statement "it's not a matter of if, but when," but can also be used to simply denote a time-bound condition - in these contexts it would make perfect sense if replaced with "if." "Therefore" is another popular one - it's like the inside-out of another subord, "because." Because indicates a cause - heck, "cause" is in the etymology of the buggering word! Therefore," on the other hand, indicates a consequence of what has just been said. "Moreover" and "furthermore" draw attention to extra information, while "nevertheless" indicates a contrast.

Anything that doesn't fit into any of the above categories is lumped under the catch-all term "particle." These often have little to no lexical meaning, simply existing for grammatical purposes. While the answer to a yes-no question may be considered an interjection by some, its lack of spontaneity instead puts such an answer in the "particle" category, although "yes" in other contexts is actually considered an interjection. A more prototypical particle in English is the "to" which marks a verbal infinitive. More isolating languages are full of grammatical particles that function in the way one would expect a case-ending to.
I never really liked doing syntax that much, even though I realised (and still realise) its importance, especially when one is dealing with isolating languages or languages with more rigid word order. The reason is it involves too much drawing. :P Because with syntax comes the syntax tree, which while providing an elegant visual in which to present the overarching structure of a sentence, is a pain in the neck to draw!

But syntax is interesting in some ways. This is where words are built up into phrases centred around a syntactic head - usually a verb, noun, or adposition, sometimes even an adjective - and those phrases are further built into clauses, which contain an action with a subject/topic and often an object/comment, and if the clause is by itself (that is, without a conjunction), then it is a sentence.

Phrase Structure Rules

Every language has at least some phrase structure rules, but the less a language relies on morphology, the greater the number and potential rigidity of these rules will be. In English, for example, there are a whole host of variants on the basic rules. Before I get into these, here are some basic abbreviations I will be using in these formulas:

N - Noun
Pro - Pronoun
PN - Proper Noun
V - Verb (copulas tend to fall under this, even in null-copula languages like Russian or Hebrew)
VPart - Verb particle (this is something I would use in the case of a phrasal verb, like "set up")
Aux - Auxiliary
Adj - Adjective
Adv - Adverb
Adp - Adposition
Det - Determiner
Conj - Coordinating conjunction
Sub - Subordinating conjunction
Cor - Correlative conjunction

Particles are seldom marked in English syntax trees - "yes" and "no" are, but the "infinitival to" is often included as part of the verb.

NP - Noun Phrase
VP - Verb Phrase
AdpP - Adpositional Phrase
AdjP - Adjective Phrase
C - Clause
S - Sentence

Now the way of making a formulaic expression is as follows. What in prose would be written as "a clause consists of a noun phrase followed by a verb phrase" is written as C = NP VP.

To have something optional included in a rule, one puts the constituent in parentheses. Take, for example, the English noun phrase and verb phrase rule:

NP = (Det) (AdjP) N (AdpP)
VP = V (NP|AdjP) (AdpP)

The pipe is added in the second one to denote an either-or situation; while only technically true of copulas like "be" and "become" in English, it is true that if a complement is adjectival, it cannot also contain a noun phrase.

Just an aside here, I know there is notation to denote that one can pile up with adjectives in a NP or adpositional phrases in a VP, but I don't remember quite what that is.

One thing about an AdpP, though, is that it sets the stage for recurring noun and adpositional phrases, theoretically ad infinitum, because noun phrases can contain adpositional phrases and adpositional phrases ALWAYS contain noun phrases. If a word that looks like a preposition is at the end of a word, it isn't a preposition at all even though it may have the same form. It's either going to be a verb particle (this is typically the case) or one of English's three postpositions (ago, aside, or notwithstanding; what makes the last one tricky is that it is sometimes used as a preposition as well! English is wack, what can I say?) OR it indicates "the gap" left by an interrogative or a relative pronoun.

One more term you need to know is argument. This is any NP that has a direct connection to the VP within a given clause.

Syntactic Roles

This is actually pretty simple. There are only really four major base components of any clause. The subject is the syntactic focus of the sentence and a stand-alone argument within the clause. Then there's the main verb - the action of the clause. If the verb is intransitive, these two are all one needs, because it indicates an action that the subject is either doing or experiencing. That said, a transitive verb requires both an argument performing the action, and one that the action is performed on, which is called the object and is a sub-constituent of the main VP of a clause. A few English verbs ("put" and "give" for example) are what are called ditransitive, meaning that there is a required AdpP or second NP within the VP. These are called indirect object. Ditransitives aside, though, any other NPs, more or less all within AdpPs, are called adjuncts, and are considered optional.

Which brings us to syntactic typology. Adjuncts are always omitted from basic word order typology because their distribution is so varied that it would make a mess of things. Subject, verb, and object are abbreviated S, V, and O respectively, and are used to denote the six logical possibilities of word order, all of which do exist in language. But there's a surprising lopsidedness in how each word order is represented! In more synthetic languages where the word order is freer, the "most neutral" word order is taken as where the language stands.

- SVO (Germanic languages inc. English, almost all modern Romance languages, Albanian, Slavic languages, Modern Greek, Modern Hebrew, some Uralic languages like Finnic languages and Hungarian, the Chinese languages, Khmer, Thai, Lao, Vietnamese) is the most-spoken word order and second-most attested in numbers of languages. The majority of creole languages are also SVO.

- SOV (Turkic and Mongolic languages, some Uralic languages like the Mordvinic, Maric, and Permic languages, Na-Dene languages, Cushitic languages, Siouan languages, Indo-Iranian languages, etc.) is the most-attested word-order in terms of number of languages. It is also attested in Sanskrit, Ancient Greek, and Latin, and is proposed for Proto-Indo-European.

Subject-initial languages account for anywhere from 75% to 87% of the world's languages, depending on which papers you reference.

- VSO languages (Celtic languages, a number of Afro-Asiatic languages including Arabic, Berber languages, and ancient Egyptian, Tagalog and many other Austronesian languages, Salishan languages, and most of the languages of southern and central Mexico and Central America) are the third-most spoken and third-most attested grouping, but have far fewer attestations than the subject-initial crowd.

- VOS languages are primarily Austronesian as far as neutral word-order is concerned. Austronesianists have speculated that Proto-Austronesian was VOS; Fijian and Malagasy are the best-known examples. Mayan language Tzotzil also has this order, although most Mayan languages are VSO-neutral.

- OVS is a very rare word order, and most of the attestations of it are from South America - and endangered at that. The Guarijo language of Northern Mexico, a Uto-Aztecan language, also employs this word order. Klingonese was constructed as an OVS language.

- OSV (aka Yoda-speak) is supposedly even rarer, occurring primarily in the Amazon Basin.

Languages with the subject before the object account for the overwhelming majority of the world's languages, with numbers of up to 95%.

Morphosyntactic alignment

Besides word order, another typological feature that determines syntax is morphosyntactic alignment, that is, the grammatical relationship between arguments - which varies based on distinctions made in marking of nouns in the language (which can be done either morphologically or with a separate word). More specifically, it differentiates based on how the subject of an intransitive, the subject of a transitive, and the object of a transitive are marked.

The most common one, by an overwhelming margin, is the nominative-accusative system, which marks all subject arguments alike (typically leaving said arguments unmarked) and the object differently. Almost every language of Europe, the Chinese, Uralic, Mongolic, Turkic, Tungusic, and Semitic language families, and a whole host of others on every continent, are nominative-accusative. A subtype of this is called "nominative-ergative," where the subjects are marked and the object is not. This is quite a bit rarer, occurring (according to Wikipedia) primarily in the Cushitic languages of the Horn of Africa (so Somali and its relatives) and the Yuman languages of the US Southwest and northwestern Mexico.

Another somewhat common alignment is ergative-absolutive, where the subject of an intransitive and the object of a transitive are marked the same (primarily unmarked), and the subject of a transitive is marked differently. Hard ergativity is fairly limited in its distribution; it does exist in Basque, Chukchi, a number of languages of the Americas (esp. North America), and the Northwest and Northeast Caucasian languages. More common is what is called split-ergativity, where a language flip-flops between alignments based on a morphosyntactic feature; many of these are due to tense or aspect (Kurmanji Kurdish, for example, is nom-acc in the present tense and erg-abs in the past) but other languages, such as Old Sumerian and Georgian, are more complex in the split-ergativity.

Much rarer are alignments that either mark all three differently (tripartite), which is attested system-wide in Penutian language Nez Percé and a few languages of Australia, and in the pronominal systems of Dyirbal (Australia) and Kalaw Yagaw Ya (Torres Strait Islands), OR don't mark any at all. (direct)

Then there's the Austronesian alignment, which primarily occurs in the Philippines but also in certain other Austronesian languages - not all (many are nominative-accusative). The word order can remain the same but the syntactic role changes based on the marking of the verb. A simpler example from Ilocano:

Siak pinatayko ti uleg - I killed the snake

Siak pinataynak ti uleg - The snake killed me

Finally, split intransitivity has been attested, where an intransitive subject can be marked two different ways based on a semantic role.

Fun fact: If English were ergative, OSV, and null-copular, GrieferLord's old motto "Something me doing?" would be standard fare. :lol:

Semantic Roles

While there are syntactic roles within an utterance, there are also semantic roles (also called thematic relations). When dealing with difference morphosyntactic voices, such as the active (which is typically the default), the passive, the middle, the applicative, and so on, the syntactic roles change, but the semantic roles do not. Heck, some languages mark morphology based on semantic roles. (One example I know of from personal experience is the Lushootseed language of the Puget Sound area in Washington state.)

The most common ones to start with are agent and patient, most often corresponding to the subject and object of an active-voice sentence. An agent is willfully responsible for an action while a "patient" is directly affected by the action. But here's where we have to split hairs. Someone or something being a "patient" implies something about them has changed. This is a little out of the ordinary for those who would lump under "patient" those things which also include theme (undergoer of an action that doesn't change a state) or experiencer (undergoer of sensory input). The term more increasingly used as a catch-all amongst linguists for these things is undergoer, especially amongst those who prescribe to functionalist theories of grammar. Furthermore, "agent" is a subclass of "actor," which also includes force (a usually inanimate actor that does not do something willfully) and stimulus (a more or less always inanimate actor that provokes a sensory response). Any such semantic roles can apply to the basic arguments of a clause.

Indirect objects marked without prepositions are most commonly recipients. English "dative shift" seen in such expressions as "I gave Haya a flower" (in which Haya is the recipient/indirect object) is one example. In German and Russian this would be marked with the "dative" case. In a more typical, "non-shifted" word order, this dative is marked with the preposition "to" in English. But be careful. "Dative shift" can also mark a beneficiary - one who benefits from an action without actually receiving something. There is some overlap between the two semantically, although in English a benificiary in a non-shifted construction is generally marked with the preposition "for."

In adjuncts, there are other roles. Many of them are easy to spot. Instrument is marked by case-ending in many languages, including Russian, and to a lesser extent, Finnish and Hungarian; in English it is often marked with the preposition "with," less frequently occurring with "by" (usually in modes of transportation). Location is marked by a whole host of English prepositions but is discernible by context. In Finnish, the "static locative cases," that is, the inessive and adessive cases, do a lot of the morphosyntactic legwork. In Russian, it is generally the prepositional case with a static-locative preposition. Direction is the active counterpart, denoting, of course, the direction that something or someone is going, and Finnish actually has no fewer than four cases that deal with this (illative, elative, allative, and ablative). A subtype of this is goal, when referring to the absolute end of one's movement.

Source is the role for an origin of something. In English this is very often denoted using "from," occasionally "of" (although this usually indicates possession or even mere association). Time is a semantic role, and typically expressed through adverbs, or when using more specific times, with the preposition "at" plus the number or time-noun (such as "noon," "midnight," "twilight," "dusk," etc.) Manner is pretty self-explanatory. A lot of the time, manner can be expressed through adverbs that 99% of the time end in -ly (a notable exception is well), but can also have a pseudo-instrumental construction of sorts using the prepositions "with" (or even "without") as well, as in the phrases "with care," "with thanksgiving," "without hesitation," etc.

The last two may overlap a bit, but should be distinguished. Purpose answers the question "why?" while cause tends to answer the question "how?" and/or "because of what?"

Valence and transitivity

These two go hand-in-hand, but aren't always the same thing. Valence is the number of arguments a verb takes, while transitivity denotes how many objects a verb takes (so the latter doesn't count the subject).

Transitive verbs have two categories. Monotransitives (often just called transitive verbs) take an obligatory direct object and have a valence of two (S and DO), while the smaller category of ditransitives - which have the maximum English valence of three - take both a direct object and an indirect object mandatorily. If you consider semantic roles, the most common indirect object role is recipient.

It is in intransitive verbs that things get a little tricky. Some intransitive verbs have a valence of zero, usually ones concerning weather - "it rained," "it snowed," "it poured" (in this sense of the word), "it thundered" (in the word's original sense), and "it hailed" are examples, which use a dummy subject so that the syntactic requirement of a subject is fulfilled - in some languages, one can get away with just having the verb. Most intransitives, though, have a valence of one.

Even monovalent intransitives are sometimes classified differently. Remember when I talked about split intransitivity? This splits the marking of an intransitive along the line of semantic role of its syntactic subject. Unergative verbs have an agent (or sometimes a stimulus) as a subject, while unaccusative verbs have an undergoer (usually a theme or experiencer and less frequently a patient) as the subject.

Phrase Structure Rules Revisited

So this brings us back to phrase structure rules again. For the sake of time, I'll stick to English, because other languages do things a whole slew of different ways.


Clauses are pretty intuitive. You need a noun phrase and a verb phrase. Whee. :P There are adverbs that can function at the level of the clause, and in English these are always located after the main verb or before the subject.

Verb phrases

Verb phrases vary in complexity. The only mandatory element of any verb phrase is a main verb, but the only times you'll get only a main verb is in the case of an intransitive verb in the simple-present (which is in practice a present habitual) and simple-past tenses. There is an ordering of things, too.

VP = (Aux) V (NP) (NP) (AdpP) (SubC)

Now this is going to sound a bit odd, because English is kind of screwy anyway. But English and other West Germanic languages have dative shift, which allows for the insertion of an indirect object without use of a preposition. I talked about this earlier. It is possible to have numerous adjunct adpositional phrases as well as an indirect object, and this gets really confusing when you have a ditransitive verb with two different possible grammatical orderings. Consider this pairing:

I gave Haya a flower for her birthday

I gave a flower to Haya for her birthday.

They mean the exact same thing and are both grammatically correct. The difference is, the first one has two NPs and the second two AdpPs. Furthermore, the second one has an inomissible adpositional phrase for the indirect object, as "give" is one of those verbs that almost always requires an indirect object. (The only counterexample I can think of is "I gave blood," and when used monotransitively like this, an indirect object is contextually inferred, that is, a blood bank or hospital.)

But here's the thing. Say you wanted to tack on another adjunct prepositional phrase or two, that modified the verb.

I gave a flower to Haya for her birthday under the tree in the back, because I love(d) her.

Now things really get messy. Within this sentence we have three omissible adjuncts - two adpositional phrases and a subordinate clause, all of which modify the action of the verb - but one of them contains another adpositional phrase that is modifying the noun phrase! In theory, it is possible to go on ad infinitum with such a loop, because of what has been called recursion. And because I found a syntax tree generator, I can actually show you what one looks like without banging my flipping head against the wall :headbrick:


I'm gonna have to come up with a sentence that shows how recursion works within a single AdpP. The thing about ordering, though, is that subordinate clauses will come last without exception.

Anyway, verb phrases can take numerous adjuncts. When AdpPs and subordinate clauses modify a verb, they are said to be being used adverbially. But there are other ways to use these!

Phrasal verbs

English has this annoying category of verbs, though, that make language learners groan. Phrasal verbs are akin to separable-prefix verbs in German, where part of the verb's infinitive form (usually in the shape of one of the adpositions) breaks off and goes to the end of the sentence. English is even weirder. While not all phrasal verbs do this, many shift word order (at least, in standard speech) depending on whether the object is a common noun or a pronoun, and what's really wacky, is that these verbs can sometimes use either order if the object is a proper noun (ie a name)!

The basic structure of a phrasal verb, not including adjuncts, is VP = (Aux) V VPart NP in the case of a common noun. For example, "The terrorists blew up the downtown core," "This group cast aside their old ways in favour of..." "The planners dummied up a rough model that they could come back to later when doing the final product."

On the other hand, with a pronoun, the "shifted" structure is VP = (Aux) V Pro VPart. "YOU MANIACS! YOU BLEW IT UP!" "So you just cast her aside like some kind of worn-out chew toy?" "I'm sure the guys can dummy something up before the deadline. It doesn't have to be perfect."

Noun phrases

So as we have seen, verb phrases in English can get pretty wacky. Guess what? So can noun phrases! They too can take adjuncts. But there's a limit on which clauses can actually be used. Sure, adpositional phrases can be used adjectivally as well as adverbially. On the other hand, while almost any kind of subordinate clause can be used on a verb, there's a special kind that can only be used in noun phrases - and it's also the only type that can be used in noun phrases. We'll talk about relative clauses in a second.

Furthermore, nouns can take determiners and/or adjectives. Unlike prepositional phrases, these always occur before the head noun. A very small number of adverbs can modify determiners of quantity or numerals. Articles cannot be modified. On the other hand, a larger number of adverbs can modify adjectives. So things get a bit messy in noun phrases as well. But adverbs aside, here's the basic structure for common nouns:

NP = (Det) (AdjP) N (AdpP) (Rel)

Now proper nouns can take adjectives and demonstrative determiners (no other types in standard speech when talking about a single thing; in some contexts numerals or indeterminates are allowed). Example: "That crazy Murdock put a whole bunch of junk in my van!" Pronouns, however, never take either. In speech, most personal pronouns do at least take relative clauses (the 3rd-person singular inanimate, aka "it," sounds clunky as all hells and damnations, but I've heard it, believe it or not!) but adpositional phrases are rarer in this context and usually only accompany subject pronouns in English.

Relative clauses

Let's step aside for a second here and talk about relative clauses. These take a complete thought and use them to describe a specific noun, or in some cases they stand alone as a NP, and they typically use the relative pronoun to mark them. In English, they are sometimes replaced by the subordinating conjunction that or even omitted entirely (where relative clauses can be identified by context), but the most general relative pronouns are "who" for people (and sometimes animals) or "which" for inanimate (and sometimes other non-human) things. It's worth noting that these share forms with the English animate interrogative pronoun and the interrogative determiner respectively. "Whom" (the object form of "who") is rapidly falling out of use and is considered dated.

But a basic relative clause works like this. Say you have the two sentences. "This is D'Arcy. He gave me the keys." The second one can be turned into a relative clause. "This is D'Arcy, who gave me the keys." Or, to use an example where neither of the sentences has a copula, say, "I like Jerry. Jerry gives me free stuff all the time." It becomes, "I like Jerry, who gives me free stuff all the time." So this is how it functions where the subject of the eventual relative clause is what is being modified.

If, however, it is either the direct or indirect object, or even an adjunct of the relative clause being modified, a gap is left in the relative clause in English (and many other languages), since the relative pronoun always starts the relative clause. Consider the following sentences.

"She decided to give it to Bruce, who(m) I saw earlier"

"Oh, that was the newspaper article which I said 'what the crap' to yesterday."

"The guy that I gave the box to yesterday seemed a bit sketchy, don't you think?"

So you have in order, direct object of a relative clause, adjunct of a relative clause, and indirect object of a relative clause. Now the capital O with a slash, that is, Ø, is very often used as a null sign, to denote zero morphemes in morphology (that is, morphemes that have no surface form, such as the dictionary form of most nouns in English) or gaps in syntax when talking about relative clauses. By placing these null signs in the above sentences, you can get an idea as to where in the relative clause the modified noun would go if it were an independent clause:

"She decided to give it to Bruce, who(m) I saw Ø earlier" (I saw Bruce earlier)

"Oh, that was the newspaper article which I said 'what the crap' to Ø yesterday." (I said "what the crap" to the newspaper article yesterday.)

"The guy that I gave the box to Ø yesterday seemed a bit sketchy, don't you think?" (I gave the box to the guy yesterday.)

The noun or pronoun that would have gone in the gap if the relative clause were independent is called an antecedent.

So those are your basic relative clauses. Now what happens when the gap refers to a possessor? While that is used for both animate and inanimate, who is for animate, and which is for inanimate, the relative pronoun referring to the possessor of a relative clause is, without exception, "whose." This throws some people for a loop, so they try to get away from using possessor-based relative clauses at all.

It's not so bad when using it with people, because it sounds natural.

"Oh, did you see the guy whose car broke down?" (antecedent possesses the RC subject)

"Do you know whose house the robber broke into Ø last night?" (antecedent possesses the RC adjunct's NP)

"That was that guy whose kid we gave that balloon to Ø!" (antecedent possesses the RC indirect object)

"Remember the woman whose flowers we planted Ø?" (antecedent possesses the RC direct object)

But with inanimate objects? It sounds wacky, but it's totally grammatical!

"Oh boy... the car whose back blinker burnt out just got smacked!" (antecedent possesses the RC subject)

"The flowers, whose fragrance could be smelled for miles, were first planted in 1999." (antecedent possesses the RC direct object)

"We just sold the house whose basement suite we'd lived in Ø when we were first married." (antecedent possesses the RC adjunct's NP)

(It's really hard to come up with an inanimate indirect object, so I'll leave it be for now. )

So we've got our basic, determinate relative clauses out of the way. There are also standalone relative clauses, which function as NPs by themselves rather than modifying a head noun. Rather than which, inanimate indeterminate relatives are generally denoted with "what." Animates are still marked with "who." Examples where the missing antecedent is the object:

"You don't know what you're getting into." (Adjunct NP of relative)

"Obi-Wan never told you what happened to your father." (Subject of relative)

"Do you have any idea what you've just unleashed?" (Direct object of relative)

"I think he knows who to give it to." (Indirect object of relative.)

A subset of these are the indeterminate relative clauses, which rather than pointing to a specific something, leaves the meaning more open. These begin with:

- "whatever" when inanimate options are potentially limitless.
- "whichever" when there is an implied or contextual limit to one's inanimate options
- "whoever" for animates. "Whomever" is used more and more sparsely these days to denote a non-subject.

"You can do whatever you want to get the job done. Just do it!"
"Whichever ones you choose will determine your ultimate fate."
"She said that whoever wins the contest will get to marry her."
"You can't just snatch up whatever you feel like! Pick one and get outta there!"

Constituency Tests

So how can you tell if something is grammatical from a purely syntactic viewpoint? There is a way. Constituency testing generally involves substituting a phrase or even a single word for another. These are primarily for NPs, but there are tests that exist for VPs and AdpPs as well. (Thank you Wikipedia for reminding me of some of these!)

Pronoun substitution

Also sometimes called an "it-cleft," pronoun substitution is a quick and easy way to identify a NP constituent. The pronoun substitution is usually "it," but if it refers to a person it will obviously be either "he/him" or "she/her."

Consider the sentence: "I knew the woman who was rushed to hospital yesterday."

If you were to say, "I knew her," it's a perfectly grammatical sentence.

But say you were to try sentences like, "I knew the her," "I knew the woman her," or "I knew her who was rushed to hospital yesterday." If you tried these sentences to a native English speaker, you'd get laughed at at least a little, because they are ungrammatical. But in fairness, this is actually a pretty good tool to use in language learning! But here is why they are ungrammatical: In all three cases, the pronoun is not replacing the entire NP, which includes the determiner. In fact, that is the sole reason the first of those three sentences is ungrammatical. The second sentence is probably the most egregiously ungrammatical, since it's replacing a relative clause that isn't standalone with a pronoun while leaving the head noun of the SINGLE original noun phrase intact. The last one might seem okay to some, because "the woman" in and of itself could actually be an NP. There's just one small problem. Relative clauses are ALWAYS within NPs; the standalone relative clauses mentioned before actually function as NPs on their own.

Clefting and pseudoclefting

Both of these tests involve taking the object NP of a sentence and shifting it in the sentence. Clefting turns the subject of a sentence into the subject of a subordinate clause, and the object into the complement of a simple existential; in English, "it was."

Take the sentence, "Bassam al-Wadud scored the winning goal." An actual cleft would render this as, "It was the winning goal that Bassam al-Wadud scored."

A pseudocleft is a little different, turning the object into the subject of the existential, and has a relative clause as the complement. So in the case of the above example, "The winning goal is what Bassam al-Wadud scored." The sentence means the same thing, the test shows the same thing, but the syntax is different enough that different NP constituency tests could further be applied to the end result, since relative clauses that start with "what" tend to be standalones.


In many languages, making an active sentence passive can be used to determine NPs. So consider, "That crazy driver almost flattened the kids crossing in front of the school bus." Make it passive. "The kids crossing in front of the school bus were almost flattened by that crazy driver."


Two names meaning basically the same thing. This works in determining constituency status of most AdpPs - the only exceptions would be indirect objects of ditransitives, which are absolutely required. But adjunct AdpPs, like adverbs, can be omitted without affecting the grammaticality of the sentence. Sure, it might leave additional information out of the sentence, but it is still understood!

"D'Arcy gave a book to Cassie in the classroom yesterday."
"D'Arcy gave a book to Cassie yesterday."
"D'Arcy gave a book to Cassie in the classroom."

These are all grammatical. BUT...

"D'Arcy gave a book?" Nope. The immediate question would be, "Who'd he give it to?" It's just the ditransitive nature of the verb. It's not that it isn't a constituent, but since this isn't Finnish, there's no adequate constituency test for indirect object AdpPs, just adjuncts.


This is another good one for adjunct AdpPs, and can also be used for non-relative subordinate clauses. Basically, you take one or the other from the end of a sentence and stick it at the beginning. The example the Wikipedia gives for AdpPs is pretty good:

"He is going to attend another course to improve his English." -> "To improve his English, he is going to attend another course."

As for subordinate clauses:

"Brittney couldn't make it to class because the ferry broke down." -> "Because the ferry broke down, Brittney couldn't make it to class."

Question tests.

These work primarily in English, and can isolate verb phrases by answering a yes-no or pronominally-based question (that is, with "what" or "who") question with "X do(es)/did."

"Did Jeremy complete the assignment?" "He did."
"What caused the explosion?" "This gas leak did."
"Who took my stapler?" "Ghislain did."

There are other constituency tests as well, but I think I'll leave you with that.

Voice: Morphosyntactic valence-changing operations

I talked about voice earlier as something that is added to verbal morphology, but its real importance lies in what it does to valence, and also in the shift of focus it provides to a sentence.

The main valence-decreasing function depends on the morphosyntactic alignment of the language. Direct languages don't really have any. Nominative-accusative languages will have a passive voice, which demotes the subject/agent of a transitive verb to an adjunct (therefore making it omissible) and the object/undergoer the subject, thus decreasing the valence from two to one. It does also work for ditransitives as long as the indirect object remains, decreasing valence from three to two. Ergative-absolutive languages have antipassives, which change the subject of a transitive (ergative) to the subject of an intransitive (absolutive), and make the object optional. Or in other words, the patient goes bye-bye. Antipassives also exist in many Austronesian-type languages. Not sure about tripartite. I'll have to do some digging on that!

Valence-increasing constructions, on the other hand, don't hang on the morphosyntactic alignment. A syntactic construction in English that is considered valence-increasing is the causative construction. Sure, English has pairs of verbs where an intransitive is the base form ("rise") and a monotransitive is the causative form ("raise"), but a construction using the verb "to make" can be used with just about anything. Case in point:

"Ethan ate pizza." -> "I made Ethan eat pizza." Here you've got a statement with a valence of two increasing to three with the introduction of a causer. And here's one for you. The only English statements with an actual valence of four are causatives, as in "I made Melissa give Jubilee the frisbee."

A number of agglutinative/synthetic languages have what's called an applicative voice, where a non-core argument (in English usually an AdpP) is promoted to a core argument, increasing the valence by one. It's harder to go beyond that in explaining in layman's terms because English just does not have a consistent equivalent to it.

Syntactically, dative shift is considered valence-increasing. I agree for verbs which are typically monotransitive, but for ditransitives? No, not really.
Language Families

The amount of ink that has been spilled trying to figure out which languages are related to which other languages is probably more than the rest of linguistics combined, as before the 20th century, much more linguistic and philological study was devoted to it than to any other category, and it still commands a fair bit of attention. Language genetics are determined largely by what is called the comparative method, which is basically applied phonology with a component of semantics. I won't go too much into this, though.

What I will instead do is give you a look at families themselves.

Let's start with one that is familiar to most people: Indo-European. Indo-European is the first language family to have been posited and established, even though there are still discoveries being made about some of its more ancient - and now extinct - members. It is also the language family with the most mother-tongue speakers of any in the world, thanks largely to the colonial spread of languages like English, French, Spanish, Portuguese, and Russian, as well as the sheer number of speakers of languages like Hindi, Bengali, Punjabi, Nepali, Marathi, Farsi, Italian, and so forth. Nearly three and a half billion people have Indo-European languages as their mother tongue.

Indo-Iranian is actually the branch with the most speakers, which further divides into Indo-Aryan, Iranic, and Nuristani. There are a number of major languages within the Indo-Aryan branch . Foremost among these is Hindi-Urdu, a "pluricentric" language with two different standards that are practically intelligible save for a few technical and religious terms. Bangla, or Bengali, is the official language of Bangladesh and one of the official languages of the Indian state of West Bengal. Punjabi is spoken in both India and Pakistan, and has two major dialects which are decreasing in intelligibility; the Western dialect is spoken primarily by Punjabis in Pakistan and their diaspora in Commonwealth (and other) countries, while the Eastern dialect is spoken in India and is the dialect you'd be most likely to hear a Sikh speak. A number of other languages of India and Pakistan, major and minor alike, fall under the Indo-Aryan banner. Marathi, Gujarati, Assamese, Bhojpuri, Sinhala, Sindhi, Marwari, Chhatisgarhi, Rangpuri, Chittagonian, the Lahnda languages, and Magahi are some of the more substantial ones. Also, Dhivehi, the official language of the Maldives, the Romani languages (of what were called "Gypsies" ), and Kashmiri are included.

As for the Iranic branch, probably the first language that would spring to mind is Farsi. Tajiki (from Tajikistan) and Dari (from Afghanistan) are largely considered dialects along with Farsi of a Persian language, but some "splitters" consider them different languages for sociolinguistic reasons. But there are other major languages within this branch, including Pashtun, Balochi (from western Pakistan), and the Kurdish languages - most notably Kurmanji. Ossetian, a fairly major language of the Caucasus Mountains, is also Iranic.

Working our way northwest, we come to the Balto-Slavic languages, at least where they were said to be originally spoken. For some people, the Slavic branch has an indellible connection to Communism, because without exception, all the Slavic nation-states were communist at one point or another. But their traditions go back much farther than that. Indeed, some of the finest Romantic classical music came from Eastern Europe. The Russians and the Poles in particular had large empires. We'll come back to their languages in a minute.

First, there's the Baltic languages, and only Latvian and Lithuanian still exist, plus some divergent dialects thereof like Latgalian. Old Prussian was a Baltic language as well, but it has long been extinct. And now back to the Slavic languages, which are divided into East, West, and South. The West Slavic languages are the most distant from the rest of the family as well as from each other internally; they are Polish, Czech, Slovak, and the Sorbian languages of eastern Germany. East Slavic comprises Russian, Ukrainian, Belarusian, and the minority language Rusyn, spoken primarily in western Ukraine and northeastern Slovakia, and also in Vojvodina (northern Serbia). South Slavic is interesting in that four of its "languages" are completely intelligible with one another, but for political reasons, they are often still classified as four languages: Croatian, Serbian, Bosnian, and Montenegrin. Serbo-Croatian does have three distinct dialects that could be considered languages, but here's the thing - the four "languages" I just mentioned are all based on the same dialect! :P Both of the more divergent dialects are spoken primarily in Croatia, and one of them (the "Chakavian" lect) also has a small community in Austria. Slovenian is close to the Serbo-Croatian dialect continuum, but distinct enough that there's no questioning its status as a separate language. Further removed from this lot are Bulgarian and Macedonian, which are closely related to one another, but distinct enough to be considered separate languages (even though some Bulgarian homers try their best not to).

Hellenic has had a large impact on many other Indo-European languages, primarily through the Attic dialect, which has since become known as Ancient Greek; this later developed into New Testament Koine Greek, and through many more changes over the last two thousand years, Modern Greek. Ionic Greek apparently had some influence on the Classical Greek languages as well. There is a remnant of Doric Greek (the Greek of Sparta and much of the southern Peloponnese), in the form of the severely endangered Tsakonian language.

Armenian is on its own. Other than occasional attempts to place it with Greek or as transitional between Greek and the Indo-Aryan languages, there's still no real place for it other than Indo-European.

One could make the same argument about Albanian, except in the case of Albanian it's gotten to the point that, if not consisting of four languages, it is at least a dialect continuum by now. The intelligibility isn't uniform between the four main "dialects," and some people count Tosk Albanian, Gheg Albanian, Arvanitika, and Arbëreshë as four different languages. (This might be oversimplifying it a bit.) In any case, Standard Albanian is based on Tosk, which is primarily spoken in southern Albania. Arvanitika is spoken by the Arvanites in Greece, and is close to Tosk but has more Greek influence, while Arbëreshë of Italy is a bit farther removed (albeit still closer to Tosk than to Gheg) and has much more Italian influence. Gheg is spoken in Northern Albania, oddly enough, including the capital Tiranë, plus Macedonia, Kosovo (who in spite of using the standard as their official language actually speaks the Northwestern sub-dialect of Gheg as a vernacular), and the southern tip of Montenegro.

Celtic used to be far more widespread than it is now, with the continental Celtic languages having been more or less wiped out either by Romans, Germanic tribes (especially the Goths), or Huns. The existing ones are split into two sub-branches, that is, Goidelic (Irish and Scottish Gaelic, and Manx) and Brittonic (Welsh, Cornish, and Breton). Manx and Cornish were once considered extinct but are now once again being spoken as mother-tongue thanks to revitalisation efforts; Welsh is the most-spoken as a mother tongue, with around half a million, and it and Scottish Gaelic have small communities outside of Europe, with Welsh being spoken by about twenty-five thousand people in Argentina and a distinct dialect of Scottish Gaelic existing on Cape Breton Island in Canada.

Italic is today limited to the Romance languages, which descend from the vernacular Latin of the later Roman Empire. But in antiquity, there was a plurality of Italic languages in much the same way as there was one of Hellenic languages. Older Italic languages included Oscan, Umbrian, Faliscan, and a number of other lects that were basically overrun by Latin thanks to the Roman Empire. Of course, the uniformity of Latin during the Pax Romana gave way to dialects, and later, languages. In keeping with the Roman tradition of being a strong influence, for better or for worse, most of these languages were spoken by people who would end up being colonisers, but a notable exception was the Vlachs - ancestors to today's Romanians - who were frequently subjugated. Today's Romance influence is felt worldwide, with hundreds of millions of people speaking French, Spanish, and Portuguese as a first language outside of Europe, and hundreds of millions more as a second language. Italian is considered a language of high culture because of the Renaissance influence of Florence and Venice - although it was the former's dialect that eventually became the standard for Italy. But there are also several minority languages within Romance. One odd case of political intrigue saw Romansch - a language with about thirty six thousand six hundred current native speakers per Swiss census - become one of the four official languages of Switzerland. Catalan is a particularly noteworthy language, spoken by 4.1 million people. And then there's Sardinian, which is the most divergent Romance language, often placed in a "South Romance" branch all by itself.

Last but not least, we have Germanic. Germanic is almost ubiquitous these days because of the influence of English, which maintains largely Germanic grammar behaviours in spite of the massive influx of Latinate (and other) vocabulary. But there is obviously more to Germanic than English! Yes, Germanic is split into three branches, but one of them was extinct before the Renaissance. That was East Germanic, which comprised the Gothic languages, Old Burgundian, and Vandalic. West Germanic, which includes English, Dutch, Afrikaans, Frisian, and German, and North Germanic, which includes Swedish, Danish, BOTH Norwegians (there are two - Bokmål and Nynorsk - both of which have official status in Norway), Icelandic, and Faeroese, are both still very much alive.

West Germanic is quite polydialectal and it remains a bit difficult to define where a dialect ends and a language begins. Some of the West Germanic lects include Alemannisch (which is the basis for Swiss German and also has a good number of speakers in southern Baden-Württemberg and Alsace in France), Schwäbisch (from most of Southern BW), Bayrisch (Bavarian - also the main vernacular dialect in Austria), the Low Saxon dialects (which include Plautdietsch, aka Mennonite Low German), Kölsch, and Luxembourgish. Yiddish is also a Germanic language - basically a Judaised version of an older High German dialect.

Extinct languages have been placed into entirely different branches of Indo-European, such as Anatolian (Hittite, Lydian, and others) and the somewhat poorly-attested Tocharian, which was spoken in what is now Xinjiang in China. Other such languages, such as Phrygian, Illyrian, Dacian, and Thracian, have not been accurately classified. On a Biblical note, it is speculated that the original Phillistines spoke an Indo-European language.

The world's largest language family in terms of distinct languages by most counts is Niger-Congo, which dominates most of Sub-Saharan Africa, although in terms of speakers, it ranks third behind Indo-European and Sino-Tibetan. Some branches of Niger-Congo are a bit controversial, though, whether it be internal classification issues (as is the case with the tenuous Adamawa-Ubangi group) or doubt over the status of the group as part of the family at all (as is the case with the Mande languages).

Most of these languages have some sort of noun class system, and this is especially the case with the famous Bantu group of languages, which is spoken in territory that covers much of southern and central Africa. The most notable Bantu language is Swahili, which while only spoken by a couple million people as a first language, primarily in Tanzania, is spoken by as many as a hundred million people as a trade language throughout Central and East Africa. The sheer diversity of languages in Africa means that fewer languages have eight-digit numbers in terms of native speakers, but a few do; closely-related Kinyarwanda and Kirundi both fall into this category, with the latter having as many as 12.5 million and the latter having as many as 10.5 million. Also in the category are Zulu (mainly in South Africa, with around 11 million) and Shona (a major language of Zimbabwe, around 12 million). Other languages in the Bantu group with over a million speakers include (but aren't restricted to) Umbundu, Kimbundu, Tswana (actually more speakers of this in South Africa than in Botswana!), Fang (pronounced like Fong), Lingala, Kituba, Kikongo/Kisikongo, Tshiluba, Kiluba, Lusonge, Nande, Gikuyu (#1 L1 in Kenya), Luhya, Kamba, Kimeru, Gusii, the two Sotho languages (Southern being the official language of Lesotho, Northern being almost exclusively spoken in South Africa), Nyanja/Chichewa (same language, called the former in Zambia and the latter in Malawi), Makhuwa, Tsonga, Oshiwambo, Swazi (also a somewhat major language of South Africa), Venda, Ndebele (major in both Zimbabwe and South Africa), Sukuma (#1 mother tongue in Tanzania), Ganda (spoken by the people from which Uganda gets its name), and Bemba (a major language of Zambia).

Southern Bantu languages have brought clicks into their phonetic repertoire due to contact with the Khoisan areal group, which are non-NC languages that at one point were thought to be genetically related. The truth is a little bit more complex. More on that much later.

Inside, and more so, outside of Bantu, classifying the languages has been a bit of a chore. Kay Williamson did prove that "Southern Bantoid" - sometimes called "Wide Bantu" as opposed to "Narrow Bantu" (Bantu proper) - was legitimate genetically, but not enough study has been done to go much further beyond that. Groupings of various languages have emerged within the "non-Narrow zone," but which languages belong in which groups is still a large matter of debate. Some of the major languages in this grouping include Tiv, the most-spoken of the Tivoid languages, and a few of the "Grassfields" languages in the hundreds of thousands, including Ghomala, Bamun, Kom, and Nso, all spoken primarily in Cameroon. Some linguists tenuously group those languages formerly grouped under the now-disavowed "Northern Bantoid" grouping under a larger "Bantoid" heading, but even this is somewhat spurious. What isn't is the larger Benue-Congo group, which along with Southern Bantoid contains the ex-"Northern Bantoid" languages, the Cross-River languages of southeastern Nigeria, which include the Efik-Ibibio dialect continuum with its 9 million speakers, and the more distantly related Platoid languages of Nigeria, whose most-spoken language is Berom, with a million speakers at last count.

Zoom out again, and you have Atlantic-Congo, which again is subject to rigorous debate as to who is included, and how they are included. Some linguists believe that the languages formerly grouped geographically as "Kordofanian" have no place in Niger-Congo at all, but some of those find themselves here, and the minority-language Talodi-Heiban branch has been somewhat established as legitimately Atlantic-Congo. The Senoufo branch is primarily spoken in Côte D'Ivoire, with three particular languages - Supyire, Cebaara, and Minyanka - having speaker numbers in at least the mid-hundreds of thousands, with Minyanka having some semblance of official status in Mali. The Kru branch of languages is spoken in Liberia and southwestern Côte D'Ivoire, with three of these - Bassa, Grebo, and Kru - being major vernaculars in Liberia, and two more - Guere and Dida - having a decent foothold in Côte D'Ivoire.

Now the Senegambian branch is of particular note because it includes the Fula macrolanguage and its 20+ million speakers, which territorially covers a swath of land stretching from Senegal all the way over to Cameroon, and is known variously as Fulani, Fulfulde, Pula, or Pulaar, depending on where you are in the continuum! It also includes two other major languages of Senegal - Wolof and Serer - which have over one million L1 speakers (and indeed, I've heard that the majority of Senegalese speak Wolof). It is an oddity within Niger-Congo in general in that it is non-tonal, but its robust noun-class system has been traced back to a common Niger-Congo (and indeed Atlantic-Congo) ancestor, and its own contiguity as a branch is based on a consistent pattern of word-initial consonant mutation.

The Bak branch of southern Senegal, the Gambia, and Guinea-Bissau is also non-tonal but lacks the consonant mutation of the Senegambian languages. Jola-Fonyi, Manjack, and Balanta are the branch's major languages, but it is also known for the Bijago language; one dialect of this is quite possibly the only language on the books outside of Vanuatu that has a linguolabial phoneme in its inventory!

The Mel branch is fairly small, but contains a couple of important languages - Themne, spoken by two million in Sierra Leone, and its close relative Kissi, spoken by just over half a million across national borders in the same area.

The Savannas branch, a recently-established branch, is still a little controversial as to its own internal classification, but is established as a legitimate genetic unit nonetheless. It is also the branch of several major languages. The Gur subbranch alone contains Mòoré (Mossi), a major vernacular language of Burkina Faso spoken by seven and a half million people or more, Kabiyé, the primary language of central and northern Togo, Dagaare (also from Burkina Faso, has just over a million speakers), Gourmanchéma, a major language of Burkina Faso and Niger, Konkomba, spoken in Ghana, and Dagbani, a major language of northeastern Côte D'Ivoire. Baatonum, an isolate within Savannas, is a major language of northern Benin. Outside Gur, Ngbaka has a million speakers in the DR of Congo, and the controversial Ubangian branch (which some linguists say isn't Niger-Congo at all) includes million-plus languages like Banda and Sango, both major languages of the Central African Republic. Finally, there's Zande, which also has over a million speakers.

But now we come to the two most important non-Benue-Congo subdivisions of Atlantic-Congo.

The Kwa, while it doesn't cover a particularly large swath of territory, running from southeastern Côte D'Ivoire to southern Benin, has within the territory a number of large and influential cities: Abidjan, Yamoussoukro, Accra, Kumasi, Lomé, Porto-Novo, and Cotonou, and with this number, millions of speakers of Kwa languages. The most noteworthy of these is Akan, the language of the Ashanti Confederacy (a major African power) and still a dominant language of Côte D'Ivoire and Ghana, with 22 million speakers total. Baoulé and Anyin, two other Kwa languages with more than a million speakers, are also spoken primarily in Côte D'Ivoire. There are a number of other languages in six digits in the family.

Which brings us to Volta-Niger. Concentrated in southern Nigeria, Benin, and southeastern Ghana, this group has no fewer than fifty million speakers, and could easily have more. Yoruba is the most-spoken Niger-Congo language by a margin of a few million people! Seriously, all the world's L1 Yoruba speakers would account for about three quarters of the population of Canada - 28 million people speak that language alone! Another major language of Nigeria, Igbo, adds another 24 million! Other major languages include Ewe (3 million), Fon (2 million), Edo, Igala, Ebira, and Nupe.

So that's it for Atlantic-Congo, but we have one last zoom-out, to the outliers of the Niger-Congo family, many of which are considered controversial inclusions. The Katla and Rashad families (both from Sudan and formerly classified within a Kordofanian branch) lack the noun-class system (being connected to Niger-Congo by the comparative method) and are relatively minor outliers. The Ijoid languages have a bit more clout in terms of speaker numbers, with around a million speakers of Izon and roughly another six hundred thousand of Kalabari, plus several other minor relatives. They also lack the noun-class system and have an SOV word-order (most NC language branches are SVO). Another such branch is Mande, and this is a relatively important one in West Africa. Several languages have more than a million speakers, and a couple - notably Bambara in Mali (4 million L1 speakers), Mandinka in the Gambia and Senegal, Maninkakan in Guinea (5 million L1 speakers across national borders), Jula in northwestern Côte D'Ivoire (2.5 million L1 speakers), and Kpelle in Liberia - have some semblance of official status. Other million-plus notables include Susu, Soninke, Mende, and Kassonke.

And there's Niger-Congo! Not all the sub-branches necessarily fit, but until we get better data, this is how it's regarded to most linguists who don't really study this family.

Austronesian languages are pretty aptly named, since it comes from the Latin word for "south wind" combined with the Greek word meaning "island;" the languages are native to the Southern Hemisphere and almost all of them have traditional territory on islands. While the bulk of the languages are considered Malayo-Polynesian, the most divergent Austronesian languages (the earliest splits) were languages formerly grouped together geographically as the "Formosan" languages of Taiwan, now considered to be discontiguous. Several theories have been posited as to how these languages are divided, and the most recent has three particularly divergent languages - Tsou, Puyuma, and Rukai - as being the oldest splits, the rest being contained within "Nuclear Austronesian." This isn't completely supported. What is widely supported is that the Austronesian languages began to split in Taiwan. All of the Austronesian languages of Taiwan are endangered to some degree, with only Amis, Paiwan, and Atayal having more than fifty thousand speakers.

Malayo-Polynesian itself is subject to some branching controversy; for the sake of simplicity I will adopt the multi-branch approach of Blust (1993). Let's start with the Philippine branch of MP, which morphosyntactically isn't really all that diverse in spite of 150 or more languages being identified within the branch. Most of the major languages within come from the Greater Central Philippine sub-branch, including Tagalog and Cebuano, the two most-spoken languages of the lot. Other major Greater Central Philippine languages include Hiligaynon, Waray-Waray, Central Bikol, Taurug, and Kinaray-a within Central Philippine proper, and Maranao and Maguindanao within the Mindanao clade. More distantly related are major languages such as Ilocano (a language I've actually studied!) and Pangasinan within the Northern Luzon sub-branch, and Kapampangan within the Central Luzon sub-branch. Other sub-branches of Philippine include Batanic, Northern Mindoro, Kalamian, South Mindanao, Sangiric, and Minahasan; there are also four unclassified languages within Philippine.

The minority Sama-Bajaw branch is split between the Philippines and Indonesia for the most part. The most spoken language is Sinama, a language I've had some exposure to (one of my profs worked on this language), which has around 410 thousand speakers. Some include this as a sub-branch of Barito, but it isn't set in stone.

The North Bornean branch has a large number of small languages, with only a couple (Melanau and Dusun) having over a hundred thousand speakers. Kayan-Murik is even smaller, with only ten languages by the most liberal counts, none of them in six digits; this is also the case with the Land Dayak branch, spoken by the Bidayuh Dayaks of Indonesia and Malaysia. The Rejang language of Sumatra is tenuously counted among these as well, and if legit, they'd account for more than half the speakers of the entire branch!

The Barito branch would be such a branch as well, since most of the languages are spoken by Dayak groups of Borneo. Ngaju is a major language of Borneo, spoken by almost nine hundred thousand people - about three-eights of the population of Central Kalimantan province in Indonesia. But that pales in comparison to the most-spoken language of the branch: Malagasy, the official language of Madagascar, with its eighteen million speakers, a number which is growing very rapidly! It is actually the fifth-most-spoken of the languages in the family as a mother tongue, behind Javanese, Malay, Tagalog, and Cebuano.

Northwest Sumatran is from... well... Northwest Sumatra. But the branch is actually quite well-represented in terms of speakers, with the Batak sub-branch in particular having a few languages over the one-million speaker mark. There are two main Malayo-Polynesian branches in Sulawesi (that funny-looking island in Indonesia) - Celebic and South Sulawesi, and while the former has more languages (and a few languages spoken by two, three, even four hundred thousand people), the latter has more speakers, thanks to two major languages - Buginese and Makassarese, which have five million and two million speakers respectively.

Before we jump into the big branches, I want to look at a few languages - some of them quite major - that have defied classification within Malayo-Polynesian. The most notable one happens to be the most-spoken L1 of the lot - Javanese, whose L1 speaker base is approaching the hundred-million mark per recent estimates. Attempts have been made at connecting Javanese with the Celebic branch and Sundanese, but nothing convincing has come about one way or the other. The same can be said for Sundanese (which has 40 million L1 speakers), although a proposal by Alexander Adelaar linking it, and another major language, Madurese, to Malay and its more proven relatives, has started to gain some popularity. Moken, Chamorro (the indigenous language of Guam), and Palauan are also a bit of a puzzle to fit into any sub-branch, although they are confirmed as Malayo-Polynesian.

Malayo-Sumbawan is the name of the theory that puts Sundanese and Madurese together with the more accepted Malayo-Chamic branch of MP; the number of languages is fairly large, too. Within the Chamic sub-branch, Cham is one of the earliest-attested written examples of an Austronesian language, while Acehnese (spoken in northern Sumatra) has around three and a half million speakers at last count. The Malayic sub-branch is further divided. Malay, the most-spoken language (macrolanguage?) of the family overall and second-most spoken as a mother tongue, is in this branch, as are the Iban languages of the so-called "Sea Dayaks" of Borneo. A third branch comprises Balinese, Sasak, and Sumbawa.

Hundreds of languages are supposedly contained within a Central-Eastern Malayo-Polynesian branch, and as is very much the case with the "Narrow Bantu vs. Bantoid" discussion in Niger-Congo, the Eastern Malayo-Polynesian's two subgroups are very well-established, but the link between the two is tenuous and the case for a cohesive "Central Malayo-Polynesian" is next-to-nonexistent. And as with the non-Narrow Bantu Bantoid languages, not many of them are considered particularly influential. Some exceptions include the Bima and Manggarai languages in the Sumba-Flores clade, and the Timoric languages Uab Meto and Tetun (aka Tetum), the latter of which is one of the official languages of Timor-Leste.

The South Halmahera-West New Guinea languages are limited in their scope, being spoken on the western tip of New Guinea and having a maximum in the lower tens of thousands of speakers. As with all languages in West Papua, they are under threat from an increasingly hostile process of Indonesianisation in the West Papua area. The Oceanic group, on the other hand, covers much of the South Pacific, and is made up of linguistic linkages for which no direct proto-language can be constructed but which share a lot of similarities, more than which can be attributed to mere contact. There are about 450 languages in Oceanic in total, and some of the more principal ones include East Fijian, Samoan, Tongan, Tahitian, Kiribati, and Maori, all but Kiribati of which are Central Oceanic languages (also called Fijian-Polynesian). Micronesian is another branch with a lot of official status, as all but one of the traditional languages of Micronesia (Yapese), Nauruan, Kiribati, and Marshallese are all contained within.
Sino-Tibetan is a language family whose internal classification between the main family level and some forty or so established groups is completely up in the air. So guess what? I'm going to say "bugger it" and go for those groups!

Sinitic or Chinese is the most-spoken group of the lot, constituting over a billion L1 speakers between the ten dialect clusters, of which some are unitary languages. Contrary to popular belief and Chinese government policies, there is no ONE Chinese language in speech, even though the logographic written language is used for all of the clusters. Mandarin, with around 960 million L1 speakers, is the #1 mother tongue in the world. This doesn't include closely-related Dungan, which is written in Cyrillic Script and spoken primarily in the former Soviet Union, especially in Kyrgyzstan and Kazakhstan, but it wouldn't make much difference even if it was, since Dungan only has around a hundred thousand speakers. By a large margin, Wu is second, and its prestige dialect is "Shanghainese." Yue (Cantonese) is spoken primarily in southeastern China and has a large diasporic presence as well, with a significant population of Cantonese speakers living in Canada and the USA (actually, I think Cantonese may be the #1 mother-tongue of people from Richmond, part of Greater Vancouver. Kinda cool, really)

Min is the most diverse of the dialect clusters, and the most divergent, representing not only a number of distinct languages, but some that are more distant from the direct descendants of Middle Chinese (the previous three dialect groups). The most commonly-spoken language of the Min languages is Min Nan, which is called Taiwanese in Taiwan - even though Mandarin is the de facto official language of Taiwan, Taiwanese Min Nan is the majority L1. Min Nan also has a significant foothold in coastal south-central China, just north of Yue territory, Most Min languages (Min Bei, Min Dong, Puxian, Min Zhong) are spoken in Fujian county, with Min Nan getting into Guangdong, while Leizhounese and Hainanese are spoken farther southwest. Hakka is also a significant cluster - a dialect continuum, if you will - but doesn't have the same kind of diversity that Min has (or Mandarin for that matter).

Other clusters include Gan (which is fairly close to Hakka), Xiang (also called Hunanese), Jin (tenuous; sometimes lumped with Mandarin) , Huizhou, and Pinghua.

The Greater Bai languages are sometimes considered as Sinitic languages as well, as older splits from Old Chinese, but this is controversial Bai proper has over a million speakers, but that's small potatoes compared to the sheer number of speakers of the definitive Chinese languages; Two of the three other languages in this group are presumed extinct, while the Caijia language is endangered.

Bodish is basically Tibetan, named after the Tibetans' name for themselves. It's split into two groups - Tibetic and East Bodish - plus a couple of additional isolates. Tibetic is where the more influential Tibetan languages are situated; Central Tibetan is the literary language of Tibet, while Khams Tibetan and Amdo Tibetan also hit seven digits. Other notables in the Tibetic group include Sikkimese, Balti, Dzongkha (the official language of Bhutan), and Sherpa; that said, the East Bodish branch is a bunch of minority languages of Bhutan, none with more than fifty thousand speakers. Tshangla is an isolate that, while drawing much of its vocabulary from Classical Tibetan, is a separate split.

West Himalayish is a group of minority languages of India, Nepal, and Bhutan - the most-spoken language is Kinnauri, which has 65K. Tamangic is a bit more widely-spoken, mostly in Nepal, with Tamang proper (more of a dialect continuum) accounting for 1.3 million or so speakers, and Gurung (two separate languages going by what Wikipedia is saying about intelligibility) accounts for another 360K.

Newar is a major language of Nepal (over a million people) that some Sino-Tibetanists try to group with the Kiranti group of languages (whose most-spoken language is Limbu), but nothing definitive has been established in this regard. Lepcha is another such language, although it is only spoken by about thirty thousand people; Baram and Thangmi, two languages of Bhutan, are grouped together and also sometimes lumped together with Kiranti. Yet another language that gets lumped with Kiranti sometimes is Lhokpu.

Another case of "lumping" is found with regards to a few small language groups of Nepal: Magaric (which includes Magar, spoken by about 840 thousand people), Chepangic, Raji-Raute, and the extinct Dura language. Most of these languages are legitimate minorities, spoken by fewer than a hundred thousand people; Raji-Raute (a three-language group) in particular is spoken by isolated groups of hunter-gatherers that don't even get up into the tens of thousands.

'Ole and Gongduk are endangered isolates spoken by five hundred and two thousand people respectively in Bhutan.

More lumping, this time in northeastern India. Siangic may be its own language family or a branch of Sino-Tibetan, but for simplicity's sake, it is included here. It consists of two minority languages, Koro and Milang. Cambridge historical linguist Roger Blench lumps this together with the Digaro languages of the India-Tibet border area, and the much larger Tani group, which includes some fairly major languages of Arunachal Pradesh and Assam, such as Mishing (over half a million speakers), Nishi (220 thousand and the most-spoken language within Arunachal Pradesh), and Adi (in six digits somewhere - another major language of AP).

Another small group that may be its own family is Kho-Bwa (also called Kamengic), which is yet another group of minority languages from Arunachal Pradesh. These languages are all endangered, as the most speakers any one of them has is 3 100 in the case of Sherdukpen. Some add the Puloik language, but Blench says that this is an actual language isolate (as opposed to an isolate within a family) and has no living relatives. Hrusish, two other languages of AP, are is another potential separate family, and aren't lumped with other languages within Sino-Tibetan.

One clade that is argued about internally as well as externally is Midzu. Some linguists consider them directly related as a group within Sino-Tibetan, some consider the Zakhring language alone to be Sino-Tibetan and the Kaman (or Miju) language to be a language isolate, and Blench considers them to both be isolates!

Out of the foggy area of potential isolates from Arunachal Pradesh and in to more confirmed Sino-Tibetan languages, with the Dhimal group, which at one point a linguist named van Driem lumped into a larger Brahmaputran group, but has now withdrawn that claim after further research. It only has two languages, Dhimal proper (spoken in Nepal) and the endangered Toto language (spoken in West Bengal, India). Brahmaputran still exists according to van Driem, but now only includes the Bodo-Koch languages and the Konyak languages. This group is VERY vigorous, including Bodo, one of the official languages of Assam (about 1.3 million speakers), Kokborok, the ethnic official language of Tripura state (about 1.5 million speakers), Garo, which has some official status in Meghalaya (around a million speakers), Rabha (around 170 thousand speakers), Konyak, the second-most-spoken indigenous language of Nagaland (around a quarter million speakers), Phom (120 thousand or so speakers), and Tangsa (around a hundred thousand between India and Myanmar). Kachin-Luic, a minor group within Sino-Tibetan, was in van Driem's original proposal but has since been left out. It does have the Jingpho language, though, which has nearly a million speakers.

The Ao languages of Nagaland are spoken by about six hundred thousand people total, with anywhere from 141 to 261 thousand people falling under the Ao-proper umbrella; this language has several very divergent dialects such as Chungli and Mongsen, which could even be considered separate languages. While none of these languages has below fifty thousand speakers, the Ao-proper "dialects" and Lotha (with about 166 thousand speakers) are particularly vigorous. The internal structure of Ao is fairly well-defined. Another branch of Sino-Tibetan spoken completely within Nagaland is Angami-Pochuri, which comprises no fewer than eight languages. The most notable languages of that bunch are Sopvoma (about 170 thousand speakers) and Angami proper (about 130 thousand).

Another separate group spoken by the Naga group of ethnicities is the Tangkhul branch, which is spoken both in Nagaland and in northern Myanmar. Tangkhul proper accounts for 140 thousand speakers.

The Zeme languages... guess what? Also spoken in Nagaland! Some classifications list as many as nine languages, but three or four of these could be considered dialects of Zeme proper, which is spoken by roughly 130 thousand speakers. Meitei (aka Manipuri) is presently unclassified within Sino-Tibetan, but is a major language within the Indian East, being an official scheduled language of India and the official state language of Manipur, spoken by more than half of the state's population (around 1.3 million). The Karbi language of northeastern India is spoken by about 416 thousand people, but nobody seems to know precisely where it goes within Sino-Tibetan! Tuija is also in this predicament, and could be considered endangered because most of the eight million ethnic Tuijas don't speak the language; only about seventy thousand people do.

One of the major branches that has been afforded a significant internal structure by historical linguists is Lolo-Burmese. I'll follow Lama (2012), because that seems to be the most consensus one at this point. The Mondzish sub-branch is apparently more divergent, and many of the languages are endangered, some critically so - only Mantsi has more than ten thousand speakers (around 37 thousand). Burmish and Loloish (also called Ngwi) are closer to one another. Now the only major language in Burmish is Burmese, with its 33 million L1 speakers, but there are a couple of Burmish languages spoken between Myanmar and China that get into six digits. The Loloish family is quite diverse, with different linguists taking different approaches to the splits, but it does feature some fairly significant languages, most notably two of the "Yi" languages spoken in China - Nuosu (2 million speakers) and Nasu (1 million). Other notables include Lisu (940 thousand), Lahu (600 thousand), Lolopo (570 thousand), and Nisu (over 400 thousand). This also includes the Phula languages, which were the language group that my ethnography/historical and comparative linguistics prof worked on!

The Qiangic languages are listed by van Driem as being several branches within Sino-Tibetan, but a few other linguists have one unified branch with several sub-branches. None of the languages included in all of these proposals is particularly major, although the Naic languages, included as a branch of Qiangic in some proposals, have the Naxi language, which has around 350 thousand speakers.

The Nungish languages are spoken in China and Burma, and not by particularly many people. The Mruic group is also quite small, with just two languages, neither of which is particularly major - Mru and Anu-Hkongso.

The Karenic languages are spoken by the Karen peoples of Myanmar and Thailand. The most-spoken Karen language is Sgaw, which is often just called Karen; this has over four million speakers, primarily in Myanmar. Pa'O and Eastern Pwo also top the million mark, while Western Pwo, Padaung, and "Red Karen" are in six digits.

Finally, you have the Kukish languages, which are many in number, and in some cases many in speaker, although none of them are in the millions. The most-spoken Kukish language is Mizo, one of the official languages of Mizoram state in India and also spoken in Myanmar; other notables are Tedim (340 thousand), Thadou (270 thousand), Hakha Chin (130 thousand), Hmar (110 thousand), and Falam (about 100 thousand).

Afro-Asiatic is a language family having almost as many speakers of its languages as Austronesian, but with only roughly 30% of the number of languages. Six well-defined branches exist, although the interrelation between them on an "in-between" level (that is, above the branch level but below Afro-Asiatic itself) isn't particularly established.

From a sociolinguistic perspective as well as based on sheer numbers, the Semitic branch is the most important. Known (and perhaps infamous) for its well-developed nonconcatenative morphology (go back to the morphology section if you need a refresher), which is unique amongst world languages, it includes such important languages as Arabic, Amharic, and Hebrew. The number of sub-branches within Semitic is argued, but I'll stick with the three Wikipedia uses. Now East Semitic, once a vigorous branch including the ancient Akkadian language, is now completely extinct, having finally died out in the first century CE, even after being the original language of the Assyrians and the Babylonians. Lack of data makes classifying it beyond this next to impossible. The other three, though, are fairly well-attested. South Semitic, for example, includes a number of languages of Ethiopia and South Arabia. While the number of languages is disputed due to the fuzzy boundary between a dialect and a language, even the most ardent lumpers would suggest that there are at least nine languages in the Western branch of South Semitic, which includes Amharic, plus Tigré and Tigrinya, the two major languages of Eritrea (also spoken in northern Ethiopia), and Ge'ez, a liturgical language. Outside the languages of Ethiopia, there is also the Razihi language of Yemen. As for the Eastern branch, these are the so-called South Arabian languages, which are spoken in Yemen and Oman but are under pressure from Arabic, and actually, only two of them, Mehri (which straddles the Yemen-Oman border) and Soqotri (on the isolated island of Socotra) have over fifty thousand speakers. Bhathari could be extinct.

Central Semitic is fairly notorious for being where the original texts of most of the major monotheistic religions have their origin. Arabic is the most-spoken Afro-Asiatic language, representing a gigantic chunk of Afro-Asiatic L1 speakers with over three hundred million L1 speakers. But here's something to consider: Arabic is in the process of splitting into separate languages of its own, and could be considered a dialect continuum rather than a monolithic language, in spite of common folk within the Arab world adamantly arguing against this. There are some "dialects" of Arabic that could be considered different languages entirely, and one - Maltese - actually is, because of its history and different alphabet. The Hassaniya Arabic of Mauritania and Western Sahara could also fall in this category. I had a good chat with my Jordanian friend Haya about this one day. She may end up doing her Master's thesis on this! An older form of Arabic is the liturgical language of Islam, and a more recent form, the liturgical language of the Druze faith.

Another branch of Central Semitic is Northwest Semitic, which is in turn divided into Aramaic, Canaanite, and the extinct languages Amorite (yes, of the infamous Amorites of the Bible ;) ) and Ugaritic. Aramaic isn't just one language, although all these languages descend from Ancient Aramaic, which at one point became the main official language of the Assyrian Empire (supplanting Akkadian, apparently), a major (yet not official) language in the Neo-Babylonian Empire, and also co-official within the Persian Empire under Darius (explaining why part of the Bible/Tanakh book of Ezra is in Aramaic). Nowadays, it lives on through such languages as Assyrian Neo-Aramaic, Chaldean Neo-Aramaic, and Turoyo. Syriac is a liturgical language of certain Oriental Orthodox Churches, while Mandaic is the liturgical language of the minority Mandaean religion. The modern survivor of the Canaanite branch is Hebrew, and only because it was the subject of perhaps the only truly successful full-blown language revitalisation project in the 19th century, after having fallen completely out of use other than as a liturgical language over a millennium and a half earlier. The other attested Canaanite languages (within this branch anyway) were Edomite, Moabite, and Ammonite, all of which died out long ago. Edomite was probably the last to go, and conflicting accounts would suggest dates anywhere from 200 BCE to some time in the first century CE. We just don't know.

The Egyptian language is a favourite of historical linguists simply because it shows just how much change a language can go through. Attested from two and a half millennia BCE, Egyptian has been said to have gone around the entire "morphological clock," being at various times agglutinative-synthetic, fusional-synthetic, and isolating. It hasn't been spoken as a vernacular for a few centuries now, but is still used as a liturgical language in the Coptic Orthodox Church.

Cushitic languages are another major branch of Afro-Asiatic, with the homeland of the speakers being the Horn of Africa. North Cushitic comprises but a single language, Beja (spoken mostly between Egypt and Sudan by about 1.3 million people), and could even be viewed as an internal isolate rather than an actual branch since there are no closer relatives to the language. Central Cushitic, or Agaw, has a few minority languages of primarily Ethiopia (one is spoken in Eritrea), with the most-spoken of these being Awngi at around 350 thousand. South Cushitic languages are exclusively spoken in Tanzania, with only the Iraqw language (with close to half a million speakers) having a substantial number, but none being endangered per se... in the West branch, anyway. Two languages forming an East branch per some linguists are extinct.

It's East Cushitic that has the bulk of both languages and speakers. It's further subdivided into Highland East Cushitic, Lowland East Cushitic, Yaaku-Dullay (a group of minority languages of Ethiopia and Kenya), and the endangered Dahalo language. Even Highland East Cushitic doesn't have much sway in terms of numbers of speakers compared to Lowland East, where all of the truly major languages of the entire Cushitic branch are situated. Still, within Highland East, there are such vigorous languages as Sidaama (3 million), Gedeo (close to a million), and Kambaata (around nine hundred thousand).

Lowland East is again split; one branch has the Afar language (with just shy of 2 million speakers), which is one of the nine recognised national languages of Eritrea and an official ethnic minority language in Djibouti, which before independence was called Afars and Issas, paired up with another RNL of Eritrea, Saho, which has about 220 thousand. Somali, the second-most-spoken of the Cushitic languages at around fifteen million speakers, constitutes its own branch, although if one were looking at intelligibility as a definitive marker of a language, one would also include the unstandardised Maay Somali (which has just shy of three million, counted in the above-mentioned figure for Somali) as a separate language within a Somali branch. According to Lecarme and Maury (1987; as linked by Wikipedia), Somali is also the most thoroughly documented language of the branch, with work on the language going back into the 1800s. The Oromoid language comprise two branches, Oromo and the minority Konsoid languages. Oromo, with roughly 25.5 million speakers, is not only the most-spoken Cushitic language, but also outnumbers Ethiopia's official language Amharic as the most-spoken mother tongue in the country. Finally, there are the minority Western Omo-Tana languages, of which one, El Molo, is possibly extinct, having just eight speakers 25 years ago!

Omotic is the most divergent branch of Afro-Asiatic, spoken almost exclusively in Ethiopia, with only the Ganza language (partially) of Sudan being an exception. Gamo-Gofa-Dawro is listed as one language, although it could actually be three - the total speakers in this group are two million; Wolyatta has 1.6 million, Kafa has 830 thousand, Bench - known for its complex tone system (even by Omotic standards) and its whistled speech - has 348 thousand, and the only South Omotic language in six digits, Aari, has around 240 thousand.

The Chadic languages are spoken in Central West Africa, primarily between Niger, Chad, Cameroon, and Nigeria, and getting into the Central African Republic a bit as well. Although small in range compared to Semitic or Berber, there are a very large number of languages. There are four accepted branches - East Chadic, Central Chadic (also called Biu-Mandara), West Chadic, and Masa. Within Central Chadic (spoken primarily in northeastern Nigeria and northern Cameroon), there are a few languages that top six digits, with Kamwe (seven hundred thousand speakers) being the largest single language in terms of speaker numbers. The Dangla language of central Chad is the most-spoken East Chadic language, but only has around sixty thousand speakers. Closely-related Masa and Musey are the most-spoken Masa languages, numbering over two hundred thousand each. Now West Chadic stands out more because of the Hausa language, the most-spoken Chadic language as a mother tongue with at least twenty seven million speakers, and probably far more (census numbers are ridiculously outdated in that part of the world) between Niger and Nigeria, the two countries in which it has some government recognition. There are a number of other languages with over two hundred thousand speakers, although none of these tops the million mark - the distant second to Hausa within West Chadic is the Ngas language of central Nigeria, which has around four hundred thousand speakers.

Finally, there's Berber. The Berber languages are native to northwestern Africa, and the subject of much debate due to their high intelligibility with one another, and also, because of outdated census data and the nomadic nature of some of the speakers. Still, there are enough speakers of these languages for them to be considered official in Morocco and Algeria. The Atlas Berber variants account for at least six and a half million speakers by the last Moroccan census, for example; Riffian (another separate variant spoken in Morocco) has around 1.4 million per last census, while Taqbaylit (Kabylie Berber) has 5.6 million per census and some estimates have them as high as seven million! These are within Northern Berber. Western Berber, in contrast, is very endangered - between the two languages there aren't even ten thousand speakers per latest data. The Tuareg languages (could be called Southern Berber but aren't) aren't as vigorous as Northern, either, but they do have a substantial number of speakers. Although there are conflicting numbers on how many speakers there are, most of these languages have over two hundred thousand speakers. The Eastern Berber languages of western Libya are disputed as a genealogical unit, and while they aren't as in dire straits as Western Berber (the Nafusi language alone has over 140 thousand speakers), the languages still aren't as widespread as the dominant Northern branch.

Dravidian is held to be the original language family of India before the Aryans invaded millennia ago, and is primarily spoken in southern India and Sri Lanka.

It is split into four geographically-named branches. I'll do South Dravidian last because it includes not only most of the languages, but also three of the four most relevant languages.

There are a few languages known to be Dravidian but are currently considered internal isolates. Two of these, Allar and Vishavan, are well under a thousand speakers and could even be considered critically endangered. Others, such as Bharia and Bazigar, have significantly outdated census figures (by longer than I've been alive :wtf: ), and one other, Malankuravan, could be a dialect of Malayalam and requires further study.

Though few in number, the North Dravidian languages are fairly vigorous, with two of the three languages being spoken by over a million people. Brahui, a language far removed geographically from the bulk of Dravidian languages (it's spoken primarily in western Pakistan) is spoken by around four million people. Its closest linguistic relatives are a couple thousand kilometres removed, in the Indian states of Odisha (formerly Orissa), Jharkhand, and West Bengal; Kurukh is spoken by close to two million people as a mother tongue, while Malto is spoken by roughly 117 thousand.

Of the four defined branches, it is actually Central Dravidian that has the fewest speakers, with only one language in it, Kolami, having more than a hundred thousand speakers.

South-Central Dravidian has the second-most speakers of the branches, but it contains the language that has the most speakers, that is, Telugu, with anywhere from 74 to 77 million native speakers, depending on what source you go by! It is an official language of the states of Andhra Pradesh and Telangana in India. This branch has 14 languages total, with a couple of other languages having significant speaker bases: Muria has roughly a million speakers, Kui has around 916 thousand, and Madia has around 340 thousand.

South Dravidian languages collectively have over 140 million speakers, but although there are 32 languages in this branch (give or take a couple, perhaps), the bulk of the speakers speak either Tamil, Malayalam, or Kannada as their mother tongue.

Tamil in particular has surprisingly widespread influence, being an official language of Sri Lanka and Singapore, as well as a state official language of Tamil Nadu within India, a widely spoken language in southern India in general, and a recognised minority language in South Africa, Malaysia, and Mauritius. It has the second-most mother tongue speakers of the Dravidian languages at about 70 million, but the overall speaker numbers may well exceed that of Telugu due to its widespreadness.

Malayalam has around 38 million speakers, largely spoken in Kerala state in India (where it is official). To the north is the state of Karnataka, where Kannada is official, and this language has around 51 million speakers total per the last census, although some figures peg the number of native speakers at up to 67 million. (Given how it's from a tourist site, I'll take this with a grain of salt.) Still, it has significant sway in southern India.

Two other fairly vigorous languages in the branch are Beary (1.5 million) and Tulu (close to 2 million).
If you like a language family with lots of languages and few speakers, Pama-Nyungan is for you. One of the reasons this family is so poorly-attested for speakers in this day and age - around twenty-four thousand people speaking them - was the mistreatment of Australian Aborigines, since the family had at one point spread to cover almost the whole of Australia. For the sake of time, I will omit extinct languages and only mention branches if they are extinct in entirety, but in total, Pama-Nyungan is said to have as many as 300 represented.

Some of the typological defining features of the language include dependent-marking (which also exists in a number of other Eurasian languages, in contrast to head-marking, which is more common in the Americas and Africa), no grammatical gender (there are some exceptions), and a robust inventory of retroflex sounds, which also occurs in India with Indo-Aryan and Dravidian languages.

Of the 41 Paman languages of the Cape York Peninsula area (up around Cairns), the overwhelming majority of the languages are extinct. The most-spoken of them is Guugu Yimithirr, which is notorious for being the source of the word "kangaroo." It has 775 speakers as of 2016. The entire Dyirbalic branch is nearly extinct, with the most-spoken language, Dyirbal, having only eight L1 speakers. Dyirbal is famous amongst linguists for having a very unique gender classification system, not only flying in the face of the lack thereof in most Pama-Nyungan languages, but lending its name to a book: "Women, Fire, And Dangerous Things." :P Yes, one of their four genders is more or less that. (It also includes water.) Could be worse. It could be Maric, which is completely extinct, with the last native speakers having passed in the 1980s. Waka-Kabic has only a single survivor, with 24 speakers of the Batyala dialect of Gabi-Gabi remaining. Durubalic languages are also extinct, although their pre-contact geographically range wasn't significantly large to begin with. Still sad. :(

The Yugambeh-Bundjalung languages are mostly still alive, but for how long this will be is anyone's guess, considering that the most speakers any one language has is Gumbaynggirr's ninety, per the 2016 Aussie census. The same can be said of the Wiradhuric languages of inland New South Wales, although Gamilaraay, even though it has no L1 speakers left, does have a number of L2 speakers, and Wiradjuri has 457 L1 speakers as of the 2016 census.

Going farther southeast, Yuin-Kuric is more or less extinct although revival attempts are being undertaken; these were the languages of what are now the Sydney and Canberra metro areas. The Gippsland languages of Victoria are practically extinct, with only a few L2 speakers of Gunai existing. Yota-Yotic consists of two languages with no L1 speakers, although one, Yotayota (aka Yorta Yorta) has 62 L2 speakers per the last census. Kulinic is entirely extinct as there are no fluent speakers left, although there is an attempt at revitalisation underway.

Lower Murray has one language that has over a hundred speakers, Ngarrindjeri, which had 312 L1s in 2016. Thura-Yura's main survivor is Adnyamathanha, which had 140 speakers. Whatever a witchetty grub is, that name comes from this language. The two Mirning languages are on the brink of extinction, down to single digits of L1 speakers.

Nyungic, spoken on the southwesternmost tip of Australia (including the Perth area), has one surviving member, Nyungar, which is apparently under-attested by Australian census takers. The figure of 475 given in the 2016 census is actually thought to be lower than the actual number of speakers, and there is even a dialect of Australian English developing with a sizable Nyungar-origin vocabulary, called "Neo-Nyungar." Apparently the name "Kylie" is also Nyungar in origin, but I'm a little skeptical about that.

None of the Kartu languages is particularly widely spoken, but Wajarri still has 145 speakers. The Kanyara-Mantharta languages are down to their last few speakers. The fairly large Ngayarda group has just two languages with attested speakers over 100 as of 2016 - Yinjibarndi (377) and Panyjima (104). Marrngu also has one such language, with Nyangumarta having at least 200 and possibly as many as 530 speakers.

Now Ngumpin-Yapa is more vigorous, and contains one language that is in the thousands rather than the hundreds of speakers - Warlpiri is spoken by around 2300 people in northwestern Australia. Only two of the eight languages - Mudbura and Warlmanpa - are under 100 speakers per census, and none of them are extinct, which is quite exceptional. Compare them to the Kalkatungic and Mayabic languages, which are all extinct.

Next we have Wati, comprising two languages. Probably the most vigorous Australian language of any is the pluricentric Western Desert, whose closest relative Ngardi is nearly extinct, but which itself has over seven thousand speakers, including three thousand of the Pitjantjatjara dialect alone. There's a similar situation in Arandic, as Lower Arrernte is extinct, Kaytetye has 122 speakers, but Upper Arrernte has a whopping 4 537!

Almost all Karnic and Yardli languages are extinct or close to it. Yolŋu, spoken near the northernmost point of Australia, does have one decently vigorous language, in Dhuwal, which has over five thousand speakers.

So yeah, Pama-Nyungan has over 300 languages attested. But out of the roughly 23 500 speakers of the language, the bulk of them speak one of four languages: Upper Arrernte, Western Desert, Dhuwal, or Warlpiri. Even with rounded numbers, that's a ballpark of 80 percent of the language family!

As a Canadian, I have to admit that Canada hasn't done much better a job in being actual humans to our indigenous people. :'(

Now for the polar opposite of Pama-Nyungan as a family - few languages, tons of speakers! Japonic only has 12 languages, of which 11 are grouped together as the Ryukyuan languages. Not surprisingly, though, the bulk of the language family's speaker base is made up of Japanese-speakers. It is the de facto official language, and the recognised national language, of Japan. It also has official status in one specific district of Palau. It is the tenth most spoken mother tongue in the world with somewhere between 125-127 million speakers. It is also known for its topic-comment syntactic structure, although for typology's sake it is often also referred to as (and largely is) an SOV language.

The Ryukyuan languages are all under tremendous pressure from Japanese, and range in number of speakers from close to a million (in the case of Okinawan) to 400 (in the case of severely endangered Yonaguni). Part of the problem lies in the fact that Japanese politicians recognise these as dialects of Japanese rather than their own languages, which is linguistically untenable, since not only are they completely unintelligible with Japanese, but with each other as well! UNESCO considers the Ryukyuan languages endangered to various degrees.

Another such language family is Turkic, although there are more attested languages in this family, and the speakers aren't concentrated in one of the languages to quite the same degree. Rather, there are a few major languages, a number of healthy "mid-card" languages, and a few minority languages.

The most unique group within Turkic is the Oghur group, which used to be far more widespread, but is now limited to the Chuvash language of central European Russia. This did include the Bulgar language, which was spoken by the people group that founded what is now known as Bulgaria, although nowadays the people there (an ethnic mix of Turkic and Slavic genetically) speak a South Slavic language, having done so for over a millennium now. The Bulgar dialects of the Volga Basin eventually became Chuvash, which currently has just over a million speakers.

Within the core of Turkic (sometimes called Common Turkic), there are four proposed branches, plus an outlying internal isolate in Khalaj, which is spoken by less than fifty thousand people in Iran. But all of these languages are fairly close in a number of respects.

Siberian Turkic is pretty easy to spot on a language map - of course the languages are all spoken in Siberia! :P Further divided into North and South, the only languages with significant speaker numbers are Sakha (North, with 450 thousand speakers), and regionally famous Tuvan (South, with around 280 thousand speakers), whose ethnic group is known for their distinct throat-singing! Fuyu Kyrgyz is probably extinct (and not all that closely related to Kyrgyz, so it's a slight misnomer), Chulym and Tofa are nearly extinct, and only two other languages out of the ten identified have more than two thousand speakers (Altai and Khakas).

The Karluk languages share a common ancestor in the Chagatai language, but eventually diverged; the most divergent - and most spoken - language in the group is actually Uzbek, which with 28 million speakers is second behind Turkish within the family. It has national official status in Uzbekistan and official status at the provincial level in certain northern provinces of Afghanistan. Uyghur is fairly robust in its own right, with over ten million speakers, but its closer relatives, Ili Turki and the Äynu code language, have markedly fewer speakers, and the former is critically endangered.

The Kipchak languages cover a fairly large geographical territory, and are further split into five branches, of which one, the South branch, is extinct (represented by Ferghana Kipchak, which went extinct in the early 20th century). Kyrgyz stands alone as another branch of Kipchak, and is spoken by almost four and a half million people, primarily in the country that bears their name Kyrgyzstan but also in China and other surrounding countries. Tatar (with over five million speakers) and Bashqort (with about 1.2 million speakers) are each other's closest relative, with their designated autonomous republics in Russia bordering one another. Tatar is actually fairly widely dispersed outside of Tatarstan as well. Although the Nogai branch has a Nogai language within it, it is actually the least-spoken language in the branch! Kazakh is spoken by fifteen million, Karakalpak by almost six hundred thousand, and Siberian Tatar by a hundred thousand, compared to Nogai's eighty-seven thousand! Finally, the Kipchak-Cuman branch, which is said to include the original Kipchak language by linguists, is the farthest west of the branches of Kipchak. The most-spoken language is Crimean Tatar with 480 thousand speakers (per latest census data), but given that it is spoken by mostly older people, and at that, by only ten percent of the ethnic group, if even that, it is considered endangered. Kumyk (with 450 thousand speakers) is in no such trouble, with 90% of the ethnic gorup speaking the language, and its closest relative, Karachay-Balkar, has around three hundred thousand speakers. All three have official status at the state level in Russia, and if you consider Crimea part of Ukraine, don't worry - similar recognition exists for Crimean Tatar within their system as well. Urum, spoken by just under two hundred thousand people, has no such status.

The largest branch of Turkic, though, is Oghuz, which contains the #1 and #3 languages in the Turkic family by number of speakers. While most of these are spoken in West or Central Asia, Salar, an internal isolate, is spoken in China. This language aside, they are split into three groups, Western, Eastern, and Southern. Southern Oghuz has two languages, Qashqay (spoken by almost a million people in Iran) and Afshar (largely nomadic but spoken in the area of northern Iran, eastern Turkey, and Azerbaijan - apparently around six hundred thousand people speak this). Eastern Oghuz comprises Turkmen (with its seven and a half million speakers) and Khorasani Turkic (census figures are way outdated for this language, but when last checked 25 years ago, there were over a million speakers - WE NEED LANGUAGE SURVEY DONE HERE STAT! :lol: ).

Western Oghuz is not only where the bulk of Oghuz speakers lie, but the bulk of Turkic speakers in general. There are 71 million L1 speakers of Turkish proper and 26 million of Azeri (between Azerbaijan and northwestern Iran - actually, there are more Azeri speakers), ranking them first and third respectively in the family for L1 speakers; Gagauz, a minority language of Moldova, and Balkan "Gagauz" Turkish (quotations mine - it's a bit of a misnomer as the people themselves don't actually refer to themselves as Gagauz!) are also within the branch, and the latter also has outdated census figures. (WHAT. THE. CRAP. :wtf: )

Anyway, that's a couple more families for you. Next post will tackle the Austro-Asiatic language family of (primarily) Southeast Asia.

Austro-Asiatic is spoken primarily in Southeast Asia and, with the exception of a couple of languages spoken on the Nicobar Islands, are exclusive to the continental mainland! It's sometimes known as Mon-Khmer, and indeed, the Khmer language (an internal isolate according to the latest proposal on the family) is within the group as its second-most spoken language, with around sixteen million speakers.

The Munda branch is situated mostly within east-central India, especially within Jharkhand, Odisha, and West Bengal, but gets over as far west as Maharashtra. The two most-spoken languages, Santali (which some estimates put at around six and a half million speakers) and Ho (just over a million speakers), are secondary official languages of Jharkhand. Mundari (from which the branch takes its name) and Juray have around three quarters of a million speakers each, and a few others, including the westernmost language of the entire family, Korku (which is classified as vulnerable), are in the lower six-digits range.

Khasi-Palaungic lumps together two clades into a single branch: the Khasic languages are spoken in Meghalaya, and the Palaungic languages in east-central Myanmar and southern China. Some linguists throw in a couple of other Austro-Asiatic branches into the mix, but I will keep them separate. Khasi proper, with just over a million speakers, is the most-spoken Khasic language and a secondary official language of Meghalaya. The only other language that cracks the six-digit mark in Khasic is Pnar, with around a quarter of a million speakers per census data from 2001. A bit outdated, I know.

Within Palaungic, two more languages go over the hundred-thousand mark, although one of them, Palaung, has outdated census data for some of its dialects, so the exact number of speakers, while at least two hundred thousand in 2000 (per one dialect having 272 thousand speakers), is unknown. The other, Wa, is spread out amongst Myanmar, Thailand, and China, and has as many as a million speakers per some estimates. Wikipedia lists nine hundred thousand. Most of the languages in this branch are in the four or five-digit range, and one particular (Mok) is probably already extinct, as there were just 7 speakers in 1981... before I was even born.

Khmuic consists of a number of minority languages of primarily northern Laos, although they spread into several other adjacent countries. The only language in the group with a substantial population lends its name to the branch - Khmu is spoken by over seven hundred thousand people in northern Laos. Mangic is endangered in its entirety, with not even ten thousand speakers existing between the three languages in the branch.

Vietic, on the other hand, has a worldwide footprint, thanks to Vietnamese having spread from the coastlands of Vietnam by refugee diaspora in the mid-20th century. It has seventy-five million L1 speakers, and as the official national language of Vietnam, has millions more L2 speakers as well. Vietic languages, under heavy influence from Chinese language due to being under Chinese rule, have developed tone, which many other Austro-Asiatic languages (including neighbouring Khmer) have not. While there are several other smaller languages, the most-spoken non-Vietnamese language in the branch is Muong, which has just over a million speakers in northern inland Vietnam.

Katuic primarily straddles the border between Vietnam and Laos, being spoken in southern Laos and central inland Vietnam... what little that actually is (Vietnam is pretty skinny in its centre)! While named after the Katu language, the most-spoken languages are actually Cuy (450 thousand), Bru (best estimate is around 300 thousand), and Ta'Oi (best estimate 220 thousand), all of which are quite polydialectal, and are spoken in multiple countries, making getting exact census figures a bit of a chore.

The Bahnaric languages have a very similar distribution to the Katuic languages from a nationality point of view, but are farther south. There are also a larger number of languages, some of which are polydialectal, but only three languages - Bahnar, for which the group is named (160 thousand), Mnong (130 thousand) and Koho (two hundred thousand) - crack six digits for speakers. (That said, Sedang and Stieng are pretty close, both having ninety thousand-plus speakers.)

The Pearic languages of the southern Cambodia-Thailand border area are all considered endangered, and Somray, with its 4100 speakers, has more per most recent census data than the other six put together - almost double that count, in fact! The Nicobarese languages are a little better-off, with most of its languages in the thousands, but could still be considered endangered, or at least threatened. The Car language has 37 thousand speakers, which is more than the rest of the group put together.

The Aslian languages are indigenous to peninsular Malaysia, and are also considered endangered, even though a couple of languages - Semiar and Demai - have over ten thousand speakers and even have a substantial number of monolinguals. They are spoken primarily in the inland areas. Shompen, an endangered language of the Nicobars, is thought by some linguists to be part of this branch (rather than Nicobarese), but others consider it an internal isolate.

Finally, there is the Monic branch, named after the Mon language, which shares this branch with Nyah Kur. Mon is fairly well-attested, with 851 thousand speakers as of the 2007 Burmese census, buy Nyah Kur is very much endangered, with just 1500 speakers.
Trans-New Guinea is a language family whose existence is largely accepted, but whose membership varies quite dramatically depending on which linguist you ask, and there's good reason for this - parts of New Guinea are still very hard to access, due to the terrain, and in the west, due to the ongoing political troubles between the indigenous Melanesian population and the Indonesian government. The number of speakers of the whole language family is only in the millions - even if we take the largest of the three proposals (476 languages), the number of speakers is only around three and a half million. The most recent proposal, by Usher (from earlier this year, actually) has a number of these omitted. I will discuss them later, utilising the Usher proposal as my main means of structuring this post.

The Berau Gulf languages are primarily spoken on the very western tip of New Guinea, that is, Bird's Head Peninsula, but per Usher, some of the languages of Timor-Leste also fall into this group. (A 2014 report disagrees with this.) It's divided into three groups - the divergent Mor language, which is critically endangered, the South Bird's Head languages, which are poorly attested in census data and generally don't have many speakers, and the West Bomberai languages, which include the Timor-Alar-Pantar group spoken in Indonesia and Timor-Leste. The three most-spoken of the Berau Gulf languages fall within the Timor-Alar-Pantar group, actually, with Makasae having over a hundred thousand speakers in Timor-Leste, Bunak having around eighty thousand speakers between Timor-Leste and Indonesia, and Fataluku having thirty-seven thousand speakers on the eastern tip of the island of Timor (in Timor-Leste, naturally).

Although the Sumuri internal isolate doesn't have census figures from later than 1978, more recent estimates put their numbers around a thousand speakers.

The West Papuan Highlands languages are spoken to the southeast of Berau Gulf and up the western part of the high geological backbone of New Guinea. Although most of these languages have little available census data, there are some languages, such as Western Dani, Grand Valley Dani, and Ekari, that are fairly vigorous and are either over, at, or close to the hundred-thousand mark for speakers.

Asmat-Marianne Strait has fewer languages than the above two, and fewer speakers as well, with the most speakers actually identified being Asmat's near twenty thousand from 1991; language survey is stifled again by the issues in West Papua.

Cook River-Kolopom is a grouping together of the Kayagar (Cook River) languages with the Kolopom languages; the most-spoken is the Kayagar language proper, but even its most recent census data only dates from 1993, and the number of speakers then was around ten thousand.

Central West New Guinea languages are spoken in the mountainous heart of New Guinea, with some languages straddling the border between Indonesia and Papua New Guinea. Mandobo and Ngalum are the most-spoken of these. Telefol, which has about 5400 speakers, is known for its unique base-27 counting system! The Ok branch of CWNG (which includes Telefol as well as Ngalum and a number of other languages) is also known for its dyadic kinship terminology, something which is fairly rare in languages on the whole - that is, there is one word to describe how two people relate to one another, so one word for "brother and sister," another for "brothers," another for "mother and son," another for "mother and daughter," etc. (if I'm understanding kinship-dyadism correctly, that is) The Oksapmin internal isolate is Trans New Guinea and shares a lot of features with CWNG (especially Ok) but this has been factored down to contact rather than immediate genetic relationship.

Papuan Plateau or Bosavi languages are spoken in Papua New Guinea, and most of them have better attestation than their relatives on the West Papua side. There are still some that lag behind. Beami, for example, hasn't had proper census or survey data since 1981; it had 4200 speakers back then, the highest of any language of this branch. None of the other languages has over 2500 speakers, but the rugged terrain and isolation of New Guinea balances that out, meaning that the languages aren't necessarily immediately endangered.

Duna-Bogaya is a two-language branch, with the titular languages being spoken in western PNG. Duna is fairly vigorous, with the most recent figures suggesting that twenty-five thousand people speak it. Bogaya, on the other hand, could be extinct for all we know - it had 300 speakers in 1981, and that's the most recent census data.

The Fly River languages are mostly spoken in PNG, although some get over into West Papua as well. Two languages with outdated census data in Yaqay and Marind supposedly have over ten thousand speakers, but the Boazi language of PNG has much more recent attestation, having 4500 speakers as of 2007.

The moribund Abom language of PNG is considered an internal isolate.

The Morobe-Eastern Highlands branch has a very large number of languages, and is split into three primary sub-branches: Eastern Highlands (aka Kainantu-Goroka), Finisterre-Huon, and Kratke Range (aka Angan). There are a number of Eastern Highlands languages with over twenty thousand speakers, with the most-spoken as an L1 being Kamano, which has over sixty thousand speakers. Kâte has twenty thousand L1 speakers but is also a widely-used lingua franca, making it the most-spoken of the Finisterre-Huon languages, while Hamani takes the top for Angan with forty-five thousand.

Finally, the Papuan Peninsula languages from the "Bird's Tail" (southeastern PNG) are grouped together based on a couple common innovations, but the families contained within are supposedly not any more closely related to each other than to other TNG languages. For the purposes of simplicity, they'll be grouped together here.

The Dagan sub-branch is held by Usher to be the most divergent branch (more distinct within the branch than the others). Daga is the most-spoken of these, with around nine thousand speakers as of 2007. The Koiarian branch is about the same for speakers - maybe a couple thousand more - with Ese being the most-spoken of these with ten thousand speakers. Humene-Uare has only around two thousand speakers for the three attested languages, of which one is extinct. Manubaran (Mount Brown) is a little better off, with around 2500 speakers for two languages, neither being extinct. Cloudy Bay-Musa River (or Mailuan-Yereban) has more languages, but the only one with a substantial number of speakers is Mailu, with 8500 speakers in 2000.

So those were just the languages accepted in most proposals. Usher leaves out eight entire branches that were included by Malcolm Ross in 2005, plus a subbranch of Finisterre-Huon, which was later integrated into Morobe-Eastern Highlands by Usher, called Goilalan - Usher has them as part of a new family, Binanderean-Goilalan, along with the Binanderean branch, which Ross includes in TNG. Of this new family, the Orokaiva language, with forty-seven thousand speakers, is the most-spoken as L1. Gogodala-Suki is one branch that Ross includes in TNG, but others include in a proposed Papuan Gulf family instead. The most-spoken language of these by far is Gogodala proper, with twenty-two thousand speakers, more than the other three languages put together.

Now if the census data is to be trusted, some of the languages in the Chimbu-Wahgi family have a fairly stable population of speakers. Melpa had 130 thousand speakers at last count, but that was in 1991. Kuman had 120 thousand in 2000. Wahgi had eighty-six thousand in 1981. Kaugel had seventy-seven thousand in 2000. The list goes on. A number of these languages are known for having a large number of laterals in the phonemic inventory!

Another such branch is Engan, which is also proposed to be its own family by some. Enga (with 230 thousand) and Huli (with 150 thousand) are two of the most-spoken indigenous languages of all New Guinea. Kewa and Angal are also very well represented.

The Madang branch (family?) has a large number of languages in its own right, but only a few have more than two thousand speakers. Foremost among these are Waskia (twenty thousand), Kobon (ten thousand), and Kalam (attested at fifteen thousand in 1991). There are a few others between three and five thousand.

Wiru is an oddball. Ross has it as an internal isolate in Trans-New-Guinea, but Usher classifies it within a larger Teberan-Pawaian family, which in turn has been included in the Papuan Gulf hypothesis. It has very poor census/survey data, with numbers from 1967 being repeated in 1981 (15K plus speakers). Dadibi from the two Teberan languages (the other being Folopa) is in a similar situation - had ten thousand speakers in 1988 but data on it since has been nonexistent. These two are in TNG per Ross while Pawaia is not, but per Usher, Wiru, Pawaia, and the Teberan languages are in Teberan-Pawaian. :unsure:

The Kiwaian languages, in TNG per Ross, have decent census/survey data, and only one of the six attested languages has fewer than a thousand speakers, with the most-spoken as L1, Kiwai, having thirty thousand.

Finally, the East Strickland languages, another sextet, don't have all that many speakers, relatively speaking. Gobasi is the most-spoken language per available data, but the figure of 1100 speakers dates back to 1993.

Kra-Dai (formerly Tai-Kadai), is a language family with its origins in mainland southeast Asia, much like Austro-Asiatic. While Chinese linguists consider these languages to be Sino-Tibetan, most historical linguists do not, and thus very few of the world's historical linguistic genetic classification systems have them identified thus. Sure, they share certain typological features, such as tonogenesis, but this is due to contact rather than genetic relationship. A more fleshed-out view not based on haphazard typological lumping is that they are potentially related to Austronesian and Hmong-Mien (a smaller language family that I will cover later), but even this is not anywhere near consensus acceptance.

There are five definitive clades within Kra-Dai. One is an internal isolate, Ong Be, an indigenous language of northern Hainan spoken by around six hundred thousand people. The Kra languages are all endangered to some degree, with even the most-spoken of the languages, Gelao, having just eight thousand speakers of an ethnic population of five hundred thousand. It is also one of the two branches of Kra-Dai that hasn't adopted Sinitic numerals, rather employing the numerals descended from the proto-language.

Kam-Sui is quite a bit better off, with the two titular languages being particularly vigorous: Kam has one and a half million speakers, while Sui has around three hundred thousand. Sometimes the Lakkja and Biao languages are included here, while other times they are treated as separate branches within a "Northeastern Kra-Dai." Hlai, which is exclusive to Hainan, is in a similar situation, albeit without one language in seven digits! Some of the more vigorous languages in that group include Ha Em (Literary Hlai, with almost two hundred thousand speakers), Laohut (166K), Tongzha (125K), and Cun (60K).

But the most voluminous primary group within Kra-Dai is Tai. It further splits into three branches per the latest proposals - Northern Tai, which is more divergent, and Central Tai and Southwestern Tai, which are closer to each other than to Northern.

Interestingly enough, the Zhuang ethnicity (possibly macro-ethnicity) is split not just between several languages, but between Northern and Central Tai. The Northern Zhuang languages of Northern Tai have a combined speaker-base of around sixteen million as of 2007, but since the Chinese government lumps them together as one language in spite of lack of intelligibility, external survey is required to get an accurate number for each language - there could be as many as eleven mutually unintelligible languages, plus the most recent figures for each individual are from the late 90s, which had one particular, Hongshuihe (which could itself be three languages), with about 2.8 million speakers, just a little more than Bouyei, another Northern Tai language that had 2.7 million in 2000. Almost all the Zhuang languages could be considered vigorous, and at least half of them have over a million speakers at last available count.

Central Tai includes the Southern Zhuang languages - which while less vigorous than the Northern ones, are still doing pretty well for themselves - plus the Nùng and Tày languages of northern Vietnam. The most-spoken of these is Tày, which has around 1.63 million speakers at last count. (2009, so fairly recent.)

Southwestern Tai has the most languages, and because of Thai proper and Lao, the most speakers, of any Tai sub-branch, and indeed of any other subgrouping within the family. It is further divided into four groups. One is an internal isolate within SW Tai, that is, Southern Thai or Pak Thai, which is spoken by about four and a half million people closer to the Malaysian border.

The Chiang Saen languages are languages that are largely called "Thai," with Thai proper having 20 million L1 speakers and an additional 44 million L2 speakers in 2000. The number has probably increased since then! Northern Thai also has several million speakers, although the figure of six million plus is dated to 1983. It also could be called transitional between Chiang Saen and another subgrouping, Lao-Phutai, as it has a lot of Lao influence.

Speaking of Lao-Phutai, obviously Lao is the prestige language within the group, but the Isan (oddly called Northwestern Thai in spite of the fact that it's much closer to Lao) language has almost the same number of L1 speakers, roughly 20 million. This being said, historical linguists will refer to the Southwestern Tai languages as a dialect continuum, as there is a degree of mutual intelligibility in the spoken languages, between Lao and Thai for example.

Then there's Northwestern Tai, whose constituent languages are/were largely spoken in Myanmar and Assam state in India. Of these, perhaps only the Shan language of northern Myanmar is truly vigorous, with around four million speakers. Some of the languages in Assam are legitimately extinct.

Nilo-Saharan is probably the most controversial family that Wikipedia actually accepts as valid, largely based on the work of Roger Blench. That said, all the "branches" he uses are recognised as viable groups on their own. This post will follow Blench's 2015 classification structure. By this, it is the Berta languages of Ethiopia and Sudan that are the most distinct. The three together, which are sometimes lumped together as a single language, have around 370 thousand speakers.

The Komuz languages group together the accepted Koman languages with the Gumuz language and the endangered (and poorly studied) Shabo language, which a number of other linguists have called an isolate. A couple of the Koman languages have over ten thousand speakers, with the most spoken of these - Uduk - spoken primarily in an Ethiopian refugee camp, because the people group has had to flee its traditional homeland in South Sudan because of the civil war there. Gumuz is the most-spoken Komuz language, with at least two hundred thousand speakers... if you consider it one language! Some linguists have stated that Gumuz's dialects are mutually unintelligible, making them separate languages (as many as three).

A third node contains the bulk of the languages. (Gee, where have I seen this before? Oh, how about the vast majority of language families! :lul: ) The Kunama languages are similar to Berta, in that linguists consider them three separate languages, while they are traditionally considered the same language. Between the three of them, spoken in an area straddling the northern border of Ethiopia with western Eritrea. The group has a sum total of 190 thousand speakers, and - although lumped together - is considered one of the national languages of Eritrea.

The Saharan languages are largely spoken in the south-central Sahara and central Sahel area of Africa, primarily in Chad, Niger, and Nigeria. The best-known language in Saharan, and possibly the most-spoken in all of Nilo-Saharan, is Kanuri, which is a major language of northeastern Nigeria and southeastern Niger, also spoken in Chad and even Cameroon. Estimates vary wildly for the number of speakers of this language, with 1987 data suggesting at least four million speakers, but data since then having estimates as high as ten and a half million! Most of the other Saharan languages are in the hundreds of thousands, although Berti is extinct.

Blench relates these closely with another well-attested group (for speakers) in Songhay (alternate spellings substitute an I for the Y). These are primarily spoken in eastern Mali and western Niger. Divided into "Northwestern" and "Eastern," the most-spoken of these languages by a huge margin is Zarma, which is one of the nationally recognised languages in Niger and is spoken by roughly 3.6 million people. Koyraboro Senni has the second-most with around four hundred thousand. The most-spoken Northwestern Songhay language is Koyra Chiini, which is the only non-tonal Songhay language and is spoken by roughly two hundred thousand people in Mali.

Blench has a Central African clade which is further subdivided into six of the other "viable groups." Kuliak, the most divergent of these, consists of three languages, but two are moribund, with seventy speakers between the two of them. Only the Ik language of northeastern Uganda is actually in the clear (for now) with 7500 speakers or so. Further down the line, Maban and Fur are grouped together, with the most-spoken Maban languages being Masalit (~400K) and Maba (~300K), both spoken in Chad, and the Fur language of the embattled Fur people of western Sudan was said to have close to 750 thousand speakers as of 2004, but the numbers may have decreased dramatically due to the campaign of genocide carried out against them by Sudanese Arab tribes.

Central Sudanic is the next of the "viable groups" and is put by itself within the tentative "Nuclear Central African" (my name, not Blench's). These are further divided into three branches: Birri-Kesh, a group of small and rather poorly-surveyed languages of South Sudan and the Central African Republic, Bongo-Bagirmi, and Eastern. Complicating the Birri-Kresh situation is that most of the Kresh languages are lumped together, so we have an attestation of sixteen thousand for the Kresh languages besides Furu put together. The most major language in the Eastern branch is Lugbara, with around 1.7 million speakers as of 2004, making it one of the major languages of northern Uganda; Lendu is also major, spoken by three quarters of a million people (at least) in the northeastern Democratic Republic of Congo. Mangbetu had 660 thousand speakers in the same region as of 1993. As for Bongo-Bagirmi, which is spread out over the Sudans, Chad, and the CAR, its most-spoken language is Ngambay, another major language of Chad, with just under a million speakers, and it has a few other languages within the between forty and sixty languages (depending on who you ask) within the branch.

I'm going to handle this last one a bit differently, since there are two branches grouped together, but one is both gigantic and relatively well-known. The other is humble Kadu. Only two languages have attestations of over ten thousand, and one of them is severely outdated. The other, Kadugli, is the most-spoken of the Kadu languages with seventy-five thousand (as of 2004).

The other is the rather large Eastern Sudanic "branch," which in itself is controversial. It is split further into nine sub-branches, which have been arranged in certain ways by certain linguists. Nubian languages are spoken in southern Egypt and northern Sudan, basically historical Nubia, and Nobiin, which one can guess from the name is considered the direct descendant of the proto-language, is the most-spoken language, with around six hundred thousand speakers. Tentatively assigned to a North Eastern Sudanic branch alongside Nubian by some are the Nara language of Eritrea, the two Nyima languages of Sudan, and the four Tama languages of Sudan and Chad.

The tentative South Eastern Sudanic branch includes the other five sub-branches. There's Surmic, which isn't particularly widely-spoken within Ethiopia or South Sudan, but the Me'en language has around 150 thousand speakers on the books. The Eastern Jebel sub-branch isn't looking so good, with only Gaam - with figures ranging from forty to eighty thousand speakers - not being in immediate threat of extinction. Temein is another small group, with the most -spoken language of that lot, Temein proper, only having ten thousand speakers. Daju has a lot in the way of outdated info; although the attestation of seventy thousand speakers for Sila is relatively recent, Nyala (aka Dar Fur Daju) has a number of eighty thousand going back to my birthyear.

"Wait, didn't you say this was a big, well-spoken, and famous grouping?" We haven't covered Nilotic yet, silly! There are some very well-spoken languages in this group! It's split into three branches, Eastern, Southern, and Western. As for Eastern Nilotic, Teso is the most-spoken language, with almost two million speakers, although the most famous language is arguably Maasai, due to the romanticisation of this people group by some Western media. Turkana has just barely under a million speakers, while Bari and Karamojong have around three quarters of a million. While the Southern Nilotic isn't as well-represented for speakers, the Kipsigis language (the most spoken of the Kalenjin languages) has roughly the same number of speakers as Teso. And then there's Western Nilotic, where the real big shooters of the family sit! Dinka - one of the two major ethnic languages of South Sudan - has a reported five million speakers, which would make it the most-spoken singular Nilo-Saharan language provided that the higher-end estimates for Kenya's Dholuo language (which I've actually studied myself!) aren't true. While the 2009 Kenyan census reports just over four million speakers of that language, and pockets of speakers exist in Tanzania and Uganda, some less conservative estimates put the number of speakers closer to six million. Alur, Acholi, and Lango also all top the million mark, and Nuer may as well, although speaker-attestation isn't as recent or as reliable. The latest numbers state just under nine hundred thousand.

Uralic is a language family I've put much study into. I did my discourse analysis paper on Finnish, an advanced phonology project on Estonian, and historical and comparative linguistics on the whole family. Uralic is almost universally accepted as legitimate, with only not even half a dozen dissenting linguists questioning it, while showing that they have little knowledge of the actual literature on the languages. (I'M LOOKING AT YOU, ANGELA MARCANTONIO :headbrick: )

Juha Janhunen and Jaakko Häkkinen have described the progression of the family as a series of singular splits as the family proceeded (mostly) west, from the origin area in what is now central Russia, just on the Asian side of the Urals. The branches I use follow them rather than Wikipedia.

The exception to the western migration was the very first split, that is, the Samoyedic languages (this name has come under some controversy lately and may end up changed). The only language among these that isn't immediately endangered is Tundra Nenets, which has an estimated eighteen thousand speakers. Some of these languages, such as the divergent Mator language, are already extinct, while some of the others are considered moribund (Nganasan and Enets), and others are still quite endangered but not quite that far gone yet, including the highly divergent Selkup "language," which may actually be three languages at this point. These are all spoken by people groups who are traditionally nomadic reindeer herders - and some of them still maintain the practice.

The second split is a very interesting one. The Mansic languages are two of the three languages formerly subsumed under a "Ugric" branch, but comparative linguistic work by the likes of Häkkinen and others has since disproven that Khanty (a language of western Siberia with just under ten thousand speakers) is more closely related, instead being a later split with some influence from neighbouring Mansi. But that's not why this branch is interesting. Mansi's closest relative is Hungarian. If ever there were a hilarious pairing in linguistics, it would be these two! Mansi has 940 speakers per the last Russian census, while Hungarian has over thirteen million, making it the most spoken language in the entire Uralic family! Mansi is spoken in Western Siberia. Hungarian's core of speakers is in... well... Hungary! There are also decent numbers of speakers in immediately adjacent countries, and a Hungarian diaspora in more distant nations, especially the USA and Canada. Ironic that such vastly different people groups speak such similar languages, amirite? That said, Mansi has lost much of its case system (only has six cases), while Hungarian has not, still having eighteen cases. Going back to Khanty for a second, some dialects have as few as three cases, which is really an anomaly for post-Samoyedic Uralic languages.

The next split is marked by a vowel shift that is thoroughly documented in Sammallahti (1988). Permic contains the three Komi languages, Komi proper, Perem Komi, and the divergent Komi-Yodz, which at one point was considered a dialect of Perem Komi, plus the Udmurt language. Udmurt is among the most-spoken Uralic languages without national official status, spoken by 340 thousand people per last census, while Komi proper has around 160 thousand and Perem around sixty eight thousand. Yodz, being a relatively new language, has just two thousand speakers. These are particularly known for the richness of their case systems - the lowest number of cases of any of them is Udmurt's fifteen, which is the same number that Finnish has. Depending on who you ask, Perem Komi could have as many as thirty, which is the largest of any Uralic language if you don't count adverbial suffixes (if you do, Finnish and Hungarian would both have more!) while more conservative linguists would say they have eighteen (which is still quite a large number) and Komi proper has seventeen. Permic languages have also lost the vowel harmony that defines the vowel phonology of most Uralic languages.

Mari is the next split, and there is some debate whether it is one or two languages, since there is still some intelligibility between the two, but this is decreasing. The Meadow, or Eastern, variant of the language is the dominant one, with around half a million speakers, the most-spoken Uralic language that isn't nationally official. It does have official status at the secondary level of government, being co-official with the Hill, or Western, variant, which only has around thirty thousand speakers. Mari is also very vigorous; most people within the ethnic group speak one of the variants. The number of cases is relatively small for a post-Samoyedic branch, but its nine cases is still comparable to the more complex Indo-European language case systems such as Lithuanian.

Mordvinic includes two living languages, Moksha and Erzya. They were previously lumped together as a single language, and the attitude that they are one language persists, in spite of being mutually unintelligible and (unlike, for example, the Chinese languages) NOT using exactly the same alphabet! This makes it problematic to determine just how many people speak the language, as around thirty seven thousand identified as speaking Erzya in the last census, two thousand identified as speaking Moksha, and almost four hundred thousand identified as speaking just "Mordovskiy Yazyk" (Mordvin language)! That makes my life difficult as well, as I seek to study the Erzya language in-depth in linguistic work.

Moksha has lost its vowel harmony, but still has a consonant-vowel harmony system (often known as a palatalisation-velarisation contrast, which exists in some other Uralic languages, Russian, and Irish Gaelic); Erzya still has vowel harmony. Like English, Moksha also has total reduction of unstressed vowels in certain contexts. Both languages have twelve cases.

Sometimes Mordvinic is considered to be the final split before Finnic, which is held to be the final resting place of the main group, but others say it is actually Samic that holds this distinction, which I agree with due to the system of consonant and vowel gradation that developed in both Finnic and Samic. The Samic languages are similar to Samoyedic in their speech status. All but one of them are very endangered - the "one" in this case being Northern Sami, which has around twenty thousand speakers. One of the languages that is potentially down to its last speakers is Ter Sami, which only had two speakers remaining, while most of the others are in three digits. Lule Saami has somewhere between one and two thousand speakers, but fewer and fewer youth are speaking it.

And now we arrive at Finnic, which is a large dialect continuum with a high degree of intelligibility, but the groupings are considered different languages, and rightly so due to morphosyntactic and phonological differences. Finnish and Estonian rank second and third in the family for speakers, with five and a half million and just over one million respectively. Most of the other languages are in three or four digits. They tend to have similar numbers of cases, with Finnish and Karelian having fifteen, Estonian having fourteen, and Votic and some eastern Finnish dialects having sixteen, but the dormant language Livonian (which has no native speakers left but several L2s in Latvia) has just eight, while Veps goes to the other extreme with twenty-three! Most of them have a robust and consistent vowel harmony, but Estonian is a major exception to this.

I love this language family! :wub:
