LAPSyD - Lyon-Albuquerque Phonological Systems Database

LAPSyD

Lyon-Albuquerque Phonological Systems Database

Introduction.

This database contains searchable basic information about the phonological systems of a substantial number of different languages around the world. It aims to include languages from different geographic areas and language families in approximate proportion to their density. In this, and a number of other aspects LAPSyD draws on the experience of compiling the World Atlas of Linguistic Structures (WALS) (Haspelmath et al 2005, Dryer & Haspelmath 2013). Unlike its predecessor UPSID, described in Maddieson (1984) and other works, LAPSyD is not designed to be a balanced sample: some very closely related languages are included, some of the better-studied families are over-represented, and languages from less well-known areas are necessarily under-represented. However, appropriately structured representative samples can be constructed by making a selection among the languages included and tools for doing so form part of the LAPSyD framework. Languages in LAPSyD which occurred in UPSID or appear in WALS are identified in saved sample lists.

The basic unit of entry is a language. For each language in the database, a list of the contrastive vowel and consonant segments is given, together with some commentary on questions of interpretation in the ‘consonant notes’ and ‘vowel notes’ fields. In addition, a descriptive outline of the permitted syllable structures and brief comments on the role played by tone and stress (accent). Categorical labels for syllable structure, tone and accent are also provided, which are explained in more detail in ¶ 3.5 3.6 3.7. A count of the number of consonants, the total number of vowel nuclei, and of the number of basic vowel qualities is also provided. Each segment is assigned a description in terms of phonetic features, enabling the database to be searched for classes of segments. The features used are outlined in ¶ 4. Familiarization with this feature set is necessary to understand how contrasts are encoded in LAPSyD and the limits that the feature set imposes on their representation.

The languages are identified by a primary name and by an ISO-639 three-letter code wherever possible. In some cases alternate names are also given when these are in common use. The area where the language is spoken is described in a text field and the language is assigned a point location specified by latitude and longitude coordinates. An abbreviated classification of the language by family is provided and languages are also assigned to one of six major geographic/genetic groupings. Most of the data is drawn from published sources or publicly available documents, such as dissertations, and the sources on which the description is based are fully cited. This ‘metadata’ on the languages is described in more detail in ¶ 2.

It is emphasized that each language included is represented in LAPSyD by a ‘snapshot’ of how it was spoken at a particular time and place by particular individuals or groups. Any language is an ensemble of varieties in constant evolution, and a different choice of time or place of observation could yield a very different picture of the phonology of the language. Morover, the data as presented in the sources is filtered by the compiler of this database in an effort to provide a uniform style of analysis, particularly as it relates to the inventories of consonants and vowels. This homogenization of the data is regarded as an important and valuable feature of the database. Matters of interpretation are described in more detail in ¶ 3.

LAPSyD is made available for general interest and as a research tool. It is planned to continue to expand it both by adding further languages and by increasing the richness of information about each individual language included.

Language selection and metadata

Criteria for language inclusion.

The primary criterion for inclusion of a language in LAPSyD is the availability of what appears to be a reliable description of its main phonological characteristics based on first-hand experience with the language and prepared by someone with an adequate level of linguistic sophistication. The most satisfactory sources are those that provide a phonetically-informed description of the pronunciation and provide explicit arguments for segmentation and contrast and an explicit discussion of syllable patterns and prosodic contrasts. Disappointingly few language descriptions meet these simple criteria. In order to include a wider range of languages, less satisfactory sources must often be used.

Languages no longer currently spoken may also be included if the documentation obtained before their extinction is adequate to make a reasonably reliable basic phonological analysis. Languages from areas affected by particularly severe language loss before the 20^th century, such as the east coast of North America and much of the south and east of Australia, but for which adequate data exists for ‘salvage’ descriptions to be made, may be included with a lower threshold of reliability. Examples include Timucua (tjm) from Florida, USA, and Biri (bzr) from Queensland, Australia.

Both geographical and genetic factors play a role in selection of languages. Virtually all the adequately-described languages spoken in large areas which are sparsely-populated or have little linguistic diversity (e.g. North Africa, Siberia) are likely be selected in order to fill the space on maps. Where language density is greater, language selection is influenced by impressions of language diversity. For example, the Bantu zone of east, central and southern Africa is less densely sampled than New Guinea, as there is greater genetic diversity of languages in the latter area.

No two varieties that are considered to belong to the same language are included; however, this criterion is imprecise as no clear distinction can be drawn between a difference of language and a difference of dialect. For example, only one variety of English (eng) is included despite very considerable phonological differences, especially within the British Isles, but both Moroccan (ary) and Egyptian Arabic (arz) are included based on claims of mutual unintelligibility of colloquial speech styles.

Language names.

Each language is indentified by a single primary name. When there is an established English name for the language this is used, e.g. German (deu), Basque (eus), Assamese (asm), Navajo (nav). In most other cases a conventionalized spelling of an indigenous name is usually preferred, following the general trend in most recent scholarly work. This may result in familiar names not being used as primary identifiers of a language: Lappish is now known as Saami (sma), Cambodian as Khmer (khm), Yurak as Nenets (yrk), Nootka as Nuuchahnutlh (noo). When a language is newly endowed with an orthography, subsequent literature frequently employs the name and spelling sanctioned by the authority establishing the orthography. Hence !Xóõ becomes !Khoon (nmn) and Yeletnye becomes Yélî Dnye (yle). Choice of the name to use is thus a judgment weighing familiarity, ‘correctness’ and guesses about how the language is likely to be referred to in the future.

The primary name may be followed by a modifier, separated by a comma, specifying a particular dialect or variety. This is particularly the case when the familiar name covers varieties that are different enough that they are clearly separate languages (or might be considered so). For example, Nahuatl and Chinantec are both represented by more than one variety which are distinguished by a modifier following the primary name. Examples are Nahuatl, North Puebla (ncj) and Chinantec, Lealao (cle).

On the other hand, a modifier preceding a name is a full part of the language name. For example West Makian (mqs) is not a western variety of a Makian language but a language that is spoken on the western half of the island of Makian (and some nearby islands) in the Maluku province of Indonesia.

A language name may also be followed by a disambiguator enclosed in parentheses. This is most often used to clarify which language is meant when more than one language can be referred to by the same spelling of a name. Such homographs are usually disambiguated by adding the name of the country. For example, Ika (Colombia) distinguishes the Chibchan language of that name spoken in Colombia (arh) from the Igboid language Ika spoken in Nigeria (ikk).

Alternative names are supplied for many languages when these are reasonably familiar, especially when they have been used in the linguistic literature. For example, earlier publications on the Waorani (auc) language of Ecuador referred to it as Auca. This is actually a derogatory exonym, but it is useful to be able to connect this name with the preferred alternative in view of its use in earlier literature. For the most part, however, variants of a name that are essentially just alternative spellings are not listed.

Language codes.

Each language, apart from a handful of exceptions, is also identified by its ISO-639 code. These codes are lower-case strings of three letters, often based on the name of the language (but sometimes arbitrary). These codes are a useful way to identify whether different sources and databases are referring to the same or different languages when nomenclature is ambiguous or imprecise. They grew out of the codes used by SIL International to identify languages in their catalogue of languages the Ethnologue starting with the 15^th edition (Gordon 2005, but have since been adopted as a world-wide standard (http://www.ethnologue.com/codes) although SIL International continues to manage the code system (for example, changes pass through them). Because of their origin these codes are sometimes known as ‘Ethnologue codes’.

For most languages there is little difficulty in matching the language with a code. However, there are some problems, mainly having to do with how coarse or fine the underlying classification of language varieties is assumed to be. The codes are mainly aligned with the division into languages that is presented in the Ethnologue. In some cases this is extremely fine: Arabic is 35 distinct languages, German is 18. But in other cases it is coarse: Mandarin Chinese (cmn) is treated as a single language even though the 15^th edition of the Ethnologue itself used to report that “Mandarin varieties of Guilin and Kunming are inherently unintelligible to speakers of Putonghua [Standard Mandarin].” When fine distinctions are made it can be hard to determine which of a number of possible codes is appropriate. When distinctions are coarse, the same code may match to two or more entries in the database. Thus, a few entries in LAPSyD have more than one code, or may share a code with another entry, as is the case of Standard Modern Greek and the Greek variety known as Grico spoken in the Salento region of Italy, both coded ell.. The set of ISO codes is also regularly updated, so the codes for certain languages may need to be changed to keep them in conformity.

In a few instances, there is no code yet provided. These will be temporarily assigned the unused code xxx. Cuitlatec, formerly spoken in the Mexican state of Guerrero, is a current example of a language with no IS0-639 code.

Languages can be searched for by name, including alternatives, or by ISO code by clicking on the “Access languages” tab in the main menu, then “Text search”. This opens a window in which the search term can be entered, and its type (language name, code) selected. This window also allows for search by source.

Language classification.

For each language basic information on its language family membership is provided. In the majority of cases, the language is classified into a high-level family on the order of Indo-European, Niger-Congo, Austronesian, Sino-Tibetan or Arawakan, and then into a major sub-family, not necessarily the next lower node in a familiar tree. In families such as Indo-European and Afro-Asiatic which split at a high level into multiple recognized branches the major subfamily is the next level of the classification (e.g. Italic or Germanic; Berber or Chadic). But in families such as Austronesian and Niger-Congo where branches are heavily nested, a lower level is usually used (e.g. Oceanic in Austronesian). In many cases one or more further layers of a classification are also included, but no attempt is made to provide a full classification. There is a marked lack of consensus on the structure of most language families at both higher and lower levels, so these classification labels are only intended to serve as a rough guide to relatedness and as a possible basis for constructing samples of languages appropriately reflecting genetic diversity. Languages can be searched by family by clicking on the "Access Languages” tab in the main menu then “By Classification".

The language classification assumed is moderately conservative. Larger units that are reasonably probable are accepted even though they may be rejected by more conservative comparative linguists. These include families such as Gulf (Muskogean + Tunican), Dene-Yeniseian (Na-Dene +Yeniseian), as well as contested larger groups such as Australian and Khoisan. Proposed macro-groupings such as Austric (Austro-Asiatic + Austronesian), Niger-Saharan (Niger-Congo + Nilo-Saharan) and Ural-Altaic are not accepted, let alone hypothesized mega-groupings such as ‘Amerind’ or ‘Eurasiatic/Nostratic’. However, a user is always free to reshuffle the classification to suit their tastes. Suggested tools to compare classifications are the Multitree http://new.multitree.org and LL-Map projects hosted by the Linguist List (http://linguistlist.org/projects/).

A special treatment is given to the languages that have sometimes been labeled

Papuan. These are the languages of island South-East Asia and the South-West Pacific that do not belong to the Austronesian or Australian families. Without implying a genetic unity these languages are all labeled “Papuan” (to be read with scare quotes!) at the top level, in part because the process of sorting of these languages into genetic groups seems to be in greater flux than is the case elsewhere. A second-level affiliation, such as Trans-New Guinea, is also provided for these languages, which may correspond more closely to independent families.

Many languages are of uncertain affiliation; in LAPSyD these may be variously included in a more established grouping or left unaffiliated. For example, Japanese and Korean are shown as Altaic, a grouping accepted by some and rejected by many. Languages with no known affiliation are of two types, at least in principle. Some are quite well documented and have been the subject of concerted efforts to decide their affiliation and yet these efforts have failed to show demonstrable relationship to any other language. These are labeled Isolates, with the name of the language following. For example the Zuni language spoken in New Mexico is labeled: Isolate: Zuni. Other languages are not documented in sufficient detail (e.g. only a small vocabulary may be available) or have not been evaluated in detail from the perspective of their classification. These are labeled Unclassified. Proposed but uncertain affiliations are occasionally noted in individual language files.

Language localization.

The location where each language is spoken is specified in two ways. There is a brief verbal description of the area where speakers of the language are primarily resident now — or in some cases were resident before major population displacements due to colonial intrusions, etc. These descriptions are either in terms of geographical features such as river systems, islands, mountain chains or in terms of administrative and political units such as towns, departments and provinces, or both. The country is always mentioned.

In addition there is a point location, indicted by co-ordinates of latitude and longitude in decimal format, intended to mark the center of the region where the language is spoken, or its most characteristic 'habitat'. The majority of the world's languages are spoken by reasonably sedentary populations settled within a relatively small area. Assigning a single point location is not a great distortion of reality on the ground in these cases. But languages spoken by highly mobile or dispersed populations (e.g. Romani, Fulani) or by large populations spread over a great area are of course poorly represented by a single point. In the first of these situations, the point location is based on the location where the data on the particular variety of the language described was obtained. Thus, Cherokee (chr) is placed in western North Carolina in the USA rather than in Oklahoma, as sources describing the North Carolina dialect provide the primary information. This location also better represents the historic home of the language. In the case of widespread languages a political center is chosen to represent the language's location. Thus, English is located in London, Spanish in Madrid, Russian in Moscow, Mandarin Chinese in Beijing and Hindi in New Delhi.

Assigning point locations has some advantages over attempting to map areas where languages are spoken. It avoids the difficulties of determining boundaries and dealing with overlapping ranges, and avoids giving a false impression of precision. People move around, voluntarily or under duress, and may change the language they use. The relatively abstract nature of point locations serves to remind the user of the somewhat fictional nature of all language mapping. LAPSyD’s visualization tools use the point locations when plotting locations for all or selected subsets of the languages included in the database.

Areal/genetic groups.

Each language is assigned to one of six major areal/genetic groupings. Such groups are often used to test the generality of typological observations. For example, if patterns are repeated in each group separately, they are more plausibly universal. The areal/genetic groupings are defined first on a geographic basis. All of the languages belonging to families wholly or primarily based in a given geographic area are attributed to the area. The six groups are as follows:

1. Europe, South and West Asia.

2. East and South-East Asia.

3. Africa.

4. North America.

5. Central and South America.

6. Oceania.

The representation of the different areal groups can be seen by clicking on the "Summary" tab in the menu and then clicking on "Language areas".

Area 1 includes all the countries of Europe, including all of Russia and the Central Asian republics of the former USSR, as well as Asia Minor and Anatolia, and the Indian subcontinent. Major language families rooted in this area are Indo-European, Uralic, Altaic, Dravidian and the three Caucasian groups. A number of isolates and small families, such as Basque, Burushaski and Chukchi-Kamchatkan, also fall into this group. Creoles that are lexically primarily Indo-European have been classified as Indo-European, and hence are included in group 1 no matter where they are spoken. Since Korean and Japanese have frequently been linked to Altaic they are also included in area 1 rather than in area 2.

Area 2 includes China, the countries of mainland South-East Asia and the islands north and west of Wallace's Line (i.e. most of Indonesia and the Philippines). The major language families in this area are Sino-Tibetan, Austro-Asiatic, Tai-Kadai, Hmong-Mien and Austronesian. Sino-Tibetan and Austro-Asiatic (Munda) languages spoken within the Indian subcontinent are included here, as are all the Austronesian languages spoken around the Pacific and the Malagasy language (plt) of Madagascar. Among smaller families in area 2 is Andamanese, since the Andaman islands though politically part of India are geographically offshore from South-East Asia. Moreover, it has recently been suggested that Andamanese might be related to Austronesian.

Area 3 is the African continent and its offshore islands and includes all languages of the four traditionally recognized major language families of Africa: Niger-Congo, Nilo-Saharan, Afro-Asiatic and Khoisan. There are increasing doubts among Africanists as to whether all the languages usually considered to belong to the first two of these are in fact related, but each contains a very large undisputed core membership. Afro-Asiatic includes all Semitic languages, some of which are spoken outside Africa, e.g. in Malta and the Middle East. Khoisan is a disputed grouping supported by some and rejected by most specialists working on these languages, but nonetheless remains a common frame of reference. There are also a small number of isolated or unclassified languages in Africa, including Hadza (hts).

Area 4 is the North American continent, which is defined as reaching to the Isthmus of Tehuantepec, so that most but not quite all of Mexico is included in the North. Many distinct language families are recognized in this area. Some of the larger families are Dene-Yeneseian (Na-Dene + Yeniseian), Algic, Iroquoian, Uto-Aztecan, Oto-Manguean and Hokan and Penutian. The latter two contain reasonably certain core membership, but many disputed extensions have been proposed. Quite numerous smaller families and isolates also exist in this area, such as Wakashan, Kiowa-Tanoan, and Kutenai. North American families generally have no members outside the strictly geographical boundaries of the area. An exception is Ket (ket), now recognized as linked to Na-Dene in the Dene-Yeniseian family. Although it is possible that the north Asian location of Ket represents the older homeland of this family, it’s members are dominantly spoken in North America, so Ket joins this area.

Area 5 includes the Americas south and east of the Isthmus of Tehuantepec as well as the islands of the Caribbean. This division places the Yucatan peninsula in the Central and South American area and unites all languages of the Mayan family in this group, together with families such as Chibchan, Arawakan, Cariban, Tupian, Pano-Tacanan, Tucanoan and many smaller families. There are also a considerable number of South American languages which appear to be isolated or are as yet unclassifiable.

Area 6 — "Oceania" — includes the islands east and south of Wallace's Line, most importantly New Guinea, and the island-continent of Australia, but also the smaller islands of the southern and eastern Pacific and a further significant part of Indonesia (Sulawesi, and the Maluku and Lesser Sunda islands such as Flores and Timor but excluding Bali which lies west of Wallace's Line). The languages assigned to this area belong to the Australian family or to one of the various groups that have been labeled 'Papuan'. Opinions on the internal classification of the indigenous languages of Australia seem to be becoming more and more stable, but the classification of 'Papuan' languages remains very unsettled. Since the 'Papuan' language groups are often discussed together, they have all been designated as 'Papuan' followed by a suggested family affiliation. The numerous Austronesian languages spoken in this geographic region are assigned to area 2 — East and South-East Asia — since languages in the Austronesian family are primarily spoken to the west and north of Wallace's Line.

Sources consulted.

The information on each language comes primarily from published or publicly-available technical linguistic literature, such as grammars, dissertations, journal articles and dictionaries. Detailed bibliographical references are given for the items relied on for each language. Occasionally an additional source that has been identified but not yet consulted is also listed together with the annotation “not seen”. In a few cases, data is based on or supplemented by personal fieldwork by the compiler, personal communications from others or resources that may be accessible on the web, including recorded speech samples. In some cases a web address is provided which will link to the item cited. As described in ¶ 3 the phonological analyses presented in the sources are not necessarily accepted as given, but are modified with the intention of achieving a uniform interpretation of the facts.

Interpretation and standardization of descriptions
1. General principles

The phonological description of each language is reviewed to standardize the analyses as far as possible. The basic idea is to remove differences that have to do with choice of theoretical model or transcriptional preferences and other issues that might either create apparent rather than real differences between the languages included, or might disguise real differences that actually exist. The goal is to represent the language in a ‘concrete’ fashion based on what is actually produced in a careful speech style.

The primary goal of this database is to represent the segmental and prosodic contrasts that form lexical distinctions in each language, together with basic information on phonotactics. Most languages yield fairly well to an analysis in terms of a set of contrastive segments that can be identified through the classic test of contrast in minimal pairs. These elements are generally known as the phonemes of the language. Each element identified in this way can then be characterized as possessing certain phonetic traits. These will normally be the properties that occur in the most common variant produced in a reference pronunciation. There are, of course, many decisions to be made about identifying the segments as produced in different environments with each other. In general, linguists tend to rely on phonetic similarity above all else. However, sometimes other arguments can support an identification, such as the identity of morphemes. For example, in a set of English (eng) words such as fate, fatal, fatality the assumption that they share the morpheme <fate> provides a basis for saying that the rather different sounds heard in utterance-final, medial post-stress and medial pre-stress positions in these words spoken in isolation are all realizations of the same element /t/. Phonetic attributes that can be reasonably derived from particular environments, such as the aspiration found in pre-stress onset position, are not considered part of the essential nature of the segment. So /t/ in English is considered a voiceless alveolar plosive (pulmonic stop) and is not marked for aspiration.

A different case can be illustrated with Bilua (blb), a 'Papuan' language of the Solomon Islands. In this language the voiced stop series is described as being prenasalized when intervocalic but plain word-initially (Obata 2003). Since prenasalization is not explicable as a reasonable consequence of simply being intervocalic, it is assumed that the voiced stops in this language are basically prenasalized. The older spelling of the language name as Mbilua suggests that in fact they probably are prenasalized in initial as well as in medial positions but prenasalization may be less salient in this position.

Following such principles, the most reasonable basic form to posit for each contrastive segment in the language is sought.

Of course, all of the classic problems of determining an inventory of segments must also be considered. These include evaluating ongoing sound changes that may be modifying the inventory, such as splits and mergers in progress and the effects of contact between language leading to the possible introduction of new sounds. Decisions must also be made about whether phonetically complex entities should be treated as units or as a sequence of segments. All elements that are candidates to be considered as unitary affricates, diphthongs, consonants with secondary articulations, prenasalized stops, and many other types of segments traditionally accepted as potential unitary elements in the phonetic literature, are also candidates to be considered as a sequence.

In LAPSyD, unlike in UPSID, there is no marking of loan segments. For each case where a segment is known (or believed) to have entered a language because of borrowing, a judgment is made as to whether the borrowed segment is sufficiently integrated in the language. For example, English has borrowed the word genre from French (fra) and some speakers may pronounce this word with a nasalized vowel [ã], similar to that in the French pronunciation [ʒãʁ] (though never with a French-sounding uvular r-sound). But this pronunciation is not used by a majority, so /ã/ is not considered part of the established inventory of English. On the other hand, On the other hand French has borrowed so many words from English with the ending <–ing> that this has become a morpheme that can be added to native French roots. Hence French now has a phoneme /ŋ/. Many languages are spoken in areas where a dominant language has had a major impact — for example, virtually all indigenous languages of the Americas are subject to influence from English, Spanish (spa), Portuguese (por) or French. Descriptions will often note that certain sounds or syllabic structures only occur in, say, loanwords from Spanish. Clues are sought as to whether such items should be considered as cited from the dominant language, or as having become an integral part of the indigenous language. In order to diminish the strong homogenizing effects of such dominance, a bias against accepting introduced elements is taken as the initial stance. This, of course can be overcome when the evidence of nativization is persuasive.

All the data in LAPSyD is subject to any limitations on the information available in the sources consulted. It is quite common, for example, to read a description that mentions long vowels or nasalized vowels as contrastive but which fails to state how many such vowels exist, or to find no explicit statement on syllabic structure. By examining words cited as examples or studying a lexicon it may be possible to check, for example, how many long vowels occur or to construct an idea of the syllable canon, but not all such lacunae can be filled.

Vowel inventory

The inventory of vowel nuclei recognized for the language is given in IPA transcription with the symbols laid out in a basic grid showing vowel height (vertical) and front-back dimensions (horizontal with front at the left). Rounding is shown by choice of symbols and by labeling. Other properties will be most often indicated by diacritics. Note that the symbol /a/ is used for a low central vowel, not a front one. If the language has diphthongs, that is, dynamic nuclear vowels, these are listed beneath the vowel grid. Many linguists use the term "diphthong" for structures that consist of an approximant and a vowel (in either order). These are not necessarily syllable nuclei but may rather be CV or VC structures. The description is searched for indications as to the best analysis. For example, if a vowel + approximant rhyme precludes the occurrence of another coda consonant, as in Thai (tha), this indicates that treating the approximant as a coda itself may be the most appropriate analysis.

A commentary field titled "Vowel notes" provides space for remarks related to the vowel system and its interpretation. These may include notes on how the data reported in LAPSyD differs from the source(s) consulted, as well as on such issues as restrictions on the distribution of certain vowels, such as vowel harmony. Information missing from the source may also be noted here. The commentaries are not in any consistent format, but may sometimes contribute to understanding the way data has been interpreted.

The total number of distinct vowel nuclei is listed in the "Count Information" provided for each language. This is the sum of all vowels and diphthongs of all types. This total is often less certain than the number of basic vowel qualities.

Basic vowel count

A count of basic vowel qualities is also provided. This collapses distinctions among vowels that have the same values on the basic parameters of height, backness and rounding. Pairs of, say, oral and nasalized vowels, or long and short vowels that can be matched to each other on the basic parameters are counted just once. Diphthongs that can be considered as composed of more basic vowels do not add to this count. Distinctions of tongue root position are mapped to height differences, so they do add independently to the basic vowel count. Navajo (nav) presents a clear example of the difference between total and basic vowel counts. It has four short oral vowels, /i, e, a, o/, as well as nasalized counterparts of each of these and also long counterparts to each of the oral and nasalized short vowels. So this language has 16 total vowels, but only 4 basic vowels. The decisions on the basic vowel inventory are not always as straightforward as in the case of Navajo, but the most uniform interpretation is sought. Cross-linguistic comparisons of the vowel systems of languages are very often based on the basic rather than the total vowel inventory.

If a reliable source of information on segment frequency has been found the most frequent of the vowels is reported at the right of the display of the counts.

Consonant inventory

The consonant inventory is presented in a chart labeled with axes for manner (vertical) and place (horizontal). Place is organized from front (at left) to back, and manner largely follows the order of degree of stricture from most closed to most open constriction. Sibilant and non-sibilant africates and fricatives appear in separate rows. Languages with clicks have a second chart for the clicks. Subsidiary distinctions, such as secondary articulations, are nested within the main axes of the grid. The conventional description of /h/ as a voiceless glottal fricative is accepted for the purposes of this database.

A count of the number of consonants recognized for the language is provided under the "Count Information". In addition, the ratio of the number of consonants to both the number of total vowels and the number of basic vowels is reported. Navajo has 34 consonants, so for this language these ratios are 2.125 (34/16) and 8.5 (34/4) respectively. As for vowels, if a reliable source of information on segment frequency has been found the most frequent of the consonants is reported at the right of the display of the counts.

A commentary field titled "Consonant notes" provides space for remarks related to the consonant system and its interpretation. As for vowels, this field may include notes on how the data reported in LAPSyD differs from the source(s) consulted, and on restrictions on distribution. This field often includes notes on the description given to coronal consonants in a source, or on ambiguities in the labeling. These commentaries are not in a consistent format, but may add some details or contribute to understanding the way data has been interpreted and where remaining uncertainties lie.

Syllable structure

The basic patterns of syllable structure are reported in several ways. At the beginning of the field titled "Syllable notes" a notation of the canonical syllable structure assumed for the language is given, using the standard notation of C for consonant and V for vowel. Items in parentheses are optional. Thus (C)V(C) means that the language allows four types of syllables with V, CV, VC and CVC structures. Common patterns allowing restricted classes of consonants in certain syllable positions may be noted with the symbols G, L and N, for Glide, Liquid and Nasal respectively. Thus C(G)V would represent a syllable canon in which an onset is obligatory and it may have the structure of just a single consonant or a consonant followed by /w/ or /j/ or a similar approximant. The notation V(V) indicates cases where there seems to be relatively free combination of vowels which nonetheless do not create separate syllabic nuclei. The notation V(ː) indicates long vowels occur. The presentation of the syllable canon may be followed by a commentary noting, for example, difficulties of interpretation, the existence of other phonotactic limits on syllables, e.g. if word-medial and word-final syllables differ in structure, and the field may provide examples of the different syllable patterns. Incompleteness of information may also be noted here.

In order to make canonical syllable structure searchable two fields are provided, one (‘Canonical Form’) containing the most fully elaborated syllable canon in simplified form using only C and V as well as ː in case of long vowels, and the other (‘Syllabic Restriction’) noting where there are substantial restrictions on the consonants that occur in a given position (noted by *). Many languages have smaller sets of permitted coda consonants than onset ones but a coda restriction is only shown in this field when the number is decidedly limited.

Each language is also assigned to one of three categories for its syllabic complexity ("Syllcat") corresponding to those in the WALS database (http://wals.info/feature/12A). Languages that allow nothing more elaborate than a CV syllable as classed as having Simple syllable structure; those which allow either a common type of two-consonant onset, such as CG or CL, or allow a single consonant in coda position, or allow both of these are classed as Moderately Complex. Languages that have less common onset clusters, such as two obstruents or three or more consonants, or which have any clusters in the coda are classed as having Complex syllable structure.

Numeric values are also given for the maximal degree of elaboration of the Onset, Nucleus and Coda elements separately, and these are summed to give an overall Syllable Index. Onset values are 0 for maximal one-consonant onset (since the CV syllable type is taken to be universal), 1 for common CC onsets (such as CG, CL), 2 for less common 2-consonant onsets, and 3 for maximally 3 or more consonants in onset. Nucleus scores are 1 for single-mora nuclei as the maximum, and 2 for bimoraic (or potentially longer) nuclei. Coda scores are 1 for maximally a single consonant, 2 for two consonants and 3 for 3 or more consonants permitted in coda. The summed Syllable Index thus ranges from 1 for a languages such as Yoruba (yor) with maximal CV syllables to 8 for a language such as English (eng) which permits elaborate onset and coda clusters.

In addition data is imported from the relevant WALS chapters by Goedemans & van der Hulst on whether the language has been interpreted by these authors as having Fixed stress location (http://wals.info/chapter/14) or stress placement that is affected by syllable weight parameters (http://wals.info/chapter/16). These interpretations may differ from that preferred in LAPSyD.

Tone system

The commentary field "Tone" briefly summarizes what is known about any system of lexical or grammatical tone contrasts that the language has. For languages known or presumed to have no tone system this fact is also noted. Common notations such as H, M and L for High, Mid and Low are often used in this field, but the numeric notation used by many linguists working on Asian languages (5 = High, 1 = Low) is also sometimes employed when describing these languages.

In the "Tonecat" field the complexity of the tone system is expressed by use of one of four category labels: None, Simple, Moderately Complex, Complex. Languages with a basic two-way contrast (which may include a limited use of downstep) are classed as having a Simple tone system, those with three contrasts are Moderately Complex. Complex tone systems have four or more contrasting tones. A few languages are noted as having a Marginal tone system. These cases are of two types: languages where tone distinctions are said to be relevant to only a small part of the lexicon, and those where the tone distinctions might be subsumed under an accentual contrast.

Stress (accent) system

The commentary field "Stress" contains summary information on the presence and role of any reported stress distinction in the language. The principal intent is to report whether stress plays a role in distinguishing lexical (or grammatical) forms. The "Stresscat" field categorizes languages into three groups according to the role of stress: None, Minimal, Lexical. Those languages reported or presumed to make no perceptible differences in stress level between syllables are labeled "None". Languages in which there are noticeable differences in stress level, but the placement of stress in lexical items is predictable (or very largely so) are classed as having a Minimal role for stress. Languages in which stress placement distinguishes (at least some) lexical forms or is otherwise unpredictable are labeled as having a Lexical role for stress.

It is not always straightforward to decide when to characterize a language as having a system of tonal contrasts or one of accentual contrasts as the two are not sharply distinguished in reality, and some languages clearly have both. In LAPSyD a forced choice is made as to whether any given language has tone, stress or both but the notes help to identify where different opinions might be justifiable.

The feature description of segments

All of the vowel and consonant segments referenced in the database are given a unique featural description. This description in features enables searches to be conducted for all occurrences of segments with individual features or sets of features and to look for co-occurrences, patterns of complementary distribution and other properties of the inventories at the featural level.

General principles

The features used to define the segments that are catalogued in LAPSyD are based on traditional phonetic terminology, such as that embodied in the charts of the International Phonetic Alphabet including all the often basic distinctions that are made by diacritical marks. More elaborate and precise classifications, such as that presented in Sounds of the World's Languages (Ladefoged and Maddieson 1996), cannot be systematically used due to the lack of precision in a good number of the available descriptions. For example, the standard term 'retroflex' is used for at least three distinguishable articulatory postures, which could be described as sublaminal post-alveolar, apical post-alveolar, and laminal post-alveolar. In many publications where the label 'retroflex' is used, there is no indication as to which of these might be the articulation used in the language.

However, within the limitations allowed by the source descriptions, LAPSyD aims to represent all the within-language contrasts encountered in each language with as much fidelity to cross-language comparison as possible. Each segment in a language's inventory of consonants and vowels has a distinct feature representation. This representation is the same as that assigned to segments in other languages which are judged to share the same classificatory characteristics. In three cases, explained more fully in the feature enumeration below, there is such a frequent lack of clarity in source descriptions that special features encoding the ambiguity are used. These features are unspecified coronal place, unspecified rhotic and unspecified mid (vowel height). Segments are assigned one of these features if it is unclear which particular coronal place, rhotic type or vowel height they typically present. Segments bearing one of these features do not form a coherent class; rather, it is uncertain which class they should be allocated to. These ambiguities most often arise when just a list of symbols is given without specific definitions of their intended value. Sometimes an informed guess can be made based on knowledge of traditions of scholarship for a particular language area or family, but in many cases ambiguity remains. There are also instances where segments are described using inherently ambiguous phonetic terms, such as denti-alveolar or vibrant.

Establishing a consonant or vowel inventory presupposes an agreed segmentation. Decisions must often be made as to whether a particular consonantal pattern represents a single complex segment or a sequence of two or more segments. Similar questions arise with vowels where alternative analyses might posit unitary diphthongs, or sequences of two independent vowels, or combinations of a vocalic approximant and a vowel. In such cases it is often distributional patterns that provide the best support for the choice of analysis. For example, [ts] and [tʃ] have very different distributions in English. [tʃ] occurs in syllable onsets where other stop+fricative sequences do not occur. [ts] occurs in syllable codas, where other combinations of stop+fricative do occur, especially other non-homorganic voiceless stop + [s] combinations (as in lax and lapse), and moreover the [s] in this position often represents a separate morpheme (as in lacks and laps). [tʃ] thus has a unitary character, whereas [ts] is naturally interpreted as a sequence of two separate consonants. As for vowels, a language that seems to allow all (or almost all) possible combinations of its simple vowels, such as Lavukaleve (lvk), is more readily judged to have sequences of independent vowels than to have a large inventory of diphthongs. Since in this language either the first or the second vowel in a VV sequence might bear stress, the preferred analysis that each vowel forms a syllable. The fields in the database for comments on the vowels and consonants provide for brief discussion of such issues in individual languages. It is not uncommon for a different interpretation to be preferred in LAPSyD to the one offered in the literature.

In the following sections (¶ 4.2-4.6) the full feature set available to characterize the segments in LAPSyD is presented

Consonant Features (some also applicable to vowels)
1. Features that create separate rows

Airstreams

pulmonic (= pulmonic egressive)

ejective (= glottalic egressive)

implosive (= glottalic ingressive)

click (= velaric ingressive)

The default airstream for speech is provided by the lungs. Thus almost all segments have the feature pulmonic. Segments assigned this feature have only a pulmonic airstream. Pulmonic airflow can be briefly interrupted or modified by constriction and vertical displacement of the larynx as in implosive stops and ejective stops, affricates and fricatives, or by expansion of the oral cavity, as in clicks. Segments whose prototypical production is as an ejective or implosive may be produced without sufficient larynx movement to actually create outward or inward airflow, but they are still classified as ejective or implosive. Stop segments described in the sources as ‘glottalized’ and notated with an apostrophe (e.g. /k’/ or /k̓/) are usually interpreted as ejectives, and segments described as ‘pre-glottalized’ and notated with a voiced stop symbol and some mark of glottalization (e.g. /ʔb/, /^ˀb/ or /ʼb/) are usually interpreted as implosives. In both cases, this is especially likely if related or neighboring languages are known to have ejectives or implosives. Implosives are most often produced with voicing — at least at their release — due to pulmonic airflow through the descending glottis, but in some languages it has been argued that there are implosives of two types, with and without voicing. So-called ‘voiceless’ implosives maintain a complete glottal closure. For this reason a voicing feature is always assigned to implosive segments. In this context voiceless must be understood as implying a fully closed glottis, rather than open vocal folds. In clicks the back closure release always involves the pulmonic airstream, and there may be voicing and/or nasal airflow due to pulmonic air during the hold of the front oral closure. However, clicks are not assigned the feature pulmonic even in these cases. Instead, these properties are indicated by voicing features and the feature nasalized, as appropriate. Clicks can also be produced with an ejective release of the back closure. Thus ejective and click can co-occur on a single segment. This is the only possible combination of two features from this set. There is a further discussion of some of the special considerations that apply to clicks in section 4.5 below.

Manners

stop

affricate

fricative

nasal

trill

tap/flap

unspecified rhotic

approximant

All features in this set are mutually exclusive. The manner features primarily represent the degree of stricture required for a segment's production: stop and nasal for full closure, fricative for narrow approximation, approximant for open approximation. Affricates combine a stop phase with a fricative release. Note that the feature nasal is only assigned to purely nasal consonants such as /m, n, ŋ/ produced with complete oral closure. The features trill and tap/flap describe consonants with intermittent or very brief closures. Trilling is aerodynamically-driven and occurs only within critically narrow limits, so trills regularly vary with productions in which something other than an actual trill occurs. If a segment is reported to be produced with multiple contacts in some instances it is usually assumed that the reference pronunciation is a trill. Taps and flaps can be distinguished as motions of the moving organ orthogonal to or parallel to the contacted surface respectively, but this distinction is not reliably adhered to in most descriptions. Hence these classes of segments are collapsed in LAPSyD. The remaining feature in this set, unspecified rhotic, is assigned to segments which are “some kind of r-sound” but for which the sources available on the language do not permit a more complete description to be made. Often no more that the symbol <r> may be provided, or an ambiguous term such as 'vibrant' may be used. These segments would be assigned to one of the trill, tap/flap or approximant categories if the correct assignment was known. They do not constitute a distinct category on their own. Rather, the unspecified rhotic feature is a device to avoid misrepresentation in the face of ignorance.

Secondary Sources

sibilant

whistled

The two secondary source features describe factors that shape the acoustic spectrum of fricatives (and affricates) downstream from the primary constriction. In sibilants air is channeled through a primary constriction so that it strikes a downstream obstacle (typically the back of the upper teeth) creating strong high-frequency turbulence. The feature whistled is added for fricatives and affricates that have a configuration of the lips that adds a whistle-like resonance to the sound, as well as filtering higher frequencies.

Escape/Release/Approach features

lateral

nasalized (also for vowels)

prenasalized

pre-stopped

trilled-release

This set of features specify non-default escape paths for the airflow, the default being oral only and central, as well as modifications taking place in the supra-laryngeal vocal tract of the approach or release of obstruents. Nasalized vowels or consonants, except for nasalized clicks, have simultaneous airflow through the oral and nasal cavities. In nasalized clicks air flows out through the nasal passage while the click mechanism blocks airflow through the oral cavity.

Features that create separate columns:

Places

bilabial

labio-dental

linguo-labial

dental

alveolar

unspecified coronal place

palato-alveolar (= laminal post-alveolar)

retroflex (= apical post-alveolar)

palatal

velar

uvular

pharyngeal

epiglottal

glottal

(known combinations: placed in a new column following the column of the non-labial place concerned)

bilabial+velar

bilabial+alveolar

bilabial+palato-alveolar (or retroflex) (as in Yélî Dnye)

bilabial+palatal

Secondary articulations

labialized

palatalized

velarized

pharyngealized (also for vowels)

epiglottalized (for 'sphincteric' vowels)

Voicing Properties

On consonant charts these create separate rows

For vowels and consonants

voiceless

voiced

breathy voiced

laryngealized

For consonants (in combination with 'voiceless' or 'voiced')

aspirated

pre-aspirated

breathy-release

pre-voiced (transcribed with a symbol sequence. e.g. /dt/)

Duration Properties

On consonant charts these create separate rows

For vowels and consonants ('normal' length is unmarked)

long

overshort

Some special considerations for clicks

Clicks are among the most complex of human speech sounds and many issues remain to be clarified in their description. Clicks of a given language are often given very dissimilar descriptions by different authors. This section explains some of the decisions reached in trying to harmonize their characterization.

All clicks require closures to be made at two locations within the oral cavity; each of these closures may be released abruptly or gradually, in the latter case creating an affricated release of the front or back closure. The click mechanism itself takes place entirely within the oral cavity so the larynx is free to adopt any configuration from complete closure (glottal stop) to a fully open voiceless position, and can also operate as an airstream iniator creating an ejective release of the back closure. The velum may be raised or lowered to prevent or permit nasal airflow. The timing of actions of the larynx and velum can vary both with respect to the oral articulations of the click and with respect to each other.

Place features: The front click closure locations are specified with the same place features as other consonants. For now, the view that back closures can be contrastively velar or uvular is accepted, although the difference has perhaps more to do with the audibility of the back release than its actual location. (When the back release is delayed its place of articulation is readily perceived as further back than where the back closure is formed at the onset of a click. The closure is retracted during the closure phase.) Hence all clicks are assigned two places of articulation. Clicks with a simple glottal closure released after the click are also given the place feature glottal. This is different from an ejective release.

Release features: clicks are assigned the feature affricate if the back closure is released with frication. Clicks lacking this feature have a stop-like back release which may be followed by aspiration. Sustained voiceless nasal air-flow after a click release is described by the combination of features nasalized and aspirated. In most languages with clicks if the click is bilabial, dental or alveolar lateral the front closure is released slowly creating some frication. Front closures of post-alveolar and palatal clicks are generally released abruptly. However, some languages may have abrupt lateral click releases or slow palatal click releases. This property was indicated in UPSID (Maddieson 1984) with a feature affricated click for the slowly released cases. Since no language seems to make a distinction at the same place of articulation between these two kinds of front release, this feature is not retained in LAPSyD. Some languages, such as |Nuu (ngh) distinguish ejective click releases with and without affrication of the release.

Vowel Features

Major classes

vowel

diphthong

triphthong

Vowel height

suprahigh

high

higher mid

unspecified mid

lower mid

low

Vowel backness

front

central

back

Vowel lip position

unrounded

rounded

lip-compressed

Subdivision of vowel height space

raised

lowered

Subdivision of vowel backness space

fronted

retracted

Ordering features for diphthongs/triphthongs

rising

lowering

backing

fronting

rounding

unrounding

nasalizing

devoicing

Superordinate class features

The following superordinate features are also used. These are designed to facilitate searches for familiar classes of segments otherwise only specifiable by a complicated set of feature specifications:

obstruent (all stops, affricates and fricatives)

liquid (rhotics and voiced lateral approximants)

Transcription

The transcription provided for all consonants and vowels in LAPSyD follows the usage set out in the Handbook of the IPA in most regards, within the limits allowed by the desire to standardize across the sources used. The following deviations are the principal differences from conventional IPA transcription.

As noted in the discussion of Vowel Inventory (¶ 3.2), /a/ is systematically used for a low central unrounded vowel, not a front one. To indicate a low front vowel the fronting diacritic is added, i.e. /a̟/. Mid vowels that are not specified as being in the higher mid or lower mid range are represented with the higher mid symbols and single quotation marks, e.g. /'e', 'o'/.

Consonants articulated in the coronal region are often rather imprecisely described in the sources. Those specifically noted as dental are transcribed with the dental diacritic (e.g. /t̪, n̪, s̪/, etc), except that /θ, ð/ are taken to explicity represent dental articulations and require a retraction diacritic to indicate further back articulation. Symbols such as /t, n, z/ without any diacritic are used to represent alveolar articulations. In those cases where a source uses a symbol for a coronal articulation without further specifying place of articulation, the transcription in LAPSyD uses single quotation marks to indicate "unspecified coronal place" (abbreviated "un-cor" on consonant charts), e.g. /'t', 'l', 'n'/. Rhotic segments whose manner of articulation is not defined (e.g. as a trill or an approximant) are transcribed with doubled letter r (i.e. as /rr/). A rhotic unspecified for manner might also be unspecified for place, in which case the transcription is /'rr'/.

References

Haspelmath, Martin, Matthew S. Dryer, David Gil & Bernard Comrie (eds.) 2005. The World Atlas of Language Structures: Oxford University Press, Oxford.

Dryer, Matthew S. & Haspelmath, Martin (eds.). 2013. The World Atlas of Language Structures Online. Max Planck Institute for Evolutionary Anthropology, Leipzig. Online at http://www.wals.info/.

International Phonetic Association. 1991. Handbook of the IPA. Cambridge University Press, Cambridge.

Gordon, Raymond G., Jr. (ed.), 2005. Ethnologue: Languages of the World, Fifteenth edition. SIL International, Dallas. Current online version: http://www.ethnologue.com/.

Maddieson, Ian. 1984. Patterns of Sounds. Cambridge University Press, Cambridge. Paperback edition, 2009.

Introduction.

Language selection and metadata

Criteria for language inclusion.

Language names.

Language codes.

Language classification.

Language localization.

Areal/genetic groups.

Sources consulted.

Interpretation and standardization of descriptions

General principles

Vowel inventory

Basic vowel count

Consonant inventory

Syllable structure

Tone system

Stress (accent) system

The feature description of segments

General principles

Consonant Features (some also applicable to vowels)

Features that create separate rows

Airstreams

Manners

Secondary Sources

Escape/Release/Approach features

Features that create separate columns:

Places

Secondary articulations

Voicing Properties

For vowels and consonants

For consonants (in combination with 'voiceless' or 'voiced')

Duration Properties

For vowels and consonants ('normal' length is unmarked)

Some special considerations for clicks

Vowel Features