Computing in Creole Languages
The Web stimulates growth and development of historically oral languages
MARILYN MASON & JEFF ALLEN
|
|
Pidgins are language varieties created out of the necessity for people who do not know each other's language to communicate. A Pidgin is categorized as a Creole language once it becomes the mother tongue of a new generation born into that linguistic community.
Contrary to popular opinion, Creoles are not mere dialects; they are structured languages possessing complex morphological, syntactic and grammatical features.
They are typically both like and unlike the "parent" languages from which they develop. Haitians, for example, as well as the inhabitants of Seychelles, Martinique, Guadeloupe, St. Lucia and other places, did with (or to) French what many students of French could only have wished had been done long ago they simplified it.
Verb stems stand alone and do not change, no matter what the tense, person, mood or other factor. The French parler becomes pale in Haitian Creole (HC) and remains pale in all situations. Haitians, like Africans, denote time, person and mood with "add-ons" between the subject and the verb stem. The present? Pale. Ongoing present? Ap pale. Future tense? Va pale or pral pale, depending upon how distant. Past tense? Te pale. Subjunctive? Ta pale. And so on. Totally methodical. Totally predictable.
Most of the world's Pidgins and Creole languages developed out of contacts between colonial nonstandard varieties of one or more European languages and several non-European languages around the Caribbean Sea and in the Atlantic, Indian and Pacific Oceans during the seventeenth through the nineteenth centuries. But many have been birthed also from non-European languages in Africa, Australasia, North America, Greenland and other Arctic regions. According to a census taken in 1977 by linguist Ian Hancock, there are 127 Creoles or Pidgins in the world.
|
|
Census of Creoles and Pidgins
|
Linguist Ian Hancock's census counts 127 Creoles and Pidgins worldwide:
35 English-based
15 French-based
14 Portuguese-based
7 Spanish-based
5 Dutch-based
3 Italian-based
6 German-based
1 Slavic-based
6 Amerindian-based
21 African-based
10 Non-Indo-European-based (Asiatic)
4 Others |
|
Tens of millions of people around the globe speak Pidgins and Creoles. Nigerian Pidgin-Creole alone has some 30 million speakers. Haitian Creole has 10 million in Haiti and in the diaspora; Afrikaans, three million; Jamaican Creole 2.5 million; and on down the line.
|
|
Geographical Distribution of Pidgins and Creoles
|
The Caribbean & South America
Anguilla, Antigua, Bahamas, Barbados, Belize, Dominica, Grenada, Guadeloupe, Haiti, Jamaica, Martinique, Montserrat, Netherlands Antilles (Aruba, Bonaire, Curaçao), St. Lucia, St. Kitts & Nevis, St. Vincent & the Grenadines, Trinidad & Tobago, Virgin Islands, Brazil, French Guyana, Surinam
Africa & Indian Ocean
Cameroon, Central African Republic, Congo, Guinea-Bissau, Nigeria, Sierra Leone, South Africa, Cape Verde, Comoro Islands, Mauritius, Principe, Reunion, Rodrigues, Saõ Tomé, Seychelles
Australasia & Oceania
Australia, Hawaii, New Caledonia, Papua New Guinea, Philippines, Pitcairn, Singapore, Vanuatu |
Diaspora Within a Diaspora
To more fully understand the crossroads at which Creole languages stand today, one needs to address the dynamics contained within the word diaspora: a) the breaking up and scattering of a people; b) people settled far from their ancestral homelands; c) the place where these people live (Merriam-Webster).
Many of the world's Creole languages were born as a result of the uprooting of slaves from Africa and their dispersion to plantations in the southern United States and to numerous islands in the Caribbean, Indian Ocean, Oceania and so on. In other words, the African Diaspora (let's call this "Diaspora I").
Over generations, inhabitants of these different corners of the globe came to view these outposts as their homelands. They sank in their roots and developed their unique cultures, languages, societies and identities. A variety of factors, however, have spawned a second more recent wave of dispersion (let's call this "Diaspora II"), each having its own name and national or regional identity.
For example, because of political and economic instability, the Haitian Diaspora can be found in large concentrations in North America, as well as in the Dominican Republic, Bahamas, French Antilles, South America, France, Switzerland and even Africa. Because of volcanic activity beginning in 1995, the Montserratan Diaspora (about half of its population) can be found in North America, England and throughout the British Commonwealth. The search for education and jobs has produced large West Indian diasporas in the United Kingdom and in France.
Those born into Diaspora II in many ways hold the key to propelling Creole languages forward into the twenty-first century. Why? For a number of reasons. Access to education and technology is more widespread. Distance from the painful events which produced the dispersion enables children to embrace the cultural positives of an existence the parents are trying so hard to forget. The pressures of globalization are producing more of a sense of cultural pride. Many are not only learning to read and write what is still mostly an oral language for their parents, but they are training computers to respect their mother tongue. And so on.
Young Haitian-Canadians Sandra and Alexander Prophète, for example, do much of the text processing and Web site maintenance (www.konbitayisyen.com) for their father, Joseph Prophète, an HC advocate, writer and publisher. Computer-literate Sandras and Alexanders are scattered across the globe, and they are serving with the technologies which are part of their everyday Diaspora II lives to assist in preparing Creole-language documentation and curriculum materials which can serve as an on-line backbone to educational systems back home in Diaspora I.
|
|
|
Haitian-Canadian "cyberkids" Alexander and Sandra Prophète
|
Language Policies Emerge
Creole is an official language mandated by the Constitution in Haiti, Seychelles and Vanuatu (Melanesia). Papiamentu is being proposed as an official language in the Netherlands Antilles. As Creole languages shift to the realm of written as well as oral means of communication, dictionaries, grammars and a growing literature base are being produced.
A major step toward institutionalizing Creole as a language of instruction throughout France's departments and territories was taken in 2001 when the French Education Ministry extended its Certificat d'Aptitude au Professorat de l'Enseignement du Second degré program to include Creole as an official regional language of France.
As a result, government-funded centers have been set up in Martinique, Guadeloupe, Guyana and Reunion to train teachers and to develop high-quality, systematized Creole-language curriculum materials.
Haiti's 1987 Constitution stated that Creole is the only language which cements all Haitians together, declared both Creole and French as official languages and established a Haitian Academy to assure the normalization and scientific development of Creole. Earlier, in 1979, two important language-related laws were passed. The first law authorized the use of Creole in schools as both a language and a subject of instruction; the second established Creole writing norms.
The Melanesian island of Vanuatu shares three official languages: Bislama (a Creole language derived from nineteenth-century South Seas Pidgin-English), English and French. Bislama, however, has been designated by Vanuatu's 1983 Constitution as its national language. With regard to the use of Bislama in education, Article 6 states: "The Minister, acting on the advice of the Director-General, may by order determine that one or more specified subjects at a specified school or schools are to be taught to students in the local vernacular or Bislama." Such a national policy, which only allows for very restricted use of Bislama in schools, does not bode well for the survival of this Creole into the Digital Age.
Seychelles
In the vanguard of officialization and full integration of Creole into the national life is the tiny (86,000 inhabitants) nation of Seychelles in the Indian Ocean, where Creole is spoken in the legislature and throughout the court system as well as in commerce and all levels of social interaction. Since 1981, the three official languages (Creole, English and French) have been systematically introduced as languages of instruction in schools. For the first four years of primary school, Creole is the only language of instruction. From the fifth year, English is gradually introduced as a second language. Once both Creole and English have been mastered at primary school level, French is introduced as a third language at the secondary level.
|
|
|
Creole is first among Seychelles' three national languages
|
|
To encourage the standardization and normalization of spoken and written Seychelles Creole (seselwa), the Seychelles government created the Creole Institute (Enstiti Kreol) in 1986. Its mandate has been to trigger, promote and monitor a Seychelles Creole literature, to assist in the teaching of Creole in primary schools and adult literacy classes and to promote the use of the official orthography (codified and made official in 1981) throughout the school system, the media and all government ministries. Readers can satisfy their literary needs with historical novels, detective stories, science fiction and tales of the paranormal in addition to the Bible, research materials, reports and government leaflets. Local newspapers and magazines also print the majority of their articles in Creole, and Creole language television programming benefits from captioning in Creole.
Internet Spurs Development of Written Creole Languages
The Internet has introduced new growth dynamics to what have historically been oral languages. Creole speakers around the globe are learning how to read and write their mother tongues in order to be able to exchange their ideas via e-mail and in Creole language on-line forums.
An example of this is Fowòm Ayisyen, an HC on-line forum moderated by Guy Antoine of Windows on Haiti. At its inception in 1999, participants wrote in many hybrid orthographies. To put some order into the chaos, Antoine began to insist that the official spelling norms established in 1979 by the Haitian government be observed. Since then, participants have worked at conforming their writing to those standards, but another logistical problem persisted: lack of uniformity in the representation of accented vowels which frequently occur in HC. Some participants placed apostrophes in front of the vowels; some after; some omitted the accents altogether. Still others used capitalization to denote accented vowels. Once Antoine posted guidelines for the cross-platform production of such characters via computer keyboard, one poster after another adopted the new procedures. Over the past year, it has been amazing to witness how much more uniform and consistent the writing of HC has become, a development made possible on a more global scale because of the Internet.
|
|
|
Guy S. Antoine, moderator of Fowòm Ayisyen
|
|
But the Internet has encouraged something even more significant than the written normalization of one particular Creole language.
A fellow from Guadeloupe gravitated to the Haitian Forum because of its standards for excellence in the writing of Creole and the relevance of its subject matter. At first, he did not even follow the writing standards for Guadeloupe Creole (GC), so it was extremely difficult for Haitians to read his posts. Then he got the bright idea to try to follow the orthographical norms for HC to write GC. Bingo! Increased communications have flowed back and forth, and there now exists an increased sense of solidarity between two different Creole societies in the Caribbean. This puts feet to something which had only been theory before: what if Creole speakers in specific regions and whose mother tongues shared the same lexical base were to harmonize the written representations for their languages? Would that not open doors more widely to the cross-fertilization of literature and curriculum materials? This is beginning to happen in the Indian Ocean region. Why not in the Caribbean?
Existing Creole Language Technologies
Speech and language technologies for Creole languages include translation systems, speech recognition and text-to-speech systems, optical character recognition applications, spell-checkers and on-line dictionaries. Some of these development efforts have been for academic research, others for military applications and others for the distribution of commercial tools and services.
Automatic translation systems. A few prototype and research-based automatic translation systems are known to have been developed for Creole languages. The earliest known project, conducted by Patrice Naze (1985) in the framework of his doctoral research at the Université de Provence (Aix-en-Provence, France), led to the creation of a basic English-to-Reunion Creole machine translation (MT) system, written in Prolog, called DIALOG-KREOL. Many Creole languages contain invariable verb forms, independent time-mood-aspect markers, a limited number of relative pronouns and a lack of connecting prepositions. These linguistic factors certainly favored the development of this transfer-based MT program.
Another translation system for Creoles was undertaken by the DIPLOMAT project at the Language Technologies Institute of Carnegie Mellon University (CMU). This work focused on rapid-deployment, wearable, bidirectional speech-to-speech translation systems. It developed techniques for quickly producing various speech- and text-based systems between English and several other languages: Croatian, HC, Korean, Spanish and French.
The systems created within DIPLOMAT are not comparable in quality to commercial translation systems on unrestricted tasks in well-documented languages, but they are more than adequate for limited-domain scenarios and rapid prototyping. The HC DIPLOMAT project lasted from November 1996 to November 1998 during which English-to-HC and HC-to-English example-based machine translation (EBMT) systems were completed. These systems and applications have been demonstrated at several conferences and on-site at the Language Technologies Institute/Center for Machine Translation. About 20 conference papers and journal articles have been completed describing the findings of the DIPLOMAT project research.
SMART Communications, Inc., investigated the development of an English-HC MT system several years ago for Haitian diaspora communities in the United States in contexts where children go to school in English, but the parents are more or less Creole monolingual.
Due to the ongoing commitment of US military presence in several countries, including Haiti, a focus is placed upon the local languages and thus on HC. The US Army Research Labs (ARL) has developed a system called the Forward Area Language Converter (FALCON), consisting of a laptop computer and accompanying software, to enable a user with no foreign language training to translate foreign language documents. FALCON permits US armed forces to translate and determine the military significance of enemy documents by processing the text of high priority languages. The FALCON project also led to MT evaluation methods.
Mason Integrated Technologies investigated the feasibility for commercial development of an English/Haitian Creole MT system with the intention of porting the technology to other Creole languages; however, the lack
of financing prevented such a program from materializing.
A promising, more cost-effective means for Pidgins and Creole languages to enter the twenty-first-century world of automated translation or computer-aided translation (CAT) solutions is for translators themselves to add Pidgin-specific or Creole-language-specific translation pairs to the highly-expandable user-dictionary-building function of the TRADOS 5.5 Translator's Workbench.
Electronic dictionaries. Electronic dictionaries are often created and used as translation support tools. A dictionary might also be used to produce word-for-word renditions, that is, sequences of individually translated words in the syntactic order of the original texts. Many such electronic dictionaries are offered on the Web and are often referred to as Web translators or on-line translators because they provide word-for-word translations.
|
|
|
Verbalis' on-line dictionary for Cape Verdean Creole
|
|
One such electronic dictionary is Dictionnaire créole en ligne: Traducteur Créole Réunionnais/Francais; Français/Kréol rényoné, which has been advertised as "Le Premier Traducteur Créole/Français Français/créole en Ligne sur Internet!" (The first [Reunion] Creole/French and French/[Reunion] Creole Online translator!).
Papiamentu is a Creole language widely spoken on Bonaire and Curaçao in the Netherlands Antilles. Don Amaro's Project Papiamentu in The Netherlands has produced a Web translator to translate words or texts from English to Papiamentu and vice versa. This translating dictionary uses a database engine powered by www.g-art.nl. It also uses server-sided scripts to manage the various exceptions between the two languages. In addition, Amaro has developed a Papiamentu rhyme finder, which is available on-line.
An example of a phrase translator is Haitian Creole: Simplified, Version 1.1. The on-line trial version translates words and short phrases from English to HC. The full version also translates from Creole to English. Several other on-line dictionaries are included in the resources lists with this article.
|
|
Corpus building projects. Albert Valdman of the Creole Institute at Indiana University has conducted Creole language corpus building projects with funding from the American National Endowment for the Humanities (NEH), the US Department of Education and the AUPELF-UREF French government agency for francophone universities. Some Creole Institute materials are available through the Institute's Web site.
In addition, The Corpus of Written British Creole has been compiled at Lancaster University (UK) with financial support from the British Academy. Most of the searching for texts, permission clearance and text input work was carried out in 1995; text tagging and validation were performed in 1998.
Also, Fundashon pa Planifikashon di Idioma (Papiamentu Language Planning Foundation) is building a corpus of Papiamentu (soon to be made available on CD-ROM). FPI serves as a model for the cross-fertilization of effort and data between "grassroots" cultural organizations and Polderland, a commercial company. FPI has been collaborating with governmental, non-governmental and commercial entities to stabilize and modernize Papiamentu orthography, develop spell-checking software and encourage the production of more texts in consistently spelled Papiamentu.
Language learning software. Transparent Language has developed an English/HC language learning software program that is included in its 101 Languages of the World product. The ARL FALCON project also led to research focused on repurposing sentence-aligned bilingual corpora, the core resources used to build MT systems, for computer-assisted language learning (CALL) projects and prototypes known as STARling and CONDOR.
|
|
|
Transparent Language's HC language learning software
|
|
Verb conjugation tool. Verbix provides two verb conjugation tools for a number of French-, Portuguese-, Dutch- and English-lexicon Creoles: a shareware product (Verbix 4.2 for Windows) and a free Web portal service, which contains a subset of the tool. The Verbix Creole Web site provides the top ten verbs in each of the Creoles. However, the authors have analyzed the entries that are provided for HC and St. Lucian Creole. For HC, we compared the Verbix entries with the CMU electronic database of HC, which contains 1.2 million words from texts obtained from multiple sources and fields, with their frequency of use. From this comparison, number one in the Verbix list is listed near the bottom of the 15 most frequent words in the CMU database. All of the other entries in the Verbix verb list are either considered low-to-medium occurring words or do not even appear at all in the CMU data. Each word of the Verbix list has been analyzed and compared to the CMU data. The Verbix verb conjugation charts also contain several errors at the lexical level, including accent placement and contracted forms. From our analysis, this might be an automatically generated database that has not been manually checked in a sufficient way by native speakers.
Spell-checking and orthography conversion. Educa Vision in Florida developed the first spell-checker program for HC, which is compatible with WordPerfect and Microsoft Word. In addition, they have developed EKED, an English<>HC computer-accessible dictionary for IBM compatibles.
In July 2002, Cambire Software, Inc., announced to a potential beta testing group the availability of another spell-checker for HC, embedded within an assortment of other Microsoft Word- and Microsoft Outlook-only compatible macro tools, to be marketed under the name Creole Toolkit, version 1.02. However, initial feedback from Creole writers who have tested the product indicates that the dictionary is quite limited (5,500 words) and, in order for these tools to work, a specific document must be opened up in either the Creole-specific or English-Creole-specific template. Documents dependent upon both these Cambire-supplied Word templates or upon any other normal or special Word templates cannot be open at the same time.
Polderland Language & Speech Technology in Nijmegen, The Netherlands, has developed a spell-checker for Papiamentu in conjunction with FPI. Even though Dutch is the official primary language, Papiamentu is widely used on the Leeward Islands of the Dutch Caribbean. SpèlChèk can be used within Microsoft Word and uses a dictionary of more than 35,000 words, ensuring that the software knows virtually all of the frequently used words in Papiamentu. In case a word is unknown or misspelled, the spell-checker provides suggestions, based on a set of well-chosen linguistic rules which use phonological knowledge. Polderland has also delivered a spell-checker and hyphenator for Afrikaans.
Orthography conversion tools are also a type of spelling application. The prototype of a flexible, semi-automated process for converting texts written in earlier HC orthographies to conform to the Institut Pédagogique National (IPN) orthography (the legal standard established by the Orthography Law of 1979) was initially completed in 1991 by Marilyn Mason. The resulting prototype software was the Mason Method of Haitian Creole Orthography Conversion (MMHCOC), which allowed for the conversion from one orthography to another, such as Pressoir-Faublas text to IPN text, IPN text to McConnell text and so on.
The process has matured from being semi-automated, taking two hours to convert a 250-page book, to fully automated, requiring less than a minute to convert that same 250-page book. Tested on text sets that had not been used to train the system, MMHCOC has been renamed CreoleConvert. It has been used to automatically convert the outdated orthographies of samples from periodicals such as Boukan, Jé Nou Louvri and Chanmòt la to the IPN orthography. It has more recently been used to update the spelling system of the transcript for the HC version of the Jesus Film and many other documents such as the Haitian Constitution and Haitian government literacy materials. CreoleConvert runs within standard software applications in Macintosh and Windows environments. Mason has demonstrated it in Haiti, United States, England, France, Greece, Seychelles, Jamaica and elsewhere.
|
|
HC text orthographically converted to conform to official IPN standard by CreoleConvert
|
|
|
In Seychelles, Marilyn Mason demonstrates CreoleConvert to Creole speakers from Indian Ocean and Caribbean regions
|
|
Although the CreoleConvert engine was originally designed to convert orthographies within only HC texts, in September 2002 Mason developed a variant to create a preliminary prototype program to convert GC texts to standard HC. This was done in response to the real-life situation presented to the Haitian Forum, where a Guadeloupean was attempting to correspond in Creole with Haitians.
Image/text processing. An optical character recognition (OCR) software application for scanning existing texts of varying age and print quality was developed by Marilyn Mason in the early 1990s, originally to test the MMHCOC orthography conversion program. The prototype resulted in the current OCR tool called CreoleScan for HC. CreoleScan runs within standard software applications in Macintosh and Windows environments and has been beta-tested in Haiti (Jounal Libète) and Florida (West Palm Beach Multilingual Education Department). CreoleConvert and CreoleScan are currently available as a value-added service to clients of The Creole Clearinghouse.
|
|
Scanned image of printed text in older orthography (Jounal Boukan 1977)
|
|
|
CreoleScan OCR version of scanned image
|
|
The ARL has been developing a suite of tools for scanning paper documents in HC, OCR and either translating the text into English or bypassing translation and identifying proper names in the source text. This customized multilingual document workflow process provides for the evaluation of captured documents, intercepted messages and a variety of other materials. The CMU DIPLOMAT project conducted some OCR research in collaboration with the ARL FALCON system. The DIPLOMAT OCR spell-checking application is a post-OCR error correction module.
Speech-based systems. The DIPLOMAT project benefited from speech technology techniques successfully used to build an HC text-to-speech system, an HC speech recognition system and some speech research by-products. A major speech data collection campaign led to a total of 149 HC speakers (89 male, 60 female) who were recorded from April 1997 to March 1998 in Port-au-Prince, Haiti; Paris, France; New York City; and Pittsburgh, Pennsylvania. Since rapid deployment was central to the project, read speech was used because this is usually much faster and less labor-intensive to develop and implement than spontaneous speech efforts. To support generation of high-quality prompts, corpus-based techniques were developed for selecting phonetically representative sentences as well as a process for acquiring pronunciation data from native speakers.
One of the most important findings, described by Christopher Hogan and Jeff Allen, is the concept of phoneme conflation strategies that can be used for HC speech recognition systems. Human-computer interactivity has also been addressed for users of the DIPLOMAT speech-to-speech system interface.
More recently, using the FestVox tools from CMU, voices have been built for Festival, an open-source, free for any use, general multilingual speech-synthesis system for more than 40 languages including HC.
Bernard Filliatre and Robert Racca at Université Antilles Guyane have developed a voice synthesis system for GC using a neural network technique to associate phonemes with texts, producing sounds with a PC audio card.
One recent academic research effort for speech data and spoken language corpora collections for Creole languages is a seed grant project entitled "Speech Warehouse A Repository of Linguistically Varied, Prosodically Transcribed Spoken Language Data." This project includes the identification and collection of data from three Caribbean English-lexicon Creoles (Sranan, rural Guyanese and Belizean).
Last, a CD-ROM, complete with sound samples and pictures for Ghanaian Pidgin English, has been produced, and information is available on-line.
|
The Next Challenges
Pidgins and Creole languages face a full set of new issues in approaching localization within the electronic age. However, the tools and applications are limited because of the economic realities of most Creole-speaking nations and because of some issues related to the normalization and standardization for each Creole. All image-, text- and speech-based systems will remain hindered in their efficiency for these languages unless the tools are conceptualized, developed, harnessed and used in ways that not only help to standardize the language but also serve effectively in the real-life contexts of native speakers. Jumping to the conclusion that the standard for one Creole can apply to all others can lead to significant problems.
It is possible, however, to implement low-cost techniques to significantly advance the design of natural language processing (NLP) tools for "minority" languages without requiring significant amounts of money to produce new minority language applications.
Despite these possibilities, however, the next generation of quality NLP tools for minority and vernacular languages will not be the automatic result. There is still a great need for "grassroots" involvement and more integrated cultural contextualization in the design and development of NLP tools and workflow processes for languages that are so related to cultural identity in an increasingly technical world.
Localization of applications and data resources remains an important task in extending the human language technology and language engineering fields to such languages. Creole languages are currently being adapted to meet the technological challenges of the twenty-first century. What steps is the localization industry taking to embrace the full potential of such an emerging global market? 
|
Marilyn Mason is founder and managing consultant of The Creole Clearinghouse, a founding director of the Professional Association for Localization and founder of Mason Integrated Technologies. She can be reached at MariLinc@aol.com
Jeff Allen leads the technical documentation department at Mycom International and is a member of the editorial board of MultiLingual Computing & Technology. He can be reached at jeff.allen@free.fr
Note: Montreal's Creole Language Month (October 2002) poster is used here by permission of KIPKAA (Committee to Support Literacy Initiatives in Haiti), the event sponsor, which commissioned the original Patrice Piard painting. The image from Jounal Boukan 1977 is also used by permission.
This article reprinted from #53 Volume 14 Issue 1 of MultiLingual Computing & Technology published by MultiLingual Computing, Inc., 319 North First Ave., Sandpoint, Idaho, USA, 208-263-8178, Fax: 208-263-6310.
|
January/February, 2003
|