Fsf-friends March 2004

fsf-friends@gnu.org.in

28 participants
38 discussions

FLOSS Policy Support project launched (fwd)
by Frederick Noronha (FN) 27 Mar '04

27 Mar '04

---------- Forwarded message ---------- [from International Institute of Infonomics ICT Weekly of March 26, 2004] FLOSS-POLS PROJECT LAUNCHED The 'Free/Libre/Open Source Software: Policy Support' project, coordinated by MERIT, was launched on March 1. FLOSS-POLS will work on three specific tracks: government policy towards open source; gender issues in open source; and the efficiency of open source as a system for collaborative problem-solving. see: http://flossproject.org/flosspols/ _______________________________________________ s-asia-it mailing list s-asia-it(a)lists.apnic.net http://mailman.apnic.net/mailman/listinfo/s-asia-it

1 0

a must read
by Mahesh T. Pai 26 Mar '04

26 Mar '04

Just published. `Free Culture' - a book by Larry Lessig. Available for free under a free of cost license from http://www.free-culture.cc/freecontent/ or for a price from Penguin. He admits that his ideas were propounded by RMS a few decades back. Fours hours after publications, I am at page 30 of the pdf. -- +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~+ Mahesh T. Pai, LL.M., 'NANDINI', S. R. M. Road, Ernakulam, Cochin-682018, Kerala, India. http://paivakil.port5.com +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~+

2 1

Wonder if we could start recirculating such a note... to CMs and MPs...
by Frederick Noronha (FN) 25 Mar '04

25 Mar '04

Sir, The Free Software Foundation is a not-for-profit organisation involved in promoting freedom in software. We write this to request you to consider using Free Software for IT programmes being implemented by your government, thus saving money and building a strong indigenous resource base. The main advantages of using Free software, compared to proprietary software are: * Free software allows the user to modify and customise it. Thus, the * programs can be modified by the user, if he has the know-how, or by a * programmer hired by the user. * The software can be installed on any number of machines, and separate * licences are not required. * The 'source codes' (ie. the original programs written in human language) * are available, which can be used as very good examples to teach students. * This is a tremendous help for budding programmers who get to see programs * that have been written and refined by some of the world's best * programmers. * The GNU/Linux operating system, which is the leading Free operating system * today, is very stable and the natural choice for most network servers. * The GNU/Linux operating system and applications running on it are as * user-friendly as any other. It would be possible for anyone familiar with * computers to start using applications in GNU/Linux with a few days' * practice. A detailed write-up is enclosed, which explains in detail the advantages of Free Software over proprietary software. Hoping for a favourable response, Yours sincerely, for Free Software Foundation

1 0

[AsiaOSS-RT:00211] Awareness slides on AsiaOSS - open content for your own use. (fwd)
by Frederick Noronha (FN) 25 Mar '04

25 Mar '04

---------- Forwarded message ---------- Dear AsiaOSS RT members, I made a couple of presentations open to the public here in Malaysia last week, on the subjects of 1) Background Information on the AsiaOSS3 Symposium - organisers, aims, projects, future plans etc. 2) Open Source Trends in Asia (based on the economies' slides presented in Hanoi). Includes some discussion of business models. I hope that these may be of use to other RT members to promote participation in the Symposium in your own economy. They are all in an open content license, so you can easily modify them for your own purpose (OpenOffice format). The slides are downloadable from http://opensource.mimos.my/?main=tech_wshop/slides_2004/slides_2004 Regards, Imran -- Imran William Smith, Open Source R&D, MIMOS Berhad, Malaysia http://opensource.mimos.my - http://www.asiaosc.org Please avoid sending me Word or PowerPoint attachments. See http://www.fsf.org/philosophy/no-word-attachments.html

1 0

GPL open source projects from INDIA
by Mujibur Rahman Shaik 25 Mar '04

25 Mar '04

Hello All, Where can i find the information about the projects and people who are working on GPL Open Source Projects in INDIA. Is there any means of collaboration between these teams to help each other in bringing up of the projects. Regards Rahman --------------------------------- Do you Yahoo!? Yahoo! Finance Tax Center - File online. File on time.

2 1

Re: [India-egov] Microsoft to conduct five e-Gov seminars in India, 10 March 2004(fwd)
by Frederick Noronha (FN) 25 Mar '04

25 Mar '04

Hi Atul, I certainly think this should be done. You have my full support. Copying this to FSF-FRIENDS. We can locate the MPs' names (even if the Lok Sabha has been dissolved) from the parliament.nic.in site. At least in the Rajya Sabha for now. In addition, we could also get friendly MPs to raise parliamentary questions. I know a couple, and FLOSS issues have already been raised in the Indian parliament, thanks to them. FN On Mon, 15 Mar 2004, Atul Asthana @ home wrote: > Dear Frederick, > > Was planning to start a campaign to write to MLAs/MPs, Ministers, > Bureaucrats stating advantages of OSS over Microsoft. > Do you think that this campaign can be initiated thru any of the > mailing lists (you and I are on, or otherwise)? That is we encourage > people on the mailing lists to write to their representatives, > bueaucrats known to them etc? > Its not a difficult task to find names and addresses of > MLAs/MPs/Bureaucrats. > > > Regards. > Atul Asthana > atulasthana(a)gmx.net > 2004-03-15 09:51:53 > > First they ignore you, then they laugh at you, then they fight you, then you win. > --Mahatma Gandhi > > ======= At 2004-03-10, 13:54:00 Frederick Noronha (FN) fred(a)sancharnet.in wrote: ======= > > >---------- Forwarded message ---------- > > > > > >10 March, 2004 [ www.i4donline.net ] > > > >-------------------------------------------------------------------------------- > > > >Microsoft to conduct five e-Gov seminars in India > >Microsoft Corporation India today announced a series of five > >e-Governance seminars targeted at bureaucrats and Government > >officials. The first seminar in the series will be held at Bhopal. > >Targeted at bureaucrats and government officials, the seminars > >will be conduced across of Madhya Pradesh, Haryana, Rajasthan, > >Orissa and West Bengal. > >http://www.thehindubusinessline.com/2004/03/10/stories/2004031002281700.htm > > > ><*> To visit your group on the web, go to: > > http://groups.yahoo.com/group/India-egov/ ------------------------------------------------------------------- March 2004 | Frederick Noronha, Freelance Journalist Su Mo Tu We Th Fr Sa | Goa India 0091.832.2409490 or 2409783 1 2 3 4 5 6 | ---------------------------------------- 7 8 9 10 11 12 13 | Email fred at bytesforall.org 14 15 16 17 18 19 20 | Writing with a difference 21 22 23 24 25 26 27 | ... on what makes *the* difference 28 29 30 31 | http://www.bytesforall.org ------------------------------------------------------------------- CHECK OUT USENET http://www.algebra.com/~scig/approved/threads.html -------------------------------------------------------------------

2 1

RE: [Fsf-friends] YaST to be set free
by Senthil_OR＠Dell.com 25 Mar '04

25 Mar '04

>>SUSE YaST to be released under the GNU GPL by Novell. >>http://news.com.com/2100-7344-5175682.html Novell is coming up with quite a good things for Free Software/Open Source Programmers. Did you come across http://forge.novell.com ? /. Is bearing lotta news abt it. Seems they are working on integrating Gnome and Kde.

4 5

Hacker Survey
by Senthil_OR＠Dell.com 24 Mar '04

24 Mar '04

Like various Corporates have surveys and analyses,Boston Consulting in collab with OSDN has come with a Hacker Survey. It is available here: http://www.osdn.com/bcg/ ( Under GNU FDL) The thing I observed was, the prominient reasons for people involving with Free Software/Open Source is " It is Intellectually Simulating and and it improves Skills". Other noticeable points were: - 98% are males ( :( if in college & :) if in school ) - India has a very small proportion as compared to US and Germany. Feel Free to Comment. -- Senthil

1 0

Language Info Needed for GNU Aspell
by Frederick Noronha (FN) 24 Mar '04

24 Mar '04

[Please distribute this document as widely as possible.] GNU Aspell 0.60 should be able to support most of the Word Languages. This includes languages languages written in Arabic and other scripts not well supported by an existing 8-bit character set. Eventually Aspell should be able to support any current language not based on the Chinese writing system. GNU Aspell is a spell checker designed to eventually replace Ispell. Its main feature is that it does a much better job of coming up with possible suggestions than just about any other spell checker out there for the English language, including Ispell and Microsoft Word. However, starting with Aspell 0.60 is should also be the only Free (as in Freedom) that can support most languages not written in the Latin or Cyrillic scripts. However I, the author of Aspell, know very little about foreign languages (ie non-English) and what it takes to correctly spell check them. Thus, I need other people to educate me. If you speak a foreign language I would appreciate if you would take the time too look over the following material and email me with any additional information you may have. The first part gives a thorough analysis of the languages which Aspell can and cannot support. If you find any of this information is incorrect please inform me at kevina(a)gnu.org. When Aspell 0.60 is released I would like to have dictionaries available for as many languages as possible. Therefore, if you know of a Free word list available for a language that is not currently listed as having a dictionary available I would appreciate hearing form you. I am especially interested in working with someone to add support for languages written in the Arabic script. The encoding of the Arabic is quite complicated and I want to be sure that Aspell can correctly handle it. I would also appreciate some help converting Ispell dictionaries to Aspell. So, if you would like to help convert some of the dictionaries listed as being available for Ispell please contact me. The second part lists languages related issues involved in correctly spell checking a document. If you can offer any additional insight on any of the issues discussed, or know of any additional complications when spell checking a given language, I would appreciate hearing from you. The last part discusses why Aspell uses 8-bit characters internally for your reading pleasure. All of this material is also included in the Aspell 0.60 manual which you can find at http://aspell.net/devel-doc/man. Languages Which Aspell can Support ********************************** Even though Aspell will remain 8-bit internally it should still be be able to support any written languages not based on a logographic script. The only logographic writing system in current use are those based on hànzi which includes Chinese, Japanese, and sometimes Korean. Supported ========= Aspell 0.60 should be able to support the following languages as, to the best of my knowledge, they all contain 220 or fewer symbols and can thus, fit within an 8-bit character set. If an existing character set does not exists than a new one can be invented. This is true even if the script is not yet supported by Unicode as the private use area can be used. Code Language Name Script Dictionary Gettext Available Translation aa Afar Latin - - ab Abkhazian Cyrillic - - ae Avestan Avestan - - af Afrikaans Latin Yes - ak Akan Latin - - an Aragonese Latin - - ar Arabic Arabic - - as Assamese Bengali - - av Avar Cyrillic - - ay Aymara Latin - - az Azerbaijani Cyrillic - - az Latin - - ba Bashkir Cyrillic - - be Belarusian Cyrillic Planned Yes bg Bulgarian Cyrillic Yes - bh Bihari Devanagari - - bi Bislama Latin - - bm Bambara Latin - - bn Bengali Bengali Planned - bo Tibetan Tibetan - - br Breton Latin Yes - bs Bosnian Latin - - ca Catalan/Valencian Latin Yes - ce Chechen Cyrillic - - ch Chamorro Latin - - co Corsican Latin - - cr Cree Latin - - cs Czech Latin Yes Yes cu Old Slavonic Cyrillic - - cv Chuvash Cyrillic - - cy Welsh Latin Yes - da Danish Latin Yes - de German Latin Yes Yes dv Divehi Thaana - - dz Dzongkha Tibetan - - ee Ewe Latin - - el Greek Greek Yes - en English Latin Yes Yes eo Esperanto Latin Yes - es Spanish Latin Yes Incomplete et Estonian Latin Planned - eu Basque Latin - - fa Persian Arabic - - ff Fulah Latin - - fi Finnish Latin Planned - fj Fijian Latin - - fo Faroese Latin Yes - fr French Latin Yes Yes fy Frisian Latin - - ga Irish Latin Yes Yes gd Scottish Gaelic Latin Planned - gl Gallegan Latin Yes - gn Guarani Latin - - gu Gujarati Gujarati - - gv Manx Latin Planned - ha Hausa Latin - - he Hebrew Hebrew Planned - hi Hindi Devanagari - - ho Hiri Motu Latin - - hr Croatian Latin Yes - ht Haitian Creole Latin - - hu Hungarian Latin Planned - hy Armenian Armenian - - hz Herero Latin - - ia Interlingua (IALA) Latin Yes - id Indonesian Latin Yes - ie Interlingue Latin - - ig Igbo Latin - - ik Inupiaq Latin - - io Ido Latin - - is Icelandic Latin Yes - it Italian Latin Yes - iu Inuktitut Latin - - jv Javanese Javanese - - jv Latin - - ka Georgian Georgian - - kg Kongo Latin - - ki Kikuyu/Gikuyu Latin - - kj Kwanyama Latin - - kk Kazakh Cyrillic - - kl Kalaallisut/Greenlandic Latin - - kn Kannada Kannada - - ko Korean Hangeul - - kr Kanuri Latin - - ks Kashmiri Arabic - - ks Devanagari - - ku Kurdish Arabic - - ku Cyrillic - - ku Latin - - kv Komi Cyrillic - - kw Cornish Latin - - ky Kirghiz Cyrillic - - ky Latin - - la Latin Latin - - lb Luxembourgish Latin Planned - lg Ganda Latin - - li Limburgan Latin - - ln Lingala Latin - - lo Lao Lao - - lt Lithuanian Latin Planned - lu Luba-Katanga Latin - - lv Latvian Latin - - mg Malagasy Latin - - mh Marshallese Latin - - mi Maori Latin Yes - mk Makasar Lontara/Makasar - - ml Malayalam Latin - - ml Malayalam - - mn Mongolian Cyrillic - - mn Mongolian - - mo Moldavian Cyrillic - - mr Marathi Devanagari - - ms Malay Latin Yes - mt Maltese Latin Planned - my Burmese Myanmar - - na Nauruan Latin - - nb Norwegian Bokmal Latin Yes - nd North Ndebele Latin - - ne Nepali Devanagari - - ng Ndonga Latin - - nl Dutch Latin Yes Yes nn Norwegian Nynorsk Latin Yes - nr South Ndebele Latin - - nv Navajo Latin - - ny Nyanja Latin - - oc Occitan/Provencal Latin - - or Oriya Oriya - - os Ossetic Cyrillic - - pa Punjabi Gurmukhi - - pi Pali Devanagari - - pi Sinhala - - pl Polish Latin Yes - ps Pushto Arabic - - pt Portuguese Latin Yes Yes qu Quechua Latin - - rm Raeto-Romance Latin - - rn Rundi Latin - - ro Romanian Latin Yes Yes ru Russian Cyrillic Yes Yes rw Kinyarwanda Latin - - sa Sanskrit Devanagari - - sc Sardinian Latin - - sd Sindhi Arabic - - se Northern Sami Latin - - sg Sango Latin - - si Sinhalese Sinhala - - sk Slovak Latin Yes - sl Slovenian Latin Yes - sm Samoan Latin - - sn Shona Latin - - so Somali Latin - - sq Albanian Latin Planned - sr Serbian Cyrillic - Yes sr Latin - - ss Swati Latin - - st Southern Sotho Latin - - su Sundanese Latin - - sv Swedish Latin Yes - sw Swahili Latin Planned - ta Tamil Tamil Planned - te Telugu Telugu - - tg Tajik Latin - - tk Turkmen Latin - - tl Tagalog Latin - - tl Tagalog - - tn Tswana Latin - - to Tonga Latin - - tr Turkish Latin - - ts Tsonga Latin - - tt Tatar Cyrillic - - tw Twi Latin - - ty Tahitian Latin - - ug Uighur Arabic - - ug Cyrillic - - ug Latin - - uk Ukrainian Cyrillic Yes - ur Urdu Arabic - - uz Uzbek Cyrillic - - uz Latin - - ve Venda Latin - - vi Vietnamese Latin - - vo Volapuk Latin - - wa Walloon Latin Planned Incomplete wo Wolof Latin - - xh Xhosa Latin - - yi Yiddish Hebrew - - yo Yoruba Latin - - za Zhuang Latin - - zu Zulu Latin Planned - Notes on Latin Languages ------------------------ Any word that can be written using on of the Latin ISO-8859 character sets (ISO-8859-1,2,3,4,9,10,13,14,15,16) can be written, in decomposed form, using the ASCII characters, the 23 additional letters: U+00C6 LATIN CAPITAL LETTER AE U+00D0 LATIN CAPITAL LETTER ETH U+00D8 LATIN CAPITAL LETTER O WITH STROKE U+00DE LATIN CAPITAL LETTER THORN U+00DE LATIN SMALL LETTER THORN U+00DF LATIN SMALL LETTER SHARP S U+00E6 LATIN SMALL LETTER AE U+00F0 LATIN SMALL LETTER ETH U+00F8 LATIN SMALL LETTER O WITH STROKE U+0110 LATIN CAPITAL LETTER D WITH STROKE U+0111 LATIN SMALL LETTER D WITH STROKE U+0126 LATIN CAPITAL LETTER H WITH STROKE U+0127 LATIN SMALL LETTER H WITH STROKE U+0131 LATIN SMALL LETTER DOTLESS I U+0138 LATIN SMALL LETTER KRA U+0141 LATIN CAPITAL LETTER L WITH STROKE U+0142 LATIN SMALL LETTER L WITH STROKE U+014A LATIN CAPITAL LETTER ENG U+014B LATIN SMALL LETTER ENG U+0152 LATIN CAPITAL LIGATURE OE U+0153 LATIN SMALL LIGATURE OE U+0166 LATIN CAPITAL LETTER T WITH STROKE U+0167 LATIN SMALL LETTER T WITH STROKE and the 14 modifiers: U+0300 COMBINING GRAVE ACCENT U+0301 COMBINING ACUTE ACCENT U+0302 COMBINING CIRCUMFLEX ACCENT U+0303 COMBINING TILDE U+0304 COMBINING MACRON U+0306 COMBINING BREVE U+0307 COMBINING DOT ABOVE U+0308 COMBINING DIAERESIS U+030A COMBINING RING ABOVE U+030B COMBINING DOUBLE ACUTE ACCENT U+030C COMBINING CARON U+0326 COMBINING COMMA BELOW U+0327 COMBINING CEDILLA U+0328 COMBINING OGONEK Which is a total of 37 additional Unicode code points. All ISO-8859 character leaves the characters 0x00 - 0x19 and 0x80 - 0x99 unmapped as they are generally used as control characters. Of those, 0x02 - 0x19 and 0x80 - 0x99 may be mapped to anything in Aspell. This is a total of 62 characters which can be remapped in any ISO-8859 character set. Thus, by remapping 37 of the 62 characters to the previously specifed Unicode code-points, any modified ISO-8859 character set can be used for any Latin languages covered by ISO-8859. Of course decomposing every single accented character wastes a lot of space, so only characters that can be not be represented in the precomposed form should be broken up. By using this trick it is possible to store foreign words in the correctly accented form in the dictionary even if the precomposed character is not in the current character set. Any letter in the Unicode range U+0000 - U+0249, U+1E00..U+1EFF (Basic Latin, Latin-1 Supplement, Latin Extended-A, Latin Extended-B, and Latin Extended Additional) can be represented using around 175 basic letters, and 25 modifiers which is less than 220 and can thus fit in an Aspell 8-bit character set. Since this unicode range covers any possible Latin language this special character set can be used to reperesnt any word written using the Latin script if so desired. Hangeul ------- Koren in generally written in hangeul or a mixture of hanja and hangeul. Aspell should be able to spell check the hangeul part of the writing. In Hangeul letters individual letters, known as jamo, are grouped together in syllable blocks. Unicode provided code points for both jamo and the combined syllable block. The syllable blocks will need to be decomposed into jamo in order for Aspell to spell check it. Syllabic ======== Syllabic languages use a separate symbol for each syllable of the language. Since most of them have more than 240 distinct characters Aspell can not support them as is. However, all hope is not lost as Aspell will most likely be able to support them in the future. Code Language Name Script am Amharic Ethiopic cr Cree Canadian Syllabics ii Sichuan Yi Yi iu Inuktitut Canadian Syllabics oj Ojibwa Ojibwe om Oromo Ethiopic ti Tigrinya Ethiopic The Ethiopic Syllabary ---------------------- Even though the Ethiopic script has more than 220 distinct characters with a little work Aspell can still handle it. The idea is to split each character into two parts based on the matrix representation. The first 3 bits will be the first part and could be mapped to `10000???'. The next 6 bits will be the second part and could be mapped to `11??????'. The combined character will then be mapped with the upper bits coming first. Thus each Ethiopic syllabary will have the form `11?????? 10000???'. By mapping the first and second parts to separate 8-bit characters it is easy to tell which part represents the consonant and which part represents the vowel of the syllabary. This encoding of the syllabary is far more useful to Aspell than if they were stored in UTF-8 or UTF-16. In fact, the exiting suggestion strategy of Aspell will work well with this encoding with out any additional modifications. However, additional improvements may be possible by taking advantage of the consonant-vowel structure of this encoding. In fact, the split consonant-vowel representation may prove to be so useful that it may be beneficial to encode other syllabary in this fashion, even if they are less than 220 of them. The code to break up a syllabary into the consonant-vowel parts does not exists as of Aspell 0.60. However, it will be fairly easy to add it as part of the Unicode normalization process once that is written. The Yi Syllabary ---------------- A very large syllabary with 819 distince symbols. However, like Ethiopic, it should be possible to support this script by breaking it up. The Unified Canadian Aboriginal Syllabics ----------------------------------------- Another very large syllabary. The Ojibwe Syllabary -------------------- With only 120 distinct symbols, Aspell can actually support this one as is. However, as previously mentioned, it may be beneficial to break it up into the consonant-vowel representation anyway. Unsupported =========== These languages, when written in the given script, are currently unsupported by Aspell for one reason or another. Code Language Name Script ja Japanese Japanese km Khmer Khmer ko Korean Hanja + Hangeul pi Pali Thai th Thai Thai zh Chinese Hanja The Thai and Khmer Scripts -------------------------- The Thai and Khmer scripts presents a different problem for Aspell. The problem is not that there are more than 220 unique symbols, but that there are no spaces between words. This means that there is no easy way to split a sentence into individual words. However, it is still possible to spell check these scripts, it is just a lot more difficult. I will be happy to work within someone who is interested in adding Thai or Khmer support to Aspell, but it is not likely something I will do in the foreseeable future. Languages which use Hànzi Characters ------------------------------------ Hànzi Characters are used to write Chinese, Japanese, Korean, and were once used to write Vietnamese. Each hànzi character represents a syllable of a spoken word and also has a meaning. Since there are around 3,000 of them in common usage it is unlikely that Aspell will ever be able to support spell checking languages written using hànzi. However, I am not even sure if these languages need spell checking since hànzi characters are generally not entered in directly. Furthermore even if Aspell could spell check hànzi the exiting suggestion strategy will not work well at all, and thus a completely new strategy will need to be developed. Japanese -------- Modern Japanese is written in a mixture of "hiragana", "katakana", "kanji", and sometimes "romaji". "Hiragana" and "katakana" are both syllabaries unique to Japan, "kanji" is a modified form of hànzi, and "romaji" uses the Latin alphabet. With some work, Aspell should be able to check the non-kanji part of Japanese text. However, based on my limited understanding of Japanese hiragana is often used at the end of kanji. Thus if Aspell was to simply separate out the hiragana from kanji it would end up with a lot of word endings which are not proper words and will thus be flagged as misspellings. However, this can be fairly easily rectified as text is tokenized into words before it is converted into Aspell's internal encoding. In fact, some Japanese text is written in entirely in one script. For example books for children and foreigners are sometimes written entirely in hiragana. Thus, Aspell could prove at least somewhat useful for spell checking Japanese. Languages Written in Multiple Scripts ===================================== Aspell should be able to check text written in the same language, but in multiple scripts, with some work. If the number of unique symbols in both scripts is less than 220 than a special character set can be used to allow both scripts to be encoding in the same dictionary. However this may not be the most efficient solution. An alternate solution is to store each script in its own dictionary and allow Aspell to chose the correct dictionary based on which script the given word is written in. Aspell currently does not support this mode of spell checking however it is something that I hope to eventually support. Notes on Planned Dictionaries ============================= be Belarusian Ispell Dictionary available bn Bengali Unoffical Aspell Dictionary available `http://www.bengalinux.org/downloads/' et Estonian Ispell Dictionary available fi Finnish Ispell Dictionary available gd Scottish Ispell Dictionary available. Gaelic `http://packages.debian.org/unstable/text/igaelic' gv Manx Ispell Dictionary available. `http://packages.debian.org/unstable/text/imanx' he Hebrew Ispell Dictionary available hu Hungarian MySpell dictionary expanded to over 500 MB. Will add once affix support is worked into the dictionary package system. lb Luxembourgish MySpell dictionary planned. lt Lithuanian MySpell dictionary expanded to over 500 MB. Will add once affix support is worked into the dictionary package system. mt Maltese Unofficial Aspell Dictionary available, but broken link to source. `http://linux.org.mt/article/spellcheck' sw Albanian Ispell Dictionary available sw Swahili Available at `http://sourceforge.net/projects/translate'. Offical version comming soon. ta Tamil Word list available at `http://www.developer.thamizha.com/spellchecker/index.html'. Working with them to create an Aspell dictionary. wa Walloon Ispell Dictionary available zu Zulu Available at `http://sourceforge.net/projects/translate'. Offical version comming soon. References ========== The information in this chapter was gathered from numerous sources, including: * ISO 639-2 Registration Authority, `http://www.loc.gov/standards/iso639-2/' * Languages and Scripts (Offical Unicode Site), `http://www.unicode.org/onlinedat/languages-scripts.html' * Omniglot - a guide to written language, `http://www.omniglot.com/' * Winkipedia - The Free Encyclopedia, `http://wikipedia.org/' * Ethnologue - Languages of the World, `http://www.ethnologue.com/' * World Languages - The Ultimate Language Store, `http://www.worldlanguage.com/' * South African Languages Web, `http://www.languages.web.za/' * The Languages and Writing Systems of Africa (Global Advisor Newsletter), `http://www.intersolinc.com/newsletters/africa.htm' Special thanks goes to Era Eriksson for helping me the information in this chapter. Language Related Issues *********************** Here are some language related issues that a good spell checker needs to handle. If you have any more information about any of these issues, or of a new issue not discussed here, please email me at <kevina(a)gnu.org>. German Sharp S ============== The German Sharp S or Eszett does not have an uppercase equivalent. Instead when `ß' is converted to `SS'. The conversion of `ß' to `SS' requires a special rule, and increases the length of a word, thus disallowing inplace case conversion. Furthermore, my general rule of converting all words to lowercase before looking them up in the dictionary won't work because the conversion of `SS' to lowercase is ambiguous; it can be `ss' or `ß'. I do plan on dealing with this eventually, however. Compound Words ============== In some languages, such as German, it is acceptable to string two words together, thus forming a compound word. However, there are rules to when this can be done. Furthermore, it is not always sufficient to simply concatenate the two words. For example, sometimes a letter is inserted between the two words. I tried implementing support for compound words in Aspell but it was too limiting and no one used it. Before I try implementing it again I want to know all the issues involved. Context Sensitive Spelling ========================== In some language, such as Luxembourgish, the spelling of a word depends on which words surround it. For example the the letter `n' at the end of a word will disappear if it is followed by another word starting with a certain letter such as an `s'. However, it can probably get more complicated than that. I would like to know how complicated before I attempt to implement support for context sensitive spelling. Unicode Normalization ===================== Because Unicode contains a large number of precomposed characters there are multiple ways a character can be represented. For example letter a* can either be represented as U+00E5 LATIN SMALL LETTER A WITH RING ABOVE or U+0061 LATIN SMALL LETTER A + U+030A COMBINING RING ABOVE By performing normalization first Aspell will only see one of these representations. The exact form of normalization depends on the language. Give the choice of 1. Precomposed character 2. Base letter + combining character(s) 3. Base letter only if the precomposed charter is in the target character set then (1), if both the base and combing character is present than (2), otherwise (3). Words With Spaces or other Symbols in Them ========================================== Many languages, including English, have words with non-letter symbols in them. For example the apostrophe. These symbols generally appear in the middle of a word, but they can also appear at the end, such as in an abbreviation. If a symbol can _only_ appear as part of a word than Aspell can treat it as if it were a letter. However, the problem is most of these symbols have other uses. For example, the apostrophe is often used as a single quote and the abbreviations marker is also used as a period. Thus, Aspell can not blindly treat them as if they were letters. Aspell currently handles the case where the symbol can only appear in the middle of the word fairly well. It simply assumes that if there is a letter both before and after the symbol than it is part of the word. This works most of the time but it is not fool proof. For example, suppose the user forgot to leave a space after the period: ... and the dog went up the tree.Then the cat ... Aspell would think "tree.Then" is one word. A better solution might be to then try to check "tree" and "Then" separately. But what if one of them is not in the dictionary? Should Aspell assume "tree.Then" is one word? The case where the symbol can appear at the beginning or end of the word is more difficult to deal with. The symbol may or may not actually be part of the word. Aspell currently handles this case by first trying to spell check the word with the symbol and if that fails, try it without. The problem is, if the word is misspelled, should Aspell assume the symbol belongs with the word or not? Currently Aspell assumes it does, which is not always the correct thing to do. Numbers in words present a different challenge to Aspell. If Aspell treats numbers as letters than every possible number a user might write in a document must be specified in the dictionary. This could be easily be solved by having special code to assume all numbers are correctly spelled. But what about something like "4th". Since the "th" suffix can appear after any number we are left with the same problem. The solution would be to have a special symbol for "any number". Words with spaces in them, such as foreign phrases, are even more trouble to deal with. The basic problem is that when tokonizing a string there is no good way to keep phrases together. One solution is to use trial and error. If a word is not in the dictionary try grouping it with the previous or next word and see if the combined word is the dictionary. But what if the combined word is not, should the misspelled word be grouped when looking for suggestions? One solution is to also store each part of the phrase in the dictionary, but tag it as part of a phrase and not an independent word. To further complicate things, most applications that use spell checkers are accustom to parsing the document themselves and sending it to the spell checker a word at a time. In order to support word with spaces in them a more complicated interface will be required. Notes on 8-bit Characters ************************* There is a very good reason I use 8-bit characters in Aspell. Speed and simplicity. While many parts of my code can fairly be easily be converted to some sort of wide character as my code is clean. Other parts can not be. One of the reasons because is many, many places I use a direct lookup to find out various information about characters. With 8-bit characters this is very feasible because there is only 256 of them. With 16-bit wide characters this will waste a LOT of space. With 32-bit characters this is just plain impossible. Converting the lookup tables to some other form, while certainly possible, will degrade performance significantly. Furthermore, some of my algorithms relay on words consisting only on a small number of distinct characters (often around 30 when case and accents are not considered). When the possible character can consist of any Unicode character this number because several thousand, if that. In order for these algorithms to still be used some sort of limit will need to be placed on the possible characters the word can contain. If I impose that limit, I might as well use some sort of 8-bit characters set which will automatically place the limit on what the characters can be. There is also the issue of how I should store the word lists in memory? As a string of 32 bit wide characters. Now that is using up 4 times more memory than charters would and for languages that can fit within an 8-bit character that is, in my view, a gross waste of memory. So maybe I should store them is some variable width format such as UTF-8. Unfortunately, way, way to many of may algorithms will simply not work with variable width characters without significant modification which will very likely degrade performance. So the solution is to work with the characters as 32-bit wide characters and than convert it to a shorter representation when storing them in the lookup tables. Now than can lead to an inefficiency. I could also use 16 bit wide characters however that may not be good enough to hold all of future versions of Unicode and it has the same problems. As a response to the space waste used by storing word lists in some sort of wide format some one asked: Since hard drive are cheaper and cheaper, you could store dictionary in a usable (uncompressed) form and use it directly with memory mapping. Then the efficiency would directly depend on the disk caching method, and only the used part of the dictionaries would relay be loaded into memory. You would no more have to load plain dictionaries into main memory, you'll just want to compute some indexes (or something like that) after mapping. However, the fact of the matter is that most of the dictionary will be read into memory anyway if it is available. If it is not available than there would be a good deal of disk swaps. Making characters 32-bit wide will increase the change that there are more disk swap. So the bottom line is that it will be cheaper to convert the characters from something like UTF-8 into some sort of wide character. I could also use some sort of disk space lookup table such as the Berkeley Database. However this will *definitely* degrade performance. The bottom line is that keeping Aspell 8-bit internally is a very well though out decision that is not likely to change any time soon. Fell free to challenge me on it, but, don't expect me to change my mind unless you can bring up some point that I have not thought of before and quite possible a patch to solve cleanly convert Aspell to Unicode internally with out a serious performance lost OR serious memory usage increase. -- http://kevin.atkinson.dhs.org _______________________________________________ GNU Announcement mailing list <info-gnu(a)gnu.org> http://mail.gnu.org/mailman/listinfo/info-gnu

1 0

YaST to be set free
by Ramanraj K 24 Mar '04

24 Mar '04

SUSE YaST to be released under the GNU GPL by Novell. http://news.com.com/2100-7344-5175682.html

1 0

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Fsf-friends March 2004