Recently, I started using UTF-8 enabled applications to read and write in Tamil, the local official language here. It appears indic languages have been incorrectly represented at Unicode. India had sent less than 128 chars each language to Unicode consortium in the 1990s, much less than the full complement of characters in each. For example, among Tamil characters, only 31 chars (12 vowels and 18 consonants + 1 Final (ஃ) have specific codes, and the chart misses almost 12 x 18 characters which now have to be encoded with three to nine bytes per character. To make things worse, their arrangement is not in any natural order, and so sorting is difficult. It appears it is difficult to amend the charts now, as a number of applications have started using the unicode coding charts. Almost all indic languages have the same problem.
Some would like to now have a 16 bit encoded Tamil-New chart, with codes allocated for 250+ characters in the Private Use area. I am not sure if other indic language groups are aware of the issues here, and what their plans are to deal with it.
Padmakumar pointed out the issues there to the fsf-friends mailing list in 2004:
http://mm.gnu.org.in/pipermail/fsf-friends/2004-December/002653.html along with the link to the article at : http://www.angelfire.com/empire/thamizh/2/ (sad that there was no response to it)
A recent TVU conference doc on the issues there is available at: http://tamilvu.org/coresite/html/cwwhatnw.htm
There are a number of things that need to be done: [1] Add any missing characters and re-arrange the Tamil Unicode characters within the range of the existing 128 so that sorting could be done [2] Examine the TVU doc and offer suggestions to those concerned regarding Tamil 16 bit encoding. [3] Almost all indic languages are in the same boat here, and therefore, the language groups ought to come up with workable plans to remove the problems.
-Ramanraj K