Classics@2: Deborah Anderson, Preliminary Guidelines to Using Unicode for Greek

Preliminary Guidelines to Using Unicode for Greek 

Deborah Anderson

 

Article Contents

Summary
Introduction
What is the Unicode Standard and why is it important?
Core Concepts in Unicode
Organization of The Unicode Standard: Code Charts
Greek in The Unicode Standard
How to Get Unicode to Work for ancient Greek
Specific Questions for Users
Bibliography and Useful Reference Guides

Summary [top]

This article offers a concise introduction to the Unicode standard and attendant technologies, aimed specifically at students and scholars of classical Greek. It describes the (intentional) limits of the Unicode standard, gives some guidelines for using Unicode characters, answers some frequently asked questions, and includes a bibliography of useful resources.

Introduction [top]

Being able to easily type ancient Greek, get it to display correctly on a computer, and send Greek text electronically to others may still seem elusive to Hellenists. Fortunately, the international character encoding standard Unicode has laid the foundation for texts with ancient Greek and other scripts to be transmitted electronically without error; this standard has been adopted and implemented widely in computers today. The Unicode Standard (“Unicode” hereafter) already covers most of the letters and symbols used by Hellenists, and current fonts, software, and computer operating systems are increasingly able to support the Greek characters. However, working on ancient Greek electronically is not yet in as ideal a state as it could be: additional characters are in he pipeline for inclusion in Unicode but are not yet approved, old non-standardized fonts are being used widely and are still causing problems in data exchange, and some difficulties in writing and sending Greek electronically on different platforms have not yet been resolved (on the latter see, for example, Donald Mastronarde’s FAQ and comments on GreekKeys on the Mac http://ist-socrates.berkeley.edu/~pinax/greekkeys/GreekKeysFAQ.html).

Because more work continues on Unicode—both within the standards committees and in various products in the computer industry—an introduction to Unicode can be useful, for it can help scholars and students make their needs more effectively known to the standards committees and to industry (and guide scholars in knowing where to direct their questions and complaints). Understanding Unicode’s core concepts can also assist scholars in gaining a better understanding of the entire layered model of text representation.

What is the Unicode Standard and why is it important? [top] 

Unicode is the international character encoding standard and is fully synchronized with ISO 10646, its parallel International Standard maintained by the International Organization for Standardization. Character encoding refers to the assignment of a number to a letter or other symbol found in a text. For example, the Greek letter mu is assigned the number U+03BC. (Here, as elsewhere, the Unicode value is cited in the form “U+[hex number].”) This number (or “character code”), U+03BC, is how the computer stores the letter (/symbol, etc.) and it underlies how text is stored in your Word documents, on the Web, etc. This number is also used in making up a font.

In the scheme of multi-layered text representation, character encoding is on the bottom, above this is markup (HTML, XML, or TEI), which can convey the hierarchical structure of a document and the content it consists of, and metadata is on the top level. Metadata is structured data about data structure.

The need for an international character encoding standard was evident already in the 1980s, when there were a variety of competing standards: there were different encoding systems used for the Mac, for Windows, and a variety of governmental body standards. A large number of fonts were also created by individuals with ad hoc (i.e., non-standard) encodings. This multitude of standards and encodings explains the situation that most users have encountered in the 1980s and 1990s: you open a document from a colleague only to find the text garbled. The reason was often due to the author and receiver using two different character encodings. The situation was becoming chaotic for the interchange of data, particularly for the business world, and as a result a single international standard was created, Unicode.

In Unicode, a unique number is assigned to every character, and this underlying number remains the same, no matter what platform, font, or program. For example, the Unicode character code for Greek final sigma is U+03C2, and this number for final sigma remains constant in Unicode-compliant products. With non-standard fonts, however, a given character may have been mapped to different character codes: final sigma in the non-Unicode font SPIonic has the character code U+006A and in the non-Unicode LaserGreek font it is U+007E. The discrepancy between non-standard vs. standardized (Unicode) fonts will cause problems when sending text.

The answer to the problem of reliably transmitting Greek text data is to use software and fonts that are based on the international character standard Unicode. As long as Hellenists are using products that are Unicode-compliant — both the sender of a document and the recipient—a Greek α should appear as an α in any electronic text document. As a result, Greek texts will be widely accessible to others on any platform and in any country and will help assure longevity to the data through time. Unicode is also now the default standard for XML.

Core Concepts in Unicode [top] 

Understanding the basics of Unicode will help to explain why some characters are included and others are not. It may appear that Unicode is, at times, inconsistent in applying these core concepts. This is due in part because Unicode absorbed older legacy character sets which may have contained characters that by current Unicode Technical Committee (UTC) policies are not considered eligible for inclusion (i.e., precomposed forms). However, had the UTC not included these characters, there would have been major interoperability issues between Unicode and all the preexisting data in those character encodings.

Unicode is plain text [top]

Unicode is used for representing plain text. Plain text is a sequence of character codes:

The letters of the word “Greek” are represented in plain text by U+0047 U+0020 U+0065 U+0065 U+006B. Plain text is contrasted with “fancy” or “rich” text, which is plain text with additional information, such as formatting information (font size, styles [bold, italic], color, etc.). The advantage to using plain text is that it is standardized and universally readable, whereas fancy text may be proprietary or implementation-specific. Using plain text enables regular expressions or other search mechanisms to be used more easily: if one were to use the superscript “style” in place of an encoded superscript number, the superscripts “style” may be missed by searching processes.

Plain text is a key concept because Unicode only encodes plain text. If one were to strip away the additional formatting information, only plain text would remain; it conveys primary content. For example: a bold letter “b” is essentially a “b.” If one were to remove the bold formatting, the letter “b” would still convey the same basic meaning.

Characters, not glyphs [top]

One of the most fundamental concepts in Unicode is that it encodes characters, not glyphs. Characters are defined as “the abstract representations of the smallest components of written language that have semantic value” (The Unicode Standard 4.0, p. 15), whereas glyphs are the physical manifestation of what appears on the printed page or on your monitor; they are the surface representations of abstract characters. A font is composed of glyphs, which are, in turn, mapped to the Unicode character codes. It is the glyphs that appear in the Unicode code charts, though these are only intended to be representative and not definitive.

Determining whether a particular letter or sign is a character—and eligible for Unicode—or a glyph can be difficult when working with historic texts, particularly if the corpus is limited or damaged.

Click here to see an example.

Some key questions that are asked to help determine this are:

  1. Does the particular letter contrast with another in the same document, with a different meaning? If so, it is a character.
  2. Is its appearance predictable? If so, it may be a contextual variant, and not eligible for encoding. It is a glyph.
  3. Can the letter be interchanged with another letter/sign and still have the same meaning? In this context, it is a glyph.

No new precomposed forms [top]

No new precomposed forms will be accepted by the Unicode Technical Committee unless a very convincing argument is made to the contrary. A precomposed form is a character that can be broken down (“decomposed”) into a series of characters. For example, a student may wish to write an alpha with a breve mark and not find the needed combination in the Unicode code chart. The reason alpha with a breve is not included in Unicode is because this combination is already capable of being created by using alpha (U+03B1) and a combining breve (U+0306). Note that the set of precomposed Greek forms in the Greek Extended block were accepted as part of the merger between Unicode and ISO 10646, and shouldn’t be used as a rationale for requesting new precomposed forms.

No variants [top]

Unicode covers characters, not variants. Variants should be chosen through markup or a font. Note: Unicode includes a provision for a “Variation Selector.” However, the assignment of Variation Selectors is overseen by the Unicode Technical Committee, and proposals, like those used for new characters, need to be submitted to the UTC demonstrating why the variation needs to be made in plain text.

No idiosyncratic characters [top]

Unicode does not include “idiosyncratic, personal, novel, or private-use characters” ( The Unicode Standard 4.0, p. 2). Hence creations of a single author or editor would not be eligible for encoding, except in rare circumstances.

Unify if possible [top] 

If a particular symbol/letter is identical to an already encoded character, it may be unified with it. In other words, only one character is included. This is often the case for punctuation, which may be shared in many writing systems across temporal and geographical boundaries. Duplication would cause confusion on the part of the user, who would not be able to tell the difference between the two. For example, the Attic acrophonic symbol for ten is a capital delta, which is already encoded (U+0394), hence there was no need to add a new capital delta (and indeed this symbol originally derives from the letter itself).

Unicode is not meant to cover fine palaeographic distinctions. [top]

Since Unicode conveys the basic content of character (the abstract letter beta, for example), it is not intended to capture fine palaeographic detail, which can be handled by providing a scanned image or via the font or markup (which can specify a particular glyph).

Organization of The Unicode Standard: Code Charts [top] 

Unicode is organized into blocks of characters, often grouped by scripts (Greek, Cyrillic, etc.) or by similar features (General Punctuation, Combining Diacritics, Geometric Shapes, Currency Symbols, etc.). The code charts are all available in the “Code Charts” section on the Unicode Consortium website and in the The Unicode Standard 4.0 book.

There are two parts to the code charts:

(a) The code chart proper with representative pictures—or glyphs—of the underlying characters, and (b) a names list. The character codes are given without the “U+” prefix.

The entries on the names list often include additional information for users:

  1. The bullet (? ) precedes an informative note, which is additional information.
  2. The equal sign (=) provides an alternative name
  3. The arrow (→) indicates a cross-reference; either the glyphs are very close or identical, but the characters are not the same or other linguistic relationships are being indicated
  4. The “identical to” sign (≡) is used to show canonical mapping; the characters that appear after the “identical to” sign can be interchanged unconditionally with the other character listed without any loss of information.

    Example:

    1F10 ? GREEK SMALL LETTER EPSILON WITH PSILI ≡ 03B5 ε 0313 ’

  5. The “almost equal to” sign (≈ ) is used to show compatibility mapping, usually to earlier standards; additional formatting information may be contained within angle brackets, such as or, for superscripts,.

    Example:

    03D0 ? GREEK BETA SYMBOL ≈ 03B2 greek small letter beta

    The character U+03D0 should only be used in mathematical formulas and not in Greek text.

Greek in The Unicode Standard [top]

Greek is found primarily in two blocks: in the Greek block (U+0370 to U+03FF) and the Greek Extended block (U+1F00 to U+1FFF). Additional characters are currently being voted on, but will likely be added to three new blocks: a new Supplemental Punctuation block, a block for ancient Greek numbers, and a third block devoted to ancient Greek musical notation. One can also use codepoints from other blocks. (Blocks are simply a convenience for documentation or sometimes historical; characters from any block can always be used together.)

The Greek block (U+0370 to U+03FF) was originally based on the Greek monotonic standard, though it now includes Coptic and a few other Greek letters and symbols. The characters in this block can be used with the Combining Diacritical Marks to create the necessary accent and breathings combinations for polytonic Greek.

When Unicode merged with ISO 10646, a separate block of characters, which had been devised for use with polytonic Greek, were adopted and became the Extended Greek block. In the names list for this block, most characters are fcollowed by an identical sign, which signifies that the Extended Greek character is identical to the base letter (from the Greek block) plus a combining diacritic (from the Combining Diacritics block). For example, an epsilon with rough breathing is assigned the codepoint U+1F11 in the Extended Greek codeblock but is identical to U+03B5 (epsilon) plus U+0314 (rough breathing).

Users can choose to use either the base letter from the Greek block plus a combining diacritic, or the “precomposed” form from the Greek Extended block. Search operations should be able to find both, for they are canonical equivalents of one another. However, at present some fonts and browsers handle the combining diacritics better than others. For a recent review of the situation, see http://www.tlg.uci.edu/help/UnicodeTest.html; http://www.tlg.uci.edu/help/Help.fonts.html; and http://www.tlg.uci.edu/~tlg/help/Help6.html.

For example, the following fonts reject combining diacritics: Aristarcoj, Palatino Linotype, Porson. On the other hand, Lucida Sans Unicode will not handle precomposed forms (although Lucida Grande will).

Not all characters needed for Greek are currently encoded, some are only now being voted on by the ISO standards committee. For tips on what to do in the meantime, see below.

There are cases where there may be several different options for handling the same character. In this case, a “best practices” guide is advisable so there is consistency across projects.

How to Get Unicode to Work for ancient Greek [top]

In order for Greek to be sent seamlessly across computers, both user and receiver should have Unicode-enabled products. For up-to-date details on which products are best for Greek and how to make any adjustments for the optimal viewing of Unicode on the Web, see the TLG webpages by Nick Nicholas, http://www.tlg.uci.edu/~tlg/help/UnicodeTest.html and http://www.tlg.uci.edu/~tlg/help/Help6.html. General information on Unicode-enabled products is found on Alan Wood’s website at http://www.alanwood.net/unicode/index.html, and at the discussion on the Unicode website: http://unicode.org/onlinedat/products.html.

Unicode-enabled products needed for typing, sending, viewing Greek on the Web include:

  1. A recent operating (Mac OS 9.2, Mac OS X, Windows CE, NT, 2000, XP, GNU/Linux with glibc 2.2.2 or newer).
  2. A recent browser (Internet Explorer, Safari, OmniWeb, Mozilla/Netscape, Opera).
  3. A Unicode text editor (Word 2000, 2002, Unipad, Apple “TextEdit”, NisusWriter Express, Mellel, and for layout and design, Adobe inDesign).
  4. An input mechanism (a keyboard [downloadable from the Web or it can come bundled with a computer], the “insert symbol” mechanism, Keyman [configurable keyboard for the PC]).
  5. A Unicode-enabled font installed (see the list and detailed review of Greek fonts on the TLG website: http://www.tlg.uci.edu/~tlg/help/UnicodeTest.html)

Specific Questions for Users [top]

a. Finding a character you need [top]

See if the character is already present in Unicode:

  1. If you know the Beta Code number, check the TLG Quick Beta Guide for the Unicode value at: http://ptolemy.tlg.uci.edu/~tlg/quickbeta.pdf.
  2. Check the Greek blocks (etc.) on the Unicode Consortium website. Besides scanning the various blocks, the Character Names Index on the Unicode website can be consulted: http://www.unicode.org/charts/charindex.html. Another useful resource is the set of Collation Charts on the Unicode website, which graphically group similar characters (separating the differences between them with colors): http://www.unicode.org/charts/collation/. A separate Greek collation chart is linked to this page. In looking through Unicode charts and when using “insert Symbol” or font charts, be careful of “spoof buddies.” These are characters that might inadvertently be used in place of the proper Greek ones, causing problems for searching and display. An example is U+0413 ?, which is the Cyrillic Capital GHE, but which closely resembles a capital Greek gamma.

See if the character is in the process of being proposed:

  1. Check on Unicode’s Proposed New Characters page: http://unicode.org/alloc/Pipeline.html.
  2. Ask on the Unicode email list if it is being proposed (directions on how to subscribe are available at: http://unicode.org/consortium/distlist.html).
  3. Check TLG Unicode proposals (http://www.tlg.uci.edu/Uni.prop.html) and, if not found, ask Maria Pantelia at the TLG (mcpantel@uci.edu).

If you find a character that is missing, it is advisable to work with theTLG to get it proposed. The Script Encoding Initiative (http://www.linguistics.berkeley.edu/~dwanders) can also assist; contact Deborah Anderson (dwanders@socrates.berkeley.edu). General guidelines on producing a Unicode proposal are located at: http://www.unicode.org/pending/proposals.html.

b. Once I find the Unicode character, how can I find a font that includes it? [top]

Many handy utilities can be used to get information on the fonts installed on your system and can show the range of Unicode characters that are covered in a particular font. For a description of these utilities, see:

http://www.alanwood.net/unicode/utilities.html.

Another way is to determine which range (block) you need, then check the listing of fonts (arranged by Unicode range and platform) on Alan Wood’s website http://www.alanwood.net/unicode/sitemap.html > “fonts.” Alan Wood’s website also includes test pages which can be used to check the coverage of the font selected as the default browser font.

c. How can I use a character that is not yet in Unicode? [top]

Options on how to handle characters that are not yet in Unicode include:

  1. Use FontLab or work with a font foundry to create a font with the needed glyph(s) in the interim, using the Private Use Area (PUA), an area set aside for testing and for the private interchange of characters. This option may present problems for exchanging documents.
  2. Use markup. For practical tips on markup, see Martin Duerst’s: “Missing Characters and Glyphs” page at http://www.w3.org/International/O-MissCharGlyph.
  3. A TEI working group on character sets was established in 2002 to revise the relevant sections of the TEI Guidelines, but nothing has yet been finalized. The latest drafts are available at: http://www.tei-c.org/Activities/CE/ (See in particular CE W 06: “Representation of non-standard characters and glyphs”).

d. What about my data that is in a non-Unicode font? [top]

If possible, upgrade your documents to Unicode, converting to a Unicode font.

  1. You can use Sean Redmond’s font converter for converting from BetaCode and some Greek fonts to Unicode: http://www.jiffycomp.com/smr/unicode/
  2. If the font you are using is without a converter, create one and ask for it to be hosted on a publicly available website (Stoa, Unicode, TLG, etc.) More information on conversion is available from SIL at: http://scripts.sil.org/cms/scripts/ > “Computers and Writing Systems.”

e. What if the font I use is missing the symbols (etc.) I need or the font is defective? [top]

If specific glyphs are missing, you can use Scalable Vector Graphics (SVG) or Web fonts. See Martin Duerst’s “Missing Characters and Glyphs” page at http://www.w3.org/International/O-MissCharGlyph. If the Unicode font does not have all the needed symbols, ligatures, or has errors in the glyphs, contact the company that makes the font and let them know. Posting a description of the problems on a publicly hosted website for Classicists would also be helpful (or sending a comment to Nick Nicholas, opoudjis@opoudjis.net, so he can post comments on his TLG fonts page).

f. How should I handle variants? [top]

Options on handling variants include:

  1. Use a font with the shapes needed;
  2. Use markup (and see Martin Duerst’s: “Missing Characters and Glyphs” page at http://www.w3.org/International/O-MissCharGlyph for further information). Also, refer to any available guidelines (i.e., TLG Quick Beta) and be sure to document how you handled the variants.

g. How can I help in the effort to improve the situation for Greek and Unicode? [top]

Understanding (and having faith) that Unicode will eventually make it possible for Hellenists to easily type, view, and transmit Greek data electronically without error is essential. Much can be gained if scholars can promote the use of Unicode-compliant products within campus departments, in their professional societies, and in scholarly publications (including PDFs), as will discouraging colleagues from the continued use of non-Unicode fonts. Also, since the Unicode Consortium has heavy industry membership and support (which explains the relative success of Unicode), those groups who are interested in the “lesser-known” scripts such as ancient Greek need to join together, become a member of Unicode, and lobby more actively on behalf of their needs amongst computer industry members.

Bibliography and Useful Reference Guides [top]

 

N. Nicholas, Unicode Resources ( http://ptolemy.tlg.uci.edu/~opoudjis/unicode/ )

N. Nicholas, Greek Unicode Issues ( http://ptolemy.tlg.uci.edu/~opoudjis/unicode/unicode.html )

D. Perry, Word Processing in Classical Languages: Latin, Germanic, Greek ( http://scholarsfonts.net/ )

P. Rourke, Unicode Polytonic Greek for the World Wide Web ( http://www.stoa.org/unicode )

Unicode Consortium, The Unicode Standard 4.0 (Reading, MA, 2003) 

To refer to this please cite it in this way:

Deborah Anderson, “Preliminary Guidelines to Using Unicode for Greek,” C. Blackwell, R. Scaife, edd., Classics@ volume 2: C. Dué & M. Ebbott, executive editors, The Center for Hellenic Studies of Harvard University, edition of April 3, 2004.

This work is licensed under a Creative Commons License.