1. | Abstract |
2. | Quick View to Unicode |
3. | Overview of CJKV |
4. | What is XSLT? |
5. | When CJKV Meets 4XSLT... |
6. | Feedback |
This document describes how to handle CJKV text within XSL/XSL documents by using Python. We'll briefly discuss Unicode, CJKV, XSLT and then delve into the details of processing CJKV text with 4XSLT.
The Unicode Standard is published by the Unicode Consortium. Unicode is a universal character set intended to encompass major scripts on the earth with an easy and unified method. Every Unicode character code is assigned by a character number in 16-bit(two bytes) width. For example, the character a is presented by two-byte 0x0061, not only one-byte 0x61 in ASCII. With the single 16-bit character set, the Unicode Standard can represent more than 65,000 characters without activating its extension mechanism.(The upper limit is 65,536 characters since 2 16 = 65,536.)
When World Wide Web(WWW) is concerned, Unicode is involved everywhere; HTML 4.0 and XML 1.0 claim that they support Unicode. For this reason, there is no way to avoid talking of Unicode nowadays due to its already being real-world applications with HTML and XML. As an extra example, major desktop operating systems, like Windows and Macintosh, are all aware of Unicode. Even Linux is working hardly towards the Unicode Standard. In addition, current major programming languages support more or less Unicode, such as Java, C/C++, Python, ECMAScript, Tcl, and Perl. From version 1.6, Python supports Unicode, too.
Before we can move onto the next section, we have to know about Unicode encodings and transcoding issues.
The default encoding form initially adopted by the Unicode Standard is the one which maps every Unicode character to a unique 16-bit code value. As previously mentioned, character a is represented as 0x0061 in Unicode Standard, not 0x61 like in ASCII. This is the very default encoding scheme that the Unicode Standard initially uses. However, in addition to the default encoding form, there are two more Unicode variant encoding forms applied widely around the world, namely UTF-16 and UTF-8.
UTF-16 is abbreviated with "UCS Transformation Format for Planes of Group 00" and UCS is a term in shorthand issued by ISO/IEC 10646, and stands for "Universal Multiple-Octect Coded Character Set." ISO/IEC 10646 defines four-octet coded character set, known as UCS-4. Besides UCS-4, it also defines two-octet coded character set called UCS-2 which is equivalent to the Unicode coded character set. This term octet is also defined by ISO to refer to a 8-bit unit that we usually think as one byte. UCS-4 can contain 2,147,483,648 characters, calculated from the equation 128x256x256x256 = 2,147,483,648. 128 means the most significant bit in the first byte of UCS-4 is not involved.
Conceptually, it is said that UCS-4 consists of 128 groups, and each group contains 256 planes, and each plane owns 256 rows, and each row is divided into 256 cells. Imagine that there are 128 buildings out there, and each building has 256 stories, and each story has 256x256 chairs. With this imaginary picture in your mind, we can define Plane (00)16 of Group (00)16 as Basic Multilingual Plane(BMP). So the chairs in BMP is totally 256x256 = 65,536, equivalent to UCS-2(256x256) and the Unicode Standard(2 16 ). In the Unicode world, chairs refer to code points.
From such a discussion, you know that both Unicode coded character set and UCS-2 use identical 16-bit units to represent characters and denote the same BMP. In BMP, there is an allocation area named Surrogate, or S-zone in terms of ISO/IEC 10646. If Surrogate is not used, UCS-2 is just UCS-2, however, when it used with Surrogate, UCS-2 turns out to be UTF-16 which can represent characters in Plane (01)16 to (10)16 of Group (00)16 in addition to those in BMP. This is the major difference between UCS-2 and UTF-16. While UCS-2 can only address characters in BMP, UTF-16 can represent ont only characters in BMP but also characters in the next 16 planes after BMP by activating the Surrogate area.
In other words, if your application conforms to UCS-2, you must disallow the uses of any code point in the Surrogate area; conversely, if your application conforms to UTF-16, you must allow the uses of any code point in the Surrogate area. As you may realize, UTF-16 comes from ISO/IEC 10646, too. In fact, the Unicode Standard initially named the mechanism that address characters outside the BMP as Surrogate. In Unicode 2.0, it uses UTF-16 only to describe the relationship between Unicode Standard and ISO/IEC 10646. However, when Unicode 3.0 releases, it claims that the Unicode text is encoded as UTF-16. Before this, we can only say the Unicode character is encoded as ordered 16-bit code values or the like. Although UTF-16 is gradually becoming the standard of Unicode encoding scheme, many softwares cannot properly deal with 16-bit characters, and this is why UTF-8 comes out.
UTF-8 stands for "UCS/Unicode Transformation Format, 8-bit form." Each Unciode character encoded in UTF-8 becomes a sequence that has a variable length ranging from one byte to four bytes. The following table shows the ranges of different UTF-8 encoded lengths in BMP.
UTF-8 stands for "UCS/Unicode Transformation Format, 8-bit form." Each Unciode character encoded in UTF-8 becomes a sequence that has a variable length ranging from one byte to four bytes. The following table shows the ranges of different UTF-8 encoded lengths in BMP.
From | To | UTF-8 | Allocation Areas |
0x0000 | 0x007F | one byte | General Scripts |
0x0080 | 0x07FF | two bytes | General Scripts |
0x0800 | 0xFFFF | three bytes | General Scripts, Symbols, CJK Phonetics and Symbols, CJK Ideographs, Yi Syllables, and Hangul |
0xD800 | 0xDFFF | four bytes | Surrogate |
0xE000 | 0xF8FF | three bytes | Private Use |
0xF900 | 0xFFFF | --- | Compatibility |
You can see that the heading field ranging from 0x0000 to 0x007F corresponds to the ASCII table(0~127 in decimal) when encoded as one-byte UTF-8 sequences. This is the reason why UTF-8 can work well with 8-bit systems because its first 128 characters are identical to ASCII. This also means that when your text contains mostly Latin characters, UTF-8 would be a more efficient encoding scheme than the default one used in the Unicode Standard by cutting off uniform 16-bit lengths to 8-bit lengths.
If you are a CJK speaker, however, adopting UTF-8 will enlarge the Unicode text sizes due to the fact that per 16-bit CJK character requires three bytes to encoded as a single UTF-8 sequence. When processing with CJK texts, the adequate encoding scheme should be UTF-16 if you care about text sizes(and implies performance divide). However, UTF-8 is the default or only encoding that most XML processors support; thus the applicable way that CJK characters can work well with XML for now is probably the UTF-8. It is, therefore, necessary to convert CJK encodings into UTF-8 before feeding them in XML processors.
When converting one encoding into another, we can refer this process as to transcoding. In Unicode context, we should treat Unicode as the central part. When moving one encoding to the other, it is suitable to first transcode the source encoding to Unicode, and then transcode Unicode to the target encoding; and so does the reverse process. Using Unicode as an intermediate encoding between the source and target ones is much more efficient and simple than directly converting the source into the target if you need to deal with multiple encodings. The reason is the same as one of the benefits of XML being an information exchange markup language: largely reduces the numbers of converters from one format to another.
The Python i18n-sig group develops some transcoding tools for East Asia, inclduing Chinese, Japanese, and Korean. For downloading these tools, link to ftp://python-codecs.sourceforge.net/pub/python-codecs/. If you are the one who need to cope with XML in different East Asia encodings using Python, you know that you are now not alone. In the end of the next section, we'll examine these tools by providing some illustrations. However, we should get a light understanding to CJKV so that you won't get confused at why it is the way it is when demonstrating these tools. Finally, for more precise information about the Unicode Standard, you should visit the Unicode web site: http://www.unicode.org/ or purchase the official book written by the Unicode Consortium.
CJKV is the shorthand for Chinese, Japanese, Korean, and Vietnamese. The history tells us the writing systems used by Japanese, Korean, and Vietnamese are partly derived from that of Chinese, and to some degree the common encoding schemes used by them are similar to one another on modern computer systems. This is the major reason why we can always mention CJKV altogether when talking about information processing.
As we all know, a character is represented as a numeric value in computer systems. In ASCII/IBM character set, one character equals one byte. However, in CJKV, the so-called one character is represented as more than one byte, but usually two bytes. Therefore, you can realize that the processing for CJKV is reasonably distinct from Latin characters. Since a two-byte unit equals 16 bits, it can provide totally 65,536(256x256) unique numeric values. It is better to think of 256x256(rowsxcells) as a two-dimensional array for imaginary visualization in your brain. For this reason, the upper limit of Chinese/Japanese/Korean/Vietnamese characters is set to 256x256 code points, respectively. Although the ultimate code space is available to 256x256 scale, the practical code space is normally a subset of 256x256. For example, the code space of Big Five(BIG5) is located at a 94x157 array containing 14,758 code points while the ISO-2022-JP is made from the 94x94 code space having the capacity of 8,836 code points. It is feasible to include one-byte Latin characters within the two-byte CJKV encoding system. Many CJKV locale encoding mix one- and two- byte characters together in the text stream. This is a very efficient way to represent CJKV text together with Latin characters. We can call such an encoding scheme as variable-length encoding, which includes Big Five(Taiwan), GBK(China), Johab(Korean), Shift-JIS(Japan), EUC variants, UTF-8(Unicode), and so forth. In order to determine when to switch back and forth between one- and two-byte characters, the variable-length encodings use the status(0/1) of the eighth bit of a byte to decide whether or not it is a two-byte character. For example, in Big Five, the first byte is ranging from 161 to 254, and the second byte ranging from 64 to 126 and 161 to 254. Similarly, in Shift-JIS(JIS X 0208-1997), the first byte is ranging from 129 to 159 and 224 to 239, and the second byte ranging from 64 to 126 and 128 to 252.For this reason, if you are writing a program to manipulate Big Five characters, your decision logics can simply determine whether the character's numeric value is inside the area of 161~254 or not. If so, it is a two-byte Big Five character and the subsequent character should be treated as the second part of this Chinese character; if not, it is something else, perhaps a Latin character. You can see that, without a high level protocol, like HTML or XML, it is almost impossible to determine which part is Big Five and which part is Shift-JIS if Big Five and Shift-JIS characters appear in the same document because the ranges of the first byte in both Big Five and Shift-JIS are overlapped.
In HTML, we can use lang attribute to specify different encodings to use in a HTML document. And in XML, we can use xml:lang tag to indicate various encodings in the same XML document. The official names for available encodings come from Internet Assigned Numbers Authority(IANA). All IANA-registered encoding names are case-insensitive, thus, when you are using these encoding names in XML/HTML for matching, you don't need to care of upper- or lower-cases in theory. For more information about IANA, visit http://www.iana.org/.
Now, let's briefly examine those code conversion tools developed by the Python i18n-sig group. It goes without saying that you must first install the encoding conversion tools before you can use them. List 2-1 is a simple example code for Big Five character stream conversion, named big5test.py.
List 2-1 Big5test.py for Traditional Chinese Demonstration
from encodings import big5_tw big5Str = "³o¬O¤@Ó´ú¸Õ" # Big Five source encoding string print big5Str uniObj = unicode(big5Str,"big5_tw") # intermediate encoding utf8 = uniObj.encode("utf-8") # target encoding # we can do something with utf8 here print repr(uniObj) print repr(utf8) back2UniObj = unicode(utf8, "utf-8") # back to the Unicode Standard back2Big5Str = back2UniObj.encode("big5_tw") # back to Big Five print repr(back2UniObj) print back2Big5Str
After execution, big5test.py outputs:
³o¬O¤@Ó´ú¸Õ u'\u9019\u662F\u4E00\u500B\u6E2C\u8A66' '\351\200\231\346\230\257\344\270\200\345\200\213\346\270\254\350\251\246' u'\u9019\u662F\u4E00\u500B\u6E2C\u8A66' ³o¬O¤@Ó´ú¸Õ
This clearly demonstrates the code conversions amongst Big Five, Unicode, and UTF-8. In this case, the source encoding is Big Five, the target encoding is UTF-8, and the intermediate encoding is the Unicode Standard. What we intend to do with is the UTF-8 encoded string. After doing something with the UTF-8 string, we can transcode it back to fixed-length Unicode stream, and then to Big Five string. "³o¬O¤@Ó´ú¸Õ" means "This is a test" in English.
As an extra example, the usage of Shift-JIS conversion tool is basically the same. List 2-2 is an illustration for using Japanese Shift-JIS encoding conversion, named sjistest.py.
List 2-2 Sjistest.py for Japanese Demonstration
from encodings import shift_jis sjisStr = "ˆÀ‘S‚ÉŽg‚¦‚é" # Shift-JIS encoded source string print sjisStr uniObj = unicode(sjisStr,"shift_jis") utf8 = uniObj.encode("utf-8") print repr(uniObj) print repr(utf8) back2UniObj = unicode(utf8, "utf-8") back2SjisStr = back2UniObj.encode("shift_jis") print repr(back2UniObj) print back2SjisStr
After running sjistest.py, it outputs:
ˆÀ‘S‚ÉŽg‚¦‚é u'\u5B89\u5168\u306B\u4F7F\u3048\u308B' '\345\256\211\345\205\250\343\201\253\344\275\277\343\201\210\343\202\213' u'\u5B89\u5168\u306B\u4F7F\u3048\u308B' ˆÀ‘S‚ÉŽg‚¦‚é
Compared with big5test.py and sjistest.py, you can find that they are almost identical in their framework except the chosen encoding method. "ˆÀ‘S‚ÉŽg‚¦‚é" means "Can use safely" in English.
For more information about CJKV processing, you should reference <<CJKV Information Processing>>, written by Ken Lunde, or the Unicode Standard.
XSLT stands for eXtensible Stylesheet Language: Transformation, which is a transformation language capable of transforming the structures of XML documents by providing a set of template rules and working together with other XML technologies, such as XPath and XLink. XSLT belongs to XSL, and XSL stands for eXtensible Stylesheet Language, composed of a formatting language involving formatting objects(FO) as well as a transformation language that is the very XSLT.
Although XML is simple enough for human reading and writing , it is almost impossible for people, common softwares at presentation layer, or others to use XML directly in its primitive structures. XML is usually transformed into something else in order for further processing, such as display or as input to other programs. Obviously, for now, the most common use of XSLT amongst WWW information systems is probably transforming XML into HTML for display in Web browsers.
Building on this recognition, you can see that even XSL includes both XSLT and FO languages, this doesn't mean you have to first use XSLT to transform XML into something else, and then later use FO for rendering the output. There is a rendering engine embedded in the Web browser. When you output HTML with XSLT from XML, the Web browser takes charge of interpreting HTML elements to show up in its window. In short, you can use XSLT alone.
As we all know, XML insulates data from presentation, serving as the standard way to exchange information on the Internet for the requirements of communications amongst softwares. The capabilities of XSLT to move XML representations from one structure to others makes it a powerful component of XML-based applications becuause the informations exchanged can be produced by the XSL transformations from the XML source. And it is not surprised to primarily generate informations in XML for exchange.
To transform an XML document, you need to write an related XSL document, and put a special processing instruction(PI) into this XML document to associate with that XSL document, for example, here is a simple XML document, named demo.xml:
List 3-1 demo.xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <?xml-stylesheet type="text/xsl" href="demo.xsl"?> <hi>Hello! XML!</hi>
When a XSLT processor reads this XML document, by the appearance of the PI <?xml-stylesheet ...?>, it recognizes that it must transform this XML document according to template rules defined in demo.xsl specified by href attribute. The XSL document, demo.xsl, is shown in List 3-2.
List 3-2 demo.xsl
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="/"> <HTML> <HEAD><TITLE><xsl:value-of select="hi"/></TITLE></HEAD> <BODY> <H1><xsl:value-of select="hi"/></H1> </BODY> </HTML> </xsl:template> </xsl:stylesheet>
After processed by the XSLT processor, the generated HTML output should look like something shown as List 3-3.
List 3-3 The HTML Output Generated By a XSLT Processor
<HTML> <HEAD> <META HTTP-EQUIV='Content-Type' CONTENT='text/html; charset=UTF-8'> <TITLE>Hello! XML!</TITLE> </HEAD> <BODY> <H1>Hello! XML!</H1> </BODY> </HTML>
Yes. This is produced from command line with running the Python XSLT processor named 4XSLT, released by FourThoughts, Inc. For more information about 4XSLT, visit http://4Suite.org/. We won't talk too much regarding uses of 4XSLT.
You can see that it uses the <META> element to specify UTF-8 as the output encoding. In both demo.xml and demo.xsl, we specify explicitly the encoding to UTF-8. As a matter of fact, if we omit the encoding attribute lying in the XML declaration line, the default encoding is still set to UTF-8 based upon the XML specification.
In an XSL transformation, there are three trees built in different processing stages. The first is the source XML document, the second is the XSL document for transforming, and the last is the output document which in this case is the HTML document. Hence, as you can imagine, XSLT processing is a severe resource-consuming handle because each tree is represented as a group of structural nodes and occupies by lots of memories when it is alive. In other words, if you want to do server-side XSLT processing on your Web server, the associated documents used for XSL transformations should get as much small as possible. One workable way to heavy traffic sites is to generate HTML files in advance from XML source at command line. Additionally, since the Web browsers are getting more and more powerful, XSL transformations have been an implementation in major Web browsers, such as MSIE 5.0 and NN 6.0. That is to say, NN or MSIE can download XML/XSL documents and make the transformation needed to render the final HTML output. But this way would make the bandwidth busy and put the load onto the clients. To sum up, the best solution depends upon your policy.
Building on previous discussions, now you can imagine what should be done in programming when CJKV texts encounter XML. Before we look into CJKV processing with 4XSLT, let us first transform previous demo.xml/deml.xsl here again by writing Python code, not from command line. List 4-1 is the Python code for transformation, named as demo.py.
List 4-1 demo.pyfrom xml.sax.drivers import drv_xmlproc from xml.xslt.Processor import Processor sheetfile = open("demo.xsl","r") sourcefile = open("demo.xml","r") SAXparser = drv_xmlproc.SAX_XPParser() SAXparser.parseFile(sourcefile) sourcefile.seek(0) # make the cursor back to the head sheet = sheetfile.read() source = sourcefile.read() processor = Processor() # 4XSLT API processor.appendStylesheetString(sheet) # 4XSLT API result = processor.runString(source) # 4XSLT API print result
After running demo.py, you can see it outputs the following HTML code:
<HTML> <HEAD> <META HTTP-EQUIV='Content-Type' CONTENT='text/html; charset=UTF-8'> <TITLE>Hello! XML!</TITLE> </HEAD> <BODY> <H1>Hello! XML!</H1> </BODY> </HTML>
Before we make XSL transformation, we take xmlproc driver to parse demo.xml for checking its well-formedness. Normally, in order to get efficient performance, you shouldn't involve XML parsing in real-world applications, for the XML documents and their related XSL stylesheets should be validated before any appliance. In other words, any XML/XSL documents you feed in 4XSLT would be considered as well-formed and valid by assuming that they are well-parsed. Of course, if your XML application need some inputs from other XML applications for further XSL transformations, you may need to take some actions to make sure the inputs are well-formed and valid. In such cases, your Python code would get xmlproc/pyexpat drivers involved. Note that when xmlproc parses demo.xml, it notifies that the UTF-8 encoding is unsupported.
In general, you would save the HTML output into a persistent file for displaying in Web browsers, or make it become another application's input, or direct it to the Web server if demo.py is a CGI script.
With such a understanding in programming 4XSLT, it's time to examine how to program 4XSLT if CJKV text is encountered. List 4-2 is the XML document containing Big5 Five characters, named book.xml.
List 4-2 Book.xml in Big5 Encoding
<?xml version="1.0"?> <?xml-stylesheet href="book.html.xsl" type="text/xsl"?> <!DOCTYPE Book [ <!ELEMENT Book (Title, Chapter+)> <!ATTLIST Book Author CDATA #REQUIRED> <!ELEMENT Title (#PCDATA)> <!ELEMENT Chapter (Name, Contents)*> <!ATTLIST Chapter id CDATA #REQUIRED> <!ELEMENT Name (#PCDATA)> <!ELEMENT Contents (#PCDATA)> ]> <Book Author="³¯«Ø¾±"> <Title>¤p¾ô¬y¤ô¤H®a</Title> <Chapter id="1"> <Name>¤p¾ô</Name> <Contents> ¾Ú»¡¥H«eªº¿W¤ì¾ô¬Oµ¹¦Ï¨«ªº, «á¨Ó¦]¬°¶Â¦Ï©M¥Õ¦Ï¤&pount;¦A§n¬[¤§«á, ´N±`±`§Q¥Î¿W¤ì¾ô¦X§@¨«¨p¦Ï¥¤, ¤HÃþ¤~¨M©w§â¿W¤ì¾ô¦¬¦^¨Ó¦Û¤v¨«. </Contents> </Chapter> <Chapter id="2"> <Name>¬y¤ô</Name> <Contents> ¾Ú»¡¥H«eªº¤ô¬O¤&pount;·|¬y°Êªº, ¦j¤÷°l¤é´N¬O¦]¬°³Ü¤&pount;¨ì¬y°Êªº¤ô, ¤~·|´÷¦ºªº. ½L¥j¬°¤F¬ö©À¦j¤÷, ´N¦b¥L¨¤W¼»¤@ªw§¿, ©³¤Uªº¤ô¨ü¤&pount;¤F¯ä¨ý¤~¶}©l¥|³B°k«, µ²ªG¤ô´NÅܬy¤ô¤F. </Contents> </Chapter> <Chapter id="3"> <Name>¤H®a</Name> <Contents> ¾Ú»¡®J¤Îªº¶H§Î¤å¦r¦³¤H®a³o¨âÓ¦r, ð©ú¬Ó¹C¤ë®c®É, ¹ß®Z§i¶D¥L»¡, ¨º¬O¦o¶¢µÛµL²á, ¤U¤Z¥h±Ð®J¤Î¤H¼gªº. ¦Ó¥BÁÙ¦³¶Ç»D, ¥j¤Òª÷¦r¶ð©³¤U, ¦³¤@°Æ´Ã§÷¥i¥Hª½±µ³q©¹¤ë²yªº¥~¬P¤H°ò¦a. ¥un½ö¶i¥h, °á©G»y: ªÛ³Â¶}ªù, ³o¼Ë´N¦æ¤F. </Contents> </Chapter> </Book>
List 4-3 is the XSL document used for book.xml.
List 4-3 Book.html.xsl
<?xml version="1.0"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0" > <xsl:template match="Book"> <HTML> <HEAD> <TITLE><xsl:value-of select="Title"/></TITLE> </HEAD> <BODY> <H1>Author:<xsl:value-of select="@Author"/></H1> <H1>Book Name:<xsl:value-of select="Title"/></H1> <UL> <xsl:apply-templates select="Chapter"/> </UL> </BODY> </HTML> </xsl:template> <xsl:template match="Title"> <xsl:value-of select="Title"/> </xsl:template> <xsl:template match="Chapter"> <LI><H2>Chapter <xsl:value-of select="@id"/>:<xsl:value-of select="Name"/></H2></LI> <H3><xsl:value-of select="Contents"/></H3> </xsl:template> </xsl:stylesheet>
What we care about here is the code conversion before and after processing with 4XSLT processor. The Python code we used for this task is shown in List 4-3, named as big5HTMLGenerator.py.
List 4-3 Big5HTMLGenerator.py
from xml.sax.drivers import drv_xmlproc_val from xml.xslt.Processor import Processor from encodings import big5_tw import string oldMeta = "<META HTTP-EQUIV='Content-Type' CONTENT='text/html; charset=UTF-8'>" newMeta = "<META HTTP-EQUIV='Content-Type' CONTENT='text/html; charset=BIG5'>" sheetFile = open("book.html.xsl","r") sourceFile = open("book.xml","r") SAXparser = drv_xmlproc_val.SAX_XPValParser() SAXparser.parseFile(sourceFile) sourceFile.seek(0) sheet = sheetFile.read() source = sourceFile.read() # transcode to UTF-8 before feeding in 4XSLT processor uniObj = unicode(source, "big5_tw") utf8Source = uniObj.encode("utf-8") processor = Processor() processor.appendStylesheetString(sheet) result = processor.runString(utf8Source) # convert result back to Big Five uniSource = unicode(result, "utf-8") big5Source = uniSource.encode("big5_tw") # we don't want UTF-8 charset; we want Big5 charset big5Source = string.replace(big5Source, oldMeta, newMeta) print big5Source
When executing big5HTMLGenerator.py, it outputs:
<HTML> <HEAD> <META HTTP-EQUIV='Content-Type' CONTENT='text/html; charset=BIG5'> <TITLE>¤p¾ô¬y¤ô¤H®a</TITLE> </HEAD> <BODY> <H1>Author:³¯«Ø¾±</H1> <H1>Book Name:¤p¾ô¬y¤ô¤H®a</H1> <UL> <LI> <H2>Chapter 1:¤p¾ô</H2> </LI> <H3> ¾Ú»¡¥H«eªº¿W¤ì¾ô¬Oµ¹¦Ï¨«ªº, «á¨Ó¦]¬°¶Â¦Ï©M¥Õ¦Ï¤&pount;¦A§n¬[¤§«á, ´N±`±`§Q¥Î¿W¤ì¾ô¦X§@¨«¨p¦Ï¥¤, ¤HÃþ¤~¨M©w§â¿W¤ì¾ô¦¬¦^¨Ó¦Û¤v¨«. </H3> <LI> <H2>Chapter 2:¬y¤ô</H2> </LI> <H3> ¾Ú»¡¥H«eªº¤ô¬O¤&pount;·|¬y°Êªº, ¦j¤÷°l¤é´N¬O¦]¬°³Ü¤&pount;¨ì¬y°Êªº¤ô, ¤~·|´÷¦ºªº. ½L¥j¬°¤F¬ö©À¦j¤÷, ´N¦b¥L¨¤W¼»¤@ªw§¿, ©³¤Uªº¤ô¨ü¤&pount;¤F¯ä¨ý¤~¶}©l¥|³B°k«, µ²ªG¤ô´NÅܬy¤ô¤F. </H3> <LI> <H2>Chapter 3:¤H®a</H2> </LI> <H3> ¾Ú»¡®J¤Îªº¶H§Î¤å¦r¦³¤H®a³o¨âÓ¦r, ð©ú¬Ó¹C¤ë®c®É, ¹ß®Z§i¶D¥L»¡, ¨º¬O¦o¶¢µÛµL²á, ¤U¤Z¥h±Ð®J¤Î¤H¼gªº. ¦Ó¥BÁÙ¦³¶Ç»D, ¥j¤Òª÷¦r¶ð©³¤U, ¦³¤@°Æ´Ã§÷¥i¥Hª½±µ³q©¹¤ë²yªº¥~¬P¤H°ò¦a. ¥un½ö¶i¥h, °á©G»y: ªÛ³Â¶}ªù, ³o¼Ë´N¦æ¤F. </H3> </UL> </BODY> </HTML>
The expected HTML output is exactly what we want here. However, note that the following code is taken a potential risk:
big5Source = string.replace(big5Source, oldMeta, newMeta)
If oldMeta is more than once in big5Source, the oldMeta at the other places in the HTML document will all be replaced with newMeta. Since we know that book.xml has no other <META> elements in it, we can safely use this stupid but simple way to survive for a period of time.
When you view the HTML output in your Web browser, you should look something like following:
Finally, let's use Shift-JIS again for an illustration to end up this HOWTO. Assume that we want to show the fifty letters of Japanese Hiragana in a HTML table from the XML document named table.xml, as shown in List 4-4.
List 4-4 Table.xml including Hiragana of 50 Letters
<?xml version="1.0"?> <?xml-stylesheet href="table.html.xsl" type="text/xsl"?> <table> <line> <ch>‚ </ch> <ch>‚¢</ch> <ch>‚¤</ch> <ch>‚¦</ch> <ch>‚¨</ch> </line> <line> <ch>‚©</ch> <ch>‚«</ch> <ch>‚</ch> <ch>‚¯</ch> <ch>‚±</ch> </line> <line> <ch>‚³</ch> <ch>‚µ</ch> <ch>‚·</ch> <ch>‚¹</ch> <ch>‚»</ch> </line> <line> <ch>‚½</ch> <ch>‚¿</ch> <ch>‚Â</ch> <ch>‚Ä</ch> <ch>‚Æ</ch> </line> <line> <ch>‚È</ch> <ch>‚É</ch> <ch>‚Ê</ch> <ch>‚Ë</ch> <ch>‚Ì</ch> </line> <line> <ch>‚Í</ch> <ch>‚Ð</ch> <ch>‚Ó</ch> <ch>‚Ö</ch> <ch>‚Ù</ch> </line> <line> <ch>‚Ü</ch> <ch>‚Ý</ch> <ch>‚Þ</ch> <ch>‚ß</ch> <ch>‚à</ch> </line> <line> <ch>‚â</ch> <ch>‚ä</ch> <ch>‚æ</ch> <ch></ch> <ch></ch> </line> <line> <ch>‚ç</ch> <ch>‚è</ch> <ch>‚é</ch> <ch>‚ê</ch> <ch>‚ë</ch> </line> <line> <ch>‚í</ch> <ch>‚ð</ch> <ch>‚ñ</ch> <ch></ch> <ch></ch> </line> </table>
List 4-5 shows the XSL document used, and we call it table.html.xsl.
List 4-5 Table.html.xsl for Table.xml
<?xml version="1.0"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0" > <xsl:template match="table"> <HTML> <HEAD> <TITLE>50 Letters of Japanese Hiragana</TITLE> </HEAD> <BODY> <TABLE BORDER="1"> <xsl:apply-templates select="line"/> </TABLE> </BODY> </HTML> </xsl:template> <xsl:template match="line"> <TR><xsl:apply-templates select="ch"/></TR> </xsl:template> <xsl:template match="ch"> <TD><xsl:value-of select="."/></TD> </xsl:template> </xsl:stylesheet>
Similarly, we use a Python script to do the XSL transformation, listed in List 4-6, and name it as sjisHTMLGenerator.py.
List 4-6 SjisHTMLGenrator.py
from xml.sax.drivers import drv_xmlproc from xml.xslt.Processor import Processor from encodings import shift_jis import string oldMeta = "<META HTTP-EQUIV='Content-Type' CONTENT='text/html; charset=UTF-8'>" newMeta = "<META HTTP-EQUIV='Content-Type' CONTENT='text/html; charset=X-SJIS'>" sheetFile = open("table.html.xsl","r") sourceFile = open("table.xml","r") SAXparser = drv_xmlproc.SAX_XPParser() SAXparser.parseFile(sourceFile) sourceFile.seek(0) sheet = sheetFile.read() source = sourceFile.read() # transcode to UTF-8 before feeding in 4XSLT processor uniObj = unicode(source, "shift_jis") utf8Source = uniObj.encode("utf-8") processor = Processor() processor.appendStylesheetString(sheet) result = processor.runString(utf8Source) # convert result back to Shift-JIS uniSource = unicode(result, "utf-8") sjisSource = uniSource.encode("shift_jis") # we don't want UTF-8 charset; we want Shift_JIS charset sjisSource = string.replace(sjisSource, oldMeta, newMeta) print sjisSource
Compared sjisHTMLGenrator.py with big5HTMLGenerator.py, you can see that the key distinction between them lies in the native encoding. After running sjisHTMLGenerator.py, it outputs:
<HTML> <HEAD> <META HTTP-EQUIV='Content-Type' CONTENT='text/html; charset=X-SJIS'> <TITLE>50 Letters of Japanese Hiragana</TITLE> </HEAD> <BODY> <TABLE BORDER='1'> <TR> <TD>‚ </TD> <TD>‚¢</TD> <TD>‚¤</TD> <TD>‚¦</TD> <TD>‚¨</TD> </TR> <TR> <TD>‚©</TD> <TD>‚«</TD> <TD>‚</TD> <TD>‚¯</TD> <TD>‚±</TD> </TR> <TR> <TD>‚³</TD> <TD>‚µ</TD> <TD>‚·</TD> <TD>‚¹</TD> <TD>‚»</TD> </TR> <TR> <TD>‚½</TD> <TD>‚¿</TD> <TD>‚Â</TD> <TD>‚Ä</TD> <TD>‚Æ</TD> </TR> <TR> <TD>‚È</TD> <TD>‚É</TD> <TD>‚Ê</TD> <TD>‚Ë</TD> <TD>‚Ì</TD> </TR> <TR> <TD>‚Í</TD> <TD>‚Ð</TD> <TD>‚Ó</TD> <TD>‚Ö</TD> <TD>‚Ù</TD> </TR> <TR> <TD>‚Ü</TD> <TD>‚Ý</TD> <TD>‚Þ</TD> <TD>‚ß</TD> <TD>‚à</TD> </TR> <TR> <TD>‚â</TD> <TD>‚ä</TD> <TD>‚æ</TD> <TD></TD> <TD></TD> </TR> <TR> <TD>‚ç</TD> <TD>‚è</TD> <TD>‚é</TD> <TD>‚ê</TD> <TD>‚ë</TD> </TR> <TR> <TD>‚í</TD> <TD>‚ð</TD> <TD>‚ñ</TD> <TD></TD> <TD></TD> </TR> </TABLE> </BODY> </HTML>
When you view the HTML output in your Web browser, you should look something like following:
This draft HOWTO roughly discuss related issues about CJKV information processing for you who need to deal with XML/XSL in Python. The newest version of this document should be able to be found in the latest ChineseCodecs package. If you have any suggestion about this HOWTO, please email to the author, Chen Chien-Hsun.