November 12, 2010

Better expat living through technology

I don't discuss expat life in Thailand much on this blog, probably because there are others who discuss it more insightfully and entertainingly, like my pal Greg (of Bangkok Podcast and Greg to Differ fame).

Lately, though, I've been thinking about the subject. Even when you speak the language, as I do, and get used to (most of) the quirks of your adoptive culture, life in a foreign country can still be difficult.

I've always liked technology, but I've never been an early adopter, nor had the budget for many gadgets. I bought my first iPod in 2007, and my first smart phone, an iPhone 4, not even two months ago. I also bought a Kindle 3 around the same time.

These new additions to my growing gadget collection made me stop and consider how technology has really improved my quality of life as an expat in small but remarkable ways.

Here are a few of ways that technology improves my life in the big, foreign city:

The daily grind: I have a 30-40 minute commute in the morning, and often longer at night. The ease of the iPod for loading podcasts and audiobooks really makes that time fly. Before I owned an iPod I used to download mp3 files for podcasts and burn them to disc to play in my car stereo. I know, right?

Living outside the US also means, if the entertainment cartels have their way at least, missing out on excellent services like Pandora, Rdio, and Last.fm Radio (that last one you can get here, but it's not free like in the US). Thanks to VPNs, I can tunnel through to use these services when I really want to. I even use a VPN service on my iPhone, so I can use Pandora or Rdio on there. But it's still kind of a hassle, and the VPN connection cuts out sometimes. [Dear Thai government: I was just joking about using VPNs. Everybody knows those are illegal here and of course I would never, ever actually use one. *twitch*]

Literally two days ago, though I found another service that solves my longstanding problem of having a huge music library, but having no desire to take the time to divide it into playlists, or swap music in and out of my iPod/iPhone. The service is Audiogalaxy, and what it does is simple: it streams your music from your computer to your iOS (or Android) device. As of yesterday, I now have my full music library (100GB+) at my fingertips anywhere in Bangkok. This blows my mind. It's a game changer for me, and pretty much the definition of a killer app.

Social life: Sometimes I feel like I'm simultaneously a misanthrope and a social butterfly. Despite being married with two kids, I enjoy my alone time. But I do miss hanging out with friends from back home. Bangkok is my wife's hometown, so her high school, college, work friends -- they're mostly all still in Bangkok. It's an enviable situation that even people in the US don't really enjoy. One of the perks of a one-big-city country. Bangkok is the center of the Thai universe, and a sort of black hole that sucks everyone to it, to boot.

How does this relate to technology? One word: Twitter. Before Twitter I was much more of a solitary expat. I had friends but I didn't see or really even communicate with them that much. I've never lived in downtown Bangkok, and I've never frequented the Bangkok social scene. Never really been my style. This has started to change ever-so-slightly, though. Thanks to Twitter I've met quite a lot of great people who I don't think I would've met otherwise, and some have become my good friends. (There are still a few positions available--just fill out this form and submit an 8x10 glossy headshot if you want to apply.)

Keeping in touch: Obviously email, blogging, Facebook and all of that help me keep in touch with my family and friends from back home. But those are old hat. It's Skype that has really changed the way I communicate. Nowadays I use Skype-In to rent a local US number in my hometown, and use a little USB connector to hook my Skype up to an actual phone in my house in Bangkok. Now I have a number that anyone in the US can use to call me at home, and when I'm not there Skype forwards the call to my cell phone. It's pretty incredible, and it costs me single-digit dollars per month.

Not only Skype, though, but just having a smart phone makes it easier to keep in regular touch. A few weeks ago day my daughter was singing to herself and made up a cute song about her younger brother. I busted out my iPhone, opened the voice recorder app, and recorded it on the spot. From there, I edited out a few seconds on either side, and with another click or two I had emailed the clip to Grandma back in the US. It's not that I couldn't do any of this stuff before, but I just didn't because I'm lazy and it wasn't simple enough.

I'm also surprised by how much I dig video chat on my shiny new iPhone. My bestest friend since I was a dork-tastic 8-year-old recently got an iPhone 4, too, and while he and I would occasionally chat or call each other before, it's a totally different beast to be able to have a face-to-face conversation with him anywhere I go (that has wifi). It's been pretty great, and helps to quell the occasional uprising of mild homesickness. I hope Apple opens up the protocol to other devices. (In the meantime Tango offers cross-platform video chat, though.)

To sum up, none of this stuff I can do now is bleeding edge tech, it's just the convergence of many cool technologies that make life better, and make Bangkok seem a lot less far away from home.

November 6, 2010

Project Gutenberg Thailand: Some nitty gritty (and a sample Thai ebook)

[Read the PGT 2014 Update!]

In my previous post on Project Gutenberg Thailand, I avoided getting too much into the technical details. Here are more of my thoughts on the steps of the book digitization process.

Scanning:

I have many public domain Thai books in my personal collection, including works by the authors I mentioned in my previous post, as well as numerous literary works by Rama VI, books by Prince Damrong, and so forth.

Many older books have had recent printings, and so they are neither rare nor expensive. In such cases my preferred method for scanning is deconstruct the binding of the book and scan each page using a flatbed scanner. This produces the best quality image, with no shadow or distortion.

This is not always possible, of course, if the book is old or rare, or doesn't belong to me. For such cases I use an OpticBook scanner, which is a special book scanner that allows you to scan one page at a time without destroying the book or breaking its spine. It looks like this, by the way:


In my free time I've scanned several public domain Thai books, and have many more ready and waiting to be scanned.

Photocopying books, say from the library, also works, but scanning is preferable because OCR software works best with grayscale images of 300 dpi resolution or higher. A photocopier is black and white, and ultimately you have to scan the photocopies into a computer anyway. The main benefit of photocopying is that Thai libraries offer cheap photocopy services, so you can have a pro do the heavy lifting for you, so to speak.

In addition, there are excellent resources like the Thammasat Electronic Rare Books site, which have black and white PDF files of many old books. Not all of these are actually public domain, but sites like this provide another avenue for public domain source material that can be digitized as text.

OCR software:

To date the best OCR software for Thai is ABBYY FineReader Professional, either version 9 or 10. In fact, ABBYY is the only one that's any good. I have used it extensively.

A program called ArnThai was released a few years ago by Thailand's National Electronics and Computer Technology Center (NECTEC), but unfortunately ArnThai is rather terrible. The quality of the OCR is not very good, but also it has no batch processing of any kind, supports limited input and output filetypes, and has no mechanism for training characters at all. ABBYY FineReader, while not perfect, has sophisticated tools for all of these things.

OCR accuracy for Thai is well above 90%, but it still has problems. ABBYY has trouble with older typefaces, for instance. And even on newer books it still has some trouble accurately detecting all superscript and subscript characters, or differentiating very similar characters. This is par for the course. Since it has a training mechanism, though, a human can teach the software how to properly recognize difficult typefaces when needed, which greatly improves the quality.

Proofreading:

The basic mechanism for crowdsourced proofreading is to show the user the original page image side-by-side with the text output from the OCR software. (Or, if the text was manually typed, with that.) Here's an example taken from pgdp.net:


The user make corrections and submits the page when it is completed. The process that pgdp.net uses is sometimes overly complicated. Every page is checked about a dozen times, each time focusing on different things (basic text accuracy, formatting, etc.). They have the luxury of plentiful volunteers.

To start with, at least, Project Gutenberg Thailand will not have this luxury. The method I propose is like this:

Each page is proofread two times, once each by two different human proofreaders. They make their corrections and submit the page. Ideally, the two versions would be identical, but there will of course be some errors. To identify the errors, the two versions are then compared programmatically, to find discrepancies between them. Any place where they differ can be assumed to be an error made by one proofreader or the other. The points of discrepancy are highlighted and shown to a third person, who can quickly correct only the highlighted points, thereby fixing mistakes made by the original two proofreaders. At that point the error rate on the page will be extremely low.

Distribution:

Once books are turned into digital text, they can be formatted into the various ebook formats for reading on computer, cell phone, or ebook reader, as well as being available on the website as text and HTML files. I believe public domain works should be distributed far and wide, of course, so I would be pleased to see any output of Project Gutenberg Thailand also posted to Thai Wikisource, as well as any other website that wished to host it.

As a test, I took the electronic text of the short book Letters from Jangwangram จดหมายจางวางหร่ำ, a 1905 epistolary novel, and created an EPUB file. I then transferred it onto my iPhone and loaded it into Stanza, a free ebook reading app. I'm happy to say it is quite readable, even though the lack of word breaks in Thai creates some odd spacing. I also loaded it onto my Kindle 3, and though readable, the default font was not ideal, and there were similar spacing issues.

You can download this EPUB on your own devices if you'd like to test it out:

จดหมายจางวางหร่ำ
โดย น.ม.ส.
Letters from Jangwangram
by N.M.S. (pen name of Prince Bidyalankarana)
jotmai-jangwangram.epub

(Note: This ebook is not in Unicode. It uses HTML entities because some ebook formats are not Unicode compatible yet. I'm still learning the particulars of the several popular ebook formats.)

Project Gutenberg Thailand: Liberating public domain Thai literature

[Read the PGT 2014 Update!]

For a few years now it's been one of my goals--it's been so long I should probably say 'dreams'--to start Project Gutenberg Thailand, a repository for public domain literature in Thai and about Thailand. The founder of the original Project Gutenberg, Michael Hart, encourages such spin-off sites, and was enthusiastic when I contacted him about the idea back in 2007.

The closest thing that currently exists is Thai Wikisource, but it has little in the way of modern literature, instead having mostly selections of classical Thai verse and public domain government documents. As far as I know, there is no existing movement to identify and disseminate more recent public domain Thai works.

Owing to a number of reasons, however, not the least of which being my own lack of sufficient free time, nothing has ever gotten off the ground.

Last year, frustrated with my inability to make any headway on this project, I began compiling a list of Thai authors whose works are in the public domain. To put it simply, under Thai law a book is copyrighted until 50 years after the author's death.

With such a relatively short copyright term, the works of many well-known 20th century authors have entered the public domain. Unfortunately it too often seems to be those authors who died young. Yakob ยาขอบ, author of the immortal Conqueror of the Ten Directions ผู้ชนะสิบทิศ, died in 1956 at age 48. And two early novelists born in 1905 failed to reach middle age -- Prince Akartdamkoeng มจ.เจ้าอากาศดำเกิง penned such well-remembered tomes as The Circus of Life ละครแห่งชีวิต before killing himself at 25; while Mai Mueangdoem ไม้ เมืองเดิม, of The Old Wound แผลเก่า fame, met his fate at 37. Such authors are no less significant for modern Thai culture than the Fitzgeralds and Steinbecks of American culture, and while they can be still be found in print, it is a shame that such works aren't yet available as free ebooks.

I hear you asking, "So what needs to be done to set these works free?" (I have excellent hearing.) "And how can I help?"

Well, we need to put in place the process for taking the paper books and turning them into electronic text. The basic steps are as follows:

1. Get the book and scan each page, to create images.
2a. Use OCR [optical character recognition] software to turn the images into digital text.
OR
2b. Have humans type out the text contained in the images.
3. Have humans proofread the digital text by comparing it side-by-side with the original image.

This process is well-established for English. Project Gutenberg has a sister website, called Distributed Proofreaders (pgdp.net) that crowdsources this work for books in Latin-alphabet languages. OCR for English is extremely accurate. Until 2009, OCR for Thai was miserably poor, but nowadays it's rather good. In other words, the time is finally right to start digitizing books for Project Gutenberg Thailand.

With the original Project Gutenberg, anyone can sign up to help via the Distributed Proofreaders site. For Project Gutenberg Thailand, we must build a similar community of people willing to contribute a little bit of time here and there to help liberate public domain Thai books from their paper prisons.

So here I am, writing this blog post, hoping to drum up help to get the ball rolling.

The most immediate need is a sympathetic soul with web programming chops to help create the website for crowdsourced Thai proofreading and/or typing. I've put a fair amount of thought into the core features needed, but I lack the programming and design skills needed to make it a reality.

If you are interested in helping with this effort, come join the Google Group, let me know on Twitter at @thai101, or email me at rdockum [at] gmail [dot] com.