Wednesday 13 April 2011

To ASCII or not to ASCII

So, you've got your locations all beautifully geoparsed, but how is anyone going to search for Liévin, let alone Łódź or Uherské Hradiště?

In the old days of the web, this wasn't a problem. They couldn't. At least, not without a huge amount of learning and expense on everyone's part. In the English-speaking world, text on the web meant ASCII. This restricted you, basically, to the characters on a standard US keyboard: 128 characters, a third of which were things like 'tab' and 'line feed'. If you wanted your readers to read 'é', you sent them a something called an entity (é in this case). If you wanted your readers to read 'Ł', you might even have to send them a little image of the letter.

This was stupid, obviously. And, thank god, we now have browsers, databases, and programming languages that speak a different text 'language': UTF-8 (Unicode). This gives us millions and millions of characters, and means that if you want people to read 'š', or even 'אַ', you can send them plain text and not have to worry too much about operating systems, browser support, or reproducing 2000 .gif files whenever you change your site design (been there, done that, please don't judge if you haven't supported Internet Explorer 4).

But this only solves half of the problem: reading, but not writing. If you have to type 'Ł', and you're using a standard US/UK keyboard, we haven't come very far. The existing methods are either clumsy, slow, or require a lot of memorisation (or all three). It's all very well expecting admins to take a bit of extra time to enter the correct name - although a little bit of AJAX and GeoNames means that they don't have to. But, as this rather forthright blog points out, is it realistic to expect this from our users?

To give you an example of how this applies to the Pleiades data, take the case of Mérida, in Spain. Mérida (Roman Emerita Augusta, Pleiades 256155, GeoNames 2513917) is quite an important location for the APGRD (Archive of Performances of Greek and Roman Drama), as its wonderfully preserved Roman theatre is the venue for many new versions of classical plays - last year's Lysistrata, for example (warning, very slightly NSFW!).

Because the APGRD has no time frame - we're interested in every single performance, from antiquity onwards - many of our users won't know that Mérida used to be Emerita Augusta, so I can't rely on them searching for that. However, a search on the Pleiades dataset for Merida (no accent) returns no results because for the computer, 'Merida' and 'Mérida' are two different values.

I imagine those of a non-English-speaking background will be tempted to say "so what, we've been typing non-ASCII characters all our lives - you do know our keyboards look different, don't you?". Likewise, those who've spent the effort learning the betacode or combining unicode commands to get the best out of TLG and Perseus may be less forgiving.

However, I must admit that my own sympathies lean somewhat toward the author of that post. My own experience, and what I've learned from geoparsing our database's locations (entered over 10 years by a variety of people) is that:
  1. people don't necessarily know that a location has non-ASCII characters, or what the right ones are (pop quiz, no googling: how do you (properly) spell Liege?). They'll get no results and not realise that they've entered the wrong search string.
  2. a substantial proportion of people, even academics, have no idea how to enter non-ASCII characters - and even those that can are generally only good at doing it in languages/ character sets with which they are familiar.
  3. of these, how many are going to cut-and-paste text from elsewhere (which does not always give you the characters you want)?
So, I'm sure of the problem, but not necessarily of the solution. It seems to me that the best way to do this might be to convert all search strings to ASCII before matching them against ASCII-ised locations (at least, for user search). But does this lead to a horrible loss of precision? Are there better ways?

And I have it easy - all my locations are at least in the Latin alphabet!

1 comment:

  1. You could have some symbol buttons at the side of the input search box, or a button to display symbols to select.
    Something like http://www.greywyvern.com/code/javascript/keyboard (click the keyboard symbol) or a smaller/simpler http://keith-wood.name/keypad.html

    If you manage to sort something out I think it would still be good to do a ASCII-ised locations search, but still display the unicode result (those results could be a 'did you mean...' for users who can only manage ascii letters). Although this might make searching take a computational longer time.

    ReplyDelete