Game Software Localization

2004-01-20

One of my past responsibilities has been localizing game software before both from the point of view of the person writing the original code and planning for localization and from the point of view of taking a finished product from another team or company and trying to localize it.

It is surprising to me just how poorly some code is written in this area. Of course I suppose it's normal that many people just don't think it through or don't have any incentive to try but in this day and age each of the major markets, America, Japan and Europe generally account for 1/3 each of all your sales and now Korea, China and Taiwan are coming on board it's even more important to take localization seriously including preparing in advance and trying to think through all the issues to make it as painless as possible.

For example, storing strings directly in the code is BAD! How the heck are you going to find those in order to update for different languages. Another issue can be way you code to present things. Read this article for the nightmare that languages can be. It will help you realize some of the issues of hard coding the wrong kinds of things and the types of messages to avoid.

Design your data to make it HARD to mess up. I've seen code like this

struct GameDialog
{
    int type;
    int numLines;
    char* msg;
}

GameDialog theDialog[] =
{
    { MSG_REGULAR, 3,
      "Welcome to \"Megaland\"\0"
      "I will be your host,\0"
      "Joe Bob.\0"
    }
}

To localize the code the company that wrote this sent the source code out to the translators and asked them to fill in the strings for about 30000 lines of dialog for an RPG game. Let's count the ways this is bad

  1. The strings are in the code.

    In code punctuation is important. Translators are not coders. They have no idea that all those quotes, brackets, commas etc are important

  2. The number of lines per message is hard coded.

    Again, the translators are not going to know that they need to change that "3" to a "4" if they add a 4th line. Even if you tell them they are bound to make mistakes.

  3. The lines are hand word wrapped.

    You may tell the translator your game can only display 20 characters per line but asking them to hand count characters is asking for errors.

  4. The lines are line broken with '\0' (NULL).

    The code that used this data looked at the number of lines, 3, and then searched for the strings by finding the NULL characters inserted by '\0' and of course there is an extra one at the end of all of it. If just one '\0' is missing the code will most likely crash since the last string will be some random data. If the translator accidentally deletes a '\' the '\0' becomes just a 0 like the zeros in "Beverly Hills 90210" and again the code will either crash or mess up.

  5. The quotes around "Megaland" have to be escaped.

    If they aren't the code will not compile. Are your translators going to remember this? This can be compounded by the fact that the translators might use true “quotes” which if they have a \ in front of them are not going to become what you expect.

  6. If you are going to translate to any multi-byte character set (i.e., Japanese, Korean, Chinese, ...) most compilers will have trouble with the strings. The reasons are varied but a simple example is the compiler, reading the strings one byte at a time, may find a character like a " inside a multi-byte character that it should not interpret as a quote.

There are many solutions. For example, for the number of lines problem and the word wrap problem you could do word wrapping on the fly at runtime. Maybe you think that will be too slow, fine, write a tool and takes the data from the translators and word wraps it at compile time. Don't force the translators to word wrap.

As for the others, if you take the tool route your tool could not only word wrap and/or count lines, the quotes and puctuation in the data are unlikely to be an issue if your tools write out the data that is stored in some kind of binary format. If you write to an intermediate format like C source your tool could escape the quotes and print the correct punctuation. If the characters are multi−byte it could escape the strings to "\x??\x??" format, that way you don't have to worry about the compiler choking on the multi−byte characters

Your tools should even check if the text fits. Why make people manaully check this? Especially if you are making an RPG or other game with tons of text, if you have some kind of display and you need to make sure that the text is a certain size or less then put that check in your tools where it can check all 40000 paragraphs of dialog instead of asking your testers or localizers to check it all by hand. If parts of the text are variable like a name then have your tool find the longest name and test with that.

On one project I used text files to give to the translators and I wrote a program to take those text files, parse them and spit out a string table, one for each language. The text files looked something like this

#
#
# Greeting:
# This is where the main character says hello to the player
#
# original English
#
# Hello <player> Are you having fun yet?
#
Greeting:
Hello <player> Are you having fun yet?

#
# Goodbye:
# This is where the main character says goodbye to the player
#
# original English
#
# Well, I really gotta go now. C ya!
#
Greeting:
Well, I really gotta go now. C ya!

I made this file for English and then saved it out 7 more times. Once each for Spanish, French, German, Dutch, Italian, Portuguese, Japanese. I then sent the files out with instructions to edit the appropriate lines and I wrote tools to take these text files and generate a string table from them, one for each language.

In the build process, at compile time, I would find the size of the largest string table and that size would get built into the code so that the code could allocate one buffer large enough for the largest string table. Then, at runtime, the user could pause the game at any time, switch languages and I could load the different string table into memory and know there would be enough memory for it. No need to free and/or allocate memory. This made it great for testing as well since they testers could pop up the debug menu and right in the middle of a dialog/text message they could switch languages and double check things.

It went relatively smoothly. The only major drawback to this method was that text files can get messed up pretty easily. Some translators translated both the line they were supposed to change AND the original English which was there for reference. Not a big deal since it was just a copy and paste for them but the whole point was as a reference to the original for anybody that needed to look at the translation. For example a programmer that has no idea what "私は馬鹿なアメリカ人です" means and therefore needs the English reference as to where it goes in the game.

The other was a few translators used a text editor that word wrapped lines automatically. This messed up any long messages since my tool assumed if the translator split the line they wanted that line split in the game.

The way I've gotten around that since then is to use Excel. Excel stores all of it's data in Unicode so handling languages like Korean, Chinese and Japanese is not a problem. It also fixes the issues of the original lines because you can just protect those cells so they can't be edited. On top of that the word wrapping issue disappears since you just allow one cell per message.

The only problem with Excel was figuring out how to get the data out. I've seen many teams that use Excel and write complicated excel macros to extract the data. That doesn't seem like a good idea to me. One reason is you are probably going to send out 6−10 excel files, one for each language to different translators. You'd most likely have to put the macros in each of those files. If you changed something later you'd have to edit the macros once in each file. Another issue is that in this age of macro viruses many translators' machines will warn them about excel macros and freaking out they will delete the macros. Plus, even if they don't delete the macros you can never be 100% sure they didn't some how go in and change something by accident giving you one more thing to debug later.

The solution is to write a tool to extract the data from excel. Fortunately excel has extensive OLE support. You can use this from any language. My preferred language for this sort of thing is perl. As of version 5.8 perl has great unicode support and it just mostly works, no futsing around. Perl can also convert to other encodings with a single line of code so if that's important for you you can do that.

Here's the main lines to pull out a single cell.

#!/usr/bin/perl
use strict;
use warnings;

use Win32::OLE;
use Win32::OLE::Variant;

# set perl's OLE module to return Unicode
# if you don't do this perl will convert to the current locale
Win32::OLE->Option(CP => Win32::OLE::CP_UTF8);

# get the Excel Application
my g_xl = Win32::OLE->new('Excel.Application', sub {_[0]->Quit;});

# open the file. Note that excel requires a FULL path and it requires
# backslashses
my xlFileHandle = g_xl->Workbooks->Open("d:\\folder\\myexcelfile.xls");

# look up sheet1 by name.
my xlSheetHandle = xlFileHandle->WorkSheets("Sheet1");

# look up an individual cell
my xlCellHandle = xlSheetHandle->Cells(12, 3); # cell C12

# get it's value
my uni = xlCellHandle->{Value};

# there is some voodoo here. I'm sure some perl guru can explain
# if I just use uni directly things don't work!?!?
my string = uni;

Of course the code above doesn't check for errors, it's just a sample to get you started.

One other thing about excel and game messages. Sometimes you are just writing paragraphs that can be word wrapped automatically and sometimes you need specific word wrapping. For example

You found the prize!

Congratulations!

You can insert lines in a single cell in excel using ALT+ENTER. You might need to send that little piece of info to your translators.

Here's a few tools I wrote that might come in handy. One is a tool for extracting a single column out of excel and save it to a file in the encoding of your choice. The other is a tool that extracts individual cells based on a template you specify. It may be suitable for generating data for some other tool to compile into a string table. Even if they are not directly useful for you they may be good examples although they have not been used extensively and may be incomplete in areas 😖. (Note: right click the links to save them)

Both are command line tools. If you have perl installed, which you can get here, you should be able to run them from the command line just by saving them with a .pl extension and then typing their name while in the same folder you saved them in. If you add a −−man as an argument as in

xlextractcolumn.pl --man

they should print out some simple documentation.

Another problem I've seen is putting more than the strings in the file for the translators. For example I worked on a product where they had setup an extremely complicated excel spreadsheet with a name and about 57 stats for each monster. They had an elaborate macro to write out all of the names and stats into a C file which got compiled into the game. To translate they sent the excel file to the translators. When they got it back they just ran their macro to spit out a new C file. Well, guess what, they have NO clue if the translators accidentally changed one of those stats and that could take months to debug like your testers find out the user can never kill the big dragon in level 12 because the translators accidentally changed its hitpoints from 1000 to 10000. Don't give the translators more than they need or your just asking for trouble.

In a similar way, I saw a project where they had special editors for everything in the game. For example this particular game had lots of robots. They had a robot stat editor written in C++ using MFC and Windows dialogs with little boxes and checkmarks for all the stats for all the robots. Included in the tool was the a box to type in a paragraph of text to describe the robot. All of the data was then saved out into their private binary formats.

There were two issues. One, there were no docs on how to run the tools to regenerate the data. I'm sure there probably were never any docs even on the original team. Instead the programmer that made the tool just explained person to person how to use the tool and it was never documented. But, worse, we're back to the same problem. If we sent that tool off and asked the translators to translate the descriptions we have no guarantee they won't accidentally edit some stats. We also now have to explain to 6−10 translators in different countries how to use these tools which have menus and buttons and labels written in only one language, not the native language of the translator, and that most likely run on only one type computer. (i.e., the translator might not be running Windows)

Finally, if you have any special codes that need to appear in the text my current thinking is you should do this the HTML way. For example let's say you had the string "<player> takes damage from <creature>". I think it would be better specified like that, using <player> rather than %p or %c or something more cryptic. The more cryptic way seems more error prone for non−coders and your tools can always convert <player> and <creature> to something more compact like %p etc. That also means if the text says "<player> does <percentdamage>% damage to <creature>" you can properly escape the actual % that the translator wants printed as opposed to a more traditional way which would require the translator to type "%p does %d%% damage to %c" where the translator would most likely not know to type the double %. On top of which, it should be easier to check for errors. If you are using single character % codes then it would be easy for the translator to mess up but if you use something like <player> and the translator types <player> it will be easy for your tools to notice the problem and inform you about it.

The same might go for formatting as in <center>Hooray!</center> or <red>blood</red> etc... Those kinds of things might argue for using something other than Excel like an HTML editor but an HTML editor would have many of the same problems as a text file being so free form.

This article has really only covered text and there are a host of other issues involving voice and graphics. Voice should be relatively obvious with a little thought. Graphics are something the artists generally making the graphics don't think about. If you go putting up billboards or signs or graffiti throughout your game all those textures and graphics may have to be translated. You can make a conscience decision to avoid words in your graphics. If that is not an option, you know at least that all those textures that have words that will need to be translated, need to be tracked, marked, put aside. You'll want to keep the original working files, the photoshop files before they've been flattened and exported, the ones with all the effects and layers still around so it will be easy for the localizers to try match the original style. You'll also need processes for getting your data rebuilt with the new translated textures and it is going to have to be documented for the people doing the localization or if the original team is doing it you'll need to design processes to build the various versions of the data that use these different image files. Maybe you really should avoid putting words in your textures.

I hope you find some of these ideas useful. I would like nothing better than to never have to worry about these kinds of localization issues again because everyone started taking stuff like this into consideration early in development. It's good for you on many levels. You'll save money because localization will go more smoothly, you may only have to pay for translation instead of also paying for localization since you already thought of the issues, you'll stop more grey market sales since the faster you can localize the faster local versions will ship, and you'll be less frustrated with all the small back and forth issues that come up since you'll have created something that by design has less chances for mistakes. 😊


PC vs. Console programming

I should mention since this will probably come up, On Windows and the Mac the default way of handling localization for applications is resource files. You put the graphics and strings for each language into your resource file for the application and the operating system handles the rest. It will load the correct strings for the user's current language automatically on demand and hand them to you.

While that may be fine on a PC that uses a hard drive, virtual mem ory, and a ram cache for reading CDs/DVDs a console doesn't generally have all of that. There is no cache, no virtual memory, there may or may not be a hard drive and when you want to display a string you want it immediately, not 0.1 to 5 seconds later. No cache means reading the DVD/CD is slow. No virtual memory means that managing memory is important so having some OS like service go off and randomly allocate some memory is generally something you need to avoid. Also often you are using the DVD/CD for music or video but if all of a sudden that has to be interrupted to get a string something's going to break.

On top of which most resource editors are really not designed for non−technical people to use. They are not going to word wrap for you, they are not going to pre−compile to %p or some smaller more memory efficient code. So, at least for console games, I would recommend rolling something yourself rather than relying on trying to emulate PC/Mac like resource files.


Update: Based on a suggestion from my friend John Alvarado at Inxile Entertainment an argubly better way to use data from Excel is to just export the data as XML and then you can use the myriad of ways to parse XML to get to the data. Nearly every computer language supports XML and by default Excel exports in unicode so all languages should be covered. Here is some example code so you can automate the exporting part.

#!/usr/bin/perl
use strict;
use warnings;

use Win32::OLE;
use Win32::OLE::Variant;
use Win32::OLE::Const;
use Data::Dumper;
use File::Spec;

# set perl's OLE module to return Unicode
# if you don't do this perl will convert to the current locale
Win32::OLE->Option(CP => Win32::OLE::CP_UTF8);

# get the Excel Application
my g_xl = Win32::OLE->new('Excel.Application', sub {_[0]->Quit;});

# get Excel's constants
my xl = Win32::OLE::Const->Load(g_xl);
#print Dumper (xl);

{
   my srcFilename = ARGV[0];
   my dstFilename = ARGV[1];

   print "converting srcFilename to dstFilename\n";
   # open the file. Note that excel requires a FULL path and it requires
   # backslashses
   my xlFileHandle = g_xl->Workbooks->Open(srcFilename);

   # look up sheet1 by name.
   if (-e dstFilename)
   {
      unlink (dstFilename);
   }
   xlFileHandle->SaveAs(
      dstFilename,
      xl->{'xlXMLSpreadsheet'}
      );

   xlFileHandle->Close;

   undef xlFileHandle;
}

undef xl;

g_xl->Quit;

undef g_xl;

I used this for Tanjun'ka and instead of using an actual XML parser I just used some simple regular expressions to pull out the data. At the time I didn't know much about parsing the data with the XML parser in C# and I figured that with schemas etc I'd probably run into exceptions and other things that I wasn't ready to deal with so I opted for a simple solution. If it bites me at some point I'll switch to the full parser but for now this seems to be working. The format of my Excel files is as follows. One file per language that looks like this

Label Text Notes
LanguageLabel 日本語 Name of this Language
ToolStripNew 新規作成 "New"
ToolStripOpen 開く "Open"
ToolStripSave 保存 "Save"
ToolStripCut カット "Cut"
ToolStripCopy コピー "Copy"
ToolStripPaste 貼り付け "Paste"
PhotoRevert リセット "Reset" : Photo Bottom Buttons
PhotoUndo 取り消し "Undo" : Photo Bottom Buttons
PhotoRedo やり直す "Redo" : Photo Bottom Buttons
PhotoTabCrop 回転/切り抜き "Rotate/Crop" : Photo Tab
PhotoTabBright 明るさ "Brightness" : Photo Tab
PhotoTabHue Hue/Saturation "Hue/Saturation" : Photo Tab
PhotoContrast コントラスト "Contrast"

Then, to pull out the data I just parse the file and put all the values into a hash/dictionary based on the labels something like this.

using System;
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;
using System.IO;

namespace Tanjunka
{
  class Util
  {
    private Dictionary<string, string> GetLangDict(string filename)
    {
      Dictionary<string, string> lang = new Dictionary<string,string>();

      string str = readFileToString(filename);

      Regex r = new Regex("<Row.*?>.*?<Cell.*?>.*?<Data.*?>(?<1>.*?)" +
            "</Data>.*?</Cell>.*?<Cell.*?>.*?<Data.*?>(?<2>.*?)"+
            "</Data>.*?</Cell>",
            RegexOptions.IgnoreCase|RegexOptions.Compiled);
      MatchCollection mc = r.Matches(str);
      for (int ii = 0; ii < mc.Count; ii++)
      {
        Match m = mc[ii];

        if (lang.ContainsKey(m.Groups[1].ToString()))
        {
          Console.Writeline("** duplicate entry for label(" +
           m.Groups[1].ToString() +
           ") in language file " + filename);
        }
        else
        {
          lang[m.Groups[1].ToString()] =
            UnEntify(m.Groups[2].ToString());
        }
      }
      return lang;
    }

    private string readFileToString(string filename)
    {
      StreamReader sr = new StreamReader(filename);
      string s = sr.ReadToEnd();
      sr.Close();
      return s;
    }

    private string UnEntify(string str)
    {
      // replace common entities;
      str = Regex.Replace( str,  "&lt;"  , "<"  );
      str = Regex.Replace( str,  "&gt;"  , ">"  );
      str = Regex.Replace( str,  "&quot;", "\"" );
      str = Regex.Replace( str,  "&nbsp;", " "  );
      str = Regex.Replace( str,  "&amp;" , "&"  );
      str = Regex.Replace( str,  "…"  , "...");

      return str;
    }
  }
}

C# defaults to reading that file as Unicode so except for things that excel itself escaped it just seems to work.


Also see this page for some really simple perl.

Comments
Making Games 1.5
Easy Game Development with 3D Game Studio?