How to Build a Multilingual App: A Demo With PHP and Gettext
Making your website or web app available to a wider audience often requires it to be available in multiple languages. For non-English projects, you can increase your audience by releasing it in English as well as your native language. Internationalizing and localizing your project, however, becomes a much easier process if you start during its infancy.
In this article, Toptal Software Engineer Igor Gomes dos Santos shows us how to leverage simple tools, like Gettext and Poedit, to internationalize and localize a PHP project.
Making your website or web app available to a wider audience often requires it to be available in multiple languages. For non-English projects, you can increase your audience by releasing it in English as well as your native language. Internationalizing and localizing your project, however, becomes a much easier process if you start during its infancy.
In this article, Toptal Software Engineer Igor Gomes dos Santos shows us how to leverage simple tools, like Gettext and Poedit, to internationalize and localize a PHP project.
With a big eye on UI/UX, Igor is a developer with strong PHP background (10+ years), moving into JS-land for more interactive experiences.
Expertise
Whether you are building a website or a full-fledged web application, making it accessible to a wider audience often requires it to be available in different languages and locales.
Fundamental differences between most human languages make this anything but easy. The differences in grammar rules, language nuances, date formats, and more combine to make localization a unique and formidable challenge.
Consider this simple example.
Rules of pluralization in English are pretty straightforward: you can have a singular form of a word or a plural form of a word.
In other languages, though – such as Slavic languages – there are two plural forms in addition to the singular one. You may even find languages with a total of four, five, or six plural forms, such as in Slovenian, Irish, or Arabic.
The way your code is organized, and how your components and interface are designed, plays an important role in determining how easily you can localize your application.
Internationalization (i18n) of your codebase, helps ensure that it can be adapted to different languages or regions with relative ease. Internationalization is usually done once, preferably in the beginning of the project to avoid needing huge changes in the source code down the road.
Once your codebase has been internationalized, localization (l10n) becomes a matter of translating the contents of your application to a specific language/locale.
Localization needs to be performed every time a new language or region needs to be supported. Also, whenever a part of the interface (containing text) is updated, new content becomes available - which then needs to be localized (i.e., translated) to all supported locales.
In this article, we will learn how to internationalize and localize software written in PHP. We will go through the various implementation options and the different tools that are available at our disposal to ease the process.
Tools for Internationalization
The easiest way to internationalize PHP software is by using array files. Arrays will be populated with translated strings, which can then be looked up from within templates:
<h1><?=$TRANS['title_about_page']?></h1>
This is, however, hardly a recommended way for serious projects, as it will definitely pose maintenance issues down the road. Some issues might even appear in the very beginning, such as the lack of support for variable interpolation or pluralization of nouns and so on.
One of the most classic tools (often taken as reference for i18n and l10n) is a Unix tool called Gettext.
Though dating back to 1995, it is still a comprehensive tool for translating software that is also easy to use. While it is pretty easy to get started with, it still has powerful supporting tools.
Gettext is what we’ll be using in this post. We will be presenting a great GUI application that can be used to easily update your l10n source files, thereby avoiding the need to deal with the command line.
Libraries To Make Things Easy
There are major PHP web frameworks and libraries that support Gettext and other implementations of i18n. Some are easier to install than others, or sport additional features or support different i18n file formats. Although in this document, we focus on the tools provided with the PHP core, here’s a list of some others worth mentioning:
-
oscarotero/Gettext: Gettext support with an object-oriented interface; includes improved helper functions, powerful extractors for several file formats (some of them not supported natively by the
gettext
command). Can also export to formats beyond just .mo/.po files, which can be useful if you need to integrate your translation files into other parts of the system, like a JavaScript interface. -
symfony/translation: Supports a lot of different formats, but recommends using verbose XLIFF’s. Doesn’t include helper functions or a built-in extractor, but supports placeholders using
strtr()
internally. -
zend/i18n: Supports array and INI files, or Gettext formats. Implements a caching layer to avoid needing to read the file system every time. Also includes view helpers, and locale-aware input filters and validators. However, it has no message extractor.
Other frameworks also include i18n modules, but those are not available outside of their codebases:
-
Laravel: Supports basic array files; has no automatic extractor but includes a
@lang
helper for template files. -
Yii: Supports array, Gettext, and database-based translation, and includes a messages extractor. Backed by the
Intl
extension, available since PHP 5.3, and based on the ICU project. This enables Yii to run powerful replacements, like spelling out numbers, formatting dates, times, intervals, currency, and ordinals.
If you decide to go for one of the libraries that provide no extractors, you may want to use the Gettext formats, so you can use the original Gettext toolchain (including Poedit) as described in the rest of the chapter.
Installing Gettext
You might need to install Gettext and the related PHP library by using your package manager, like apt-get or yum. After it’s installed, enable it by adding extension=gettext.so
(Linux/Unix) or extension=php_gettext.dll
(Windows) to your php.ini
file.
Here we will also be using Poedit to create translation files. You will probably find it in your system’s package manager; it’s available for Unix, Mac, and Windows and can be downloaded for free on its website as well.
Types of Gettext Files
There are three file types you usually deal with while working with Gettext.
The main ones are PO (Portable Object) and MO (Machine Object) files, the first being a list of readable “translated objects” and the second being the corresponding binary (to be interpreted by Gettext when doing localization). There’s also a POT (PO Template) file, that simply contains all existing keys from your source files, and can be used as a guide to generate and update all PO files.
The template files are not mandatory; depending on the tool you’re using to do l10n, you’ll be just fine with only PO/MO files. You’ll have one pair of PO/MO files per language and region, but only one POT per domain.
Separating Domains
There are some cases, in big projects, where you might need to separate translations when the same words convey different meaning in different contexts.
In those cases, you’ll need to split them into different “domains,” which are basically named groups of POT/PO/MO files, where the filename is the said translation domain.
Small and medium-sized projects usually, for simplicity, use only one domain; its name is arbitrary, but we will be using “main” for our code samples.
In Symfony projects, for example, domains are used to separate the translation for validation messages.
Locale Code
A locale is simply a code that identifies one version of a language. It’s defined following the ISO 639-1 and ISO 3166-1 alpha-2 specs: two lower-case letters for the language, optionally followed by an underscore and two upper-case letters identifying the country or regional code.
For rare languages, three letters are used.
For some speakers, the country part may seem redundant. In fact, some languages have dialects in different countries, such as Austrian German (de_AT) or Brazilian Portuguese (pt_BR). The second part is used to distinguish between those dialects - when it’s not present, it’s taken as a “generic” or “hybrid” version of the language.
Directory Structure
To use Gettext, we will need to adhere to a specific structure of folders.
First, you’ll need to select an arbitrary root for your l10n files in your source repository. Inside it, you’ll have a folder for each needed locale, and a fixed “LC_MESSAGES” folder that will contain all your PO/MO pairs.
Plural Forms
As we said in the introduction, different languages might sport different pluralization rules. However, Gettext saves us this trouble.
When creating a new .po file, you’ll have to declare the pluralization rules for that language, and translated pieces that are plural-sensitive will have a different form for each of those rules.
When calling Gettext in code, you’ll have to specify a number related to the sentence (e.g. for the phrase “You have n messages.”, you will need to specify the value of n), and it will work out the correct form to use - even using string substitution if needed.
Plural rules are composed of the number of rules necessary with a boolean test for each rule (test for at most one rule may be omitted). For example:
-
Japanese:
nplurals=1; plural=0;
- one rule: there are no plural forms -
English:
nplurals=2; plural=(n != 1);
- two rules: use plural form only when n is not 1, otherwise use the singular form. -
Brazilian Portuguese:
nplurals=2; plural=(n > 1);
- two rules, use plural form only when n is greater than 1, otherwise use the singular form.
For a deeper explanation, there’s an informative LingoHub tutorial available online.
Gettext will determine which rule to use based on the number provided and will use the correct localized version of the string. For strings where pluralization needs to be handled, you will need to include in the .po file a different sentence for each plural rule defined.
Sample Implementation
After all that theory, let’s get a little practical. Here’s an excerpt of a .po file (don’t worry yet too much about the syntax, but instead just get a sense of the overall content):
msgid ""
msgstr ""
"Language: pt_BR\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Plural-Forms: nplurals=2; plural=(n > 1);\n"
msgid "We're now translating some strings"
msgstr "Nós estamos traduzindo algumas strings agora"
msgid "Hello %1$s! Your last visit was on %2$s"
msgstr "Olá %1$s! Sua última visita foi em %2$s"
msgid "Only one unread message"
msgid_plural "%d unread messages"
msgstr[0] "Só uma mensagem não lida"
msgstr[1] "%d mensagens não lidas"
The first section works like a header, having the msgid
and msgstr
empty.
It describes the file encoding, plural forms, and a few other things. The second section translates a simple string from English to Brazilian Portuguese, and the third does the same, but leverages string replacement from sprintf
, enabling the translation to contain the username and visit date.
The last section is a sample of pluralization forms, displaying the singular and plural version as msgid
in English and their corresponding translations as msgstr
0 and 1 (following the number given by the plural rule).
There, string replacement is used as well, so the number can be seen directly in the sentence, by using %d
. The plural forms always have two msgid
(singular and plural), so it’s advised to not use a complex language as the source of translation.
Localization Keys
As you may have noticed, we’re using the actual English sentence as the source ID. That msgid
is the same used throughout all your .po files, meaning other languages will have the same format and the same msgid
fields but translated msgstr
lines.
Speaking of translation keys, there are two standard “philosophical” approaches here:
1. msgid as a real sentence
The main advantages of this approach are:
-
If there are parts of the software untranslated in any given language, the key displayed will still maintain some meaning. For example, if you know how to translate from English to Spanish but need help translating to French, you might publish the new page with missing French sentences, and parts of the website would be displayed in English instead.
-
It’s much easier for the translator to understand what’s going on and make a proper translation based on the
msgid
. -
It gives you “free” l10n for one language - the source one.
On the other hand, the primary disadvantage is that, if you need to change the actual text, you need to replace the same msgid
across several language files.
2. msgid as a unique, structured key
This would describe the sentence role in the application in a structured way, including the template or part where the string is located instead of its content.
This is a great way to have the code organized, separating the text content from the template logic. However, that could present problems to the translator who would miss the context.
A source language file would be needed as a basis for other translations. For example, the developer would ideally have an “en.po” file, that translators would read to understand what to write in “fr.po”.
Missing translations would display meaningless keys on screen (“top_menu.welcome” instead of “Hello there, User!” on the said untranslated French page).
That’s good as it would force translation to be complete before publishing - but bad as translation issues would be really awful in the interface. Some libraries, though, include an option to specify a given language as “fallback,” having a similar behavior as the other approach.
The Gettext manual favors the first approach as, in general, it’s easier for translators and users in case of trouble. That’s the approach we’ll be using here as well.
It should be noted, though, that the Symfony documentation favors keyword-based translation, to allow for independent changes of all translations without affecting templates as well.
Everyday Usage
In a common application, you would use some Gettext functions while writing static text in your pages.
Those sentences would then appear in .po files, get translated, compiled into .mo files, and then used by Gettext when rendering the actual interface. Given that, let’s tie together what we have discussed so far in a step-by-step example:
1. A sample template file, including some different gettext calls
<?php include 'i18n_setup.php' ?>
<div id="header">
<h1><?=sprintf(gettext('Welcome, %s!'), $name)?></h1>
<!-- code indented this way only for legibility →
<?php if ($unread): ?>
<h2>
<?=sprintf(
ngettext('Only one unread message', '%d unread messages', $unread),
$unread
)?>
</h2>
<?php endif ?>
</div>
<h1><?=gettext('Introduction')?></h1>
<p><?=gettext('We\'re now translating some strings')?></p>
-
gettext()
simply translates amsgid
into its correspondingmsgstr
for a given language. There’s also the shorthand function_()
that works the same way -
ngettext()
does the same but with plural rules -
There’s also
dgettext()
anddngettext()
, that allows you to override the domain for a single call (more on domain configuration in the next example)
2. A sample setup file (i18n_setup.php as used above), selecting the correct locale and configuring Gettext
Using Gettext involves a bit of a boilerplate code, but it is mostly about configuring the locales directory and choosing appropriate parameters (a locale and a domain).
<?php
/**
* Verifies if the given $locale is supported in the project
* @param string $locale
* @return bool
*/
function valid($locale) {
return in_array($locale, ['en_US', 'en', 'pt_BR', 'pt', 'es_ES', 'es');
}
//setting the source/default locale, for informational purposes
$lang = 'en_US';
if (isset($_GET['lang']) && valid($_GET['lang'])) {
// the locale can be changed through the query-string
$lang = $_GET['lang']; //you should sanitize this!
setcookie('lang', $lang); //it's stored in a cookie so it can be reused
} elseif (isset($_COOKIE['lang']) && valid($_COOKIE['lang'])) {
// if the cookie is present instead, let's just keep it
$lang = $_COOKIE['lang']; //you should sanitize this!
} elseif (isset($_SERVER['HTTP_ACCEPT_LANGUAGE'])) {
// default: look for the languages the browser says the user accepts
$langs = explode(',', $_SERVER['HTTP_ACCEPT_LANGUAGE']);
array_walk($langs, function (&$lang) { $lang = strtr(strtok($lang, ';'), ['-' => '_']); });
foreach ($langs as $browser_lang) {
if (valid($browser_lang)) {
$lang = $browser_lang;
break;
}
}
}
// here we define the global system locale given the found language
putenv("LANG=$lang");
// this might be useful for date functions (LC_TIME) or money formatting (LC_MONETARY), for instance
setlocale(LC_ALL, $lang);
// this will make Gettext look for ../locales/<lang>/LC_MESSAGES/main.mo
bindtextdomain('main', '../locales');
// indicates in what encoding the file should be read
bind_textdomain_codeset('main', 'UTF-8');
// if your application has additional domains, as cited before, you should bind them here as well
bindtextdomain('forum', '../locales');
bind_textdomain_codeset('forum', 'UTF-8');
// here we indicate the default domain the gettext() calls will respond to
textdomain('main');
// this would look for the string in forum.mo instead of main.mo
// echo dgettext('forum', 'Welcome back!');
?>
3. Preparing translation for the first run
One of the great advantages Gettext has over custom framework i18n packages is its extensive and powerful file format.
Perhaps you’re thinking “Oh man, that’s quite hard to understand and edit by hand, a simple array would be easier!” Make no mistake, applications like Poedit are here to help - a lot. You can get the program from their website, it’s free and available for all platforms. It’s a pretty easy tool to get used to, and a very powerful one at the same time - using all features Gettext has available. We’ll be working here with the latest version, Poedit 1.8.
In the first run, you should select “File > New…” from the menu. You’ll be asked for the language; select/filter the language you want to translate to, or use the format we mentioned before, such as en_US
or pt_BR
.
Now, save the file - using that directory structure we mentioned as well. Then you should click “Extract from sources”, and here you’ll configure various settings for the extraction and translation tasks. You’ll be able to find all those later through “Catalog > Properties”:
-
Source paths: Include all folders from the project where
gettext()
(and siblings) are called - this is usually your templates/views folder(s). This is the only mandatory setting. -
Translation properties:
- Project name and version, Team and Team’s email address: Useful information that goes in the .po file header.
- Plural forms: These are the rules we mentioned before. You can leave it with the default option most of the time, as Poedit already includes a handy database of plural rules for many languages.
- Charsets: UTF-8, preferably.
- Source code charset: The charset used by your codebase - probably UTF-8 as well, right?
-
Source keywords: The underlying software knows how
gettext()
and similar function calls look in several programming languages, but you might as well create your own translation functions. It will be here you’ll add those other methods. This will be discussed later in the “Tips” section.
After setting those properties, Poedit will run a scan through your source files to find all the localization calls. After every scan, Poedit will display a summary of what was found and what was removed from the source files. New entries will be empty into the translation table, allowing you to enter the localized versions of those strings. Save it and a .mo file will be (re)compiled into the same folder and, presto!, your project is internationalized!
Poedit can also suggest common translations from the web and from previous files. It’s handy so you only have to check if they make sense, and accept them. If you are unsure about a translation, you can mark it as Fuzzy, and it will be displayed in yellow. Blue entries are those that have no translation.
4. Translating strings
As you may have noticed, there are two main types of localized strings: simple ones and those with plural forms.
Simple ones have only two boxes: source and localized string. The source string can’t be modified, since Gettext/Poedit do not include the ability to alter your source files; rather, you will need to change the source itself and rescan the files. (Tip: If you right-click a translation line, it will display a hint with the source files and lines where that string is being used.)
Plural form strings include two boxes to show the two source strings, and tabs so you can configure the different final forms.
Example of a string with a plural form on Poedit, showing a translation tab for each one.
Whenever you change your source code files and need to update the translations, just hit Refresh and Poedit will rescan the code, removing non-existent entries, merging the ones that changed and adding new ones.
Poedit may also try to guess some translations, based on other ones you did. Those guesses and the changed entries will receive a “Fuzzy” marker, indicating that they need review, displayed in yellow in the list.
It’s also useful if you have a translation team and someone tries to write something they’re not sure about: just mark it Fuzzy and someone else will review it later.
Finally, it’s advised to leave “View > Untranslated entries first” marked, as it will help you avoid forgetting any entries. From that menu, you can also open parts of the UI that allow you to leave contextual information for translators if needed.
Tips & Tricks
Web servers may end up caching your .mo files.
If you’re running PHP as a module on Apache (mod_php), you might face issues with the .mo file being cached. It happens the first time it’s read, and then, to update it, you might need to restart the server.
On Nginx and PHP5 it usually takes only a couple of page refreshes to refresh the translation cache, and on PHP7 it is rarely needed.
Libraries provide helper functions to keep localization code short.
As preferred by many people, it’s easier to use _()
instead of gettext()
. Many custom i18n libraries from frameworks use something similar to t()
as well, to make translated code shorter. However, that’s the only function that sports a shortcut.
You might want to add in your project some others, such as __()
or _n()
for ngettext()
, or maybe a fancy _r()
that would join gettext()
and sprintf()
calls. Other libraries, such as oscarotero’s Gettext also provide helper functions like these.
In those cases, you’ll need to instruct the Gettext utility on how to extract the strings from those new functions. Don’t be afraid, it’s very easy. It’s just a field in the .po file or a Settings screen in Poedit (in the editor, that option is inside “Catalog > Properties > Sources keywords”).
Remember: Gettext already knows the default functions for many languages, so don’t be concerned if that list seems empty. You need to include in that list the specifications of the new functions, following this specific format:
-
If you create something like
t()
, that simply returns the translation for a string, you can specify it ast
. Gettext will know the only function argument is the string to be translated; -
If the function has more than one argument, you can specify in which one the first string is and, if needed, the plural form as well. For instance, if our function signature is
__('one user', '%d users', $number)
, the specification would be__:1,2
, meaning the first form is the first argument, and the second form is the second argument. If your number comes as the first argument instead, the spec would be__:2,3
, indicating the first form is the second argument, and so on.
After including those new rules in the .po file, a new scan will bring in your new strings just as easily as before.
Make Your PHP App Multilingual With Gettext
Gettext is a very powerful tool for internationalizing your PHP project. Beyond its flexibility that allows support for a large number of human languages, its support for more than 20 programming languages allows you to easily transfer your knowledge of using it with PHP to other languages like Python, Java, or C#.
Furthermore, Poedit can help smooth the path between code and translated strings, making the process more straightforward and easier to follow. It can also streamline shared translation efforts with its Crowdin integration.
Whenever possible, consider other languages your users might speak. This is mostly important for non-English projects: You can boost your user access if you release it in English as well as your native language.
Of course, not all projects have a need for internationalization, but it’s much easier to start i18n during a project’s infancy, even if not initially needed, than it is to do it later down the road should it subsequently become a requirement. And, with tools like Gettext and Poedit it is easier than ever.
Further Reading on the Toptal Blog:
Rio de Janeiro - State of Rio de Janeiro, Brazil
Member since February 15, 2016
About the author
With a big eye on UI/UX, Igor is a developer with strong PHP background (10+ years), moving into JS-land for more interactive experiences.