How Languages Handle Unicode
I often think that processing strings of characters in an application is a simple process. The reasoning behind this line of thought is that string APIs provided by programming languages are usually very simple. However, processing characters is only straightforward when edge cases are avoided. With the growth of Unicode and use of special characters (for localization, emojis, etc.), edge cases are increasingly common. This post explores the complexities of Unicode and how programming languages support its edge cases.
What is Unicode?
Unicode is a standard for creating character encodings and handling characters in a consistent manner1. Unicode was built upon older character encodings such as ISO 8859 and its subset ASCII. These old encodings have limited scope (ASCII has just 128 characters), making them inapt for international applications. While ASCII often works fine for applications used by English speakers in the United States (ASCII contains the characters A-Z, numbers 0-9, punctuation, and common symbols), anyone else is out of luck. Unicode fixes these limitations, implementing characters from languages across the world, and even creating fun symbols and emojis. As of Unicode 11.0 over 137,000 characters exist.