Unicode and Encoding - Understanding Text at the Byte Level
Vaibhav • September 10, 2025
Text in C# isn’t just about characters-it’s about how those characters are represented, stored, and transmitted. Whether you're building a console app, reading files, or sending data over the web, understanding Unicode and encoding is essential. This article will walk you through the fundamentals of Unicode, how C# handles strings internally, and how to work with different encodings safely and effectively.
What Unicode Actually Is
Unicode is a global standard that assigns a unique number (called a code point) to every character in every language. For example, the letter 'A' is U+0041, and the emoji 😊 is U+1F60A. This system allows consistent representation of text across platforms and languages.
Unicode is a character set, not an encoding. Encodings like UTF-8, UTF-16, and UTF-32 define how Unicode code points are stored as bytes.
How C# Internally Represents Strings
In C#, strings are sequences of char
values, and each char
is a UTF-16 code unit. This means C# uses UTF-16 internally. Most common
characters (like English letters) fit into a single char
, but some characters
(like emojis or rare scripts) require two char
values-this is called a
surrogate pair.
string s = "😊";
Console.WriteLine(s.Length); // Output: 2
Even though the string contains one visible character, its length is 2 because UTF-16 uses two code units for characters outside the Basic Multilingual Plane (BMP).
Encodings: UTF-8, UTF-16, UTF-32
Encodings determine how Unicode code points are stored as bytes. C# provides built-in support for several
encodings via the System.Text.Encoding
class.
- UTF-8: Variable-length encoding (1–4 bytes). Efficient for ASCII-heavy text. Common in web and file formats.
- UTF-16: Used internally by .NET strings. Fixed-length for BMP characters, variable for others.
- UTF-32: Fixed-length (4 bytes per character). Rarely used due to size overhead.
string text = "Hello 🌍";
byte[] utf8Bytes = Encoding.UTF8.GetBytes(text);
byte[] utf16Bytes = Encoding.Unicode.GetBytes(text);
byte[] utf32Bytes = Encoding.UTF32.GetBytes(text);
Each encoding produces a different byte array. UTF-8 is compact for ASCII but expands for emojis and non-Latin scripts. UTF-16 balances size and compatibility. UTF-32 is predictable but large.
Encoding and Decoding Strings
You can convert strings to bytes and back using encoding classes. This is essential when reading or writing files, sending data over networks, or interfacing with external systems.
string original = "Café";
byte[] bytes = Encoding.UTF8.GetBytes(original);
string decoded = Encoding.UTF8.GetString(bytes);
Console.WriteLine(decoded); // Output: Café
Always use the same encoding for both encoding and decoding. Mismatched encodings can corrupt data or produce unreadable characters.
Common Pitfalls and How to Avoid Them
Encoding issues often arise when reading files or handling user input. Here are some common mistakes:
- Assuming default encoding: File readers may use system default encoding, which varies by OS and locale.
- Ignoring BOM (Byte Order Mark): UTF-16 files may include a BOM to indicate byte order. C# handles this automatically, but manual parsing can fail.
- Mixing encodings: Reading UTF-8 data as ASCII or Latin-1 can produce garbled text.
Always specify encoding explicitly when reading or writing files. For
example, use Encoding.UTF8
with StreamReader
or File.ReadAllText
.
Working with Files and Encoding
When reading or writing text files, encoding matters. Here’s how to handle it correctly:
// Writing a UTF-8 file
File.WriteAllText("data.txt", "Résumé", Encoding.UTF8);
// Reading the same file
string content = File.ReadAllText("data.txt", Encoding.UTF8);
This ensures consistent encoding and avoids surprises when sharing files across systems.
Normalization and Canonical Forms
Unicode allows multiple representations of the same character. For example, 'é' can be a single code point (U+00E9) or a combination of 'e' and an accent (U+0065 + U+0301). These are visually identical but byte-wise different.
string composed = "\u00E9"; // é
string decomposed = "\u0065\u0301"; // e + ́
Console.WriteLine(composed == decomposed); // False
To compare or store text consistently, normalize it using System.Text.NormalizationForm
.
string norm1 = composed.Normalize(NormalizationForm.FormC);
string norm2 = decomposed.Normalize(NormalizationForm.FormC);
Console.WriteLine(norm1 == norm2); // True
Use FormC (composed) for storage and comparison. FormD (decomposed) is useful for linguistic analysis.
Encoding in Web and APIs
Web applications often deal with encoding in HTTP headers, HTML meta tags, and JSON payloads. UTF-8 is the standard for web content. When sending or receiving data, ensure encoding is declared and respected.
// Setting encoding in HTTP response
response.ContentEncoding = Encoding.UTF8;
response.ContentType = "text/html; charset=utf-8";
For APIs, JSON libraries in .NET use UTF-8 by default. Avoid manual encoding unless necessary.
Encoding and Console Applications
Console output may depend on system settings. To ensure correct display of Unicode characters, set the console encoding:
Console.OutputEncoding = Encoding.UTF8;
Console.WriteLine("Unicode test: 🌟");
Without this, some characters may appear as question marks or boxes.
Encoding and Security
Improper encoding can lead to security issues like injection attacks or broken authentication. Always validate and sanitize input, especially when converting between encodings or displaying user-generated content.
Use encoding-aware APIs and avoid manual string manipulation for security-sensitive operations.
Summary
Unicode and encoding are foundational to modern software. In C#, strings use UTF-16 internally, but you’ll often work with UTF-8 for files, web, and APIs. Understanding how to encode and decode text, normalize characters, and choose the right encoding ensures your applications handle text correctly across languages and platforms. Always specify encoding explicitly, normalize when comparing, and be mindful of system defaults. With these practices, your code will be robust, international-ready, and free from subtle text bugs.