Question 1

What is Unicode normalization?

Accepted Answer

Unicode normalization is the process of converting text into a standard form so that equivalent characters are represented the same way. The same visual character can exist as a single code point or as a combination of code points, and normalization collapses those different representations into one consistent form.

Question 2

What are the Unicode normalization forms?

Accepted Answer

There are four: NFC, NFD, NFKC, and NFKD. NFC composes characters into the shortest form using precomposed characters. NFD decomposes them into base characters plus combining marks. NFKC and NFKD do the same but also apply compatibility mappings that convert things like ligatures and width variants into their standard equivalents. NFC is the most commonly used for web and general text work.

Question 3

Why does my text look correct but cause errors in my database or search?

Accepted Answer

This is often a Unicode normalization mismatch. Two strings that look identical on screen can be stored as different byte sequences if one uses a precomposed character and the other uses a base character plus a combining mark. Most search and comparison operations will treat them as different strings unless both are normalized to the same form first.

Question 4

What is NFC normalization?

Accepted Answer

NFC stands for Canonical Decomposition followed by Canonical Composition. It is the standard normalization form recommended for most text on the web. It converts characters to their precomposed form, so an accented letter like é is stored as one code point rather than two.

Question 5

How do I fix Unicode characters in Python?

Accepted Answer

Use the unicodedata module: unicodedata.normalize('NFC', your_string). Replace 'NFC' with 'NFD', 'NFKC', or 'NFKD' depending on which form you need. For a quick fix outside of code, paste the text into this tool instead.

Question 6

Why does text copied from a PDF have weird characters?

Accepted Answer

PDF text extraction often produces inconsistent Unicode because PDFs store characters using internal encodings that do not always map cleanly to standard Unicode. Ligatures, special quotes, and accented characters are common offenders. Normalizing the text after extraction helps fix a lot of those issues.

Question 7

What is the difference between Unicode normalization and encoding?

Accepted Answer

Encoding is how text is stored as bytes, such as UTF-8 or UTF-16. Normalization is about how characters within Unicode are represented, since Unicode itself allows multiple representations for the same glyph. They are related but separate issues.

Question 8

How do I remove invisible Unicode characters from text?

Accepted Answer

Normalization does not always remove invisible characters on its own. Characters like zero-width spaces, non-breaking spaces, and other control characters may need to be stripped separately. Normalization handles the representation of visible characters rather than removing hidden ones.

Question 9

Does normalizing Unicode change how the text looks?

Accepted Answer

Usually not visibly. The point is to standardize the underlying representation without changing what you actually see. In rare cases involving compatibility normalization forms like NFKC, some ligatures or special glyphs may be converted to their plain equivalents, which can look slightly different.

Normalize Unicode Text

What Is the Normalize Unicode Tool?

How to Use This Tool

When Would You Use This?

Examples

Accented character as two code points normalized to one

Compatibility normalization of a ligature

Full-width characters normalized to standard width

Smart quotes and special punctuation

Frequently Asked Questions