Skip to main content Skip to docs navigation

Code Unit

A code unit is the basic component used by a character encoding system (such as UTF-8 or UTF-16). A character encoding system uses one or more code units to encode a Unicode code point .

On this page

In UTF-16 (the encoding system used for JavaScript strings) code units are 16-bit values. This means that operations such as indexing into a string or getting the length of a string operate on these 16-bit units. These units do not always map 1-1 onto what we might consider characters.

For example, characters with diacritics such as accents can sometimes be represented using two Unicode code points:

js
                                    
                                        const
                                        myString =
                                        "\u006E\u0303"
                                        ;
                                        console.
                                        log
                                        (
                                        myString)
                                        ;
                                        // ñ
                                        console.
                                        log
                                        (
                                        myString.
                                        length)
                                        ;
                                        // 2
                                    
                                

Also, since not all of the code points defined by Unicode fit into 16 bits, many Unicode code points are encoded as a pair of UTF-16 code units, which is called a surrogate pair :

js
                                    
                                        const
                                        face =
                                        "🥵"
                                        ;
                                        console.
                                        log
                                        (
                                        face.
                                        length)
                                        ;
                                        // 2
                                    
                                

The codePointAt() method of the JavaScript String object enables you to retrieve the Unicode code point from its encoded form:

js
                                    
                                        const
                                        face =
                                        "🥵"
                                        ;
                                        console.
                                        log
                                        (
                                        face.
                                        codePointAt
                                        (
                                        0
                                        )
                                        )
                                        ;
                                        // 129397
                                    
                                

See also

Updated on April 20, 2024 by Datarist.