• sarmale@lemmy.zip
    link
    fedilink
    arrow-up
    8
    ·
    1 year ago

    How many unicode characters could you add to the standard until it becomes unreliable?

    • Kerb@discuss.tchncs.de
      link
      fedilink
      arrow-up
      28
      ·
      edit-2
      1 year ago

      aparently unicode supports about 1.1 million characters, and we currently only use 96,382 as of version 4.0

      EDIT: i just read that unicode 4.0 is very outdated, current version is unicode 15.1 with 149,878 characters.

    • A Unicode character can be up to 4 bytes, so 2^32 or 4,294,967,296 potential unique characters. And it’d be easy enough to adjust the standard to allow for an extra byte(s) if necessary – it’s been done before.

      • Turun@feddit.de
        link
        fedilink
        arrow-up
        4
        ·
        edit-2
        1 year ago

        This is incorrect. While in UTF-32 a character (actually a code point) requires 4 bytes, and in UTF-8 up to 4 bytes, the Unicode standard is limited to 17*2^16 code points. (edit: apparently because that is the limit of UTF-16. 4 Byte UTF-8 can encode 2^21 code points, but it is not technically limited to four bytes, so in total is a ble to encode 2^31 code points)

        Unicode is the standard that says “the thing we call captial A is the 65th character”, literally defining a mapping from numbers to concepts.
        UTF-8 or UTF-32 are a way to encode a list of numbers in a more (UTF-8) or less (UTF-32) efficient way.