About character sets

I looped through 512 characters and I realized that: my character set is 256. so it is not an 7-bit ascii. Instead, it is variant.

My question is: Is there any way to know what character set is default for your machine?

Also, I remember programming(before 2013), when i write cout << char(1); it gives me a smiley character, char(2) give an angry character, doing some console games, those were the hero and the enemy respectively), but not anymore, I get now a box with question mark inside. Do you have any idea why this happened? where is my hero?

Note: back then I used xp, now I am using win10.

AbstractionAnon (6954)

An unsigned char is 8 bits. That's 0 - 255. Not sure why you would loop through 0-511 (9 bits). You can find the 8 bit extended ASCII table here:
https://www.asciitable.com/

In order to represent characters used in foreign (non-English) languages, the concept of code pages was introduced.
https://en.wikipedia.org/wiki/Windows_code_page

The smiley face with an 8-bit value of 1 was a member of code page 437.
https://en.wikipedia.org/wiki/Code_page_437

jonnin (11338)

my cygwin compiler still produces smiley faces for those. if its just a console program, maybe another compiler will run the code without hacking on it. Otherwise you may have to use unicode or change the settings to get it back.

ninja01 (157)

An unsigned char is 8 bits. That's 0 - 255. Not sure why you would loop through 0-511 (9 bits)

I thought since char is at least 1 byte, it may be more than 255, don't you think so?

Peter87 (11179)

ninja01 wrote:
Is there any way to know what character set is default for your machine?

I don't think there is a standard way to find out. Outside of Windows it's typically UTF-8 nowadays. Technically UTF-8 is the character encoding while Unicode is the character set.

ninja01 wrote:
I thought since char is at least 1 byte, it may be more than 255, don't you think so?

A char is typically 8 bits and 8 bits can store at most 256 different values (0-255).

It's true that the standard allows char to be larger than 8 bits but good luck finding a computer architecture/compiler where that is the case.

ninja01 wrote:
I looped through 512 characters and I realized that: my character set is 256. so it is not an 7-bit ascii. Instead, it is variant.

Note that some character encodings (such as UTF-8) uses multiple bytes to encode a single character.

a uses 1 byte, ä uses 2 bytes and 👍 uses 4 bytes
https://onlinetoolz.net/unicode#c=a%C3%A4%F0%9F%91%8D&e=utf8&b=16

Last edited on

ninja01 (157)

This UTF-8 is really a pain, I still cannot figure it out.

Outside of Windows it's typically UTF-8 nowadays.

. Also in Windows you can check that using GetConsoleOutputCP(); Mine gives me the 65001 identifier which is utf-8

ninja01 wrote:
I thought since char is at least 1 byte, it may be more than 255, don't you think so?

1 byte is typically 8 bits and 8 bits can store at most 256 different values (0-255).

I thought we cannot know what char's size is

Peter87 (11179)

ninja01 wrote:
This UTF-8 is really a pain, I still cannot figure it out.

I think it's quite nice actually.

On Linux the following code just works* without me having to do anything special.

std::cout << "aä👍\n";

* Assuming the font used supports for these characters.

UTF-8 has become the most widely used encoding in recent year. Even this cplusplus.com HTML page is encoded in UTF-8.

On Windows I can imagine it being a bit painful, at least for "console applications". I have had no problems porting games that used SDL from Linux to Windows but that was because I read all text from UTF-8 encoded text files and used SDL functions with UTF-8 support to draw the text.

Note that I'm not necessarily talking about using char8_t/std::u8string. Those might become useful in the future but for the moment I'm ignoring their existence.

ninja01 wrote:
I thought we cannot know what char's size is

You can use CHAR_BIT.

#include <iostream>
#include <climits>

int main()
{
	std::cout << "A char is " << CHAR_BIT << " bits.\n";
}

char is 8 bits on all mainstream computers. You'll need to make an effort to find a computer architecture where that is not the case. If you find something it will not be your ordinary mobile/laptop/desktop computer. It will either be something very old or something very specialized.

Note that char is the smallest possible size that an object can be. All objects is a multiple of char. If char is larger than 8 bits it would mean there are no 8-bit integers available. This would make it incompatible with a lot of existing things, or at least less efficient when it comes to handling 8-bit formats, so it would probably be hard to advertise such a product.

This might seem strange if you just think of char as a type to store characters, but despite its name it's more than that. Historically it has essentially been the "byte" type in C and C++. There are even special rules that allow you to use char pointers to inspect the underlying data of other data types, something that is normally not allowed, and this has obviously very little to do with characters/text.

So I don't think you should feel bad if you assume char is 8 bits when writing a program. A lot of software do that in one way or another.

If you really feel like making it explicit you could use static_assert to test and document your assumption.

1
2

#include <climits>
static_assert(CHAR_BIT == 8);

This will give you a compilation error if char is not 8 bits.

Note that all people that use std::int8_t or std::uint8_t implicitly makes this assumption.

Last edited on

ninja01 (157)

I think I understand everything you just said, simply because you simplify it too much, and I am thankful for that.

Here is why utf-8 gives me much pain:

I think of UNICODE as the "All characters ~~in the~~ used by humans".
From this link https://en.wikipedia.org/wiki/Unicode, I quote:

The standard, which is maintained by the Unicode Consortium, defines as of the current version (15.0) 149,186 characters

so it is clearly! that UNICODE has a total of 149,186 characters.

I think of utf-8 as way to represent those 149,186 characters numerically, so that can be stored in some memory or other uses .
From this link https://en.wikipedia.org/wiki/UTF-8, I quote:

UTF-8 is capable of encoding all 1,112,064 valid character

From 149,186 character to 1,112,064!
Was not utf-8 supposed to be a character encoding for those 149,186 characters, or is it a character encoding for something other than UNICODE? or UNICODE does not have 149,186 characters, but have more?

========

In the link you just gave me https://onlinetoolz.net/unicode#c=a%C3%A4%F0%9F%91%8D&e=utf8&b=16 , I quote

The characters are encoded as a series of 1-4 bytes

If UNICODE even had 1,112,064 character, we need only 3 bytes to encode not just 1,112,064 characters, but 16,777,216 characters so why the need of 4bytes

Last edited on

Peter87 (11179)

ninja01 wrote:
From 149,186 character to 1,112,064!

There are 1 112 064 valid character code points (i.e. "Unicode numbers").
Only 149 186 of those have been assigned characters.
137 468 are reserved for "private use".
Etc.

In total there are 825 279 reserved code points that have not yet been assigned a meaning but that might change in the future (Unicode 15.0 that was released earlier this month added 4489 characters).

https://www.unicode.org/versions/stats/charcountv15_0.html

ninja01 wrote:
If UNICODE even had 1,112,064 character, we need only 3 bytes to encode not just 1,112,064 characters, but 16,777,216 characters so why the need of 4bytes

UTF-8 wastes some bits so that you can look at a byte and decide if it's the first byte of a code point, and if so how many bytes are used to represent this code point.

If you are interested in the details you should look at the Encoding section on Wikipedia.
https://en.wikipedia.org/wiki/UTF-8#Encoding
The table is very good. Each x means one bit that is used to represent the code point. The 1s and 0s you see are overhead. If you count the number of x on the third row you see that 3 bytes only give you 16 bits to encode the code point.

Last edited on

ninja01 (157)

Thank you @Peter, Great explanation. After couple sources here and there, I fully understand the damn utf-8, I got say what a genius why to encode the UNICODE set, but that 8 and 16 were miss-leading

Note that some character encodings (such as UTF-8) uses multiple bytes to encode a single character.

I guess the utf-8 will be set only if I use char8_t, utf-16 when I use char16_t, and utf-32 when I use char32_t, so as long I use char ch, I do not need to worry about the utf-x I am right?

With this line

std::cout << "aä👍\n";

Did you copy-past the hand, or write it using your keyboard?

About that smiley character, I found out couple information:
The version that print those control code was present on the original IBM PC, based on characters used in Wang word processing machines. This extended ASCII is also refereed as ALT codes because in Windows if you hold down the ALT key and type 33 digits ASCII code on the numpad the character is outputed

Now the crazy part is when I press ALT+1 I see the smiley character, and I run cout << char(1); I do not see it, lol

JLBorges (13770)

> so as long I use char ch, I do not need to worry about the utf-x I am right?

The text may be encoded as UTF-8 (for instance it may have been read from an html page which has <meta charset="UTF-8">).

#include <iostream>
#include <iomanip>
#include <string>

int main()
{
    //                       Ø     §     µ
    const char cstr[] = "\u00D8\u00A7\u00B5" ; // three utf-8 characters plus a null character
    std::cout << sizeof(cstr) << '\n' ; // 7 ( 3*2 + 1 : three two-byte characters plus the null character)

    const std::string str = cstr ;
    std::cout << str.size() << '\n' ; // 6 (three two-byte characters)

    std::cout << std::quoted(cstr) << '\n' // "Ø§µ"
               << std::quoted(str) << '\n' ; // "Ø§µ"
}

https://coliru.stacked-crooked.com/a/fc31823eb8e16de6

Peter87 (11179)

ninja01 wrote:
Did you copy-past the hand, or write it using your keyboard?

I copy-pasted it and the file was saved as UTF-8 (i.e. the default encoding).

Topic archived. No new replies allowed.

C++

Forum

About character sets