Summary: | assignments to nsSMState in nsCodingStateMachine result in unspecified behavior | ||
---|---|---|---|
Product: | uchardet | Reporter: | James Cowgill <jcowgill+freedesktop> |
Component: | general | Assignee: | Jehan <jehan> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | normal | ||
Priority: | medium | CC: | uchardet |
Version: | unspecified | ||
Hardware: | x86-64 (AMD64) | ||
OS: | Linux (All) | ||
Whiteboard: | |||
i915 platform: | i915 features: |
Description
James Cowgill
2017-05-13 15:56:22 UTC
> This may be a little pedantic No problem about being pedantic when code is involved. ;-) > However, the C++11 standard Actually I don't know if the original authors were targeting a specific version of C++. If we were to do so, would C++11 be the recommended version? > In the case of nsSMState, the valid values are 0, 1, 2, and 3. Aren't the valid values even only 0, 1 and 2? > Assigning any other value results in unspecified behavior. Makes complete sense. I'll have a look at this. Ok I've had a quick look. I guess the right solution is that we will have to update the enum of states with other possible states returned by the state machine. At the very least, it feels obvious that there should be some kind of "eProcessing" state (or some other name), for cases when we are not sure yet. And maybe there would be other intermediary states possible. But for that I will have to take some time some day to read more closely this piece of code to make sure I understand how it works. To this day, I now understand quite well the whole part for detection of single byte character sets, but I have never read yet the rest. I'll have to do so some day anyway if I want to improve all the CJK detection for instance. (In reply to Jehan from comment #1) > > However, the C++11 standard > > Actually I don't know if the original authors were targeting a specific > version of C++. If we were to do so, would C++11 be the recommended version? I haven't checked C++98 to see if the same rules apply (I expect they do), but since gcc-6 now uses C++11 by default, most people are going to be compiling this in C++11 mode anyway. > > In the case of nsSMState, the valid values are 0, 1, 2, and 3. > > Aren't the valid values even only 0, 1 and 2? This is what I was meaning when I said there was a complicated definition :) For enums with only positive values, the valid range is: [0, 2^m) where m is the smallest integer such that 2^m > lagest enum value. This allows some compiler optimizations like only having to compare the lower 8 bits if there are < 256 enum values. (In reply to James Cowgill from comment #3) > (In reply to Jehan from comment #1) > > > However, the C++11 standard > > > > Actually I don't know if the original authors were targeting a specific > > version of C++. If we were to do so, would C++11 be the recommended version? > > I haven't checked C++98 to see if the same rules apply (I expect they do), > but since gcc-6 now uses C++11 by default, most people are going to be > compiling this in C++11 mode anyway. I see. Let's go for C++11 then. I'll check if there is not a cmake rule to enforce it. If anyone has an alternative proposition, go ahead and tell me. I am not a C++ expert and am open to use other versions of C++ if someone who knows better tells me it makes sense. > > > In the case of nsSMState, the valid values are 0, 1, 2, and 3. > > > > Aren't the valid values even only 0, 1 and 2? > > This is what I was meaning when I said there was a complicated definition :) > For enums with only positive values, the valid range is: [0, 2^m) where m is > the smallest integer such that 2^m > lagest enum value. This allows some > compiler optimizations like only having to compare the lower 8 bits if there > are < 256 enum values. Oh I see. Well for the compiler, this will be the valid values. In a semantic point of view though, I will stick to the defined enum constants only. For this, I will need to read and understand the code first. Ok so I finally read the code. In the end, we can't really add new states to the enum values because it can have an indefinite number of state depending on the charset we are detecting. All states apart from the 3 basic (start, error and found) are specific to a prober. As a conclusion, let's just replace nsSMState by an unsigned int. As a side task, I have also defined C++11 as the standard uchardet will follow. This is not set in stone, and if it makes sense to follow another version, this can be discussed. commit 53f7ad0e0b0894f11c948ee75f74595ab1b61405 Author: Jehan <jehan@girinstud.io> Date: Sun May 28 19:12:22 2017 +0200 Bug 101032 - assignments to nsSMState in nsCodingStateMachine result... ... in unspecified behavior. When compiling with UBSan (-fsanitize=undefined), execution complains: > runtime error: load of value 5, which is not a valid value for type 'nsSMState' Since the machine states depend on every different charset's state machine, it is not possible to simply extend the enum with more generic values. Instead let's just make the state as an unsigned int value and define the 3 generic states as constants. src/nsBig5Prober.cpp | 2 +- src/nsCodingStateMachine.h | 19 ++++++++++--------- src/nsEUCJPProber.cpp | 2 +- src/nsEUCKRProber.cpp | 2 +- src/nsEUCTWProber.cpp | 2 +- src/nsEscCharsetProber.cpp | 2 +- src/nsGB2312Prober.cpp | 2 +- src/nsSJISProber.cpp | 2 +- src/nsUTF8Prober.cpp | 2 +- 9 files changed, 18 insertions(+), 17 deletions(-) |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.