Bug 101032

Summary: assignments to nsSMState in nsCodingStateMachine result in unspecified behavior
Product: uchardet Reporter: James Cowgill <jcowgill+freedesktop>
Component: generalAssignee: Jehan <jehan>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: medium CC: uchardet
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:

Description James Cowgill 2017-05-13 15:56:22 UTC
This may be a little pedantic but I thought it would be worth mentioning anyway.

nsCodingStateMachine.h:42
> typedef enum {
>    eStart = 0,
>    eError = 1,
>    eItsMe = 2 
> } nsSMState;

nsCodingStateMachine.h:71
>     //from byte's class and stateTable, we get its next state
>     mCurrentState=(nsSMState)GETFROMPCK(mCurrentState*(mModel->classFactor)+byteCls,
>                                        mModel->stateTable);

As far as I can tell, nCurrentState is designed to contain values outside the original enumeration above. However, the C++11 standard does not permit casting integers to an enum type which are "out of range" (which has a slightly complicated definition). See: https://www.securecoding.cert.org/confluence/display/cplusplus/INT50-CPP.+Do+not+cast+to+an+out-of-range+enumeration+value

In the case of nsSMState, the valid values are 0, 1, 2, and 3. Assigning any other value results in unspecified behavior.

UBSan complains loudly about this:
>  20/116 Test  #20: es:iso-8859-1 ....................***Failed    0.01 sec
> /home/jcowgill/deb-pkg/uchardet/upstream/src/nsCodingStateMachine.h:75:12: runtime error: load of value 5, which is not a valid value for type 'nsSMState'
> /home/jcowgill/deb-pkg/uchardet/upstream/src/nsCodingStateMachine.h:66:9: runtime error: load of value 5, which is not a valid value for type 'nsSMState'
> /home/jcowgill/deb-pkg/uchardet/upstream/src/nsCodingStateMachine.h:72:30: runtime error: load of value 5, which is not a valid value for type 'nsSMState'
> /home/jcowgill/deb-pkg/uchardet/upstream/src/nsCodingStateMachine.h:72:30: runtime error: load of value 5, which is not a valid value for type 'nsSMState'
Comment 1 Jehan 2017-05-14 12:00:43 UTC
> This may be a little pedantic

No problem about being pedantic when code is involved. ;-)

> However, the C++11 standard

Actually I don't know if the original authors were targeting a specific version of C++. If we were to do so, would C++11 be the recommended version?

> In the case of nsSMState, the valid values are 0, 1, 2, and 3.

Aren't the valid values even only 0, 1 and 2?

> Assigning any other value results in unspecified behavior.

Makes complete sense. I'll have a look at this.
Comment 2 Jehan 2017-05-14 16:03:20 UTC
Ok I've had a quick look. I guess the right solution is that we will have to update the enum of states with other possible states returned by the state machine. At the very least, it feels obvious that there should be some kind of "eProcessing" state (or some other name), for cases when we are not sure yet.

And maybe there would be other intermediary states possible. But for that I will have to take some time some day to read more closely this piece of code to make sure I understand how it works. To this day, I now understand quite well the whole part for detection of single byte character sets, but I have never read yet the rest. I'll have to do so some day anyway if I want to improve all the CJK detection for instance.
Comment 3 James Cowgill 2017-05-15 10:42:01 UTC
(In reply to Jehan from comment #1)
> > However, the C++11 standard
> 
> Actually I don't know if the original authors were targeting a specific
> version of C++. If we were to do so, would C++11 be the recommended version?

I haven't checked C++98 to see if the same rules apply (I expect they do), but since gcc-6 now uses C++11 by default, most people are going to be compiling this in C++11 mode anyway.

> > In the case of nsSMState, the valid values are 0, 1, 2, and 3.
> 
> Aren't the valid values even only 0, 1 and 2?

This is what I was meaning when I said there was a complicated definition :) For enums with only positive values, the valid range is: [0, 2^m) where m is the smallest integer such that 2^m > lagest enum value. This allows some compiler optimizations like only having to compare the lower 8 bits if there are < 256 enum values.
Comment 4 Jehan 2017-05-15 10:48:16 UTC
(In reply to James Cowgill from comment #3)
> (In reply to Jehan from comment #1)
> > > However, the C++11 standard
> > 
> > Actually I don't know if the original authors were targeting a specific
> > version of C++. If we were to do so, would C++11 be the recommended version?
> 
> I haven't checked C++98 to see if the same rules apply (I expect they do),
> but since gcc-6 now uses C++11 by default, most people are going to be
> compiling this in C++11 mode anyway.

I see. Let's go for C++11 then. I'll check if there is not a cmake rule to enforce it.

If anyone has an alternative proposition, go ahead and tell me. I am not a C++ expert and am open to use other versions of C++ if someone who knows better tells me it makes sense.

> > > In the case of nsSMState, the valid values are 0, 1, 2, and 3.
> > 
> > Aren't the valid values even only 0, 1 and 2?
> 
> This is what I was meaning when I said there was a complicated definition :)
> For enums with only positive values, the valid range is: [0, 2^m) where m is
> the smallest integer such that 2^m > lagest enum value. This allows some
> compiler optimizations like only having to compare the lower 8 bits if there
> are < 256 enum values.

Oh I see.
Well for the compiler, this will be the valid values. In a semantic point of view though, I will stick to the defined enum constants only. For this, I will need to read and understand the code first.
Comment 5 Jehan 2017-05-28 18:07:32 UTC
Ok so I finally read the code. In the end, we can't really add new states to the enum values because it can have an indefinite number of state depending on the charset we are detecting. All states apart from the 3 basic (start, error and found) are specific to a prober.

As a conclusion, let's just replace nsSMState by an unsigned int.
As a side task, I have also defined C++11 as the standard uchardet will follow. This is not set in stone, and if it makes sense to follow another version, this can be discussed.

commit 53f7ad0e0b0894f11c948ee75f74595ab1b61405
Author: Jehan <jehan@girinstud.io>
Date:   Sun May 28 19:12:22 2017 +0200

    Bug 101032 - assignments to nsSMState in nsCodingStateMachine result...
    
    ... in unspecified behavior.
    When compiling with UBSan (-fsanitize=undefined), execution complains:
    > runtime error: load of value 5, which is not a valid value for type 'nsSMState'
    Since the machine states depend on every different charset's state
    machine, it is not possible to simply extend the enum with more generic
    values. Instead let's just make the state as an unsigned int value and
    define the 3 generic states as constants.

 src/nsBig5Prober.cpp       |  2 +-
 src/nsCodingStateMachine.h | 19 ++++++++++---------
 src/nsEUCJPProber.cpp      |  2 +-
 src/nsEUCKRProber.cpp      |  2 +-
 src/nsEUCTWProber.cpp      |  2 +-
 src/nsEscCharsetProber.cpp |  2 +-
 src/nsGB2312Prober.cpp     |  2 +-
 src/nsSJISProber.cpp       |  2 +-
 src/nsUTF8Prober.cpp       |  2 +-
 9 files changed, 18 insertions(+), 17 deletions(-)

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.