Character Sets

The character set chset matches a set of characters over a finite range bounded by the limits of its template parameter CharT. This class is an optimization of a parser that acts on a set of single characters. The template class is parameterized by the character type CharT and can work efficiently with 8, 16 and 32 and even 64 bit characters.

    template <typename CharT = char>
    class chset;

The chset is constructed from literals (e.g. 'x'), ch_p or chlit<>, range_p or range<>, anychar_p and nothing_p (see primitives) or copy-constructed from another chset. The chset class uses a copy-on-write scheme that enables instances to be passed along easily by value.

Sparse bit vectors

To accommodate 16/32 and 64 bit characters, the chset class statically switches from a std::bitset implementation when the character type is not greater than 8 bits, to a sparse bit/boolean set which uses a sorted vector of disjoint ranges (range_run). The set is constructed from ranges such that adjacent or overlapping ranges are coalesced.

range_runs are very space-economical in situations where there are lots of ranges and a few individual disjoint values. Searching is O(log n) where n is the number of ranges.

Examples:

    chset<> s1('x');
    chset<> s2(anychar_p - s1);

Optionally, character sets may also be constructed using a definition string following a syntax that resembles posix style regular expression character sets, except that double quotes delimit the set elements instead of square brackets and there is no special negation ^ character.

    range = anychar_p >> '-' >> anychar_p;
    set = *(range_p | anychar_p);

Since we are defining the set using a C string, the usual C/C++ literal string syntax rules apply. Examples:

    chset<> s1("a-zA-Z");       // alphabetic characters
    chset<> s2("0-9a-fA-F");    // hexadecimal characters
    chset<> s3("actgACTG");     // DNA identifiers
    chset<> s4("\x7f\x7e");     // Hexadecimal 0x7F and 0x7E

The standard Spirit set operators apply (see operators) plus an additional character-set-specific inverse (negation ~) operator:

Character set operators
~a