pub enum Utf8ErrorKind {
TooFewBytes,
NonUtf8Byte,
UnexpectedContinuationByte,
InterruptedSequence,
OverlongEncoding,
Utf16ReservedCodepoint,
TooHighCodepoint,
}
Expand description
The types of errors that can occur when decoding a UTF-8 codepoint.
The variants are more technical than what an end user is likely interested in, but might be useful for deciding how to handle the error.
They can be grouped into three categories:
- Will happen regularly if decoding chunked or buffered text:
TooFewBytes
. - Input might be binary, a different encoding or corrupted,
UnexpectedContinuationByte
andInterruptedSequence
.
(Broken UTF-8 sequence). - Less likely to happen accidentaly and might be malicious:
OverlongEncoding
,Utf16ReservedCodepoint
andTooHighCodepoint
. Note that theese can still be caused by certain valid latin-1 strings such as"Á©"
(b"\xC1\xA9"
).
Variants§
TooFewBytes
There are too few bytes to decode the codepoint.
This can happen when a slice is empty or too short, or an iterator
returned None
while in the middle of a codepoint.
This error is never produced by functions accepting fixed-size
[u8; 4]
arrays.
If decoding text coming chunked (such as in buffers passed to Read
),
the remaing bytes should be carried over into the next chunk or buffer.
(including the byte this error was produced for.)
NonUtf8Byte
A byte which is never used by well-formed UTF-8 was encountered.
This means that the input is using a different encoding, is corrupted or binary.
This error is returned when a byte in the following ranges is encountered anywhere in an UTF-8 sequence:
192
and193
(0b1100_000x
): Indicates an overlong encoding of a single-byte, ASCII, character, and should therefore never occur.248..
(0b1111_1xxx
): Sequences cannot be longer than 4 bytes.245..=247
(0b1111_0101 | 0b1111_0110
): Indicates a too high codepoint. (above\u10ffff
)
UnexpectedContinuationByte
The first byte is not a valid start of a codepoint.
This might happen as a result of slicing into the middle of a codepoint, the input not being UTF-8 encoded or being corrupted. Errors of this type coming right after another error should probably be ignored, unless returned more than three times in a row.
This error is returned when the first byte has a value in the range
128..=191
(0b1000_0000..=0b1011_1111
).
InterruptedSequence
The byte at index 1..=3 should be a continuation byte,
but doesn’t fit the pattern 0b10xx_xxxx
.
When the input slice or iterator has too few bytes,
TooFewBytes
is returned instead.
OverlongEncoding
The encoding of the codepoint has so many leading zeroes that it could be a byte shorter.
Successfully decoding this can present a security issue:
Doing so could allow an attacker to circumvent input validation that
only checks for ASCII characters, and input characters or strings that
would otherwise be rejected, such as /../
.
This error is only returned for 3 and 4-byte encodings;
NonUtf8Byte
is returned for bytes that start longer or shorter
overlong encodings.
Utf16ReservedCodepoint
The codepoint is reserved for UTF-16 surrogate pairs.
(Utf8Char
cannot be used to work with the
WTF-8 encoding for UCS-2 strings.)
This error is returned for codepoints in the range \ud800
..=\udfff
.
(which are three bytes long as UTF-8)
TooHighCodepoint
The codepoint is higher than \u10ffff
, which is the highest codepoint
unicode permits.
Trait Implementations§
Source§impl Clone for Utf8ErrorKind
impl Clone for Utf8ErrorKind
Source§fn clone(&self) -> Utf8ErrorKind
fn clone(&self) -> Utf8ErrorKind
1.0.0 · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source
. Read more