Heap buffer overflow in `_PyTokenizer_ensure_utf8`

# Bug report

### Bug description:

OSS-Fuzz has found a heap buffer overflow in `_PyTokenizer_ensure_utf8`. [Link to OSS-Fuzz bug report](https://issues.oss-fuzz.com/issues/451112368).

The root cause is that `valid_utf8()` in `Parser/tokenizer/helpers.c` checks continuation bytes in reverse order thus reader `s[expected]` before `s[1]` on these lines:

https://github.com/python/cpython/blob/8b7b5a994602824a5e41cf2516691212fcdfa25e/Parser/tokenizer/helpers.c#L497-L499

When a multi-byte UTF-8 sequence is truncated - such as a 3-byte lead `\xEA` followed immediately by a null terminator - the backward loop reads past the end of the valid data before encountering the null byte that would stop it.

This is not a security-critical issue.

### CPython versions tested on:

CPython main branch

### Operating systems tested on:

_No response_


### Linked PRs
* gh-144807

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Heap buffer overflow in `_PyTokenizer_ensure_utf8` #144872

Bug report

Bug description:

CPython versions tested on:

Operating systems tested on:

Linked PRs

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

	for (; expected; expected--)
	if (s[expected] < 0x80 \|\| s[expected] >= 0xC0)
	return 0;

Uh oh!

Heap buffer overflow in _PyTokenizer_ensure_utf8 #144872

Description

Bug report

Bug description:

CPython versions tested on:

Operating systems tested on:

Linked PRs

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Heap buffer overflow in `_PyTokenizer_ensure_utf8` #144872