Skip to content

Heap buffer overflow in _PyTokenizer_ensure_utf8 #144872

@AdamKorcz

Description

@AdamKorcz

Bug report

Bug description:

OSS-Fuzz has found a heap buffer overflow in _PyTokenizer_ensure_utf8. Link to OSS-Fuzz bug report.

The root cause is that valid_utf8() in Parser/tokenizer/helpers.c checks continuation bytes in reverse order thus reader s[expected] before s[1] on these lines:

for (; expected; expected--)
if (s[expected] < 0x80 || s[expected] >= 0xC0)
return 0;

When a multi-byte UTF-8 sequence is truncated - such as a 3-byte lead \xEA followed immediately by a null terminator - the backward loop reads past the end of the valid data before encountering the null byte that would stop it.

This is not a security-critical issue.

CPython versions tested on:

CPython main branch

Operating systems tested on:

No response

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    interpreter-core(Objects, Python, Grammar, and Parser dirs)type-bugAn unexpected behavior, bug, or error

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions