Skip to content

Fix GH-21223: ext/mbstring replace alloca with safe_emalloc in mb_guess_encoding_for_strings#21224

Closed
jordikroon wants to merge 2 commits intophp:masterfrom
jordikroon:fix/gh-21223
Closed

Fix GH-21223: ext/mbstring replace alloca with safe_emalloc in mb_guess_encoding_for_strings#21224
jordikroon wants to merge 2 commits intophp:masterfrom
jordikroon:fix/gh-21223

Conversation

@jordikroon
Copy link
Contributor

Fixes #21223

@alexdowad
Copy link
Contributor

@jordikroon, thanks for the PR. Some questions:

  • Where did you first encounter this crash? Did you find it in actual, production code, or by fuzzing PHP to look for crashes? Does the crash only occur when running under ASAN, or also when just running PHP "normally" (i.e. not compiling with ASAN)?
  • On what platform did you observe the crash? Do you know if it can be reproduced on other platforms?
  • The reproducer uses an array with 50,000 strings. What is the actual "threshold" size at which crashes occur?
  • What is the performance impact?

If there is no performance impact, then I would definitely agree with moving to emalloc. If there is non-negligible performance impact, then I would be more interested in the possibility of de-duplicating the $encodings array.

Speaking of which... is there performance impact to running mb_guess_encoding with duplicate values in the $encodings array (apart from the possibility of getting a crash from ASAN)?

@jordikroon
Copy link
Contributor Author

@alexdowad I hope this answers your questions.

  • I encountered this problem by fuzzing the mbstring specifically under ASAN. But running normally the given code example will give a Segmentation fault (core dumped).
  • Confirmed to be an issue on MacOS and Linux. So I assume this will hit on all platforms.
  • The threshold seems vary slightly. On my MacOS machine it seems to crash with a list of 208_629. On Linux 209_192. Not sure where the difference comes from.

Lastly as for the performance impact. It seems non-negligible and both before and after seem be even in performance. Sometimes one is faster than the other. But the longer the input, the slower it will become.

Execution time: 0.009391069~ (before)
Execution time: 0.009022951~ (after)

That said. Duplicate values do seem to affect performance. As in a single UTF-8 is notably faster than a list of 100.000.

@alexdowad
Copy link
Contributor

@jordikroon Hmm. You said for macOS you found a threshold of 208,629? But your test code uses an array with 50,000 copies of the string "UTF-8".

@jordikroon
Copy link
Contributor Author

Where did you see 50.000 copies? The test file and example show 500.000 copies.

@alexdowad
Copy link
Contributor

Where did you see 50.000 copies? The test file and example show 500.000 copies.

Ah 😅 That explains it.
Sorry for the mistake.

When I have some time, I intend to benchmark this change myself as an additional point of reference to assess the impact.

@ndossche
Copy link
Member

ndossche commented Feb 16, 2026

You can use do_alloca which will either use alloca or a heap allocation, automatic trade-off wrt performance.

EDIT: sorry for accidental closure, github...

@ndossche ndossche closed this Feb 16, 2026
@ndossche ndossche reopened this Feb 16, 2026
@alexdowad
Copy link
Contributor

Thanks so much to @ndossche!
Pulled @jordikroon's changes, squashed commits, corrected one comment. Just building and running tests...

@alexdowad
Copy link
Contributor

Tests are good. I wondered if the new test should be marked as a 'slow test', but it seems that's not necessary.
We need to update NEWS, etc.
Also just checking what PHP versions should be patched.

@alexdowad
Copy link
Contributor

The code where the crash was found is present in PHP 8.3, which is the oldest version in active support... I think I should merge down from there.

@alexdowad
Copy link
Contributor

Hmm, OK.
GitHub branch protection rule saved me from wrongly merging to PHP-8.3. That version now receives "critical fixes" only.
Let me merge into PHP-8.4 instead.

@alexdowad
Copy link
Contributor

OK.
Merged into PHP-8.4, merged down from there to PHP-8.5, merged down from there to master.
We are all good now.
Thanks very much, @jordikroon!

@alexdowad alexdowad closed this Feb 16, 2026
@alexdowad
Copy link
Contributor

PS to @jordikroon... your fix was merged as commit 37c5a13 (if you want to look at it in the commit log).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ext/mbstring: stack overflow in mb_guess_encoding called via mb_detect_encoding

3 participants