Plastic Man Benchmark Fail
Tom Spencer · Category: stories_and_anecdotes
When testing the BrowserComp benchmark prompt about a 1960s fictional character, many advanced models like Claude 4.5 and GPT-4.5 failed before the browser extension succeeded.
© 2025 The Build. All rights reserved.
Privacy Policy