Wolfram Ravenwolf: Benchmark Progress Update: I've completed ANOTHER round to ensure accuracy - yes, I have now run ALL the benchmarks TWICE! While still compiling the results for a blog post, here's a sneak peek featuring detailed metrics and Top 10 rankings. Stay tuned for the complete analysis.

Wolfram Ravenwolf wolfram.ravenwolf.ai
Benchmark Progress Update: I've completed ANOTHER round to ensure accuracy - yes, I have now run ALL the benchmarks TWICE! While still compiling the results for a blog post, here's a sneak peek featuring detailed metrics and Top 10 rankings. Stay tuned for the complete analysis.
- Wolfram Ravenwolf wolfram.ravenwolf.ai · Dec 1, 2024
  Almost done benchmarking, write-up coming tomorrow – but wanted to share some important findings right away: Tested QwQ from 3 to 8 bit EXL2 in MMLU-Pro, and by raising max_tokens from default 2K to 8K, smaller quants got MUCH better scores. They need room to think!
Dec 2, 2024 23:07
0 reposts 0 quotes 0 likes

View on Bluesky Download image Show all post labels
Wolfram Ravenwolf wolfram.ravenwolf.ai · Dec 2, 2024
I'll probably go with average scores to determine rank - which means QwQ-32B-Preview (8bpw, 16K max new tokens) at 324.5 vs Athene-V2-Chat at 321.5 takes the lead as best local model (in my rankings)!

View on Bluesky Show all post labels

An unhandled error has occurred. Reload 🗙