Big Data Science (@bdscience): 💡Benchmark for comprehensive assessment of LLM logical thinking is a benchmark based on l…

😎💡Benchmark for comprehensive assessment of LLM logical thinking

ZebraLogic is a benchmark based on logic puzzles and is a set of 1000 program-generated tasks of varying difficulty - with a grid from 2x2 to 6x6.

Each puzzle consists of N houses (numbered from left to right) and M features for each house. The task is to determine the unique distribution of feature values across the houses based on the provided clues.
Language models are given one example of the puzzle solution with a detailed explanation of the reasoning process and the answer in JSON format. Models must then solve a new problem, providing both the reasoning progress and the final solution in a given format.

Evaluation Metrics:
1. Puzzle-level accuracy (percentage of completely correctly solved puzzles).
2. Cell-level accuracy (percentage of correctly completed cells in the solution matrix).

🟡 Project Page
🟡 Dataset

Local launch of ZebraLogic as part of the ZeroEval framefork:

# Install via conda



conda create -n zeroeval python=3.10

conda activate zeroeval



# pip install vllm -U # pip install -e vllm 



pip install vllm==0.5.1

pip install -r requirements.txt

# export HF_HOME=/path/to/your/custom/cache_dir/



# Run Meta-Llama-3-8B-Instruct via local, with greedy decoding on `zebra-grid`

bash zero_eval_local.sh -d zebra-grid -m meta-llama/Meta-Llama-3-8B-Instruct -p Meta-Llama-3-8B-Instruct -s 4

Обсуждение 0

Вход в экосистему

Ваши настройки cookie