eval_tools/model_comparison/README_per_case_comparison.md

# Per-Case 2D Metrics Comparison Tool

This tool compares `per_case_2d` metrics between two model evaluation reports and identifies cases with significant metric differences.

## Files

- `compare_per_case_2d.py` - Main Python script for comparing per-case metrics
- `compare_per_case_2d.sh` - Shell script with pre-configured paths for mono3d vs yolov5s-300w-newdata comparison

## Usage

### Quick Start (Using Shell Script)

```bash
cd /deeplearning_team/ydong/dongying/projects/yolov5-3d
./eval_tools/model_comparison/compare_per_case_2d.sh
```

This will compare the two models and save results to `evaluation_results/per_case_2d_comparison.json`.

### Custom Comparison (Using Python Script)

```bash
python eval_tools/model_comparison/compare_per_case_2d.py \
    --model1 path/to/model1/evaluation_report.json \
    --model2 path/to/model2/evaluation_report.json \
    --model1-name "Model-A" \
    --model2-name "Model-B" \
    --threshold 0.1 \
    --output comparison_results.json \
    --top-n 30
```

### Arguments

- `--model1`: Path to first model's evaluation_report.json (required)
- `--model2`: Path to second model's evaluation_report.json (required)
- `--model1-name`: Display name for model 1 (default: "Model-1")
- `--model2-name`: Display name for model 2 (default: "Model-2")
- `--threshold`: Threshold for significant difference, e.g., 0.1 = 10% (default: 0.1)
- `--output`: Output JSON file path (default: "per_case_comparison.json")
- `--top-n`: Number of top different cases to display (default: 20)

## Output

The script generates:

1. **Console Output**:
   - Summary of total cases and common cases
   - Top N cases with significant differences
   - Summary statistics (mean, std, median, range) for each class and metric

2. **JSON File**: Contains detailed comparison data including:
   - `summary`: Overview statistics
   - `significant_differences`: List of cases exceeding the threshold
   - `all_case_comparisons`: Complete per-case comparison data
   - `summary_statistics`: Statistical analysis by class and metric

## Example Output

```
Top 30 Cases with Significant Differences
================================================================================

1. Case: 20251118/seq-53
   Class: pedestrian, Metric: ap
   mono3d: 1.0000
   yolov5s-300w-newdata: 0.0000
   Difference: -1.0000 (abs: 1.0000)

2. Case: 20251121/seq-30
   Class: roadblock, Metric: ap
   mono3d: 1.0000
   yolov5s-300w-newdata: 0.0000
   Difference: -1.0000 (abs: 1.0000)
...

Summary Statistics
================================================================================

VEHICLE:
  ap        : mean=-0.0776, std=0.1439, median=-0.0243, range=[-0.7935, +0.0994]
  precision : mean=+0.1279, std=0.2248, median=+0.0934, range=[-0.9442, +0.6074]
  recall    : mean=-0.1210, std=0.1579, median=-0.0635, range=[-0.8975, +0.0000]
```

## Interpretation

- **Positive difference**: Model 2 performs better than Model 1
- **Negative difference**: Model 1 performs better than Model 2
- Cases are sorted by absolute difference (largest differences first)
- Summary statistics show overall trends across all cases
单目3D初始代码 2026-06-24 09:35:46 +08:00			`# Per-Case 2D Metrics Comparison Tool`

			This tool compares `per_case_2d` metrics between two model evaluation reports and identifies cases with significant metric differences.

			`## Files`

			- `compare_per_case_2d.py` - Main Python script for comparing per-case metrics
			- `compare_per_case_2d.sh` - Shell script with pre-configured paths for mono3d vs yolov5s-300w-newdata comparison

			`## Usage`

			`### Quick Start (Using Shell Script)`

			```bash
			`cd /deeplearning_team/ydong/dongying/projects/yolov5-3d`
			`./eval_tools/model_comparison/compare_per_case_2d.sh`
			```

			This will compare the two models and save results to `evaluation_results/per_case_2d_comparison.json`.

			`### Custom Comparison (Using Python Script)`

			```bash
			`python eval_tools/model_comparison/compare_per_case_2d.py \`
			`--model1 path/to/model1/evaluation_report.json \`
			`--model2 path/to/model2/evaluation_report.json \`
			`--model1-name "Model-A" \`
			`--model2-name "Model-B" \`
			`--threshold 0.1 \`
			`--output comparison_results.json \`
			`--top-n 30`
			```

			`### Arguments`

			- `--model1`: Path to first model's evaluation_report.json (required)
			- `--model2`: Path to second model's evaluation_report.json (required)
			- `--model1-name`: Display name for model 1 (default: "Model-1")
			- `--model2-name`: Display name for model 2 (default: "Model-2")
			- `--threshold`: Threshold for significant difference, e.g., 0.1 = 10% (default: 0.1)
			- `--output`: Output JSON file path (default: "per_case_comparison.json")
			- `--top-n`: Number of top different cases to display (default: 20)

			`## Output`

			`The script generates:`

			`1. Console Output:`
			`- Summary of total cases and common cases`
			`- Top N cases with significant differences`
			`- Summary statistics (mean, std, median, range) for each class and metric`

			`2. JSON File: Contains detailed comparison data including:`
			- `summary`: Overview statistics
			- `significant_differences`: List of cases exceeding the threshold
			- `all_case_comparisons`: Complete per-case comparison data
			- `summary_statistics`: Statistical analysis by class and metric

			`## Example Output`

			```
			`Top 30 Cases with Significant Differences`
			`================================================================================`

			`1. Case: 20251118/seq-53`
			`Class: pedestrian, Metric: ap`
			`mono3d: 1.0000`
			`yolov5s-300w-newdata: 0.0000`
			`Difference: -1.0000 (abs: 1.0000)`

			`2. Case: 20251121/seq-30`
			`Class: roadblock, Metric: ap`
			`mono3d: 1.0000`
			`yolov5s-300w-newdata: 0.0000`
			`Difference: -1.0000 (abs: 1.0000)`
			`...`

			`Summary Statistics`
			`================================================================================`

			`VEHICLE:`
			`ap : mean=-0.0776, std=0.1439, median=-0.0243, range=[-0.7935, +0.0994]`
			`precision : mean=+0.1279, std=0.2248, median=+0.0934, range=[-0.9442, +0.6074]`
			`recall : mean=-0.1210, std=0.1579, median=-0.0635, range=[-0.8975, +0.0000]`
			```

			`## Interpretation`

			`- Positive difference: Model 2 performs better than Model 1`
			`- Negative difference: Model 1 performs better than Model 2`
			`- Cases are sorted by absolute difference (largest differences first)`
			`- Summary statistics show overall trends across all cases`