feat: smart batch processing with skip logic

- Change --batch to accept directory instead of glob pattern
- Automatically skip already-processed scan dates
- Add --force flag to reprocess all files
- Fix date extraction regex to parse from client info line
- Display helpful tips about skipping/forcing
- Better user feedback with skip counts and suggestions

Usage:
  python dexa_extract.py --batch data/pdfs --height-in 74 --outdir data/results

This will process only new scans, skipping any dates already in the output.
This commit is contained in:
Mac DeCourcy 2025-10-06 15:33:05 -07:00
parent d6793e2572
commit b046af5d25
3 changed files with 342 additions and 38 deletions

View file

@ -66,7 +66,16 @@ python dexa_extract.py <PDF_PATH> --height-in <HEIGHT> [--weight-lb <WEIGHT>] [-
python dexa_extract.py data/pdfs/2025-10-06-scan.pdf --height-in 74 --weight-lb 212 --outdir data/results
```
**Process multiple scans** (appends to existing files):
**Batch process multiple scans:**
```bash
# Process all PDFs in a directory (automatically skips already-processed dates)
python dexa_extract.py --batch data/pdfs --height-in 74 --outdir data/results
# Force reprocessing all files
python dexa_extract.py --batch data/pdfs --height-in 74 --outdir data/results --force
```
**Individual scans** (appends to existing files):
```bash
python dexa_extract.py data/pdfs/scan-2025-01.pdf --height-in 74 --outdir data/results
python dexa_extract.py data/pdfs/scan-2025-04.pdf --height-in 74 --outdir data/results
@ -247,10 +256,35 @@ Higher trunk percentage may indicate good core development, while higher leg per
The script appends data to existing CSV files, making it easy to track changes over time:
1. Place all your DEXA PDFs in `data/pdfs/`
2. Process each one with the same output directory
3. Open `overall.csv` in Excel/Google Sheets to visualize trends
4. Compare `muscle_balance.csv` to track left/right symmetry improvements
### Option 1: Batch Processing (Recommended)
```bash
# Place all your PDFs in one directory
data/pdfs/
├── scan-2025-01-15.pdf
├── scan-2025-04-20.pdf
└── scan-2025-10-06.pdf
# Process all at once (automatically skips already-processed dates)
python dexa_extract.py --batch data/pdfs --height-in 74 --outdir data/results
# Add new scans later - only new ones will be processed
cp ~/Downloads/scan-2025-12-15.pdf data/pdfs/
python dexa_extract.py --batch data/pdfs --height-in 74 --outdir data/results
```
### Option 2: Individual Processing
```bash
# Process scans as you get them
python dexa_extract.py data/pdfs/scan-2025-01.pdf --height-in 74 --outdir data/results
python dexa_extract.py data/pdfs/scan-2025-04.pdf --height-in 74 --outdir data/results
python dexa_extract.py data/pdfs/scan-2025-10.pdf --height-in 74 --outdir data/results
```
### Analyzing Results
1. Open `overall.csv` in Excel/Google Sheets to visualize trends
2. Compare `muscle_balance.csv` to track left/right symmetry improvements
3. Review `summary.md` for readable reports of each scan
4. Use `overall.json` for programmatic analysis
## Privacy & Security
@ -281,12 +315,12 @@ The script appends data to existing CSV files, making it easy to track changes o
Contributions welcome! Areas for improvement:
- [ ] Enhanced error handling and validation
- [ ] Automatic height detection from PDF
- [ ] Data visualization/plotting features
- [ ] GUI interface for non-technical users
- [ ] Batch processing multiple PDFs at once
- [ ] Export to additional formats (Excel, SQLite, etc.)
- [ ] Support for older BodySpec PDF formats
- [ ] Progress bar for batch processing
## License