feat: smart batch processing with skip logic

- Change --batch to accept directory instead of glob pattern - Automatically skip already-processed scan dates - Add --force flag to reprocess all files - Fix date extraction regex to parse from client info line - Display helpful tips about skipping/forcing - Better user feedback with skip counts and suggestions Usage: python dexa_extract.py --batch data/pdfs --height-in 74 --outdir data/results This will process only new scans, skipping any dates already in the output.
2025-10-06 15:33:05 -07:00 · 2025-10-06 15:33:05 -07:00 · b046af5d25
commit b046af5d25
parent d6793e2572
3 changed files with 342 additions and 38 deletions
--- a/README.md
+++ b/README.md
@ -66,7 +66,16 @@ python dexa_extract.py <PDF_PATH> --height-in <HEIGHT> [--weight-lb <WEIGHT>] [-
 python dexa_extract.py data/pdfs/2025-10-06-scan.pdf --height-in 74 --weight-lb 212 --outdir data/results
 ```

-**Process multiple scans** (appends to existing files):
+**Batch process multiple scans:**
+```bash
+# Process all PDFs in a directory (automatically skips already-processed dates)
+python dexa_extract.py --batch data/pdfs --height-in 74 --outdir data/results
+
+# Force reprocessing all files
+python dexa_extract.py --batch data/pdfs --height-in 74 --outdir data/results --force
+```
+
+**Individual scans** (appends to existing files):
 ```bash
 python dexa_extract.py data/pdfs/scan-2025-01.pdf --height-in 74 --outdir data/results
 python dexa_extract.py data/pdfs/scan-2025-04.pdf --height-in 74 --outdir data/results
@ -247,10 +256,35 @@ Higher trunk percentage may indicate good core development, while higher leg per

 The script appends data to existing CSV files, making it easy to track changes over time:

-1. Place all your DEXA PDFs in `data/pdfs/`
-2. Process each one with the same output directory
-3. Open `overall.csv` in Excel/Google Sheets to visualize trends
-4. Compare `muscle_balance.csv` to track left/right symmetry improvements
+### Option 1: Batch Processing (Recommended)
+```bash
+# Place all your PDFs in one directory
+data/pdfs/
+├── scan-2025-01-15.pdf
+├── scan-2025-04-20.pdf
+└── scan-2025-10-06.pdf
+
+# Process all at once (automatically skips already-processed dates)
+python dexa_extract.py --batch data/pdfs --height-in 74 --outdir data/results
+
+# Add new scans later - only new ones will be processed
+cp ~/Downloads/scan-2025-12-15.pdf data/pdfs/
+python dexa_extract.py --batch data/pdfs --height-in 74 --outdir data/results
+```
+
+### Option 2: Individual Processing
+```bash
+# Process scans as you get them
+python dexa_extract.py data/pdfs/scan-2025-01.pdf --height-in 74 --outdir data/results
+python dexa_extract.py data/pdfs/scan-2025-04.pdf --height-in 74 --outdir data/results
+python dexa_extract.py data/pdfs/scan-2025-10.pdf --height-in 74 --outdir data/results
+```
+
+### Analyzing Results
+1. Open `overall.csv` in Excel/Google Sheets to visualize trends
+2. Compare `muscle_balance.csv` to track left/right symmetry improvements
+3. Review `summary.md` for readable reports of each scan
+4. Use `overall.json` for programmatic analysis

 ## Privacy & Security

@ -281,12 +315,12 @@ The script appends data to existing CSV files, making it easy to track changes o

 Contributions welcome! Areas for improvement:

- [ ] Enhanced error handling and validation
 - [ ] Automatic height detection from PDF
 - [ ] Data visualization/plotting features
 - [ ] GUI interface for non-technical users
- [ ] Batch processing multiple PDFs at once
 - [ ] Export to additional formats (Excel, SQLite, etc.)
+- [ ] Support for older BodySpec PDF formats
+- [ ] Progress bar for batch processing

 ## License