feat: smart batch processing with skip logic

- Change --batch to accept directory instead of glob pattern
- Automatically skip already-processed scan dates
- Add --force flag to reprocess all files
- Fix date extraction regex to parse from client info line
- Display helpful tips about skipping/forcing
- Better user feedback with skip counts and suggestions

Usage:
  python dexa_extract.py --batch data/pdfs --height-in 74 --outdir data/results

This will process only new scans, skipping any dates already in the output.
This commit is contained in:
Mac DeCourcy 2025-10-06 15:33:05 -07:00
parent d6793e2572
commit b046af5d25
3 changed files with 342 additions and 38 deletions

View file

@ -66,7 +66,16 @@ python dexa_extract.py <PDF_PATH> --height-in <HEIGHT> [--weight-lb <WEIGHT>] [-
python dexa_extract.py data/pdfs/2025-10-06-scan.pdf --height-in 74 --weight-lb 212 --outdir data/results
```
**Process multiple scans** (appends to existing files):
**Batch process multiple scans:**
```bash
# Process all PDFs in a directory (automatically skips already-processed dates)
python dexa_extract.py --batch data/pdfs --height-in 74 --outdir data/results
# Force reprocessing all files
python dexa_extract.py --batch data/pdfs --height-in 74 --outdir data/results --force
```
**Individual scans** (appends to existing files):
```bash
python dexa_extract.py data/pdfs/scan-2025-01.pdf --height-in 74 --outdir data/results
python dexa_extract.py data/pdfs/scan-2025-04.pdf --height-in 74 --outdir data/results
@ -247,10 +256,35 @@ Higher trunk percentage may indicate good core development, while higher leg per
The script appends data to existing CSV files, making it easy to track changes over time:
1. Place all your DEXA PDFs in `data/pdfs/`
2. Process each one with the same output directory
3. Open `overall.csv` in Excel/Google Sheets to visualize trends
4. Compare `muscle_balance.csv` to track left/right symmetry improvements
### Option 1: Batch Processing (Recommended)
```bash
# Place all your PDFs in one directory
data/pdfs/
├── scan-2025-01-15.pdf
├── scan-2025-04-20.pdf
└── scan-2025-10-06.pdf
# Process all at once (automatically skips already-processed dates)
python dexa_extract.py --batch data/pdfs --height-in 74 --outdir data/results
# Add new scans later - only new ones will be processed
cp ~/Downloads/scan-2025-12-15.pdf data/pdfs/
python dexa_extract.py --batch data/pdfs --height-in 74 --outdir data/results
```
### Option 2: Individual Processing
```bash
# Process scans as you get them
python dexa_extract.py data/pdfs/scan-2025-01.pdf --height-in 74 --outdir data/results
python dexa_extract.py data/pdfs/scan-2025-04.pdf --height-in 74 --outdir data/results
python dexa_extract.py data/pdfs/scan-2025-10.pdf --height-in 74 --outdir data/results
```
### Analyzing Results
1. Open `overall.csv` in Excel/Google Sheets to visualize trends
2. Compare `muscle_balance.csv` to track left/right symmetry improvements
3. Review `summary.md` for readable reports of each scan
4. Use `overall.json` for programmatic analysis
## Privacy & Security
@ -281,12 +315,12 @@ The script appends data to existing CSV files, making it easy to track changes o
Contributions welcome! Areas for improvement:
- [ ] Enhanced error handling and validation
- [ ] Automatic height detection from PDF
- [ ] Data visualization/plotting features
- [ ] GUI interface for non-technical users
- [ ] Batch processing multiple PDFs at once
- [ ] Export to additional formats (Excel, SQLite, etc.)
- [ ] Support for older BodySpec PDF formats
- [ ] Progress bar for batch processing
## License

View file

@ -1,18 +0,0 @@
# Results Directory
Your extracted DEXA data will be saved here by default.
## Output Files
When you run the extraction script with `--outdir data/results`, you'll get:
- `overall.csv` - Time-series data (one row per scan)
- `regional.csv` - Regional body composition
- `muscle_balance.csv` - Left/right limb comparison
- `overall.json` - Structured JSON format
- `summary.md` - Human-readable summary
## Note
⚠️ **Result files are gitignored** - They contain your personal health data and won't be committed to version control.

View file

@ -22,7 +22,6 @@ import re
import sys
from datetime import datetime
from pathlib import Path
import pdfplumber
import pandas as pd
@ -30,6 +29,21 @@ class ValidationError(Exception):
"""Custom exception for validation errors"""
pass
def get_processed_dates(outdir):
"""Get list of already-processed scan dates from existing CSV"""
overall_csv = Path(outdir) / "overall.csv"
if not overall_csv.exists():
return set()
try:
df = pd.read_csv(overall_csv)
if 'MeasuredDate' in df.columns:
return set(df['MeasuredDate'].dropna().unique())
except Exception:
pass
return set()
def read_pdf_text(pdf_path):
with pdfplumber.open(pdf_path) as pdf:
pages_text = [page.extract_text() or "" for page in pdf.pages]
@ -109,7 +123,13 @@ def parse_dexa_pdf(pdf_path):
text = read_pdf_text(pdf_path)
data = {}
data["measured_date"] = find_one(r"Measured Date\s+([\d/]+)", text, cast=str)
# Try to extract date from client info line: "Name Male 9/26/1995 74.0 in. 213.0 lbs. 10/6/2025"
# The last date on the line is the measured date
date_match = re.search(r"(\d{1,2}/\d{1,2}/\d{4})\s*$", text.split('\n')[0] if '\n' in text else text, re.MULTILINE)
if not date_match:
# Try finding it in the full text - look for pattern at end of client info lines
date_match = re.search(r"lbs\.\s+(\d{1,2}/\d{1,2}/\d{4})", text)
data["measured_date"] = date_match.group(1) if date_match else None
# First try to extract from SUMMARY RESULTS table (more reliable)
# Pattern: 10/6/2025 27.8% 211.6 58.8 145.4 7.4
@ -300,6 +320,196 @@ def append_markdown(path, md_text):
with open(path, mode) as f:
f.write(md_text.strip() + "\n\n")
def process_single_pdf(pdf_path, height_in, weight_lb, outdir):
"""Process a single PDF file and return success status"""
try:
# Validate PDF file
pdf_file = Path(pdf_path)
if not pdf_file.exists():
print(f" ❌ Skipping {pdf_path}: File not found", file=sys.stderr)
return False
if not pdf_file.is_file():
print(f" ❌ Skipping {pdf_path}: Not a file", file=sys.stderr)
return False
if pdf_file.suffix.lower() != '.pdf':
print(f" ❌ Skipping {pdf_path}: Not a PDF", file=sys.stderr)
return False
print(f"\n📄 Processing: {pdf_file.name}")
# Parse PDF
d = parse_dexa_pdf(pdf_path)
# Check if critical data was extracted
if d.get("body_fat_percent") is None or d.get("total_mass_lb") is None:
print(f" ⚠️ Warning: Missing critical data from {pdf_file.name}", file=sys.stderr)
if d.get("body_fat_percent") is None:
print(" - Body Fat % not found", file=sys.stderr)
if d.get("total_mass_lb") is None:
print(" - Total Mass not found", file=sys.stderr)
# Process data
measured_date_raw = d.get("measured_date") or datetime.now().strftime("%m/%d/%Y")
measured_date = convert_date_to_iso(measured_date_raw)
total_mass, derived = compute_derived(d, height_in=height_in, weight_lb=weight_lb)
# Write output files (same as before)
overall_cols = [
"MeasuredDate","Height_in","Height_ft_in","Weight_lb_Input","DEXA_TotalMass_lb","BodyFat_percent",
"LeanMass_percent","FatMass_lb","LeanSoftTissue_lb","BoneMineralContent_lb","FatFreeMass_lb",
"BMI","FFMI","FMI","LST_Index","ALM_lb","SMI","VAT_Mass_lb","VAT_Volume_in3","VAT_Index",
"BMDI","Android_percent","Gynoid_percent","AG_Ratio","Trunk_to_Limb_Fat_Ratio",
"Arms_Lean_pct","Legs_Lean_pct","Trunk_Lean_pct","Arm_Symmetry_Index","Leg_Symmetry_Index",
"Adjusted_Body_Weight_lb","RMR_cal_per_day"
]
overall_row = {
"MeasuredDate": measured_date,
"Height_in": derived["height_in"],
"Height_ft_in": derived["height_ft_in"],
"Weight_lb_Input": derived["weight_input_lb"],
"DEXA_TotalMass_lb": round(total_mass, 1),
"BodyFat_percent": d.get("body_fat_percent"),
"LeanMass_percent": derived.get("lean_mass_percent"),
"FatMass_lb": d.get("fat_mass_lb"),
"LeanSoftTissue_lb": d.get("lean_soft_tissue_lb"),
"BoneMineralContent_lb": d.get("bmc_lb"),
"FatFreeMass_lb": derived.get("fat_free_mass_lb"),
"BMI": derived["bmi"],
"FFMI": derived.get("ffmi"),
"FMI": derived.get("fmi"),
"LST_Index": derived.get("lsti"),
"ALM_lb": derived.get("alm_lb"),
"SMI": derived.get("smi"),
"VAT_Mass_lb": d.get("vat_mass_lb"),
"VAT_Volume_in3": d.get("vat_volume_in3"),
"VAT_Index": derived.get("vat_index"),
"BMDI": derived.get("bmdi"),
"Android_percent": d.get("android_percent"),
"Gynoid_percent": d.get("gynoid_percent"),
"AG_Ratio": d.get("ag_ratio"),
"Trunk_to_Limb_Fat_Ratio": derived.get("trunk_to_limb_fat_ratio"),
"Arms_Lean_pct": derived.get("arms_lean_pct"),
"Legs_Lean_pct": derived.get("legs_lean_pct"),
"Trunk_Lean_pct": derived.get("trunk_lean_pct"),
"Arm_Symmetry_Index": derived.get("arm_symmetry_index"),
"Leg_Symmetry_Index": derived.get("leg_symmetry_index"),
"Adjusted_Body_Weight_lb": derived.get("adjusted_body_weight_lb"),
"RMR_cal_per_day": d.get("rmr_cal_per_day"),
}
write_or_append_csv(os.path.join(outdir, "overall.csv"), overall_row, overall_cols)
# Regional table
regional_cols = ["Region","FatPercent","TotalMass_lb","FatTissue_lb","LeanTissue_lb","BMC_lb"]
reg_rows = []
for name, r in d.get("regional", {}).items():
reg_rows.append({
"Region": name,
"FatPercent": r["fat_percent"],
"TotalMass_lb": r["total_mass_lb"],
"FatTissue_lb": r["fat_tissue_lb"],
"LeanTissue_lb": r["lean_tissue_lb"],
"BMC_lb": r["bmc_lb"],
})
regional_path = os.path.join(outdir, "regional.csv")
if os.path.exists(regional_path):
pd.DataFrame(reg_rows).to_csv(regional_path, mode="a", header=False, index=False)
else:
pd.DataFrame(reg_rows).to_csv(regional_path, index=False)
# Muscle balance
mb_cols = ["Region","FatPercent","TotalMass_lb","FatMass_lb","LeanMass_lb","BMC_lb"]
mb_rows = []
for name, r in d.get("muscle_balance", {}).items():
mb_rows.append({
"Region": name,
"FatPercent": r["fat_percent"],
"TotalMass_lb": r["total_mass_lb"],
"FatMass_lb": r["fat_mass_lb"],
"LeanMass_lb": r["lean_mass_lb"],
"BMC_lb": r["bmc_lb"],
})
mb_path = os.path.join(outdir, "muscle_balance.csv")
if os.path.exists(mb_path):
pd.DataFrame(mb_rows).to_csv(mb_path, mode="a", header=False, index=False)
else:
pd.DataFrame(mb_rows).to_csv(mb_path, index=False)
# JSON
regional_array = [
{"region": name, **data}
for name, data in d.get("regional", {}).items()
]
muscle_balance_array = [
{"region": name, **data}
for name, data in d.get("muscle_balance", {}).items()
]
overall_json = {
"measured_date": measured_date,
"anthropometrics": {
"height_in": derived["height_in"],
"height_ft_in": derived["height_ft_in"],
"weight_input_lb": derived["weight_input_lb"],
"dexa_total_mass_lb": round(total_mass, 1),
"adjusted_body_weight_lb": derived.get("adjusted_body_weight_lb"),
"bmi": derived["bmi"]
},
"composition": {
"body_fat_percent": d.get("body_fat_percent"),
"lean_mass_percent": derived.get("lean_mass_percent"),
"fat_mass_lb": d.get("fat_mass_lb"),
"lean_soft_tissue_lb": d.get("lean_soft_tissue_lb"),
"bone_mineral_content_lb": d.get("bmc_lb"),
"fat_free_mass_lb": derived.get("fat_free_mass_lb"),
"derived_indices": {
"ffmi": derived.get("ffmi"),
"fmi": derived.get("fmi"),
"lsti": derived.get("lsti"),
"alm_lb": derived.get("alm_lb"),
"smi": derived.get("smi"),
"bmdi": derived.get("bmdi")
}
},
"regional": regional_array,
"regional_analysis": {
"trunk_to_limb_fat_ratio": derived.get("trunk_to_limb_fat_ratio"),
"lean_mass_distribution": {
"arms_percent": derived.get("arms_lean_pct"),
"legs_percent": derived.get("legs_lean_pct"),
"trunk_percent": derived.get("trunk_lean_pct")
}
},
"muscle_balance": muscle_balance_array,
"symmetry_indices": {
"arm_symmetry_index": derived.get("arm_symmetry_index"),
"leg_symmetry_index": derived.get("leg_symmetry_index")
},
"supplemental": {
"android_percent": d.get("android_percent"),
"gynoid_percent": d.get("gynoid_percent"),
"ag_ratio": d.get("ag_ratio"),
"vat": {
"mass_lb": d.get("vat_mass_lb"),
"volume_in3": d.get("vat_volume_in3"),
"vat_index": derived.get("vat_index")
},
"rmr_cal_per_day": d.get("rmr_cal_per_day")
},
"bone_density": d.get("bone_density", {})
}
write_or_append_json(os.path.join(outdir, "overall.json"), overall_json)
# Markdown summary
md_text = make_markdown(measured_date, d, derived, total_mass)
append_markdown(os.path.join(outdir, "summary.md"), md_text)
print(f"{pdf_file.name}: Body fat {d.get('body_fat_percent')}%, FFMI {derived.get('ffmi')}")
return True
except Exception as e:
print(f" ❌ Error processing {pdf_path}: {e}", file=sys.stderr)
return False
def make_markdown(measured_date, d, derived, total_mass):
lines = []
lines.append(f"# DEXA Summary — {measured_date}")
@ -332,24 +542,26 @@ def make_markdown(measured_date, d, derived, total_mass):
def main():
ap = argparse.ArgumentParser(
description="BodySpec Insights - Extract and analyze body composition data from BodySpec DEXA scan PDFs",
epilog="Example: python dexa_extract.py scan.pdf --height-in 74 --weight-lb 212 --outdir ./data/results"
epilog="Examples:\n"
" Single: python dexa_extract.py scan.pdf --height-in 74 --outdir ./data/results\n"
" Batch: python dexa_extract.py --batch data/pdfs --height-in 74 --outdir ./data/results",
formatter_class=argparse.RawDescriptionHelpFormatter
)
ap.add_argument("pdf", help="Path to BodySpec DEXA report PDF")
ap.add_argument("pdf", nargs="?", help="Path to BodySpec DEXA report PDF (not used with --batch)")
ap.add_argument("--batch", metavar="DIR", help="Process all PDFs in directory (skips already-processed dates)")
ap.add_argument("--height-in", type=float, required=True, help="Height in inches (e.g., 6'2\" = 74)")
ap.add_argument("--weight-lb", type=float, help="Body weight in lbs (optional; used if DEXA total mass missing)")
ap.add_argument("--outdir", default="dexa_out", help="Output directory (default: dexa_out)")
ap.add_argument("--force", action="store_true", help="Reprocess all files, even if already in output")
args = ap.parse_args()
# Validate PDF file exists
pdf_file = Path(args.pdf)
if not pdf_file.exists():
print(f"❌ Error: PDF file not found: {args.pdf}", file=sys.stderr)
# Check that either pdf or --batch is provided
if not args.pdf and not args.batch:
print("❌ Error: Must provide either a PDF file or --batch directory", file=sys.stderr)
ap.print_help()
sys.exit(1)
if not pdf_file.is_file():
print(f"❌ Error: Path is not a file: {args.pdf}", file=sys.stderr)
sys.exit(1)
if pdf_file.suffix.lower() != '.pdf':
print(f"❌ Error: File is not a PDF: {args.pdf}", file=sys.stderr)
if args.pdf and args.batch:
print("❌ Error: Cannot use both PDF file and --batch. Choose one.", file=sys.stderr)
sys.exit(1)
# Validate height
@ -362,12 +574,88 @@ def main():
print(f"❌ Error: Weight seems unrealistic: {args.weight_lb} lbs (expected 50-500 lbs)", file=sys.stderr)
sys.exit(1)
# Create output directory
try:
ensure_outdir(args.outdir)
except PermissionError:
print(f"❌ Error: Cannot create output directory: {args.outdir} (permission denied)", file=sys.stderr)
sys.exit(1)
# Batch mode
if args.batch:
batch_dir = Path(args.batch)
if not batch_dir.exists():
print(f"❌ Error: Directory not found: {args.batch}", file=sys.stderr)
sys.exit(1)
if not batch_dir.is_dir():
print(f"❌ Error: Not a directory: {args.batch}", file=sys.stderr)
sys.exit(1)
# Find all PDF files in directory
pdf_files = sorted(batch_dir.glob("*.pdf"))
if not pdf_files:
print(f"❌ Error: No PDF files found in: {args.batch}", file=sys.stderr)
sys.exit(1)
# Get already-processed dates
processed_dates = set()
if not args.force:
processed_dates = get_processed_dates(args.outdir)
if processed_dates:
print(f"📋 Found {len(processed_dates)} already-processed scan(s) in {args.outdir}")
print(f"📦 Batch mode: Found {len(pdf_files)} PDF file(s) in {args.batch}")
print(f"📂 Output directory: {args.outdir}\n")
success_count = 0
fail_count = 0
skip_count = 0
for pdf_file in pdf_files:
# Quick check: try to extract date and see if already processed
if not args.force and processed_dates:
try:
d_temp = parse_dexa_pdf(str(pdf_file))
measured_date_raw = d_temp.get("measured_date")
if measured_date_raw:
measured_date = convert_date_to_iso(measured_date_raw)
if measured_date in processed_dates:
print(f"\n⏭️ Skipping: {pdf_file.name} (date {measured_date} already processed)")
skip_count += 1
continue
except Exception:
pass # If we can't extract date, try to process anyway
if process_single_pdf(str(pdf_file), args.height_in, args.weight_lb, args.outdir):
success_count += 1
else:
fail_count += 1
print(f"\n{'='*60}")
print(f"✅ Batch complete: {success_count} succeeded, {skip_count} skipped, {fail_count} failed")
print(f"📁 Results saved to: {args.outdir}")
if args.force and skip_count > 0:
print(f" 💡 Tip: Remove --force flag to skip already-processed scans")
elif skip_count > 0:
print(f" 💡 Tip: Use --force to reprocess skipped scans")
if fail_count > 0:
sys.exit(1)
return
# Single file mode
pdf_file = Path(args.pdf)
if not pdf_file.exists():
print(f"❌ Error: PDF file not found: {args.pdf}", file=sys.stderr)
sys.exit(1)
if not pdf_file.is_file():
print(f"❌ Error: Path is not a file: {args.pdf}", file=sys.stderr)
sys.exit(1)
if pdf_file.suffix.lower() != '.pdf':
print(f"❌ Error: File is not a PDF: {args.pdf}", file=sys.stderr)
sys.exit(1)
print(f"📄 Reading PDF: {args.pdf}")
try: