How It Works¶

pipguard's design has one rule: code must never execute during scanning.

Architecture¶

pipguard install X
       │
       ▼
pip download --prefer-binary X    ← downloads wheel/sdist, no code execution
       │
       ▼
Detect sdist fallback             ← exit 2 if sdist detected (unless --allow-sdist)
       │
       ▼
Extract archive (zipfile/tarfile) ← never executes code
       │
       ▼
AST scan all .py files            ← parallel, ThreadPoolExecutor
  setup.py, pyproject.toml, *.pth ← CRITICAL/HIGH scope
  all other .py                   ← MEDIUM/LOW scope
       │
       ▼
Risk scoring:
  CRITICAL → block (exit 1)
  HIGH     → block (exit 1)
  MEDIUM   → warn + confirm
  LOW      → warn + confirm
  CLEAN    → install silently
       │
       ▼
pip install --no-index            ← installs FROM SCANNED FILES (TOCTOU-safe)
    --find-links /tmp/pipguard-XX

Why Pre-Install?¶

Classical security tools (pip-audit, Safety, GuardDog) work post-hoc — they check installed packages against known-bad signature databases. This means:

Zero-day blind spot — a new attack not yet in the database walks straight through
Race condition — the malicious code has already run by the time the tool checks

pipguard reverses the order. It asks: does this code do something that any pip install should be allowed to do?

Regardless of whether the package is on any watchlist, the answer to "reads ~/.ssh/id_rsa and sends it over a network" is always no.

TOCTOU Safety¶

A subtle attack vector: scan a clean file, then swap it for a malicious one before install.

pipguard counters this by:

Downloading the archive to a temp directory
Scanning the files in place in that temp directory
Running pip install --no-index --find-links /tmp/pipguard-XX — installing the exact files that were scanned

The archive is never re-downloaded or re-extracted after scanning.

AST Scanning¶

pipguard uses Python's built-in ast module — no third-party dependencies — to parse .py files into abstract syntax trees and walk the nodes looking for dangerous patterns.

What gets flagged¶

CRITICALHIGHMEDIUMLOW

Pattern	Example
`.pth` file with executable Python	`import os; os.system(...)` in `.pth`
Obfuscated eval	`eval(base64.b64decode(...))`
Network in `setup.py` / install hooks	`urllib.request.urlopen(...)` in `setup.py`
Shell/subprocess execution in install hooks	`os.system(...)`, `subprocess.run(..., shell=True)`

Pattern	Example
Non-ASCII character in package name	`bоto3` with Cyrillic `о` — possible homoglyph/typosquatting attack
Credential path read in install hooks	`open('~/.ssh/id_rsa')` in `setup.py`
Subprocess execution in install hooks	`subprocess.run([...])`

Pattern	Example
Binary-only wheel (no Python source)	Wheel with only `.so` / `.pyd` / `.dylib` files
Network in runtime code	`urllib.request.urlopen(...)` in `utils.py`
Sensitive env var access	`os.environ.get('AWS_SECRET_ACCESS_KEY')`
Large source file over 1MB	scanner emits confidence-reduction warning
Binary IOC string hit	`.so` contains `https://...` or `/bin/sh`

Pattern	Example
Compiled binary extension in mixed wheel	`.so` / `.pyd` / `.dylib` alongside `.py` source
Dynamic imports	`importlib.import_module(name)`
`__import__()`	`__import__(variable)`

Binary files are also scanned with lightweight IOC string matching (first 2MB) to surface obvious credential-path or exfiltration indicators.

Homoglyph / Typosquatting Detection¶

Before scanning archive contents, pipguard checks the package name itself for non-ASCII characters (e.g. Cyrillic о substituted for Latin o). Any such character produces a HIGH finding regardless of what the package contains:

bоto3   ← Cyrillic 'о' (U+043E) in position 1
↑ visually identical to boto3, but a different string

Package names are also NFKC-normalized before allowlist comparison, so a homoglyph name cannot bypass the allowlist by mimicking a trusted package.

Binary Extension Scanning¶

pipguard detects compiled binary extension files (.so, .pyd, .dylib) in extracted wheels. Static AST scanning cannot inspect these files, so pipguard flags them explicitly:

Mixed wheel (.py source + binary extensions): each extension file generates a LOW finding — the scanner covered the Python parts but is blind to any payload in compiled code.
Binary-only wheel (no .py source at all): a single MEDIUM finding is emitted, and the confirmation gate fires. pipguard's core scan promise cannot be fulfilled for packages with no Python source.

Seed Allowlist¶

Some packages legitimately access credentials as part of their core purpose. pipguard ships with a seed allowlist that reduces their finding from HIGH to MEDIUM (CRITICAL is never reduced):

keyring, keyrings.alt, boto3, botocore, awscli, paramiko, google-auth, google-cloud-storage, google-cloud-bigquery, google-cloud-core, azure-identity

Full allowlist reference →

Limitations¶

Phase 1 scope

These are known limitations of the current static-analysis approach.

Obfuscation — multi-layer obfuscation (e.g. exec(compile(...)) wrapped multiple times) may evade detection
C extensions — .so / .pyd binaries are opaque to AST scanning; flagged as LOW (mixed) or MEDIUM (binary-only) to surface the blind spot
Python/pip only — no npm, cargo, or go module support
Phase 2 (in design) — seccomp/eBPF sandbox for capability-level interception at runtime