How It Works¶
pipguard's design has one rule: code must never execute during scanning.
Architecture¶
pipguard install X
│
▼
pip download --prefer-binary X ← downloads wheel/sdist, no code execution
│
▼
Detect sdist fallback ← exit 2 if sdist detected (unless --allow-sdist)
│
▼
Extract archive (zipfile/tarfile) ← never executes code
│
▼
AST scan all .py files ← parallel, ThreadPoolExecutor
setup.py, pyproject.toml, *.pth ← CRITICAL/HIGH scope
all other .py ← MEDIUM/LOW scope
│
▼
Risk scoring:
CRITICAL → block (exit 1)
HIGH → block (exit 1)
MEDIUM → warn + confirm
LOW → warn + confirm
CLEAN → install silently
│
▼
pip install --no-index ← installs FROM SCANNED FILES (TOCTOU-safe)
--find-links /tmp/pipguard-XX
Why Pre-Install?¶
Classical security tools (pip-audit, Safety, GuardDog) work post-hoc — they check installed packages against known-bad signature databases. This means:
- Zero-day blind spot — a new attack not yet in the database walks straight through
- Race condition — the malicious code has already run by the time the tool checks
pipguard reverses the order. It asks: does this code do something that any pip install should be allowed to do?
Regardless of whether the package is on any watchlist, the answer to "reads ~/.ssh/id_rsa and sends it over a network" is always no.
TOCTOU Safety¶
A subtle attack vector: scan a clean file, then swap it for a malicious one before install.
pipguard counters this by:
- Downloading the archive to a temp directory
- Scanning the files in place in that temp directory
- Running
pip install --no-index --find-links /tmp/pipguard-XX— installing the exact files that were scanned
The archive is never re-downloaded or re-extracted after scanning.
AST Scanning¶
pipguard uses Python's built-in ast module — no third-party dependencies — to parse .py files into abstract syntax trees and walk the nodes looking for dangerous patterns.
What gets flagged¶
| Pattern | Example |
|---|---|
.pth file with executable Python |
import os; os.system(...) in .pth |
| Obfuscated eval | eval(base64.b64decode(...)) |
Network in setup.py / install hooks |
urllib.request.urlopen(...) in setup.py |
| Shell/subprocess execution in install hooks | os.system(...), subprocess.run(..., shell=True) |
| Pattern | Example |
|---|---|
| Non-ASCII character in package name | bоto3 with Cyrillic о — possible homoglyph/typosquatting attack |
| Credential path read in install hooks | open('~/.ssh/id_rsa') in setup.py |
| Subprocess execution in install hooks | subprocess.run([...]) |
| Pattern | Example |
|---|---|
| Binary-only wheel (no Python source) | Wheel with only .so / .pyd / .dylib files |
| Network in runtime code | urllib.request.urlopen(...) in utils.py |
| Sensitive env var access | os.environ.get('AWS_SECRET_ACCESS_KEY') |
| Large source file over 1MB | scanner emits confidence-reduction warning |
| Binary IOC string hit | .so contains https://... or /bin/sh |
| Pattern | Example |
|---|---|
| Compiled binary extension in mixed wheel | .so / .pyd / .dylib alongside .py source |
| Dynamic imports | importlib.import_module(name) |
__import__() |
__import__(variable) |
Binary files are also scanned with lightweight IOC string matching (first 2MB) to surface obvious credential-path or exfiltration indicators.
Homoglyph / Typosquatting Detection¶
Before scanning archive contents, pipguard checks the package name itself for
non-ASCII characters (e.g. Cyrillic о substituted for Latin o). Any such
character produces a HIGH finding regardless of what the package contains:
Package names are also NFKC-normalized before allowlist comparison, so a homoglyph name cannot bypass the allowlist by mimicking a trusted package.
Binary Extension Scanning¶
pipguard detects compiled binary extension files (.so, .pyd, .dylib) in
extracted wheels. Static AST scanning cannot inspect these files, so pipguard
flags them explicitly:
- Mixed wheel (
.pysource + binary extensions): each extension file generates a LOW finding — the scanner covered the Python parts but is blind to any payload in compiled code. - Binary-only wheel (no
.pysource at all): a single MEDIUM finding is emitted, and the confirmation gate fires. pipguard's core scan promise cannot be fulfilled for packages with no Python source.
Seed Allowlist¶
Some packages legitimately access credentials as part of their core purpose. pipguard ships with a seed allowlist that reduces their finding from HIGH to MEDIUM (CRITICAL is never reduced):
keyring, keyrings.alt, boto3, botocore, awscli, paramiko,
google-auth, google-cloud-storage, google-cloud-bigquery,
google-cloud-core, azure-identity
Limitations¶
Phase 1 scope
These are known limitations of the current static-analysis approach.
- Obfuscation — multi-layer obfuscation (e.g.
exec(compile(...))wrapped multiple times) may evade detection - C extensions —
.so/.pydbinaries are opaque to AST scanning; flagged as LOW (mixed) or MEDIUM (binary-only) to surface the blind spot - Python/pip only — no npm, cargo, or go module support
- Phase 2 (in design) — seccomp/eBPF sandbox for capability-level interception at runtime