Skip to content

How It Works

pipguard's design has one rule: code must never execute during scanning.

Architecture

pipguard install X
pip download --prefer-binary X    ← downloads wheel/sdist, no code execution
Detect sdist fallback             ← exit 2 if sdist detected (unless --allow-sdist)
Extract archive (zipfile/tarfile) ← never executes code
AST scan all .py files            ← parallel, ThreadPoolExecutor
  setup.py, pyproject.toml, *.pth ← CRITICAL/HIGH scope
  all other .py                   ← MEDIUM/LOW scope
Risk scoring:
  CRITICAL → block (exit 1)
  HIGH     → block (exit 1)
  MEDIUM   → warn + confirm
  LOW      → warn + confirm
  CLEAN    → install silently
pip install --no-index            ← installs FROM SCANNED FILES (TOCTOU-safe)
    --find-links /tmp/pipguard-XX

Why Pre-Install?

Classical security tools (pip-audit, Safety, GuardDog) work post-hoc — they check installed packages against known-bad signature databases. This means:

  1. Zero-day blind spot — a new attack not yet in the database walks straight through
  2. Race condition — the malicious code has already run by the time the tool checks

pipguard reverses the order. It asks: does this code do something that any pip install should be allowed to do?

Regardless of whether the package is on any watchlist, the answer to "reads ~/.ssh/id_rsa and sends it over a network" is always no.

TOCTOU Safety

A subtle attack vector: scan a clean file, then swap it for a malicious one before install.

pipguard counters this by:

  1. Downloading the archive to a temp directory
  2. Scanning the files in place in that temp directory
  3. Running pip install --no-index --find-links /tmp/pipguard-XX — installing the exact files that were scanned

The archive is never re-downloaded or re-extracted after scanning.

AST Scanning

pipguard uses Python's built-in ast module — no third-party dependencies — to parse .py files into abstract syntax trees and walk the nodes looking for dangerous patterns.

What gets flagged

Pattern Example
.pth file with executable Python import os; os.system(...) in .pth
Obfuscated eval eval(base64.b64decode(...))
Network in setup.py / install hooks urllib.request.urlopen(...) in setup.py
Shell/subprocess execution in install hooks os.system(...), subprocess.run(..., shell=True)
Pattern Example
Non-ASCII character in package name bоto3 with Cyrillic о — possible homoglyph/typosquatting attack
Credential path read in install hooks open('~/.ssh/id_rsa') in setup.py
Subprocess execution in install hooks subprocess.run([...])
Pattern Example
Binary-only wheel (no Python source) Wheel with only .so / .pyd / .dylib files
Network in runtime code urllib.request.urlopen(...) in utils.py
Sensitive env var access os.environ.get('AWS_SECRET_ACCESS_KEY')
Large source file over 1MB scanner emits confidence-reduction warning
Binary IOC string hit .so contains https://... or /bin/sh
Pattern Example
Compiled binary extension in mixed wheel .so / .pyd / .dylib alongside .py source
Dynamic imports importlib.import_module(name)
__import__() __import__(variable)

Binary files are also scanned with lightweight IOC string matching (first 2MB) to surface obvious credential-path or exfiltration indicators.

Homoglyph / Typosquatting Detection

Before scanning archive contents, pipguard checks the package name itself for non-ASCII characters (e.g. Cyrillic о substituted for Latin o). Any such character produces a HIGH finding regardless of what the package contains:

bоto3   ← Cyrillic 'о' (U+043E) in position 1
↑ visually identical to boto3, but a different string

Package names are also NFKC-normalized before allowlist comparison, so a homoglyph name cannot bypass the allowlist by mimicking a trusted package.

Binary Extension Scanning

pipguard detects compiled binary extension files (.so, .pyd, .dylib) in extracted wheels. Static AST scanning cannot inspect these files, so pipguard flags them explicitly:

  • Mixed wheel (.py source + binary extensions): each extension file generates a LOW finding — the scanner covered the Python parts but is blind to any payload in compiled code.
  • Binary-only wheel (no .py source at all): a single MEDIUM finding is emitted, and the confirmation gate fires. pipguard's core scan promise cannot be fulfilled for packages with no Python source.

Seed Allowlist

Some packages legitimately access credentials as part of their core purpose. pipguard ships with a seed allowlist that reduces their finding from HIGH to MEDIUM (CRITICAL is never reduced):

keyring, keyrings.alt, boto3, botocore, awscli, paramiko, google-auth, google-cloud-storage, google-cloud-bigquery, google-cloud-core, azure-identity

Full allowlist reference →

Limitations

Phase 1 scope

These are known limitations of the current static-analysis approach.

  • Obfuscation — multi-layer obfuscation (e.g. exec(compile(...)) wrapped multiple times) may evade detection
  • C extensions.so / .pyd binaries are opaque to AST scanning; flagged as LOW (mixed) or MEDIUM (binary-only) to surface the blind spot
  • Python/pip only — no npm, cargo, or go module support
  • Phase 2 (in design) — seccomp/eBPF sandbox for capability-level interception at runtime