Skip to content

TRYLOCK is a layered security system that protects AI chatbots from manipulation attacks by combining multiple defense mechanisms that work together like locks on different doors. It blocks 88% more attacks than unprotected models while staying helpful for legitimate questions.

License

Notifications You must be signed in to change notification settings

scthornton/trylock

Repository files navigation

Project TRYLOCK v2.0

Adversarial Enterprise Guard for Intrinsic Security

An open-source research project to create a dataset and training pipeline that improves open LLMs' resistance to prompt-based attacks while minimizing over-refusal.

License: Apache 2.0 Python 3.10+ Models Dataset

The Problem

Current LLM defenses leave a critical gap:

Defense Layer Protection Issue
Base model ~0% Will do anything
Instruct/RLHF ~60% Basic safety training
Flagship (Claude/GPT) ~75% Must stay usable for everyone
Third-party guardrails ~95% 20%+ false positive rate

Enterprises need 85-90% protection without the false positive explosion.

The Solution

TRYLOCK provides a three-layer defense stack:

┌─────────────────────────────────────────────────────────────────────┐
│                    TRYLOCK v2 DEFENSE STACK                            │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Layer 1: KNOWLEDGE (LoRA + DPO)                                    │
│  └── Teaches model what attacks look like through preference        │
│      learning on multi-turn trajectories                            │
│                                                                      │
│  Layer 2: INSTINCT (Representation Engineering)                     │
│  └── Dampens "attack compliance" direction with tunable α           │
│      coefficient (0.0 = research, 1.0 = balanced, 2.5 = lockdown)  │
│                                                                      │
│  Layer 3: OVERSIGHT (Security Sidecar)                              │
│  └── Parallel 8B classifier scores conversation state               │
│      (SAFE | WARN | ATTACK) invisible to attacker                   │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

🎯 Trained Models & Research

Published Models

The TRYLOCK defense system is fully trained and available on HuggingFace:

Dataset

  • Public Sample: scthornton/trylock-demo-dataset (48 diverse examples)
  • Full Training Set: Private (2,939 preference pairs - available upon request for academic research)

Research Paper

See paper/TRYLOCK_Canonical.md for the complete research paper documenting methodology, experiments, and results.

Performance: TRYLOCK achieves 82.8% reduction in attack success rate (from 100% to 17.2%) while maintaining low over-refusal (12.6%) on benign queries.

Quick Start

Use Pre-Trained Models

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load DPO-trained model
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")
base = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")
model = PeftModel.from_pretrained(base, "scthornton/trylock-mistral-7b-dpo")

# See scripts/eval_cpu_only.py for full three-layer implementation

Installation (For Development)

git clone https://github.com/scthornton/trylock.git
cd trylock
pip install -e ".[dev]"

Generate Sample Data

# Generate 100 attack trajectories (mock mode - no API)
python scripts/generate_data.py generate --count 100 --mock

# Generate with real API calls
export ANTHROPIC_API_KEY="your-key"
python scripts/generate_data.py generate --count 100

# Generate benign hard negatives
python scripts/generate_data.py benign --count 50

# Validate generated data
python scripts/generate_data.py validate data/tier1_open/attacks/generated.jsonl

Python API

from generation import TRYLOCKPipeline, PipelineConfig

# Configure pipeline
config = PipelineConfig(
    red_bot_model="claude-sonnet-4-20250514",
    judge_model="claude-sonnet-4-20250514",
    mutations_per_trajectory=3,
)

# Generate trajectories
pipeline = TRYLOCKPipeline(config)
trajectories = pipeline.run(count=100)

# Save output
pipeline.save(trajectories, "data/output.jsonl")

Dataset Structure

Each trajectory contains:

{
  "id": "trylock-mt-00421",
  "version": "2.0",
  "attack_metadata": {
    "family": "multi_turn_manipulation",
    "type": "crescendo",
    "wrappers": ["adversarial_poetry"],
    "manipulation_tactics": ["context_anchoring"],
    "difficulty": "hard"
  },
  "conversation": [
    {
      "turn": 1,
      "role": "user",
      "content": "I'm researching security...",
      "turn_labels": {
        "attack_stage": "setup",
        "attack_intent_score": 0,
        "expected_action": "continue"
      }
    }
  ],
  "pivot_turn_index": 3,
  "responses": {
    "chosen": {"content": "...", "response_type": "soft_decline"},
    "rejected_unsafe": {"content": "...", "failure_mode": "..."},
    "rejected_overblock": {"content": "...", "failure_mode": "..."}
  }
}

Attack Taxonomy

TRYLOCK covers five attack families:

Family Description Priority
Multi-turn Manipulation Crescendo, context anchoring, boundary softening HIGH
Indirect Injection RAG poisoning, tool output injection HIGH
Obfuscation Wrappers Poetry, roleplay, encoding, translation MEDIUM
Direct Injection Classic jailbreaks, system prompt extraction MEDIUM
Tool/Agent Abuse Instruction hierarchy attacks, hidden goals EMERGING

See taxonomy/v2.0/attack_families.yaml for the full taxonomy.

Project Structure

trylock/
├── taxonomy/v2.0/          # Attack classification system
│   ├── attack_families.yaml
│   ├── manipulation_tactics.yaml
│   ├── attack_stages.yaml
│   └── response_types.yaml
│
├── data/
│   ├── schema/             # JSON schema + validator
│   ├── tier1_open/         # Public dataset (Apache 2.0)
│   ├── tier2_gated/        # Research agreement required
│   └── tier3_private/      # Internal only
│
├── generation/             # Data generation pipeline
│   ├── red_bot.py          # Attack generator
│   ├── victim_bot.py       # Target model simulator
│   ├── judge_bot.py        # Labeler + response generator
│   ├── mutation_engine.py  # Create attack variants
│   ├── activation_capture.py  # RepE training data
│   └── pipeline.py         # Orchestration
│
├── training/               # Training pipeline (coming soon)
│   ├── sft_warmup.py
│   ├── dpo_preference.py
│   ├── repe_training.py
│   └── sidecar_classifier.py
│
├── eval/                   # Evaluation framework (coming soon)
│   ├── harness.py
│   ├── metrics.py
│   └── benchmarks/
│
└── scripts/                # CLI tools
    └── generate_data.py

Target Metrics

Metric Baseline Target
Single-turn ASR ~25% ≤10%
Multi-turn ASR ~35% ≤15%
Indirect/RAG ASR ~40% ≤20%
Novel wrapper ASR ~60% ≤30%
Over-refusal rate - ≤+2-4%
Capability preservation 100% ≥95%

Academic References

  • SecAlign: arXiv:2410.05451
  • MTJ-Bench: arXiv:2508.06755
  • PoisonedRAG: USENIX Security 2025
  • Adversarial Poetry: arXiv:2511.15304
  • LLMail-Inject: arXiv:2506.09956

Contributing

We welcome contributions! Areas of interest:

  1. New attack patterns: Especially novel multi-turn and indirect injection
  2. Benign hard negatives: Cases that look like attacks but aren't
  3. Evaluation benchmarks: Integration with existing security benchmarks
  4. Training improvements: Better DPO/RepE configurations

Please see CONTRIBUTING.md for guidelines.

License

Apache 2.0 with a Responsible Use Addendum. See LICENSE.

The dataset is intended for defensive security research only. Do not use this data to:

  • Train models intended to generate attacks
  • Bypass security measures on systems you don't own
  • Cause harm to individuals or organizations

Citation

@software{trylock2025,
  title = {TRYLOCK: Adversarial Enterprise Guard for Intrinsic Security},
  author = {Thornton, Scott},
  year = {2025},
  url = {https://github.com/scthornton/trylock}
}

Contact

trylock

About

TRYLOCK is a layered security system that protects AI chatbots from manipulation attacks by combining multiple defense mechanisms that work together like locks on different doors. It blocks 88% more attacks than unprotected models while staying helpful for legitimate questions.

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages