MusicLab

Meet my MusicLab — a powerful end-to-end GAN and GenAI free production-ready music experimentation platform for comercial and private use. Upload your favorite 2 music patterns, choose the parts you like the most and let the MusicLab generate a ginle of those combining them together. MusicLab equipped with interactive multilingual Chat-bot assistant “Rita”, and on-prem/cloud model inference.

Inside Music Lab (the backend) is a creative agent NN that masks out important music parts, Microsoft EnCodec encoder followeed by cascade of 12-layer transformer blocks (similar to modern LLMs) - they do the fusion magic, and Microsoft EnCodec decoder. How to use it on your own? You can use it anytime, just follow this link and enjoy.

You want to use it locally or deploy on AWS for your own needs? Just download the code and ask me for a check point and be ready to use. The MusicLab docker scripts and and fits on t2.micro iAWS Free Tier instances. See the deployment instructions

Overview

Neural audio continuation system that learns to creatively blend input audio (rhythm) with target characteristics (harmony/melody). The model uses learnable complementary masks to separate and intelligently mix different musical components.

Architecture

Input Audio (16s, 24kHz)
    ↓
EnCodec Encoder (frozen)
    ↓
Stage 0 Transformer (256→1024 dim, 6 layers)
    ↓
Stage 1 Transformer (384→1536 dim, 6 layers)
    ↓
Creative Agent (generates complementary masks via attention)
    ├─ Input Mask (extracts rhythm)
    └─ Target Mask (extracts harmony/melody)
    ↓
Masked combination → EnCodec Decoder (frozen)
    ↓
Output Audio

Total Parameters: 24.9M (16.7M transformer + 700K creative agent + 3.8M discriminator)

Key Components

2-Stage Cascade: Progressive refinement with spectral normalization
Creative Agent: Cross-attention based mask generator with complementarity loss
Audio Discriminator: Adversarial training for realistic audio generation
EnCodec Integration: 24kHz, 6.0 bandwidth (frozen encoder/decoder)

Requirements

# Python 3.10+
torch>=2.0.0
torchaudio>=2.0.0
encodec
mlflow
tqdm
numpy

Quick Start

Training

# Single-node multi-GPU training (DDP)
python train_simple_ddp.py \
  --train_dir dataset_pairs_wav/train \
  --val_dir dataset_pairs_wav/val \
  --checkpoint_dir checkpoints \
  --batch_size 8 \
  --epochs 200

# Or use launch script
bash run_train_creative_agent_fixed.sh

Resume from Checkpoint

bash run_train_creative_agent_resume.sh

Inference

# Generate audio from trained model
python inference_cascade.py \
  --checkpoint checkpoints/best_model.pt \
  --input_audio input.wav \
  --target_audio target.wav \
  --output output.wav \
  --shuffle_targets  # Random target pairing for creativity

Training Configuration

Default Hyperparameters

Parameter	Value	Description
Batch size	32 (8×4 GPUs)	Per-device: 8
Learning rate	1e-4	AdamW optimizer
Weight decay	0.01	L2 regularization
Epochs	200	Full training
Input loss weight	0.3	RMS reconstruction
Target loss weight	0.3	RMS continuation
Mask reg weight	0.1	Complementarity penalty
Balance loss weight	15.0	50/50 mask balance
GAN weight	0.15	Adversarial loss
Correlation penalty	0.5	Anti-modulation

Project Structure

PowrMuse/
├── model_simple_transformer.py    # 2-stage cascade architecture
├── creative_agent.py              # Attention-based mask generator
├── audio_discriminator.py         # GAN discriminator
├── correlation_penalty.py         # Anti-modulation loss
│
├── train_simple_worker.py         # DDP training worker
├── train_simple_ddp.py            # Multi-GPU launcher
├── training/losses.py             # Loss functions
│
├── inference_cascade.py           # Audio generation
├── inference_sample.py            # Single sample inference
│
├── dataset_wav_pairs.py           # Dataset loader
├── create_dataset_pairs_wav.py    # Dataset creation
│
├── run_train_creative_agent.sh           # Main training script
├── view_mlflow_data.py            # View training metrics
│
└── README.md                      # This file

Dataset Format

Paired WAV files organized as:

dataset_pairs_wav/
├── train/
│   ├── pair_0000_input.wav    # 16 seconds, 24kHz, mono
│   ├── pair_0000_target.wav
│   ├── pair_0001_input.wav
│   └── ...
└── val/
    ├── pair_0000_input.wav
    └── ...

Create dataset using:

python create_dataset_pairs_wav.py \
  --input_dir path/to/audio \
  --output_dir dataset_pairs_wav \
  --duration 16 \
  --sample_rate 24000

Hardware Requirements

Training: 4× NVIDIA A100-80GB (or similar)
Inference: 1× GPU with 16GB+ VRAM (or CPU)
Storage: ~50GB for dataset, ~5GB for checkpoints

Citation

@software{jingle_d_2025,
  title={Jingle_D: Creative Audio Mixing with Cascade Transformers},
  author={Training conducted on DKRZ Levante supercomputer},
  year={2025},
  url={https://github.com/YOUR_USERNAME/Jingle_D}
}

License

Research project - refer to institution policies for usage terms.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
backend		backend
cache_noisy		cache_noisy
frontend		frontend
music_samples		music_samples
training		training
.DS_Store		.DS_Store
AWS_DEPLOYMENT.md		AWS_DEPLOYMENT.md
DOCKER_DEPLOYMENT.md		DOCKER_DEPLOYMENT.md
Dockerfile.backend		Dockerfile.backend
Dockerfile.frontend		Dockerfile.frontend
README.md		README.md
compositional_creative_agent.py		compositional_creative_agent.py
create_dataset_pairs.sh		create_dataset_pairs.sh
create_dataset_pairs_wav.py		create_dataset_pairs_wav.py
creative_agent.py		creative_agent.py
dataset_encoded_expanded.py		dataset_encoded_expanded.py
dataset_wav_pairs.py		dataset_wav_pairs.py
deploy_to_aws.sh		deploy_to_aws.sh
docker-compose.yml		docker-compose.yml
inference_cascade.py		inference_cascade.py
inference_sample.py		inference_sample.py
run_train_creative_agent.sh		run_train_creative_agent.sh
train_simple_ddp.py		train_simple_ddp.py
train_simple_worker.py		train_simple_worker.py
view_mlflow_data.py		view_mlflow_data.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MusicLab

Overview

Architecture

Key Components

Requirements

Quick Start

Training

Resume from Checkpoint

Inference

Training Configuration

Default Hyperparameters

Project Structure

Dataset Format

Hardware Requirements

Citation

License

About

Uh oh!

Releases

Packages

Languages

Vlasenko2006/MusicLab

Folders and files

Latest commit

History

Repository files navigation

MusicLab

Overview

Architecture

Key Components

Requirements

Quick Start

Training

Resume from Checkpoint

Inference

Training Configuration

Default Hyperparameters

Project Structure

Dataset Format

Hardware Requirements

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages