DJ Reinforcement Learning System

Premise

A DJ is faced with an impossible amount of information to sift through. With roughly 475 million songs across Spotify and SoundCloud combined, and a curated DJ library that easily exceeds 1,000 tracks, it is hard to pay attention to the crowd’s reaction and pick the next song.

This project uses deep learning and reinforcement learning to traverse that maze on the DJ’s behalf. A wireless sensor mesh measures crowd engagement; a sequence model trained on real DJ sets proposes the next song; an RL head nudges that proposal based on what the crowd just did. The algorithm exploits, the musician explores.

What it actually looks like

Final system block diagram for the DJ RL system — Fig. — The final system. I²C is the backbone between the controller ESP, the ATMega LCD, and the Raspberry Pi; ESP-NOW carries data from the peripheral sensor packs to the controller; WiFi carries everything to the Mac running RTRLOS.

Sensor mesh (Chris)

Two peripheral packs, each a UM FeatherS2 (ESP32-S2) with a sensor and a LiPo battery, deployed wirelessly across the venue:

Environmental (BME680): ambient temperature, humidity, pressure, VOC. All three move with crowd exertion. I²C to the FeatherS2.
Audio (MAX9814): a microphone with automatic gain control. Sampled over a 50 ms window; min/max → ADC → calibrated decibel value. Required attenuation tuning on the ESP’s 12-bit ADC to avoid clipping the bias-shifted output.

Both packs broadcast over ESP-NOW to a single controller FeatherS2. The controller’s callback is deliberately tiny — flag-set only — because we previously bricked three FeatherS2 boards stacking heavier work in the callback alongside I²C.

Raspberry Pi (mine)

The Pi is the I²C master and the WiFi bridge. It pulls data from the controller ESP and from a thermal camera (calculating per-frame motion as the Frobenius norm of the consecutive-frame difference matrix), packages everything into JSON, and sends it to the Mac. It also pushes selected fields over I²C to the ATMega328PB, which drives the LCD — the bare-metal-C complexity requirement for the course.

RTRLOS — the orchestrator (mine)

RTRLOS stands for “Real-Time Reinforcement-Learning Operating System.” It is, charitably, none of those things: not real-time, not pre-emptive, not using a scheduler we wrote in this class. It is built on Python’s asyncio and threading primitives, which give us an event-driven loop and context-switching.

The architectural inspiration is GoLang’s parallelism idiom: share memory by communicating, not the other way around. We model channels with asyncio.Queue and the Pi-to-Mac pipe is a long-lived WebSocket connection rather than shared mutable state.

Why “real-time”? — because the loop closes every 100 ms whether or not inference has caught up. Late sensor readings are dropped at the head of the queue, not buffered. The DJ should never feel the system stall.

The model

Two stages, mutually exclusive and chained:

Pretraining (offline). A custom web-scraper collected ~3,000 DJ sets. After preprocessing — trackname normalisation, audio-metadata extraction, embedding lookup — the corpus became a sequence-prediction dataset: given the last K songs, predict the next. We trained a small 2 M-parameter GRU (1,799,552 parameters) on next-song-in-sequence loss. Done before the semester started.
Reinforcement-learning head (online). A small MLP sitting on top of the GRU’s latent. It takes [of_dx, of_dy, tof_mm, thermal_avgC, audio_db, …] + the current sequence state and outputs a nudge vector. The RL signal is crowd movement after a song change. The frozen GRU contributes structure; the RL head contributes adaptation.

Pretraining loss curve for the GRU sequence model — Fig. — Pretraining loss curve for the GRU. Loss decreases monotonically and converges; the model successfully learned next-track structure on the scraped dataset.

A simple in-memory vector database holds song embeddings; the inference step finds nearest-neighbour candidates to the predicted vector and ranks them. For an MVP-stage prototype with ~1,000 songs this is sufficient; a production version would swap in a real vector store.

What changed during the build

Almost all of the original hardware plan changed. The original block diagram had a single ATMega328PB at the centre, talking UART up to the Mac. The MVP demo inverted that: the Pi became the centre, the ATMega moved to the LCD edge, and UART became WiFi/JSON. The reasons were boring — sensor stack on the ATMega was prohibitively register-heavy for the parts we had — but the lesson was useful: simplify the data streams. Most of our debugging time was spent on protocol crossings (I²C ↔ ESP-NOW ↔ WiFi), not on application logic.

What I’d build next

Better playback. The current system recommends a single next track. The interesting version overlaps stems — mix acapellas onto instrumentals, beat-match in software — so the “recommendation” is a transition, not a track.
Bigger pretraining corpus & richer features. 3K sets and metadata-only embeddings were a starting point; raw audio embeddings (CLAP, Audio MAE) would let the model reason about timbre and energy directly.
Off-board the thermal camera. Today it’s I²C to the Pi, which means the Pi sits where the camera sits. Moving the camera onto its own ESP peripheral pack gets it overhead, where it wants to be.

Built with Chris Spletzer for ESE 519 at UPenn (Fall 2025). Full report and demo video linked above.