An Intelligent Question-Answering AI Consumer Technology Product
Artificial Intelligence For Science and Technology - A.Y. 2024/2025
Universitร Degli Studi di Milano-Bicocca
Authors: Andrea Yachaya (913721) & Mirko Morello (920601)
- Overview
- System Architecture
- Hardware Components
- Software Components
- Key Features
- Technical Implementation
- Installation & Setup
- Usage
- Performance Metrics
- Future Work
- License
Marvin is an AI-based virtual assistant designed as a consumer technology product, built from scratch using state-of-the-art machine learning models. The project demonstrates a complete end-to-end conversational AI system with custom hardware integration, featuring:
- Privacy-First Design: Wake word detection runs entirely on-device
- Multi-Speaker Support: Real-time speaker identification and diarization
- Natural Conversations: Context-aware responses using LLama 8B
- Physical Device: Custom 3D-printed enclosure with Raspberry Pi hardware
- Low Latency: Optimized pipeline for responsive interactions
- Design a working AI-based virtual assistant
- Enable the client to listen, communicate with server, and play audio responses
- Implement server capabilities for:
- Speaker diarization and identification using embeddings
- Speech-to-text transcription
- LLM-based intelligent responses
- Text-to-speech synthesis
- Support real-time multiple client connections with minimal latency
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CLIENT DEVICE โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ 1. Detects when user is speaking (VAD) โ โ
โ โ 2. Responds to wake-word ("Marvin") โ โ
โ โ 3. Sends audio data to server โ โ
โ โ 4. Plays response from server โ โ
โ โ 5. Visual feedback through 12 RGB LEDs โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โฒ โ โ
โ โ โ โ
โ Audio โ โ Audio โ
โ โ โผ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
TCP Socket (Port 8080)
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ SERVER โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ 1. Audio Reception & Pre-processing โ โ
โ โ 2. Speaker Diarization (pyannote 3.1) โ โ
โ โ 3. Speaker Identification (embedding-based) โ โ
โ โ 4. Speech-to-Text (Whisper Large v3 Turbo) โ โ
โ โ 5. LLM Processing (LLama 8B Instruct) โ โ
โ โ 6. Text-to-Speech (Kokoro TTS) โ โ
โ โ 7. Response Transmission โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
The system uses a custom binary protocol over TCP for minimal overhead:
- Built on top of TCP (Layer 4)
- Custom binary format at Layer 7
- Supports audio streaming and state synchronization
- Server listens on
0.0.0.0:8080
Message Format:
โโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Size (4 bytes)โ Data (N bytes) โ
โ (int32, big-endian) โ (float32 audio or JSON) โ
โโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
The client is a Raspberry Pi 3 Model B+ enclosed in a custom 3D-printed hexagonal case with integrated components:
- Microcontroller: Raspberry Pi 3 Model B+
- Microphone Array: ReSpeaker 7-mic array with beamforming
- LEDs: 12 addressable RGB LEDs (WS2812B)
- Speakers: 2ร 5W speakers (stereo output)
- Amplifier: Integrated speaker amplifier
- Battery: 3300mAh Lithium battery
- Power Management: Battery manager with 5V voltage regulator
- Enclosure: Custom 3D-printed hexagonal case
- Dimensions: 115mm ร 105mm ร 96mm
- Weight: ~400g with battery
Communication Protocols:
- IยฒC (Inter-Integrated Circuit): LED control
- SDA (GPIO 10): Data line
- SCL (GPIO 11): Clock line
- IยฒS (Inter-IC Sound): Microphone array
- SCK (GPIO 18): Bit clock
- WS (GPIO 19): Word select (channel selection)
- SD (GPIO 20, 21, 26, 16): 4 data lines for 7 microphones
Pinout Configuration:
| Signal | GPIO Pin | Physical Pin | Notes |
|---|---|---|---|
| LED_SCLK | GPIO 11 | 23 | LED clock (SPI) |
| LED_MOSI | GPIO 10 | 19 | LED data (SPI) |
| MIC_D0 | GPIO 21 | 40 | Microphone data line 0 |
| MIC_D1 | GPIO 20 | 38 | Microphone data line 1 |
| MIC_D2 | GPIO 26 | 37 | Microphone data line 2 |
| MIC_D3 | GPIO 16 | 36 | Microphone data line 3 |
| MIC_WS | GPIO 19 | 35 | Word select/LRCK |
| MIC_CK | GPIO 18 | 12 | PCM clock |
The client software is responsible for audio capture, wake word detection, and user interaction feedback.
1. Main Client (Final_Project/client/main.py)
- Initializes audio parameters and model
- Creates client instance and manages main loop
2. Client Logic (Final_Project/client/client.py)
- Manages socket connection to server
- Implements state machine for conversation flow
- Handles audio streaming and response playback
3. Wake Word Detection Model (Final_Project/client/model.py)
- Custom MatchboxNet-inspired architecture
- 77,987 trainable parameters (7.27 MB model size)
- Processes audio through MFCC preprocessing
- Uses Jasper blocks with depthwise separable convolutions
4. Voice Activity Detection (Final_Project/client/utils/vad.py)
- Energy-based VAD with exponential smoothing
- Detects when user stops speaking
- Adaptive threshold for different environments
5. LED Animation System (Final_Project/client/led_sequences/)
- Multiple animation patterns for different states:
- Waiting: Rainbow pulse
- Listening: Rotating orange/cyan
- Processing: Red circular loading
- Thinking: Blue pulse
- Speaking: White breathing
- WAKEWORD ๐ด: Waiting for wake word detection
- VAD ๐ข: Listening to user conversation
- VOICE_RECEIVED: Grace period before sending
- STT_DIARIZATION ๐ต: Server processing audio
- LLM ๐ฃ: Server generating response
- TTS ๐ : Server converting to speech
- PLAYING: Playing server response
The server implements a sophisticated multi-stage pipeline for audio processing and response generation.
Audio Reception
โ
Pre-Processing (WAV conversion)
โ
Diarization (Speaker segmentation)
โ
Embedding Extraction (Per segment)
โ
Speaker Identification
โ
Speech-to-Text (Per segment)
โ
Aggregation & Context Formation
โ
LLM Processing
โ
Text-to-Speech
โ
Response Transmission
1. Server Core (Final_Project/server/server.py)
- Manages TCP socket connections
- Handles multiple clients sequentially
- Coordinates pipeline execution
- Sends responses with state information
2. Audio Processing (Final_Project/server/audio_processing.py)
- Integrates diarization and STT pipelines
- Manages temporary audio files
- Coordinates speaker identification
3. Speaker Identification (Final_Project/server/utils/speaker_id.py)
- Loads enrolled speaker database
- Extracts embeddings for segments
- Compares with database using cosine similarity
- Threshold-based speaker matching
4. LLM Handler (Final_Project/server/utils/llm_handler.py)
- Loads LLama 8B Instruct model
- Custom prompt engineering for home assistant context
- Generates concise, context-aware responses
5. TTS Handler (Final_Project/server/utils/tts_handler.py)
- Kokoro TTS engine integration
- Phoneme-based speech synthesis
- GPU-accelerated generation
6. Speaker Enrollment (Final_Project/server/enroll_speaker.py)
- Records 20ร 3-second utterances per speaker
- Computes averaged embedding centroids
- Stores in JSON database with L2 normalization
- On-device processing: No audio sent to server until wake word detected
- Custom lightweight model: Only 77k parameters
- High accuracy: 94.02% on test set
- Fast inference: ~5ms per inference on Raspberry Pi
- Speaker diarization: Automatic segmentation by speaker
- Speaker identification: Embedding-based recognition
- Contextual understanding: LLM receives full conversation context with speaker labels
- LLama 8B model: 8 billion parameters with quantization
- Custom system prompt: Optimized for home assistant behavior
- Context-aware: Understands multi-turn conversations
- 12 RGB LEDs: Visual indication of system state
- Smooth animations: Professional-looking transitions
- Multiple patterns: Different animations for each state
- Custom protocol: Minimal overhead compared to WebSockets
- Optimized pipeline: Parallel processing where possible
- Efficient serialization: Binary format for audio data
Based on MatchboxNet principles with custom adaptations:
Input Audio (16kHz, 1 second)
โ
AudioToMFCCPreprocessor
- 64 MFCC features
- n_fft: 512
- hop_length: 160
โ
ConvASREncoder (6 Jasper Blocks)
Block 1: 64 โ 128 channels, k=11
Block 2: 128 โ 64 channels, k=13 (residual)
Block 3: 64 โ 64 channels, k=15 (residual)
Block 4: 64 โ 64 channels, k=17 (residual)
Block 5: 64 โ 128 channels, k=29, dilation=2
Block 6: 128 โ 128 channels, k=1
โ
ConvASRDecoderClassification
- AdaptiveAvgPool1d
- Linear(128 โ 35 classes)
โ
Output: Class predictions
Jasper Block Structure:
- Depthwise separable convolution
- Pointwise (1ร1) convolution
- Batch normalization
- ReLU activation
- Optional residual connections
- Dropout regularization
- Dataset: Google Speech Commands V2
- 35 classes
- 105,829 training samples
- Optimizer: Adam
- Loss: CrossEntropyLoss
- Scheduler: ReduceLROnPlateau
- Batch size: 64
- Initial LR: 0.001
- Epochs: 20
- Final accuracy: 94.02%
| Metric | Value |
|---|---|
| Total Parameters | 77,987 |
| Model Size | 7.27 MB |
| Single Sample Inference | ~5ms |
| Avg Batch Inference (64) | 137ms |
| Input Size | 0.06 MB |
| Forward/Backward Pass Size | 1.66 MB |
# For each speaker:
1. Record 20 audio clips (3 seconds each)
2. Extract embedding for each clip using pyannote/embedding
3. Average all embeddings โ centroid
4. L2-normalize centroid
5. Store in JSON database with speaker nameEmbedding Model: pyannote/embedding
- Pre-trained on VoxCeleb
- 512-dimensional embeddings
- Sliding window: 1.5s with 0.75s step
# For each diarized segment:
1. Extract embedding using same model
2. L2-normalize embedding
3. Compute cosine similarity with all enrolled speakers:
similarity = dot_product(segment_emb, speaker_centroid)
4. If max_similarity > threshold (e.g., 0.75):
โ Assign speaker name
else:
โ Label as "Unknown"Advantages:
- Fast: O(n) for n enrolled speakers
- Robust: Averaged centroids reduce noise
- Scalable: Can handle many speakers
- Privacy-preserving: Embeddings stored locally
Model: OpenAI Whisper Large v3 Turbo
- Parameters: 809M
- VRAM requirement: ~6GB
- Features:
- Handles varying audio quality
- Robust to background noise
- Multilingual support (though English used)
- Punctuation and formatting
Processing Strategy:
- Process each diarized segment individually
- Maintains speaker context
- Aggregates transcripts with timestamps
- Improves accuracy with shorter, focused segments
Model: Meta-LLama-3-8B-Instruct
- Parameters: 8 billion
- Quantization: Hybrid quantization for memory efficiency
- VRAM usage: 6-8GB
- Context window: Handles full conversation history
Custom System Prompt:
You are Marvin, a helpful home speaker assistant.
You are having a conversation with multiple people in a room.
Provide concise, friendly responses that are natural when spoken aloud.
Keep responses brief (1-3 sentences) unless more detail is explicitly requested.
Context Format:
[Speaker1]: Hello Marvin, what's the weather today?
[Speaker2]: Also, can you remind me about my meeting?
Response Generation:
- Streaming not used (full response generated)
- Temperature tuning for natural speech
- Max tokens limited for brevity
- Context includes speaker labels
Model: Kokoro TTS
- Parameters: 82M
- Voice: "af" (configurable via environment)
- Processing: GPU-accelerated
- Output: 16kHz WAV format
Pipeline:
Text Input
โ
Phoneme Sequence Generation
โ
Neural Vocoder
โ
Waveform Synthesis
โ
Audio Output (16kHz, mono)
Advantages:
- Natural-sounding speech
- Fast generation (~100-250ms for typical response)
- Consistent voice quality
- Low latency for real-time use
- Python 3.8+
- CUDA-capable GPU (16GB+ VRAM recommended)
- Ubuntu 20.04+ or similar Linux distribution
- Clone the repository:
git clone https://github.com/MirkoMorello/MSc_ICT.git
cd MSc_ICT/Final_Project- Create virtual environment:
python3 -m venv venv
source venv/bin/activate- Install dependencies:
pip install -r requirements.txt- Set up environment variables:
# Create .env file in Final_Project directory
cat > .env << EOF
HF_AUTH_TOKEN=your_huggingface_token_here
KOKORO_LANG_CODE=a
EOF- Enroll speakers (optional but recommended):
cd server
python enroll_speaker.py
# Follow prompts to record 20 samples per speaker- Start the server:
cd server
python main.pyServer will start listening on 0.0.0.0:8080
- Raspberry Pi 3 Model B+ or newer
- ReSpeaker 7-mic array
- Raspbian OS (Bullseye or newer)
- Speakers connected to audio jack
- Internet connection
- Clone repository on Pi:
git clone https://github.com/MirkoMorello/MSc_ICT.git
cd MSc_ICT/Final_Project- Install system dependencies:
sudo apt-get update
sudo apt-get install python3-pip python3-pyaudio portaudio19-dev
sudo apt-get install libatlas-base-dev # For NumPy- Install Python dependencies:
pip3 install -r requirements.txt- Configure server address:
# Edit client/utils/config.py
nano client/utils/config.py
# Set SERVER_ADDRESS to your server's IP- Run client:
cd client
python3 main.py-
Wake the Assistant:
- Say "Marvin" clearly
- LEDs will show rainbow pattern when listening
- Wait for acknowledgment sound
-
Ask Your Question:
- Speak naturally after wake word
- LEDs show listening state (orange/cyan rotation)
- System detects when you stop speaking
-
Processing:
- LEDs show red loading animation
- Server processes audio
- Blue pulse indicates LLM thinking
-
Response:
- White breathing pattern during speech
- Audio plays through speakers
- Returns to wake word listening after response
Single User:
User: "Marvin, what's the capital of France?"
Marvin: "The capital of France is Paris, known for the Eiffel
Tower and rich cultural history."
Multiple Users:
Alice: "Marvin, what's the weather like?"
Marvin: "I don't have real-time weather data, but I can help
with other questions."
Bob: "Can you tell us a joke?"
Marvin: "Sure! Why don't scientists trust atoms? Because they
make up everything!"
Say goodbye phrases to end:
- "Goodbye Marvin"
- "That's all"
- "Thank you, bye"
- "See you later"
| Metric | Value |
|---|---|
| Test Accuracy | 94.02% |
| Model Size | 7.27 MB |
| Inference Time (RPi 3) | ~5ms |
| False Positive Rate | ~6% |
| False Negative Rate | ~6% |
Average Processing Times (per conversation):
| Component | Time (ms) | Notes |
|---|---|---|
| Diarization | 2000-6000 | Varies with audio length |
| STT (per segment) | 500-1500 | Depends on segment length |
| Speaker ID | 50-200 | Fast embedding comparison |
| LLM Generation | 2000-8000 | Depends on response length |
| TTS | 100-250 | Fast synthesis |
| Total Pipeline | 5000-17000 | ~5-17 seconds total |
Throughput:
- Single conversation: ~5-17 seconds end-to-end
- Can handle 1 client at a time (sequential processing)
- Low network latency: <100ms for audio transfer
Server (GPU):
- VRAM: ~14-16GB (all models loaded)
- CPU: 4-8 cores recommended
- RAM: 16GB+ recommended
- Disk: ~20GB for models
Client (Raspberry Pi 3):
- CPU: ~15-25% during wake word detection
- RAM: ~200MB
- Storage: ~100MB (including model)
-
Context-Aware Conversation Termination
- Replace keyword-based termination with sentiment analysis
- Use BERT-based models (e.g., BERTa from Meta)
- Infer natural conversation endpoints
-
Smart Home Integration
- Agent-based framework for device control
- API integration for:
- Lights (Phillips Hue, etc.)
- Thermostats
- Door locks
- Entertainment systems
- Time and weather information
- Calendar and reminder management
-
Performance Optimization
- Streaming TTS for lower latency
- Model quantization (INT8) for faster inference
- Parallel processing for multiple clients
- Caching for common queries
-
Enhanced Privacy
- On-device STT for sensitive queries
- Encrypted audio transmission
- Local LLM option for privacy-conscious users
- User data deletion policies
-
Improved Hardware
- Upgrade to Raspberry Pi 4/5 for better performance
- Add hardware button for wake-up
- Battery level indicator
- Better speaker quality
-
Multilingual Support
- Multiple wake words for different languages
- Language detection in STT
- Multilingual TTS voices
Problem: Commercial models (85M parameters) too large for Raspberry Pi Solution: Custom MatchboxNet-inspired architecture (77k parameters, 7.27MB) Result: 94% accuracy with 5ms inference time
Problem: Need to distinguish between different users Solution: Enrollment system + embedding-based identification Result: Robust speaker identification with <200ms overhead
Problem: Multiple heavy models causing delays Solution:
- GPU acceleration for all models
- Optimized pipeline flow
- Custom binary protocol
Result: 5-17 second total latency (acceptable for conversational AI)
Problem: Complex electrical connections and protocols Solution:
- Custom 3D-printed enclosure
- Proper IยฒC and IยฒS protocol implementation
- Comprehensive testing suite
Result: Reliable hardware operation with visual feedback
Problem: Socket disconnections and timeouts Solution:
- Automatic reconnection logic
- Timeout handling
- Error recovery mechanisms
Result: Stable client-server communication
- Whisper Large v3 Turbo: OpenAI Whisper
- LLama 3 8B: Meta LLama
- Pyannote Diarization: pyannote.audio
- Kokoro TTS: Speech synthesis engine
- MatchboxNet: NVIDIA NeMo
- PyTorch: Deep learning framework
- Transformers: Hugging Face library
- PyAudio: Audio I/O
- NumPy: Numerical computing
- SciPy: Scientific computing
- Raspberry Pi: Single-board computer
- ReSpeaker: Microphone array
- WS2812B: Addressable RGB LEDs
This project is licensed under the MIT License - see the LICENSE file for details.
-
Mirko Morello (920601)
- Hardware design & integration
- Client software development
- Wake word model training
- LED animation system
-
Andrea Yachaya (913721)
- Server pipeline development
- Model integration
- Speaker identification system
- System architecture
- Universitร Degli Studi di Milano-Bicocca for academic support
- Course: Artificial Intelligence For Science and Technology (A.Y. 2024/2025)
- Open Source Community for excellent tools and libraries
For questions, issues, or contributions, please open an issue on the GitHub repository.
Repository: https://github.com/MirkoMorello/MSc_ICT
Made with โค๏ธ for AI Education
Marvin - Your Friendly AI Assistant