A Python-based semantic search system to find relevant transcript chunks based on user queries. Supports TF-IDF and Hugging Face LLM (llm2) search methods, with a Streamlit web interface and CLI for interactive querying. Outputs results in the format [timestamp], <chunk> and logs them to output/output.txt.
-
Two Search Methods:
- TF-IDF: Keyword-based search using scikit-learn's TfidfVectorizer.
- Hugging Face LLM (llm2): Semantic search using sentence-transformers (
all-MiniLM-L6-v2).
-
Web Interface: Streamlit app (
app.py) for uploading transcripts and querying. -
CLI: Interactive command-line interface (
transcript.py) for querying. -
Output Logging: Saves queries and results to
output/output.txtin the format:Question: What is artificial intelligence? Method: tfidf Output: [00:41 - 00:44], So let's start with artificial intelligence. -
Transcript Format: Supports
[MM:SS - MM:SS] textformat. -
GEMINI API: This approach wasn't applied as I don't have any api key for that, although i do have an api key of gemini but I dont't know the code wasn't running. So I removed the search_llm1/py method but below is the implementation
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from utils_transcript import TranscriptLoader
class GeminiSearcher:
def __init__(self, chunks, timestamps, model_name="models/ embedding-001"):
"""
Initialize Gemini searcher with transcript chunks.
Args:
chunks (list): List of text chunks.
timestamps (list): List of corresponding timestamps.
model_name (str): Gemini embedding model name.
"""
self.chunks = chunks
self.timestamps = timestamps
self.model_name = model_name
self.embeddings = None
self.MIN_WORDS = 3
self.loader = TranscriptLoader("") # For clean_text method only
self._load_model_and_embed()
def _load_model_and_embed(self):
"""Load Gemini embedding model and create embeddings."""
print(f"Loading Gemini model: {self.model_name}")
try:
# Initialize Gemini embeddings
self.model = GoogleGenerativeAIEmbeddings(model=self.model_name)
# Filter chunks by minimum word count
self.filtered_indices = [i for i, ch in enumerate(self.chunks)
if len(self.loader.clean_text(ch).split()) >= self.MIN_WORDS]
self.filtered_chunks = [(self.timestamps[i], self.chunks[i])
for i in self.filtered_indices]
if not self.filtered_chunks:
print("Warning: No chunks meet minimum word count.")
return
# Create embeddings for filtered chunks
print("Creating embeddings for transcript chunks...")
chunk_texts = [self.loader.clean_text(ch) for _, ch in self.filtered_chunks]
self.embeddings = self.model.embed_documents(chunk_texts)
self.embeddings = np.array(self.embeddings)
print(f"Created embeddings with shape: {self.embeddings.shape}")
except Exception as e:
print(f"Error loading Gemini model or creating embeddings: {e}")
self.embeddings = None
def search(self, query, top_k=1, similarity_threshold=0.3):
"""
Search for the most relevant chunk using Gemini embeddings.
Args:
query (str): User question.
top_k (int): Number of top results to return.
similarity_threshold (float): Minimum similarity score.
Returns:
tuple: (timestamp, chunk) of most relevant chunk, or None.
"""
if not query.strip():
return None
if self.embeddings is None or not self.filtered_chunks:
print("Error: Embeddings not initialized or no valid chunks.")
return None
try:
# Preprocess and embed query
query = self.loader.clean_text(query)
query_embedding = np.array(self.model.embed_query(query)).reshape(1, -1)
# Calculate cosine similarity
embeddings = np.array(self.embeddings)
similarities = cosine_similarity(query_embedding, embeddings).flatten()
# Debug: Print similarities
print("Gemini Similarities:", [(ts, f"{score:.4f}")
for (ts, _), score in zip(self.filtered_chunks, similarities)])
# Get top match
top_indices = np.argsort(similarities)[::-1][:top_k]
best_idx = top_indices[0]
best_score = similarities[best_idx]
if best_score < similarity_threshold:
print(f"No relevant chunk found (score: {best_score:.2f})")
return None
print(f"Best match similarity: {best_score:.4f}")
return self.filtered_chunks[best_idx]
except Exception as e:
print(f"Error during Gemini search: {e}")
return NoneThe system processes transcript files by:
- Loading and Chunking:
utils.pyparses transcripts into timestamped chunks using regex, cleaning text for consistency. - Search Methods:
- TF-IDF: Builds a vector index of chunks, using cosine similarity to match queries.
- Hugging Face LLM: Encodes chunks into embeddings, using semantic similarity for matching.
- Interfaces:
- CLI: Interactive loop in
transcript.pyfor querying and logging. - Web: Streamlit interface in
app.pyfor file uploads and querying.
- CLI: Interactive loop in
- Output: Returns the most relevant chunk as
[timestamp], <chunk>, logged tooutput/output.txt.
- Python 3.8+
- Required packages (listed in
requirements.txt):scikit-learn==1.5.2 sentence-transformers==3.1.1 streamlit==1.39.0 torch==2.4.1 numpy==1.26.4
-
Clone the Repository:
git clone https://github.com/your-repo/semantic-search-transcript.git cd semantic-search-transcript -
Create Virtual Environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install Dependencies:
pip install -r requirements.txt
semantic-search-transcript/
├── src/
│ ├── transcript.py # CLI entry point
│ ├── utils.py # Transcript loading and preprocessing
│ ├── search_tfidf.py # TF-IDF search implementation
│ ├── search_hf.py # Hugging Face LLM search implementation
├── data/
│ └── transcript.txt # Sample transcript file
├── output/
│ ├── output.txt # Search results log
│ └── screenshots/ # Screenshots (optional)
├── app.py # Streamlit web interface
├── requirements.txt # Dependencies
├── README.md # This file
Run the CLI to query a transcript file interactively.
Command:
python src/transcript.py data/transcript.txt <method>- Replace
<method>withtfidforllm2.
Example:
python src/transcript.py data/transcript.txt tfidfSteps:
- At the prompt (
Transcript loaded, please ask your question (press 8 for exit):), enter a question, e.g.,What is artificial intelligence?. - View results, e.g.,
[00:41 - 00:44], So let's start with artificial intelligence.. - Press
8to exit. - Check
output/output.txtfor logged results.
Run the Streamlit app for a browser-based interface. Link : https://semanticsearchtranscript.streamlit.app
Download the transcript.txt from data folder of the repo and then drag and drop int the above link. Now ask the questions based on the methods mentioned above.
Command:
streamlit run app.pySteps:
- Open
http://localhost:8501in your browser. - Upload
data/transcript.txt. - Select a search method (
TF-IDForHugging Face LLM (llm2)). - Enter a question, e.g.,
What is artificial intelligence?. - Click
Searchto view results. - Check
output/output.txtfor logged results.
Test the following questions to verify functionality:
- What is the definition of artificial intelligence as explained in the transcript?
- When was the term artificial intelligence first introduced?
- How does machine learning contribute to artificial intelligence?
- What are some practical applications of deep learning mentioned in the transcript?
- Can you explain the difference between artificial intelligence and traditional programming?
- For
What is artificial intelligence?:[00:41 - 00:44], So let's start with artificial intelligence. - For
When was AI first coined?:[00:44 - 00:48], The term artificial intelligence was first coined in the year 1956.
- CLI: Run with both methods, check console output and
output/output.txt. - Web: Test both methods, capture screenshots, and verify
output/output.txt. - Logs: Ensure
output/output.txtcontains:Question: What is artificial intelligence? Method: tfidf Output: [00:41 - 00:44], So let's start with artificial intelligence.
copyright issued by Aditya Arya
For questions or issues, please open a GitHub issue or contact arya050411@gmail.com.
Developed by Aditya Arya.




