EdgeGPT is a chat application that uses local LLM models for inference. This project combines a React frontend with a FastAPI Python backend to serve local AI models.
- Chat with local Mistral-7B and CodeLlama-7B models
- Switch between text and code generation models seamlessly
- Enhanced sequential thinking for step-by-step problem solving
- Responsive UI built with React, TypeScript, and Tailwind CSS
- Streaming responses for real-time interaction
- Conversation history management and persistence
The project is structured as follows:
EdgeGPT/
├── src/ # Frontend React code
│ ├── components/ # UI components
│ ├── utils/ # Utility functions
│ ├── hooks/ # React hooks
│ ├── lib/ # Library code
│ ├── providers/ # React context providers
│ ├── pages/ # App pages
│ ├── App.tsx # Main React application component
│ ├── ChatInterface.tsx # Chat interface component (both in root and src)
│ ├── main.tsx # Main entry point for React
│ ├── sequentialThinkingServer.ts # Sequential thinking server implementation
│ └── index.css # Global CSS styles
├── public/ # Static assets
├── models/ # Local model files (not included in repo)
├── chat_history/ # Saved chat conversations
├── mcp/ # Model Context Protocol related resources
│ └── thinking/ # Sequential thinking implementation resources
├── run.py # FastAPI backend server for model inference
├── server.js # Node.js server for additional features
├── modelService.ts # TypeScript service for model interaction
├── start-servers.ps1 # PowerShell script to start all servers on Windows
├── start-servers.sh # Shell script to start all servers on macOS/Linux
├── package.json # Node.js dependencies
├── tailwind.config.ts # Tailwind CSS configuration
├── vite.config.ts # Vite build configuration
└── README.md # Project documentation
The codebase has been cleaned up to:
- Remove duplicate files (ChatInterface.tsx, modelService.ts)
- Remove Python cache files (pycache/)
- Remove temporary installation files (get-pip.py)
- Update .gitignore to exclude appropriate files
- Organize the codebase structure for better maintainability
- Node.js 16+
- npm or yarn
- TypeScript
- Vite
- Python 3.9+
- FastAPI
- uvicorn
- llama-cpp-python
-
Install dependencies:
npm install
-
Start the development server:
npm run dev
-
Navigate to the models directory:
cd models -
Create a virtual environment (optional but recommended):
python -m venv venv
-
Activate the virtual environment:
- Windows:
venv\Scripts\activate - macOS/Linux:
source venv/bin/activate
- Windows:
-
Install required packages:
pip install fastapi uvicorn pydantic psutil
-
Install llama-cpp-python:
- Windows (with prebuilt wheel):
pip install llama_cpp_python-0.2.90-cp312-cp312-win_amd64.whl
- macOS/Linux:
pip install llama-cpp-python
- Windows (with prebuilt wheel):
-
Download the required model files:
- Create a
modelsdirectory in the project root if it doesn't exist - Download the following models from Hugging Face or other trusted sources:
- Mistral-7B-Instruct:
mistral-7b-instruct-v0.1.Q4_K_M.gguf - CodeLlama-7B-Instruct:
codellama-7b-instruct.Q4_K_M.gguf
- Mistral-7B-Instruct:
- Place the downloaded model files in the
modelsdirectory
Example download commands (replace with actual URLs):
# For Mistral-7B wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf -O models/mistral-7b-instruct-v0.1.Q4_K_M.gguf # For CodeLlama-7B wget https://huggingface.co/TheBloke/CodeLlama-7B-Instruct-GGUF/resolve/main/codellama-7b-instruct.Q4_K_M.gguf -O models/codellama-7b-instruct.Q4_K_M.gguf
- Create a
-
Verify model paths in
run.py:- Open
run.pyin an editor - Ensure the model paths match your downloaded model filenames:
# Example paths in run.py TEXT_MODEL_PATH = "models/mistral-7b-instruct-v0.1.Q4_K_M.gguf" CODE_MODEL_PATH = "models/codellama-7b-instruct.Q4_K_M.gguf"
- Open
You can start all servers at once using the provided scripts:
-
Windows:
./start-servers.ps1
-
macOS/Linux:
./start-servers.sh
If you prefer to start each component individually:
-
Start the Local Models API Server:
python run.py
The server will run on http://localhost:8000
-
Start the Sequential Thinking Server:
npm run thinking-server
The thinking server will run on http://localhost:8001
-
Start the Frontend:
npm run dev
The frontend will be available at http://localhost:5173
-
Access the Application: Open your browser and navigate to http://localhost:5173
npm run dev cd models;python run.py python -m uvicorn run:app --host 0.0.0.0 --port 8000 --reload npm run thinking-server .\start-AgenTick-server.ps1
By default, the application connects to the following endpoints:
- Model API: http://localhost:8000
- Thinking Server: http://localhost:8001
If you need to change these, you can modify the connection settings in:
src/lib/api.tsfor API endpointssrc/sequentialThinkingServer.tsfor thinking server configuration
-
Model Loading Errors:
- Verify model files exist in the
models/directory - Check that file paths in
run.pymatch your actual model filenames - Ensure you have enough RAM (at least 8GB recommended for 7B models)
- Verify model files exist in the
-
Connection Errors:
- Verify all servers are running
- Check console outputs for error messages
- Ensure ports 8000, 8001, and 5173 are not in use by other applications
-
Performance Issues:
- Close unnecessary applications to free up memory
- Consider using smaller model quantizations if responses are too slow
Once you have all components running, perform these checks to verify everything is working correctly:
-
API Server Test:
curl http://localhost:8000/status
You should receive a JSON response with server status information.
-
Thinking Server Test:
curl http://localhost:8001/health
You should receive a response indicating the thinking server is healthy.
-
End-to-End Test:
- Open the web interface at http://localhost:5173
- Start a new conversation
- Type a simple query like "Hello, how are you?"
- Verify you receive a response from the model
- Try enabling sequential thinking for a more complex query
If all components are working correctly, you should be able to:
- Switch between text and code models
- Send messages and receive responses
- Use sequential thinking for complex queries
- Save and load conversations
The application uses two quantized models:
-
Mistral-7B-Instruct (Text generation)
- Used for general text responses
- Optimized for instruction following
- File:
mistral-7b-instruct-v0.1.Q4_K_M.gguf
-
CodeLlama-7B-Instruct (Code generation)
- Specialized for generating code samples
- Responds to programming-related queries
- File:
codellama-7b-instruct.Q4_K_M.gguf
You can switch between the text and code models using the dropdown in the chat header. The application will automatically use the appropriate model based on your selection.
The FastAPI server provides the following endpoints:
GET /- Check server status and current modelGET /status- Get server status information including memory usage and uptimePOST /generate- Generate text from the current modelPOST /generate_stream- Stream text generation token by tokenGET /conversations- Get all saved conversationsGET /conversation/{conversation_id}- Get a specific conversationPOST /conversation- Create a new conversationDELETE /conversation/{conversation_id}- Delete a conversationPOST /switch-model- Switch between text and code models
This project includes an optimized implementation of the Model Context Protocol (MCP) sequential thinking server, allowing for step-by-step reasoning without requiring internet access.
- High-performance implementation of sequential thinking capability
- Works with existing local models
- Provides a visual representation of the thinking process
- Supports branching and revisions in the thinking process
- Automatic timeout handling and performance optimization
The sequential thinking feature allows the AI to break down complex problems into discrete steps, showing its reasoning process before providing a final answer. This is especially useful for:
- Mathematical or logical problems
- Algorithm design and analysis
- Multi-step reasoning tasks
- Complex decision-making scenarios
- Start a new conversation
- Click the sparkle (✨) icon next to the send button
- Check the "Sequential thinking" checkbox
- Type your question and send it
For optimal results when using sequential thinking:
- Ask clear, well-defined questions
- For complex problems, provide all necessary information in one message
- Be patient as the system works through the thinking steps
- Limit to 3-5 thoughts for best performance
- Use shorter prompts for faster processing
If sequential thinking is not working properly:
- Ensure all servers are running (thinking server, model server, and web server)
- Check the server console output for any errors
- Try resetting the thinking server using:
curl -X POST http://localhost:8001/reset - If thoughts seem repetitive, you can force completion with:
curl -X POST http://localhost:8001/force-complete - Verify server status with:
curl http://localhost:8001/health
The sequential thinking feature is optimized for performance but may still require more processing time than standard responses. Each thought typically takes 2-5 seconds to generate, so a complete sequence might take 10-30 seconds.
The sequential thinking feature allows the AI to break down complex problems into steps. When enabled:
- The AI first processes the prompt through a series of thought steps
- Each thought builds on previous ones
- The AI can revise or branch from previous thoughts
- After completing the thinking process, it provides a final answer
MIT