This project is a word to vector embedding training in Golang using the skip-gram model.
By default, the training is set up for English. To change the language, modify the following:
- Replace
GetEnglishDictionary()with a function that retrieves a dictionary for the target language. - Modify the
DownloadBook()function to download books in the desired language by changing the languageEnglishto the target language written in english.
Currently, DownloadBook() uses Project Gutenberg, a public domain library. Ensure that the chosen language has sufficient resources available in this library.
This project relies on the following external dependencies:
github.com/joho/godotenv– For loading environment variables from a.envfile.github.com/lib/pq– PostgreSQL driver for Golang.
To install these dependencies, run:
go get github.com/joho/godotenv github.com/lib/pqThe program stores word vectors in a PostgreSQL database due to size limitations in Golang's gob binary encoding.
CREATE TABLE embeddings (
word TEXT PRIMARY KEY,
vector DOUBLE PRECISION[]
);
CREATE UNIQUE INDEX embeddings_word_idx ON embeddings(word);Create a .env file to store database credentials:
DATABASE_PORT=
DATABASE_USER=
DATABASE_PASSWORD=
DATABASE_HOST=
DATABASE_NAME=To compile the program on Windows and run it on Linux, set the following environment variables before building:
$env:GOOS="linux"
$env:GOARCH="amd64"Then, run the build command as usual.