This repository holds the code for the paper:
- Xin Du and Kumiko Tanaka-Ishii. "Semantic Field of Words Represented as Nonlinear Functions", NeurIPS 2022.
We proposed a new word representation in a functional space rather than a vector
space, called FIeld REpresentation (FIRE). Each word
Compared with previous word representation methods, FIRE represents nonlinear
word polysemy while preserving a linear structure for additive semantic compositionality.
The word polysemy is represented by the multimodality of
The similarity between two sentences
where
Figure (left): words that are frequent and similar to the word bank,
visualized in the semantic field of bank, when bank are
naturally separated with FIRE. Figure (right): overlapped semantic fields of
river and financial, and their locations bank in the image above, indicating FIRE's property of compositionality.
-
Python >= 3.7
-
Packages
# With CUDA 12 or later $ pip install -r requirements.txt # With CUDA 11 $ pip install -r requirements-cu11.txt # With CUDA 10 $ pip install -r requirements-cu10.txt
-
NLTK You may need to download the NTLK
punktpackage. To do that, please run python and execute the following:>>> import nltk >>> nltk.download('punkt')
-
If you are using Windows or MacOS, you need to install the Rust compiling toolchain (cargo) in advance, which is used to compile the
corpusitpackage from source.
We provide the following pre-trained FIRE models:
- $D=2,L=4,L=1$ (23 parameters per word)
- $D=2,L=4,L=10$ (50 parameters per word)
- $D=2,L=8,L=20$ (100 parameters per word)
The saved models can be reloaded by
import firelang
model = firelang.FireWord.from_pretrained('checkpoints/v1.1/wacky_mlplanardiv_d2_l4_k10')Execute the benchmarking script as follows:
$ bash scripts/benchmark/2_run_benchmark.shWe integrated WanDB functionalities in the training program for experiment management and visualization. So by default, you need to do the following three steps to enable those functionalities:
- Create
wandb_config.pyfrom the templatewandb_config.template.py - Register an WanDB account
- Fill in your username (
WANDB_ENTITY) and token (WANDB_API_KEY) inwandb_config.py
If you do not plan to use WanDB, you will have to delete the argument --use_wandb
in scripts/text8/4_train.sh and scripts/wacky/4_train.sh
We provide scripts in /scripts/text8/ to train a FIRE model on the text8 dataset.
Text8 is smaller (~100MB) and is publicly available.
# download the text8 corpus
$ bash scripts/text8/1_download_text8.sh
# tokenize the corpus with the NLTK tokenizer
$ bash scripts/text8/2_tokenize.sh
# build a vocabulary with the tokenized corpus
$ bash scripts/text8/3_build_vocab.sh
# training from scratch
$ bash scripts/text8/4_train.sh
# This would takes 2-3 hours, so it is recommended to run the process in the background.
# For example:
# $ CUDA_VISIBLE_DEVICES=0 nohup bash scripts/text8/4_train.sh > log.train.wacky.log 2>&1 &The training process is carried out with the SkipGram method.
For fast sampling from the tokenized corpus in the SkipGram way, we used
another python package corpusit
that is written in Rust (and binded with PyO3).
On Windows or MacOS, installing corpusit with pip will compile the
package from source code; in this case, you need to have
cargo installed in advance.
The WaCKy corpus is a concatenation of two corpora ukWaC and WaCkypedia_EN.
Both are provided at https://wacky.sslmit.unibo.it/doku.php?id=download via request.
After you get the two corpora, put the concatenated file at
/data/corpus/wacky/wacky.txt. Then, you can run scripts under /scripts/wacky/
(for tokenization, vocabulary construction, and training) to start training a
FIRE on the concatenated corpus. The process takes 10-20 hours depending on the hardware.
A challenge for implementing FIRE is to parallize
the evaluation of (neural-network) functions
The usual way of using a neural network NN is to process a data batch at a time,
that is the parallelization of
In FIRE-based language models, we instead require the parallelization of both neural networks and data. The desired behaviors should include:
-
paired mode: output a vector.
-
$\text{NN}_1(x_1)$ ,$\text{NN}_2(x_2)$ ,$\text{NN}_3(x_3)$ ...
In analog to element-wise multiplication
$\text{NN}*x$ , where$\text{NN}=[\text{NN}_1,\text{NN}_2,\cdots,]^\text{T}$ , and$x=[x_1,x_2,\cdots]^\text{T}$ -
-
cross mode: output a matrix.
-
$\text{NN}_1(x_1)$ ,$\text{NN}_1(x_2)$ ,$\text{NN}_1(x_3)$ ... -
$\text{NN}_2(x_1)$ ,$\text{NN}_2(x_2)$ ,$\text{NN}_2(x_3)$ ... -
$\text{NN}_3(x_1)$ ,$\text{NN}_3(x_2)$ ,$\text{NN}_3(x_3)$ ... $\cdots$
In analog to matrix multiplication of vectors:
$\text{NN}$ @$x$ . -
We call
To store the parameters for all words, slicing is required to extract the parameters for these
words and recombine them into a new stacked function.
For word-vector representations, slicing is natively supported
for the matrix
- Slicing:
vecs1 = V[[id_apple, id_pear, ...]] # (n1, D) vecs2 = V[[id_iphone, id_fruit, ...]] # (n2, D)
- Computation of paired similarity (where n1 == n2 must hold):
sim = (vecs1 * vecs2).sum(-1) # (n1,)
- Computation of cross similarity (n1 and n2 can be different):
sim = vecs1 @ vecs2.T # (n1, n2)
In this repository, we provide an analogous implementation for parallelizing neural networks.
In FIRE, each neural network (a function or a measure) is treated like a vector.
Multiple neural networks are stacked like the stacking of vectors into a matrix.
In FIRE, the slicing and similarity computation are done in a similar way to vectoral.
- Slicing:
x1 = model[["apple", "pear", ...]] # FIRETensor: (n1, D) x2 = model[["iphone", "fruit", ...]] # FIRETensor: (n1, D)
- Computation of paired similarity (where n1 == n2):
sim = x2.measures.integral(x1.funcs) # (n1,) + x1.measures.integral(x2.funcs) # (n2,)
- Computation of cross similarity:
sim = x2.measures.integral(x1.funcs, cross=True) # (n1, n2) + x1.measures.integral(x2.funcs, cross=True).T # (n2, n1) -> transpose -> (n1, n2)
In addition to the way above where integral must be explicitly invoked,
a more friendly way is also provided, as below:
# paired similarity
# sim: (n1,)
sim = x1.funcs * x2.measures # (n1,)
+ x1.measures * x2.funcs # (n2,)
# cross similarity
sim = x1.funcs @ x2.measures + x1.measures @ x2.funcs # (n1, n2)Furthermore, the two steps above can be done in one line:
sim = model[["apple", "pear", "melon"]] @ model[["iphone", "fruit"]] # (3, 2)For the functions in a FIRE, we implemented arithmetic operators to make the (stacked) functions look more like a vector.
For example, the regularization of the similarity scores by the following formula
is done by:
x1 = model[["apple", "pear"]]
x2 = model[["fruit", "iphone"]]
sim_reg = x2.measures.integral(x1.funcs) + x1.measures.integral(x2.funcs) \
- x1.measures.integral(x1.funcs) - x2.measures.integral(x1.funcs)or equivalently:
sim_reg = (x2.funcs - x1.funcs) * x1.measures + (x1.funcs - x2.funcs) * x2.measureswhere x2.funcs - x1.funcs produces a new functional.
Please cite the following paper:
@inproceedings{
du2022semantic,
title={Semantic Field of Words Represented as Non-Linear Functions},
author={Xin Du and Kumiko Tanaka-Ishii},
booktitle={Thirty-Sixth Conference on Neural Information Processing Systems},
year={2022},
}
