This project relates to the paper "HEAT: Heterogeneous Entity-level Attention for Entity Typing".
This project provides an attention-based neural network model for the entity-level typing task. It processes heterogeneous data and learns representations from them, including the name, relevant paragraphs, and relevant image features.
While Mention-level Entity Typing infers the types of a mention that are supported by its textual contexts, Entity-level Typing infers types of an entity by considering all types supported by data.
- PyTorch version >= 1.7
- Python version >= 3.7
- transformers >= 3.3
- a GPU with 11GB graphic RAM if run with the BiLSTM encoder
- a GPU with 32GB graphic RAM if run with the BERT encoder
Run python init.py to download datasets. Run bash run.sh to train the model with each dataset.
-taskspecifies the task id-datasetspecifies the dataset name-text_encoderspecifies the text encoder in {'bert', 'bert_freeze', 'lstm'}-remove_name, -remove_para, -remove_imgmake the model not use corresponding modules-without_token_attention, -without_cross_modal_attentionmake the model not use corresponding attention layers-seedspecifies the random seed id-consistencymakes the model do consistency training together-labeled_numlimits the labeled samples number-cpuruns on cpu- Other hyper-parameters are set in
config.py
Four public datasets have been processed for the Entity-level typing with heterogeneous data task:
- TypeNet: aligns Freebase types to noun synsets from the WordNet hierarchy and eliminate types that are either very specific or very rare. The original dataset consists of more than two million mentions with the name and contexts. Our processed TypeNet consists of more than half a million entities with the name and relevant paragraphs. download
- MedMentions: contains annotations for a wide diversity of entities in the biomedical domain. The original dataset consists of nearly
$250$ thousand mentions. Our processed MedMentions consists of more than 50 thousand entities with the name and relevant paragraphs. download - Flowers: (Oxford Flowers-102) provides text descriptions for each flower image. The objective is to classify the fine-grained flower name of the sample. We keep the original splits and use the cross-entropy loss since each sample has only one positive type. download
- Birds: (Caltech-UCSD Birds) provides text descriptions for each bird image. The objective is to classify the fine-grained bird name of the sample. We also keep the original splits and use the cross-entropy loss since each sample has only one positive type. download
Files in each dataset:
data.pklcontains the name, raw paragraphs, and each typing task's labels for an entitydata_txt.pklis generated bydata_loader.pyand contains the tokenized indexes of the name and paragraphsdata_img.pkl(optional) contains each entity's relevant image featuressplit.pkl(optional) contains the split information of train/valid/test sets by entity's nametypes.jsoncontains the name of each typehierarchies.jsoncontains the taxonomy between types