Skip to content

Conversation

@fyang3
Copy link
Contributor

@fyang3 fyang3 commented Apr 15, 2021

Functionalities:

  • automatic character detection with NER
  • auto gender detection (currently binary) using SVM classifier
  • simple named-entity disambiguation pipeline
  • auto character attributes analysis

kenalba and others added 23 commits October 23, 2020 02:10
Added a functional gender_pos.py whose functions are modeled after (and, honestly, supplant) gender_adjective.py. Also added four lists to common.py, of the Treebank tags for each part of speech. We may eventually want to allow the user to define their own part of speech list, like we do with gender.
Added functionality to support four parts of speech -adjectives, adverbs, proper nouns, and verbs - isntead of just adjectives.
Fixing some linting.
Importing some elements of common directly.
Adding documentation. I also added a few lines to handle the situation where metadata points to a text file that doesn't exist. In that case, it returns an empty string.
added character.py and started the core functions as
_init_
get_overall_popularity
get_stage_popularity
helper functions for stage_popularity
[started]: get char adjectives
get_char_list will return a sorted list of char names and popularity for a given document
[to-do]: more robust, updatable by users later
…cter

- refined get_char_list through adding creating object functionality
- integrated get_char_gender with a self-trained classifier in Character class
… ML classifier in _init_

modified init so auto-detect gender if not provided
similar to gender_pos, character_pos gets the pos words for a given character
Trying to isolate the three bugs we're working on!
Making this function, doing some minor linting.
Warning: unstable push, just sharing with Funing!

Added nickname list checking, similarity_index.py as a framework for other name disambiguation.
- refined disambiguation to filter duplicates
- create char objects filter duplicates on another level for char clusters
- gender detection module could detect based on honorifics
@fyang3 fyang3 requested a review from MBJean April 15, 2021 16:04
Copy link
Contributor

@MBJean MBJean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Funing! There's a lot of great work here, but it's a significant amount of lines of code change to move through line-by-line. I'd like to offer a couple of suggestions here for how we can bring this into the main branch.

Regarding your Character class

There's three things I'd be interested in seeing regarding this class.

  1. My initial reaction on seeing document as an instance attribute of Character is that that feels like it's the wrong direction for that relationship and that it's tangling too much functionality up into the Character instance. I'm imagining the tests you'd have to write, and you'd have to initialize a new Document for each test, which feels like a good sign that the Character class does not separate its concerns enough. I think an approach that better separates our concerns might be to remove the document from Character and move the instance methods that depend on it (get_overall_popularity et al.) into a character analysis module alongside the functions defined in character_pos. That is: the Character class deals only with storing names, nicknames, etc., and the associated Gender and leaves any Document-based analysis to the appropriate analysis module. A secondary effect of this change would be to allow us to evaluate the same Character across multiple Document instances (as you gesture toward with your comment about making .document an array).
  2. Following on that line of thought: should the Document class have a .characters attribute? Did you consider having .get_char_list create new Character instances?
  3. If the above seems like a reasonable approach to you, I'd like to see the Character class alone pulled out into a separate PR. That'll help us focus on just one piece of self-contained code at a time.

Do those suggestions seem reasonable?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants