-
Notifications
You must be signed in to change notification settings - Fork 6
New Feature Enhancement for Character Class #155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Added a functional gender_pos.py whose functions are modeled after (and, honestly, supplant) gender_adjective.py. Also added four lists to common.py, of the Treebank tags for each part of speech. We may eventually want to allow the user to define their own part of speech list, like we do with gender.
Added functionality to support four parts of speech -adjectives, adverbs, proper nouns, and verbs - isntead of just adjectives.
Fixing some linting.
Importing some elements of common directly.
Adding documentation. I also added a few lines to handle the situation where metadata points to a text file that doesn't exist. In that case, it returns an empty string.
added character.py and started the core functions as _init_ get_overall_popularity get_stage_popularity helper functions for stage_popularity [started]: get char adjectives
get_char_list will return a sorted list of char names and popularity for a given document [to-do]: more robust, updatable by users later
…cter - refined get_char_list through adding creating object functionality - integrated get_char_gender with a self-trained classifier in Character class
… ML classifier in _init_ modified init so auto-detect gender if not provided
similar to gender_pos, character_pos gets the pos words for a given character
Making this function, doing some minor linting.
…lysis into character_class
Warning: unstable push, just sharing with Funing! Added nickname list checking, similarity_index.py as a framework for other name disambiguation.
- refined disambiguation to filter duplicates - create char objects filter duplicates on another level for char clusters
- gender detection module could detect based on honorifics
refactored the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Funing! There's a lot of great work here, but it's a significant amount of lines of code change to move through line-by-line. I'd like to offer a couple of suggestions here for how we can bring this into the main branch.
Regarding your Character class
There's three things I'd be interested in seeing regarding this class.
- My initial reaction on seeing
documentas an instance attribute ofCharacteris that that feels like it's the wrong direction for that relationship and that it's tangling too much functionality up into theCharacterinstance. I'm imagining the tests you'd have to write, and you'd have to initialize a newDocumentfor each test, which feels like a good sign that theCharacterclass does not separate its concerns enough. I think an approach that better separates our concerns might be to remove thedocumentfromCharacterand move the instance methods that depend on it (get_overall_popularityet al.) into acharacteranalysismodule alongside the functions defined incharacter_pos. That is: theCharacterclass deals only with storing names, nicknames, etc., and the associatedGenderand leaves anyDocument-based analysis to the appropriateanalysismodule. A secondary effect of this change would be to allow us to evaluate the sameCharacteracross multipleDocumentinstances (as you gesture toward with your comment about making.documentan array). - Following on that line of thought: should the
Documentclass have a.charactersattribute? Did you consider having.get_char_listcreate newCharacterinstances? - If the above seems like a reasonable approach to you, I'd like to see the
Characterclass alone pulled out into a separate PR. That'll help us focus on just one piece of self-contained code at a time.
Do those suggestions seem reasonable?
Functionalities: