New Feature Enhancement for Character Class #155

fyang3 · 2021-04-15T16:04:29Z

Functionalities:

automatic character detection with NER
auto gender detection (currently binary) using SVM classifier
simple named-entity disambiguation pipeline
auto character attributes analysis

Added a functional gender_pos.py whose functions are modeled after (and, honestly, supplant) gender_adjective.py. Also added four lists to common.py, of the Treebank tags for each part of speech. We may eventually want to allow the user to define their own part of speech list, like we do with gender.

Added functionality to support four parts of speech -adjectives, adverbs, proper nouns, and verbs - isntead of just adjectives.

Fixing some linting.

Importing some elements of common directly.

Adding documentation. I also added a few lines to handle the situation where metadata points to a text file that doesn't exist. In that case, it returns an empty string.

added character.py and started the core functions as _init_ get_overall_popularity get_stage_popularity helper functions for stage_popularity [started]: get char adjectives

get_char_list will return a sorted list of char names and popularity for a given document [to-do]: more robust, updatable by users later

…cter - refined get_char_list through adding creating object functionality - integrated get_char_gender with a self-trained classifier in Character class

… ML classifier in _init_ modified init so auto-detect gender if not provided

similar to gender_pos, character_pos gets the pos words for a given character

Trying to isolate the three bugs we're working on!

Making this function, doing some minor linting.

…lysis into character_class

Warning: unstable push, just sharing with Funing! Added nickname list checking, similarity_index.py as a framework for other name disambiguation.

- refined disambiguation to filter duplicates - create char objects filter duplicates on another level for char clusters

- gender detection module could detect based on honorifics

refactored the code

MBJean

Hi Funing! There's a lot of great work here, but it's a significant amount of lines of code change to move through line-by-line. I'd like to offer a couple of suggestions here for how we can bring this into the main branch.

Regarding your `Character` class

There's three things I'd be interested in seeing regarding this class.

My initial reaction on seeing document as an instance attribute of Character is that that feels like it's the wrong direction for that relationship and that it's tangling too much functionality up into the Character instance. I'm imagining the tests you'd have to write, and you'd have to initialize a new Document for each test, which feels like a good sign that the Character class does not separate its concerns enough. I think an approach that better separates our concerns might be to remove the document from Character and move the instance methods that depend on it (get_overall_popularity et al.) into a character analysis module alongside the functions defined in character_pos. That is: the Character class deals only with storing names, nicknames, etc., and the associated Gender and leaves any Document-based analysis to the appropriate analysis module. A secondary effect of this change would be to allow us to evaluate the same Character across multiple Document instances (as you gesture toward with your comment about making .document an array).
Following on that line of thought: should the Document class have a .characters attribute? Did you consider having .get_char_list create new Character instances?
If the above seems like a reasonable approach to you, I'd like to see the Character class alone pulled out into a separate PR. That'll help us focus on just one piece of self-contained code at a time.

Do those suggestions seem reasonable?

kenalba and others added 23 commits October 23, 2020 02:10

Merge branch 'master' into gender_pos

7b1abee

Adding gender_pos, refactoring gender_adjective.

a9b2fc5

Added functionality to support four parts of speech -adjectives, adverbs, proper nouns, and verbs - isntead of just adjectives.

Linting upgrades!

0b36711

Fixing some linting.

More, better linting.

fdee52b

Changing how importing stuff works

262cb8b

Importing some elements of common directly.

Added difference_pos_freq, pos_analysis, and display functions.

042e64f

Adding documentation, handling absent text files.

15d20eb

Adding documentation. I also added a few lines to handle the situation where metadata points to a text file that doesn't exist. In that case, it returns an empty string.

initialized character.py

441017c

added character.py and started the core functions as _init_ get_overall_popularity get_stage_popularity helper functions for stage_popularity [started]: get char adjectives

added get_char_list function in document.py

719cf2e

get_char_list will return a sorted list of char names and popularity for a given document [to-do]: more robust, updatable by users later

refined get_char_list and integrated gender identification into Chara…

eb4119a

…cter - refined get_char_list through adding creating object functionality - integrated get_char_gender with a self-trained classifier in Character class

refactored for seamless integration of gender identification based on…

93fe95b

… ML classifier in _init_ modified init so auto-detect gender if not provided

created first pass for character_pos

b308a9c

similar to gender_pos, character_pos gets the pos words for a given character

tuned character.py to make it include gender obj instead of strings

d9e76bf

Bugbash 1

e046ef1

Trying to isolate the three bugs we're working on!

Bug bashing!

00c6ce2

Making this function, doing some minor linting.

fixed unable to load model error and get char list

1656655

Merge branch 'character_class' of https://github.com/dhmit/gender_ana…

9469dcf

…lysis into character_class

Adding similarity index stuff

e94481c

Warning: unstable push, just sharing with Funing! Added nickname list checking, similarity_index.py as a framework for other name disambiguation.

revised char_disambiguation and create_char_objects

9f7a6f2

- refined disambiguation to filter duplicates - create char objects filter duplicates on another level for char clusters

fixed make_char_name bugs and refactor

1fbf6ec

expanded honorifics list and robust gender detection functionality

949b1bb

- gender detection module could detect based on honorifics

added honorifics for a more comprehensive coverage

f049986

refactored the code

fyang3 requested a review from MBJean April 15, 2021 16:04

MBJean reviewed May 6, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New Feature Enhancement for Character Class #155

New Feature Enhancement for Character Class #155

Uh oh!

fyang3 commented Apr 15, 2021

Uh oh!

MBJean left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

New Feature Enhancement for Character Class #155

Are you sure you want to change the base?

New Feature Enhancement for Character Class #155

Uh oh!

Conversation

fyang3 commented Apr 15, 2021

Uh oh!

MBJean left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Regarding your Character class

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

MBJean left a comment •

edited

Loading

Regarding your `Character` class