-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
some decision points made today:
- the "primary key" will be the timestamp: even though this is not a fully unique ID, it best matches the most likely query (ie people are more likely to want to sort by date than by page of the PDF)
- Type A duplicate column: if a timestamp/sender combo appears in another textfile with 100% match of timestamp/senders (type A duplicate), this cell will have the textfile name of the "preferred" iteration of the duplicate (see this issue for the process of generating preferred versions).
- Type B duplicate column: if a timestamp/sender combo appears in any other textfile (does not need to be total match), this cell will have the textfile name of the "preferred" iteration of the duplicate.
- queries will be able to pull only the preferred version by suppressing any row in which the "duplicate column" (A or B)'s textfile name does not match the row textfile name.