-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Description
Hello,
While using this module, for one revision I realized that returned current_tokens list from state update method misses some tokens.
Here is an example code to generate the described problem:
import requests
from pprint import pprint
import mwpersistence
import deltas
from mwreverts.defaults import RADIUS
from deltas.tokenizers.wikitext_split import wikitext_split
page_id = 2161298
rev_id = 480327915 # for testing purpose, process only this revision id
# wikitext_split is used, defult is text_split.
state = mwpersistence.DiffState(deltas.SegmentMatcher(tokenizer=wikitext_split),
revert_radius=RADIUS)
# get text of given revision
params = {'pageids': page_id, 'action': 'query', 'prop': 'revisions',
'rvprop': 'content|ids|timestamp|sha1|comment|flags|user|userid',
'rvlimit': 1, 'format': 'json', 'rvstartid': rev_id}
result = requests.get(url='https://en.wikipedia.org/w/api.php', params=params).json()
_, page = result['query']['pages'].popitem()
for rev in page.get('revisions', []):
text = rev.get('*', '')
text = text.lower()
# process revision
current_tokens, tokens_added, tokens_removed = state.update(text, revision=rev_id)
# split rev text to compare with returned current_tokens
tokens = wikitext_split.tokenize(text)
print(len(current_tokens), len(tokens))
# pprint(current_tokens)When you run this code, you will see that number of tokens returned are different (3822 and 5563) and the last 5 tokens in ´current_tokens' are:
- Token('has', type='word', revisions=[480327915]),
- Token(' ', type='whitespace', revisions=[480327915]),
- Token('higher', type='word', revisions=[480327915]),
- Token(' ', type='whitespace', revisions=[480327915]),
- Token('mechanical', type='word', revisions=[480327915])
which are not last tokens in that revision .
Firstly I would like to ask if I use these modules correctly? If yes, why are those tokens are missing?
I am workin on python 3.5.3 and installed all modules with pip.
Metadata
Metadata
Assignees
Labels
No labels