Skip to content

missing tokens after state update #4

@bitnik

Description

@bitnik

Hello,

While using this module, for one revision I realized that returned current_tokens list from state update method misses some tokens.

Here is an example code to generate the described problem:

import requests
from pprint import pprint
import mwpersistence
import deltas
from mwreverts.defaults import RADIUS
from deltas.tokenizers.wikitext_split import wikitext_split


page_id = 2161298
rev_id = 480327915  # for testing purpose, process only this revision id
# wikitext_split is used, defult is text_split.
state = mwpersistence.DiffState(deltas.SegmentMatcher(tokenizer=wikitext_split), 
                                revert_radius=RADIUS)

# get text of given revision
params = {'pageids': page_id, 'action': 'query', 'prop': 'revisions',
          'rvprop': 'content|ids|timestamp|sha1|comment|flags|user|userid',
          'rvlimit': 1, 'format': 'json', 'rvstartid': rev_id}
result = requests.get(url='https://en.wikipedia.org/w/api.php', params=params).json()
_, page = result['query']['pages'].popitem()
for rev in page.get('revisions', []):
    text = rev.get('*', '')
    text = text.lower()
    # process revision
    current_tokens, tokens_added, tokens_removed = state.update(text, revision=rev_id)

    # split rev text to compare with returned current_tokens
    tokens = wikitext_split.tokenize(text)

    print(len(current_tokens), len(tokens))
    # pprint(current_tokens)

When you run this code, you will see that number of tokens returned are different (3822 and 5563) and the last 5 tokens in ´current_tokens' are:

  • Token('has', type='word', revisions=[480327915]),
  • Token(' ', type='whitespace', revisions=[480327915]),
  • Token('higher', type='word', revisions=[480327915]),
  • Token(' ', type='whitespace', revisions=[480327915]),
  • Token('mechanical', type='word', revisions=[480327915])
    which are not last tokens in that revision .

Firstly I would like to ask if I use these modules correctly? If yes, why are those tokens are missing?

I am workin on python 3.5.3 and installed all modules with pip.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions