-
Notifications
You must be signed in to change notification settings - Fork 14
fix data harvesting and fuzzy matching #24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Updated Python version requirement to allow newer versions.
| match_data["match_similarity"] = similarity | ||
| match_data["match_variant"] = fuzzy_matched_variant | ||
| match_data["matching_string"] = cand | ||
| lookup_name = match_data.get("name", m) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes!
|
Hi @rfdougherty! Thanks so much for this pull request and I really appreciate the time you have put into it and your willingness to contribute. Please forgive my late reply. I just have a quick request, there are a lot of files changed (17 files), so it's a bit hard for me to review as this is the majority of the files in the project. I can see at a glance that some things have been removed, such as the call to curl if the user is on Windows - I am not sure if this is intentional or part of the PR. Would it be possible please to split it up into atomic PRs - if you are fixing multiple issues can you send them as separate PRs, ideally each one modifying only one or two files, and also remove things from the PR that don't need to be in there? Then I can review more easily. If not, I will take the time to review and try to merge as soon as I get some time, perhaps I will merge the files individually. I would like to get it merged as the changes look really valuable, especially if you have improved the data ingestion! We could always connect on a quick video call to go through the changes if that works? I'm free in the week beginning 29 December. |
|
Hi Thomas,
Thanks for the response! I did put more into the PR than I had intended, as
it included some changes I made that were specific to my use-case. I
noticed this after submitting and didn't know if the repo was being
maintained so hadn't bothered to fix it. I'll redo the PR with just the
generally useful changes and will break it apart if necessary. It may take
me a week or so to get time to do this.
cheers,
bob
…On Thu, Dec 18, 2025 at 4:10 AM Thomas Wood ***@***.***> wrote:
*woodthom2* left a comment
(fastdatascience/drug_named_entity_recognition#24)
<#24 (comment)>
Hi @rfdougherty <https://github.com/rfdougherty>! Thanks so much for this
pull request and I really appreciate the time you have put into it and your
willingness to contribute. Please forgive my late reply.
I just have a quick request, there are a lot of files changed (17 files),
so it's a bit hard for me to review as this is the majority of the files in
the project. I can see at a glance that some things have been removed, such
as the call to curl if the user is on Windows
<https://github.com/fastdatascience/drug_named_entity_recognition/pull/24/files#diff-e4d5f442dd795f7b17b0b0e962854b1a9ee54aade46f513337f5b4dc4f916eaf>
- I am not sure if this is intentional or part of the PR.
Would it be possible please to split it up into atomic PRs - if you are
fixing multiple issues can you send them as separate PRs, ideally each one
modifying only one or two files, and also remove things from the PR that
don't need to be in there? Then I can review more easily. If not, I will
take the time to review and try to merge as soon as I get some time,
perhaps I will merge the files individually.
I would like to get it merged as the changes look really valuable,
especially if you have improved the data ingestion!
We could always connect on a quick video call to go through the changes if
that works? I'm free in the week beginning 29 December.
—
Reply to this email directly, view it on GitHub
<#24 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGXPQSZZ3USEWMXQCPDH5T4CKKTLAVCNFSM6AAAAACL2LPE5SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMNRZHE4DEMBSHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***
com>
|
|
Thanks Bob. Yes I'd appreciate it. It's still being maintained and used (I
think we have a few thousand users judging by the Pypi stats) so anything
you can submit would be useful. If possible atomic PRs that change one or
two files each are easiest for me to review. But no rush at all!
…On Tue, 23 Dec 2025, 21:33 Bob Dougherty, ***@***.***> wrote:
*rfdougherty* left a comment
(fastdatascience/drug_named_entity_recognition#24)
<#24 (comment)>
Hi Thomas,
Thanks for the response! I did put more into the PR than I had intended,
as
it included some changes I made that were specific to my use-case. I
noticed this after submitting and didn't know if the repo was being
maintained so hadn't bothered to fix it. I'll redo the PR with just the
generally useful changes and will break it apart if necessary. It may take
me a week or so to get time to do this.
cheers,
bob
On Thu, Dec 18, 2025 at 4:10 AM Thomas Wood ***@***.***>
wrote:
> *woodthom2* left a comment
> (fastdatascience/drug_named_entity_recognition#24)
> <
#24 (comment)>
>
> Hi @rfdougherty <https://github.com/rfdougherty>! Thanks so much for
this
> pull request and I really appreciate the time you have put into it and
your
> willingness to contribute. Please forgive my late reply.
>
> I just have a quick request, there are a lot of files changed (17
files),
> so it's a bit hard for me to review as this is the majority of the files
in
> the project. I can see at a glance that some things have been removed,
such
> as the call to curl if the user is on Windows
> <
https://github.com/fastdatascience/drug_named_entity_recognition/pull/24/files#diff-e4d5f442dd795f7b17b0b0e962854b1a9ee54aade46f513337f5b4dc4f916eaf>
> - I am not sure if this is intentional or part of the PR.
>
> Would it be possible please to split it up into atomic PRs - if you are
> fixing multiple issues can you send them as separate PRs, ideally each
one
> modifying only one or two files, and also remove things from the PR that
> don't need to be in there? Then I can review more easily. If not, I will
> take the time to review and try to merge as soon as I get some time,
> perhaps I will merge the files individually.
>
> I would like to get it merged as the changes look really valuable,
> especially if you have improved the data ingestion!
>
> We could always connect on a quick video call to go through the changes
if
> that works? I'm free in the week beginning 29 December.
>
> —
> Reply to this email directly, view it on GitHub
> <
#24 (comment)>,
> or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/AAGXPQSZZ3USEWMXQCPDH5T4CKKTLAVCNFSM6AAAAACL2LPE5SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMNRZHE4DEMBSHE>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***
> com>
>
—
Reply to this email directly, view it on GitHub
<#24 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADUBTVKGPANUGVBXRALPFCL4DGYLRAVCNFSM6AAAAACL2LPE5SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMOBYGAZDEOBUGQ>
.
You are receiving this because you commented.Message ID:
***@***.***
com>
|
|
So it might be easier if, instead of deleting files from the existing PR,
to make a new PR and copy just the necessary files into it one by one. I
can merge a small one- or two-file PR very quickly
…On Wed, 24 Dec 2025, 10:23 Thomas Wood, ***@***.***> wrote:
Thanks Bob. Yes I'd appreciate it. It's still being maintained and used (I
think we have a few thousand users judging by the Pypi stats) so anything
you can submit would be useful. If possible atomic PRs that change one or
two files each are easiest for me to review. But no rush at all!
On Tue, 23 Dec 2025, 21:33 Bob Dougherty, ***@***.***>
wrote:
> *rfdougherty* left a comment
> (fastdatascience/drug_named_entity_recognition#24)
> <#24 (comment)>
> Hi Thomas,
>
> Thanks for the response! I did put more into the PR than I had intended,
> as
> it included some changes I made that were specific to my use-case. I
> noticed this after submitting and didn't know if the repo was being
> maintained so hadn't bothered to fix it. I'll redo the PR with just the
> generally useful changes and will break it apart if necessary. It may
> take
> me a week or so to get time to do this.
>
> cheers,
> bob
>
> On Thu, Dec 18, 2025 at 4:10 AM Thomas Wood ***@***.***>
> wrote:
>
> > *woodthom2* left a comment
> > (fastdatascience/drug_named_entity_recognition#24)
> > <
> #24 (comment)>
>
> >
> > Hi @rfdougherty <https://github.com/rfdougherty>! Thanks so much for
> this
> > pull request and I really appreciate the time you have put into it and
> your
> > willingness to contribute. Please forgive my late reply.
> >
> > I just have a quick request, there are a lot of files changed (17
> files),
> > so it's a bit hard for me to review as this is the majority of the
> files in
> > the project. I can see at a glance that some things have been removed,
> such
> > as the call to curl if the user is on Windows
> > <
> https://github.com/fastdatascience/drug_named_entity_recognition/pull/24/files#diff-e4d5f442dd795f7b17b0b0e962854b1a9ee54aade46f513337f5b4dc4f916eaf>
>
> > - I am not sure if this is intentional or part of the PR.
> >
> > Would it be possible please to split it up into atomic PRs - if you are
> > fixing multiple issues can you send them as separate PRs, ideally each
> one
> > modifying only one or two files, and also remove things from the PR
> that
> > don't need to be in there? Then I can review more easily. If not, I
> will
> > take the time to review and try to merge as soon as I get some time,
> > perhaps I will merge the files individually.
> >
> > I would like to get it merged as the changes look really valuable,
> > especially if you have improved the data ingestion!
> >
> > We could always connect on a quick video call to go through the changes
> if
> > that works? I'm free in the week beginning 29 December.
> >
> > —
> > Reply to this email directly, view it on GitHub
> > <
> #24 (comment)>,
>
> > or unsubscribe
> > <
> https://github.com/notifications/unsubscribe-auth/AAGXPQSZZ3USEWMXQCPDH5T4CKKTLAVCNFSM6AAAAACL2LPE5SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMNRZHE4DEMBSHE>
>
> > .
> > You are receiving this because you were mentioned.Message ID:
> > ***@***.***
> > com>
> >
>
> —
> Reply to this email directly, view it on GitHub
> <#24 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ADUBTVKGPANUGVBXRALPFCL4DGYLRAVCNFSM6AAAAACL2LPE5SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMOBYGAZDEOBUGQ>
> .
> You are receiving this because you commented.Message ID:
> ***@***.***
> .com>
>
|
|
agree-- I'll create a new PR!
On Wed, Dec 24, 2025 at 2:58 AM Thomas Wood ***@***.***>
wrote:
… *woodthom2* left a comment
(fastdatascience/drug_named_entity_recognition#24)
<#24 (comment)>
So it might be easier if, instead of deleting files from the existing PR,
to make a new PR and copy just the necessary files into it one by one. I
can merge a small one- or two-file PR very quickly
On Wed, 24 Dec 2025, 10:23 Thomas Wood, ***@***.***> wrote:
> Thanks Bob. Yes I'd appreciate it. It's still being maintained and used
(I
> think we have a few thousand users judging by the Pypi stats) so
anything
> you can submit would be useful. If possible atomic PRs that change one
or
> two files each are easiest for me to review. But no rush at all!
>
> On Tue, 23 Dec 2025, 21:33 Bob Dougherty, ***@***.***>
> wrote:
>
>> *rfdougherty* left a comment
>> (fastdatascience/drug_named_entity_recognition#24)
>> <
#24 (comment)>
>> Hi Thomas,
>>
>> Thanks for the response! I did put more into the PR than I had
intended,
>> as
>> it included some changes I made that were specific to my use-case. I
>> noticed this after submitting and didn't know if the repo was being
>> maintained so hadn't bothered to fix it. I'll redo the PR with just the
>> generally useful changes and will break it apart if necessary. It may
>> take
>> me a week or so to get time to do this.
>>
>> cheers,
>> bob
>>
>> On Thu, Dec 18, 2025 at 4:10 AM Thomas Wood ***@***.***>
>> wrote:
>>
>> > *woodthom2* left a comment
>> > (fastdatascience/drug_named_entity_recognition#24)
>> > <
>>
#24 (comment)>
>>
>> >
>> > Hi @rfdougherty <https://github.com/rfdougherty>! Thanks so much for
>> this
>> > pull request and I really appreciate the time you have put into it
and
>> your
>> > willingness to contribute. Please forgive my late reply.
>> >
>> > I just have a quick request, there are a lot of files changed (17
>> files),
>> > so it's a bit hard for me to review as this is the majority of the
>> files in
>> > the project. I can see at a glance that some things have been
removed,
>> such
>> > as the call to curl if the user is on Windows
>> > <
>>
https://github.com/fastdatascience/drug_named_entity_recognition/pull/24/files#diff-e4d5f442dd795f7b17b0b0e962854b1a9ee54aade46f513337f5b4dc4f916eaf>
>>
>> > - I am not sure if this is intentional or part of the PR.
>> >
>> > Would it be possible please to split it up into atomic PRs - if you
are
>> > fixing multiple issues can you send them as separate PRs, ideally
each
>> one
>> > modifying only one or two files, and also remove things from the PR
>> that
>> > don't need to be in there? Then I can review more easily. If not, I
>> will
>> > take the time to review and try to merge as soon as I get some time,
>> > perhaps I will merge the files individually.
>> >
>> > I would like to get it merged as the changes look really valuable,
>> > especially if you have improved the data ingestion!
>> >
>> > We could always connect on a quick video call to go through the
changes
>> if
>> > that works? I'm free in the week beginning 29 December.
>> >
>> > —
>> > Reply to this email directly, view it on GitHub
>> > <
>>
#24 (comment)>,
>>
>> > or unsubscribe
>> > <
>>
https://github.com/notifications/unsubscribe-auth/AAGXPQSZZ3USEWMXQCPDH5T4CKKTLAVCNFSM6AAAAACL2LPE5SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMNRZHE4DEMBSHE>
>>
>> > .
>> > You are receiving this because you were mentioned.Message ID:
>> > ***@***.***
>> > com>
>> >
>>
>> —
>> Reply to this email directly, view it on GitHub
>> <
#24 (comment)>,
>> or unsubscribe
>> <
https://github.com/notifications/unsubscribe-auth/ADUBTVKGPANUGVBXRALPFCL4DGYLRAVCNFSM6AAAAACL2LPE5SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMOBYGAZDEOBUGQ>
>> .
>> You are receiving this because you commented.Message ID:
>> ***@***.***
>> .com>
>>
>
—
Reply to this email directly, view it on GitHub
<#24 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGXPQQDLMCEXEOAINFXY6D4DJWWXAVCNFSM6AAAAACL2LPE5SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMOBZGQ3TGNJYGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***
com>
|
|
I'm splitting this into two PRs. The first is ready for review: #25. I'll close this PR now. |
Fixed the data harvesting scripts to allow drug data update. Also refactored the fuzzy-matching feature to use fuzzyset2 and resolved several bugs in the code.