Skip to content

Conversation

@kata198
Copy link

@kata198 kata198 commented Apr 1, 2017

Hey bud,

This patch adds .pop(idx) to both linked and double-linked list implementation.

This follows the same pattern as the standard python list, where an index can be provided (positive or negative) to the pop function to pop a specific element.

The index is optional, and if not provided defaults to popright (so backwards-compatible, and compatible with base list).

Please consider merging and re-releasing with the addition of this patch.

The primary advantage of a linked-list or double-linked-list versus an array is the ability to work on the middle without rebuilding the entire list. You already have support for inserts in the middle of the list.
This adds that benefit with removal (via "pop") to these implementations, and thus allows them to out-perform the base python list for various scenarios.

Thanks,
Tim

@kata198
Copy link
Author

kata198 commented Apr 1, 2017

In my test, this does perform well wiith smaller-sized lists (as in, faster than the python base list)

But still python base list is faster for larger sizes (like around 1 million elements), because the walk time takes so long.

I'm currently looking into implementing a compound linked list. That is a blocked series of linked lists in an array, so for example if you had 500 elements and block size 100, you'd have 5 linked lists in an array. If you requested element 220, you'd start at idx=2 and only have to walk 20 items instead of 220.

I haven't seen such before, I may submit it soon if it solves my needs (beating python list at random-access popping from large sizes)

kata198 added 2 commits April 2, 2017 00:16
…ight, and call those functions. This allows us to always skip forward one walk iteration, and remove any logic for first or last.

Also, on double-linked list pop, walk from the back if we are popping past middle.

These changes bring the linked list random-pop benchmark much closer to
the python list performance even further out (like 400 list elements,
200 pops they perform the same. Below that linked list impl's now
perform better, above that list performs better (because of walk time)
@kata198
Copy link
Author

kata198 commented Apr 2, 2017

After my updates today, it performs even better. On a 400 element list with 200 random-index pops, single-linked-list and python base list perform the same, double-linked-list is faster. Below that, the linked list impls are faster, above that python list gets faster (because walk time takes up so much of the time)

src/dllist.c Outdated
self->last = (PyObject*)prev_node;
}

if (self->last_accessed_node == (PyObject*)del_node)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a problem with this if statement. last_accessed_node and last_accessed_idx should always point at the same node. When you remove an item from a list, all nodes behind it are shifted to a lower index, so these two fields can become unsynchronized.
The simplest solution for this issue would be to unconditionally invalidate the cache like dllist_remove. Alternatively you could decrement last_accessed_idx if it's larger than the removed index (dllist_popleft does something similar).

index = ((DLListObject*)self)->size + index;

/* Either a negative greater than index size, or a positive greater than size */
if ( index < 0 || index >= ((DLListObject*)self)->size )
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might reuse dllist_get_node_internal here to locate the node to remove. This function can walk the list from the last accessed element, so the following code could be faster:

item = list[idx]    # remembers last acessed item
list.pop(idx)

@ajakubek
Copy link
Owner

ajakubek commented Apr 2, 2017

Hey,

Thanks for the pull request. The code looks good and a generalized pop method can indeed be useful. I'll merge your branch once the problem with last accessed item cache is resolved. Would you like to provide a patch for this issue?

Also, could you update the documentation of the pop method? It's in docs/index.rst, you can build it with make docs.

Thanks,
Adam

kata198 added 4 commits April 7, 2017 15:34
…o now DLList can extend directly with SLList and SLList can extend directly with a DLList, at a much much higher performance rate than before
Copy link
Owner

@ajakubek ajakubek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last_accessed node and index fields are meant to improve access time for nearby items when they are fetched using subscript operator (list[index]).
This feature makes dllist extremely fast (80% speed of Python's builtin list) as long as index is changed by a small amount between accesses.

Your commit implements midpoint caching, which makes item lookup to 50% faster in general case, but at the same time completely removes the former optimization.

Compare results of the following benchmark with and without this commit to see the difference:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from collections import deque
from llist import dllist, sllist
import time

num = 100000

def measure_iteration(container):
    for idx in range(len(container)):
        container[idx]

for container in [deque, dllist, sllist]:
    c = container(range(num))
    start = time.time()
    measure_iteration(c)
    elapsed = time.time() - start
    print "Completed %s iteration in \t\t%.8f seconds:\t %.1f ops/sec" % (
        container.__name__,
        elapsed,
        num / elapsed)

while ( (PyObject*)node != Py_None )
{
if( node->value == value )
return PyLong_FromSsize_t(idx);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyLong_FromSsize_t is unavailable in Python 2.5, so we need to drop support for it in README and documentation. Shouldn't be an issue (2.5 was EOLed years ago).

/* NOTE - THIS FUNCTION DOES NOT WORK!!
*
* dllist([1, 5, 9]) has the SAME hash as dllist([5, 1, 9])
* and thus it is NOT a hash function
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right that we should have chosen a better hashing function. I'll change it to something that detects simple reordering of list items.

@ajakubek
Copy link
Owner

ajakubek commented Apr 9, 2017

Ok, to sum up the current state of this merge request. There are several things that I like. The pop, index and rindex functions are definitely useful and I'd be happy to merge them. I can also pull the minor fixes and the refactoring into generic LListObject.
However, the part which replaces caching of recently accessed item with the middle one is problematic. It causes significant regression in performance for use cases which had been actually documented to be fast. I'm a bit wary to merge such change.
I can cherry-pick the other commits into my repository and close this PR next weekend. Or you can resubmit the merge request without the removal of index caching, and then I'll just pull your branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants