Skip to content

Data loss when using PersistentDirectoryClient #18

@utdrmac

Description

@utdrmac

In the example below, a Deque is created using PersistentDirectoryClient with "mydir" as the storage dir. 3 items are added to the deque. Then all the items are printed. Lastly, 1 item ('Zero') is removed using popleft(). The result should be 2 remaining items in the Deque ('First', and 'Second'). There should be 2 files inside "mydir" representing the remaining items.

Prior to this test, modify iterables/clients.py#L82 to be "w+b" to fix #16

$ ls -l mydir/
total 0

$ cat disk_issue.py
from functools import partial
from diskcollections.serializers import PickleSerializer
from diskcollections.iterables import Deque, PersistentDirectoryClient

pdc = partial(PersistentDirectoryClient,"mydir")

mydeque = partial(
    Deque,
    client_class=pdc,
    serializer_class=PickleSerializer
)

queue = mydeque()

# Add strings
queue.append("Zero")
queue.append("First")
queue.append("Second")

# Inspect the contents
print("Contents of the deque:")
for item in queue:
    print(f"- {item}")

print("POPPING LEFT")
popped = queue.popleft()
print(f"POPPED: {popped}")

Result:

Contents of the deque:
- Zero
- First
- Second
POPPING LEFT
Traceback (most recent call last):
  File "/Users/utdrmac/pdctest/disk.py", line 27, in <module>
    popped = queue.popleft()
  File "/Users/utdrmac/pdctest/python-disk-collections/src/diskcollections/iterables/iterables.py", line 228, in popleft
    del self[0]
        ~~~~^^^
  File "/Users/utdrmac/pdctest/python-disk-collections/src/diskcollections/iterables/iterables.py", line 184, in __delitem__
    del self.__client[idx]
        ~~~~~~~~~~~~~^^^^^
  File "/Users/utdrmac/pdctest/python-disk-collections/src/diskcollections/iterables/clients.py", line 134, in __delitem__
    file = open(file_path, mode="r+")
FileNotFoundError: [Errno 2] No such file or directory: 'mydir/1'

$ ls -l mydir/
total 8
-rw-r--r--  1 utdrmac  staff  21 May 29 08:03 0
(pdctest) [utdrmac@test1 pdctest]$ cat mydir/0
��
�Second�.

This is very much incorrect. The file for the 2nd entry, "First", is gone, resulting in data loss. I believe the issue is in iterables/clients.py delitem function for PersistentDirectoryClient.

        for i in range(len(self.__files))[::-1]:
            if i < index:
                continue

            self.__files[i].close()
            old_file_path = self.get_file_path(i + 1)
            new_file_path = self.get_file_path(i)
            os.rename(old_file_path, new_file_path)

This loop processes remaining files in reverse order which causes the data loss. After adding the 3 entries, you have files mydir/0, mydir/1, mydir/2. When you popleft(), mydir/0 is removed and the length of __files decrements to 2. range(2)[::-1] generates the sequence 1, 0. old_file_path is now 1+1, 2 and new_file_path is 1. Rename mydir/2 to mydir/1. That rename results in data loss due to overriding the contents of mydir/1.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions