Skip to content

mapping a zipped iterator to produce tuples #40

@ExpHP

Description

@ExpHP

I have a function which looks vaguely like this:

struct Rect { real: f64, imag: f64 }
struct KetRef<'a> { real: &'a [f64], imag: &'a [f64] }

impl<'a> KetRef<'a> {
    pub fn dot(self, other: KetRef) -> Rect {
        assert_eq!(self.real.len(), other.real.len());
        assert_eq!(self.real.len(), other.imag.len());
        assert_eq!(self.real.len(), self.imag.len());
        zip!(self.real, self.imag, other.real, other.imag)
            .map(|(ar, ai, br, bi)| {
                let real = ar * br + ai * bi;
                let imag = ar * bi - ai * br;
                Rect { real, imag }
            })
            .fold(Rect::zero(), |a,b| a + b)
    }
}

Converting it to use faster requires two passes over the arrays; I am unable to produce both real and imag in one pass because simd_map requires the function output to be a single vector:

pub fn dot<K: AsKetRef>(self, other: K) -> Rect {
    use ::faster::prelude::*;

    let other = other.as_ket_ref();
    assert_eq!(self.real.len(), other.real.len());
    assert_eq!(self.real.len(), other.imag.len());
    assert_eq!(self.real.len(), self.imag.len());

    let real = (
        self.real.simd_iter(f64s(0.0)),
        self.imag.simd_iter(f64s(0.0)),
        other.real.simd_iter(f64s(0.0)),
        other.imag.simd_iter(f64s(0.0)),
    ).zip().simd_map(|(ar, ai, br, bi)| {
        ar * br + ai * bi
    }).simd_reduce(f64s(0.0), |acc, v| acc + v).sum();

    let imag = (
        self.real.simd_iter(f64s(0.0)),
        self.imag.simd_iter(f64s(0.0)),
        other.real.simd_iter(f64s(0.0)),
        other.imag.simd_iter(f64s(0.0)),
    ).zip().simd_map(|(ar, ai, br, bi)| {
        ar * bi - ai * br
    }).simd_reduce(f64s(0.0), |acc, v| acc + v).sum();

    Rect { real, imag }
}

So is it faster? Well, actually, yes! It is plenty faster... up to a point:

Change in run-time for different ket lengths
dot/16              change: -33.973%
dot/64              change: -29.575%
dot/256             change: -26.762%
dot/1024            change: -34.054%
dot/4096            change: -36.297%
dot/16384           change: -7.3379%

Yikes! Once we hit 16384 elements there is almost no speedup!

I suspect it is because at this point, memory has become the bottleneck, and most of what was gained by using SIMD was lost by making two passes over the arrays. It would be nice to have an API that allowed this do be done in one pass by allowing a mapping function to return a tuple (producing a new PackedZippedIterator or similar).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions