mapping a zipped iterator to produce tuples

I have a function which looks vaguely like this:

```rust
struct Rect { real: f64, imag: f64 }
struct KetRef<'a> { real: &'a [f64], imag: &'a [f64] }

impl<'a> KetRef<'a> {
    pub fn dot(self, other: KetRef) -> Rect {
        assert_eq!(self.real.len(), other.real.len());
        assert_eq!(self.real.len(), other.imag.len());
        assert_eq!(self.real.len(), self.imag.len());
        zip!(self.real, self.imag, other.real, other.imag)
            .map(|(ar, ai, br, bi)| {
                let real = ar * br + ai * bi;
                let imag = ar * bi - ai * br;
                Rect { real, imag }
            })
            .fold(Rect::zero(), |a,b| a + b)
    }
}
```

Converting it to use `faster` requires two passes over the arrays; I am unable to produce both `real` and `imag` in one pass because `simd_map` requires the function output to be a single vector:

```rust
pub fn dot<K: AsKetRef>(self, other: K) -> Rect {
    use ::faster::prelude::*;

    let other = other.as_ket_ref();
    assert_eq!(self.real.len(), other.real.len());
    assert_eq!(self.real.len(), other.imag.len());
    assert_eq!(self.real.len(), self.imag.len());

    let real = (
        self.real.simd_iter(f64s(0.0)),
        self.imag.simd_iter(f64s(0.0)),
        other.real.simd_iter(f64s(0.0)),
        other.imag.simd_iter(f64s(0.0)),
    ).zip().simd_map(|(ar, ai, br, bi)| {
        ar * br + ai * bi
    }).simd_reduce(f64s(0.0), |acc, v| acc + v).sum();

    let imag = (
        self.real.simd_iter(f64s(0.0)),
        self.imag.simd_iter(f64s(0.0)),
        other.real.simd_iter(f64s(0.0)),
        other.imag.simd_iter(f64s(0.0)),
    ).zip().simd_map(|(ar, ai, br, bi)| {
        ar * bi - ai * br
    }).simd_reduce(f64s(0.0), |acc, v| acc + v).sum();

    Rect { real, imag }
}
```

So is it faster?  Well, actually, yes!  It is plenty faster... up to a point:

```
Change in run-time for different ket lengths
dot/16              change: -33.973%
dot/64              change: -29.575%
dot/256             change: -26.762%
dot/1024            change: -34.054%
dot/4096            change: -36.297%
dot/16384           change: -7.3379%
```

Yikes! Once we hit 16384 elements there is almost no speedup!

I suspect it is because at this point, memory has become the bottleneck, and most of what was gained by using SIMD was lost by making two passes over the arrays.  It would be nice to have an API that allowed this do be done in one pass by allowing a mapping function to return a tuple (producing a new `PackedZippedIterator` or similar).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mapping a zipped iterator to produce tuples #40

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

mapping a zipped iterator to produce tuples #40

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions