-
Notifications
You must be signed in to change notification settings - Fork 9
Implement DuckDB-based converter #198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
You can set KV_METADATA in DuckDB? That's awesome and resolves the primary issue we had on our list! Will test later, thanks! Maybe it makes sense to jump on a call for the metadata discussion. |
Yes. I first tried the python API, but that doesn't expose the Metadata option. Would be great to have a call on the metadata details, you wrote the vecorel-parquet logic and seem well informed... |
|
I'm still stuck on fiboa validation: As I understand it, the validator takes the parquet schema and checks it with the fiboa schema (see implementation ). The fiboa schema is built from the extensions (and is correct in this case), but the parquet schema is implicitly created by duckdb (with Can I force this to a 'non-null' result? I've tried Casting with Maybe related to duckdb/duckdb#13949 |
|
I think duckdb parquet writer doesn't support setting the nullability derived from the resultset. same for |
|
For me that sounds like a bug in duckdb. Is there an open issue for it, otherwise maybe open one? |
Is "Non-nullability" a property of a query result column? Maybe this information is lost. But at least it's a feature request.. |
200c61a to
3cf3b8d
Compare
3f0dd95 to
c524142
Compare
The downside of our geopandas-based FiboaConverters is that the dataframe need to be in memory. This can be huge. A completely different approach would be to use duckdb for conversion. To test it's viability, I took the dataset (Japan) with largest dataframe. I don't need a 128GB+ machine anymore, it's fast & performant on a laptop using 20GB memory and runs way faster (<30minutes).
I need some help with the parquet metadata...