The Global Database of Events, Language and Tone (GDELT) project claims to be largest, most comprehensive, and highest resolution open database of human society ever created. GDELT data goes back decades and continues to get updated every 15mins. The link provided is well worth exploring if you are not familiar with this fascinating dataset.
nu-gdelt grabs that new data each 15mins from the raw compressed CSV files, casts the data attributes to the correct types and saves the data into parquet files that are stored in monthly partitioned directories. The raw CSV files are stored in a Bronze partitioned directory, and the parquet files are stored in a Silver partitioned directory - adopting the data lake medallion architecture.
nu-gdelt uses nushell scripts while harnessing duckdb for the data transformations.
nu-gdelt is intended to be initiated by a cron job every 15mins, such as in the following:
*/15 * * * * /full/file/path/filename.nuOnce the data builds up in your own personal data lake, you can query it using duckdb rather simply. For example, to view the schema and details of the data, you could run:
duckdb
.mode line
DESCRIBE SELECT * FROM read_parquet('silver/2024/09/*.parquet');Or, to see some aggregates of all of your data, you could run:
SUMMARIZE SELECT * FROM read_parquet('silver/2024/10/*.parquet');As your data grows, it stays organised and in a form that is efficient and easy to query.
The final parquet files have the following schema. This schema closely matches that of the GDELT source, as per the documentation.
| column_name | column_type | nullable |
|---|---|---|
| GlobalEventID | INTEGER | YES |
| Day | INTEGER | YES |
| MonthYear | INTEGER | YES |
| Year | INTEGER | YES |
| FractionDate | FLOAT | YES |
| Actor1Code | VARCHAR | YES |
| Actor1Name | VARCHAR | YES |
| Actor1CountryCode | VARCHAR | YES |
| Actor1KnownGroupCode | VARCHAR | YES |
| Actor1EthnicCode | VARCHAR | YES |
| Actor1Religion1Code | VARCHAR | YES |
| Actor1Religion2Code | VARCHAR | YES |
| Actor1Type1Code | VARCHAR | YES |
| Actor1Type2Code | VARCHAR | YES |
| Actor1Type3Code | VARCHAR | YES |
| Actor2Code | VARCHAR | YES |
| Actor2Name | VARCHAR | YES |
| Actor2CountryCode | VARCHAR | YES |
| Actor2KnownGroupCode | VARCHAR | YES |
| Actor2EthnicCode | VARCHAR | YES |
| Actor2Religion1Code | VARCHAR | YES |
| Actor2Religion2Code | VARCHAR | YES |
| Actor2Type1Code | VARCHAR | YES |
| Actor2Type2Code | VARCHAR | YES |
| Actor2Type3Code | VARCHAR | YES |
| IsRootEvent | INTEGER | YES |
| EventCode | VARCHAR | YES |
| EventBaseCode | VARCHAR | YES |
| EventRootCode | VARCHAR | YES |
| QuadClass | INTEGER | YES |
| GoldsteinScale | FLOAT | YES |
| NumMentions | INTEGER | YES |
| NumSources | INTEGER | YES |
| NumArticles | INTEGER | YES |
| AvgTone | FLOAT | YES |
| Actor1Geo_Type | INTEGER | YES |
| Actor1Geo_FullName | VARCHAR | YES |
| Actor1Geo_CountryCode | VARCHAR | YES |
| Actor1Geo_ADM1Code | VARCHAR | YES |
| Actor1Geo_ADM2Code | VARCHAR | YES |
| Actor1Geo_Lat | FLOAT | YES |
| Actor1Geo_Long | FLOAT | YES |
| Actor1Geo_FeatureID | VARCHAR | YES |
| Actor2Geo_Type | INTEGER | YES |
| Actor2Geo_FullName | VARCHAR | YES |
| Actor2Geo_CountryCode | VARCHAR | YES |
| Actor2Geo_ADM1Code | VARCHAR | YES |
| Actor2Geo_ADM2Code | VARCHAR | YES |
| Actor2Geo_Lat | FLOAT | YES |
| Actor2Geo_Long | FLOAT | YES |
| Actor2Geo_FeatureID | VARCHAR | YES |
| ActionGeo_Type | INTEGER | YES |
| ActionGeo_FullName | VARCHAR | YES |
| ActionGeo_CountryCode | VARCHAR | YES |
| ActionGeo_ADM1Code | VARCHAR | YES |
| ActionGeo_ADM2Code | VARCHAR | YES |
| ActionGeo_Lat | FLOAT | YES |
| ActionGeo_Long | FLOAT | YES |
| ActionGeo_FeatureID | VARCHAR | YES |
| DATEADDED | BIGINT | YES |
| SOURCEURL | VARCHAR | YES |
nu-gdelt also produces and saves logs into a custom log file called gdelt.log. The format for logs is simply Datetime, Severtity and Message. The log file has been formatted in such a way to make it easy to read. Additionally, because we are using nushell, we can very easily navigate and filter our logs using the following command:
open gdelt.log | lines | split column " - " | rename "Datetime" "Severity" "Message" | into valueThis command will provide a table of log data that can be filtered and sorted, including by time as the Datetime data is read in as an actual date date type.
If you want to observe the logs as the program is running, you can run the following:
tail -f gdelt.logCTRL+C to exit tailing.
Future iterations of the logging functionality will include auto-rotating of logs, archiving and compression etc.
--
This repo is under active development.