feat: adding hive support #776

Shellmia0 · 2026-01-12T19:26:07Z

feat: adding hive support

github-actions · 2026-01-12T19:26:23Z

Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.

I have read the CLA Document and I hereby sign the CLA

_{You can retrigger this bot by commenting recheck in this Pull Request.}_{Posted by the CLA Assistant Lite bot.}

gemini-code-assist · 2026-01-12T19:26:39Z

Summary of Changes

Hello @Shellmia0, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the system's data source capabilities by adding native support for Apache Hive. The integration is designed for high performance, utilizing the Arrow Flight SQL protocol for data exchange. To facilitate development, testing, and user understanding, a self-contained tutorial environment has been provided, which includes a simulated Hive backend and all necessary scripts to demonstrate federated queries with Hive.

Highlights

Hive Data Source Integration: Introduced a new HIVE data source kind, enabling the system to connect to Hive databases.
Arrow Flight SQL Protocol: Integrated Hive support using the Arrow Flight SQL protocol, leveraging the existing ArrowSqlAdaptorFactory for efficient data transfer and native columnar support.
Hive SQL Dialect: Implemented a HiveDialect for SQL parsing and rewriting, including specific function name mappings (e.g., ifnull to nvl) and type conversions (e.g., VARCHAR to STRING, LONG to BIGINT) to ensure compatibility with Hive's SQL syntax.
Comprehensive Tutorial and Testing Environment: Added a complete tutorial and testing infrastructure for Hive, featuring a Python-based Arrow Flight SQL server to simulate Hive backends, dedicated configuration files, and example SQL/Python scripts for setup, federated queries, and service management.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Shellmia0 · 2026-01-12T19:26:58Z

I have read the CLA Document and I hereby sign the CLA

gemini-code-assist

Code Review

This pull request introduces support for Hive as a data source, primarily by reusing the Arrow Flight SQL adapter. The changes include backend modifications to recognize the Hive dialect and a comprehensive set of new files for an example tutorial, complete with configuration, setup scripts, and a mock server. While the core backend changes appear solid, my review has identified several critical and high-severity issues within the new example scripts. These include significant security risks from hardcoded credentials and disabled TLS verification, a critical correctness bug in a database initialization script, and maintainability problems stemming from hardcoded paths and brittle manual protobuf parsing in the mock server. Addressing these issues is crucial to ensure the example is secure, functional, and promotes best practices.

gemini-code-assist · 2026-01-12T19:30:08Z

examples/scdb-tutorial/hive/arrow_flight_server.py

+def parse_flight_sql_command(data: bytes) -> str:
+    """
+    解析 Arrow Flight SQL 的 CommandStatementQuery protobuf 消息
+
+    CommandStatementQuery 的 protobuf 定义大致是:
+    message CommandStatementQuery {
+      string query = 1;
+      string transaction_id = 2;
+    }
+
+    在 wire format 中:
+    - Field 1 (query): tag = 0x0a (field 1, wire type 2 = length-delimited)
+    - 然后是 varint 长度
+    - 然后是 UTF-8 编码的字符串
+    """
+    if not data:
+        return ""
+
+    try:
+        # 检查是否是 google.protobuf.Any 包装
+        # Any 的格式是: field 1 = type_url, field 2 = value
+        # type_url 通常以 "type.googleapis.com/" 开头
+        if b"type.googleapis.com" in data:
+            # 跳过 Any 包装，查找内部的 CommandStatementQuery
+            # 查找 field 2 (value) 的开始位置
+            idx = 0
+            while idx < len(data):
+                if data[idx] == 0x12:  # field 2, wire type 2
+                    idx += 1
+                    # 读取 varint 长度
+                    length, varint_size = _read_varint(data, idx)
+                    idx += varint_size
+                    # 提取内部消息
+                    inner_data = data[idx:idx+length]
+                    # 递归解析内部消息
+                    return parse_flight_sql_command(inner_data)
+                idx += 1
+
+        # 尝试直接解析 CommandStatementQuery
+        idx = 0
+        while idx < len(data):
+            tag = data[idx]
+            idx += 1
+
+            if tag == 0x0a:  # field 1 (query), wire type 2 (length-delimited)
+                length, varint_size = _read_varint(data, idx)
+                idx += varint_size
+                query_bytes = data[idx:idx+length]
+                return query_bytes.decode("utf-8")
+            elif (tag & 0x07) == 2:  # 其他 length-delimited 字段，跳过
+                length, varint_size = _read_varint(data, idx)
+                idx += varint_size + length
+            elif (tag & 0x07) == 0:  # varint 字段，跳过
+                _, varint_size = _read_varint(data, idx)
+                idx += varint_size
+            else:
+                # 未知的 wire type，跳过
+                break
+
+        # 如果解析失败，尝试直接作为字符串解码
+        return data.decode("utf-8", errors="replace")
+
+    except Exception as e:
+        print(f"[警告] 解析 protobuf 失败: {e}")
+        # 回退到直接解码
+        return data.decode("utf-8", errors="replace")
+


The function parse_flight_sql_command and its helper _read_varint manually parse raw protobuf bytes. This approach is extremely brittle, difficult to maintain, and prone to errors. It can easily break if the underlying protobuf definitions change. It is strongly recommended to use a proper protobuf library (e.g., google-protobuf for Python) to handle serialization and deserialization safely and reliably. You can generate Python code from the .proto files for this purpose.

gemini-code-assist · 2026-01-12T19:30:08Z

examples/scdb-tutorial/hive/initdb/bob_init.hql

+CREATE TABLE user_stats (
+    ID STRING,
+    credit_rank INT,
+    income INT,
+    age INT
+) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
+
+INSERT INTO user_stats VALUES
+    ('id0001', 6, 100000, 20),
+    ('id0002', 5, 90000, 19),
+    ('id0003', 6, 89700, 32),
+    ('id0005', 6, 607000, 30),
+    ('id0006', 5, 30070, 25),
+    ('id0007', 6, 12070, 28),
+    ('id0008', 6, 200800, 50),
+    ('id0009', 6, 607000, 30),
+    ('id0010', 5, 30070, 25),
+    ('id0011', 5, 12070, 28),
+    ('id0012', 6, 200800, 50),
+    ('id0013', 5, 30070, 25),
+    ('id0014', 5, 12070, 28),
+    ('id0015', 6, 200800, 18),
+    ('id0016', 5, 30070, 26),
+    ('id0017', 5, 12070, 27),
+    ('id0018', 6, 200800, 16),
+    ('id0019', 6, 30070, 25),
+    ('id0020', 5, 12070, 28);


The schema and data for the user_stats table appear to be incorrect, likely due to a copy-paste error from alice_init.hql. The CREATE TABLE statement and INSERT values define columns (credit_rank, income, age) that do not match the expected schema for Bob's data (ID, order_amount, is_active), as used in other parts of the tutorial like arrow_flight_server.py. This inconsistency will cause the tutorial to fail.

CREATE TABLE user_stats ( ID STRING, order_amount INT, is_active INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; INSERT INTO user_stats VALUES ('id0001', 5000, 1), ('id0002', 3000, 1), ('id0003', 8000, 0), ('id0005', 12000, 1), ('id0006', 1500, 1), ('id0007', 2500, 0), ('id0008', 9500, 1), ('id0009', 7000, 1), ('id0010', 500, 0), ('id0011', 3500, 1), ('id0012', 15000, 1), ('id0013', 2000, 0), ('id0014', 4500, 1), ('id0015', 6500, 1), ('id0016', 1000, 0), ('id0017', 8500, 1), ('id0018', 11000, 1), ('id0019', 3200, 1), ('id0020', 7500, 0);

gemini-code-assist · 2026-01-12T19:30:08Z

examples/scdb-tutorial/start_all.sh

+PROJECT_ROOT="/root/autodl-tmp/scql"
+TUTORIAL_DIR="/root/autodl-tmp/scql/examples/scdb-tutorial"


The script uses hardcoded absolute paths for PROJECT_ROOT and TUTORIAL_DIR, which makes it non-portable and will cause it to fail on any machine where the project is not located at /root/autodl-tmp/scql. These paths should be determined dynamically based on the script's own location.

Suggested change

PROJECT_ROOT="/root/autodl-tmp/scql"

TUTORIAL_DIR="/root/autodl-tmp/scql/examples/scdb-tutorial"

PROJECT_ROOT=$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)

TUTORIAL_DIR="$PROJECT_ROOT/examples/scdb-tutorial"

gemini-code-assist · 2026-01-12T19:30:09Z

examples/scdb-tutorial/start_all_hive.sh

+PROJECT_ROOT="/root/autodl-tmp/scql"
+TUTORIAL_DIR="/root/autodl-tmp/scql/examples/scdb-tutorial"


The script uses hardcoded absolute paths for PROJECT_ROOT and TUTORIAL_DIR, which makes it non-portable and will cause it to fail on any machine where the project is not located at /root/autodl-tmp/scql. These paths should be determined dynamically based on the script's own location.

Suggested change

PROJECT_ROOT="/root/autodl-tmp/scql"

TUTORIAL_DIR="/root/autodl-tmp/scql/examples/scdb-tutorial"

PROJECT_ROOT=$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)

TUTORIAL_DIR="$PROJECT_ROOT/examples/scdb-tutorial"

gemini-code-assist · 2026-01-12T19:30:09Z

examples/scdb-tutorial/client/users_local.json

+{
+    "alice": {
+        "UserName": "alice",
+        "Password": "some_password"
+    },
+    "bob": {
+        "UserName": "bob",
+        "Password": "another_password"
+    },
+    "root": {
+        "UserName": "root",
+        "Password": "root"
+    }
+}


Hardcoding default credentials, even for examples, poses a security risk as this pattern might be copied into production environments. It's recommended to use placeholder values that are clearly not real passwords and instruct the user to replace them, or to read them from a configuration source that is not checked into version control.

gemini-code-assist · 2026-01-12T19:30:09Z

examples/scdb-tutorial/init_database_local.sh

+CMD="mysql -u root"
+if [ ! -z "$MYSQL_PASS" ]; then
+    CMD="mysql -u root -p$MYSQL_PASS"
+fi


Passing the MySQL password directly on the command line via -p$MYSQL_PASS is insecure, as the password becomes visible in the system's process list (ps). A more secure alternative is to use the MYSQL_PWD environment variable, which the mysql client will automatically use without exposing it in the process arguments.

Suggested change

CMD="mysql -u root"

if [ ! -z "$MYSQL_PASS" ]; then

CMD="mysql -u root -p$MYSQL_PASS"

fi

if [ ! -z "$MYSQL_PASS" ]; then

export MYSQL_PWD="$MYSQL_PASS"

fi

CMD="mysql -u root"

gemini-code-assist · 2026-01-12T19:30:09Z

examples/scdb-tutorial/start_all_hive.sh

+
+# 启动 SCDB Server（设置 root 密码为 "root"）
+echo "启动 SCDB Server (端口 8080)..."
+export SCQL_ROOT_PASSWORD="root"


Exporting a password to an environment variable (SCQL_ROOT_PASSWORD) can be a security risk, as other processes running under the same user account may be able to access it. For a tutorial, this might be a pragmatic choice, but it's a practice that should be avoided in production. Consider documenting this risk or using a more secure method for credential management.

gemini-code-assist · 2026-01-12T19:30:09Z

examples/scdb-tutorial/hive/test_federated_query.py

+import time
+
+SCDB_URL = "http://localhost:8080"
+ROOT_PASSWORD = "p6>14%h:u2&79k83"  # 从日志中获取


The root password is hardcoded directly in the script. This is a significant security risk. It is best practice to load sensitive information like passwords from environment variables or a configuration file that is not committed to version control. You can use os.environ.get() for this purpose (remember to import os at the top of the file).

Suggested change

ROOT_PASSWORD = "p6>14%h:u2&79k83" # 从日志中获取

ROOT_PASSWORD = os.environ.get("SCQL_ROOT_PASSWORD", "p6>14%h:u2&79k83") # 从日志中获取

gemini-code-assist · 2026-01-12T19:30:09Z

examples/scdb-tutorial/configure_local.py

+alice_conf = "engine/alice/conf/gflags.conf"
+bob_conf = "engine/bob/conf/gflags.conf"
+scdb_conf = "scdb/conf/config.yml"
+scdb_host = "scdb/conf/config.yml"


The variable scdb_host is assigned but never used. Unused code should be removed to improve code clarity and maintainability.

gemini-code-assist · 2026-01-12T19:30:09Z

examples/scdb-tutorial/configure_local.py

+def update_file(filepath, replacements):
+    with open(filepath, 'r') as f:
+        content = f.read()
+
+    for old, new in replacements.items():
+        content = content.replace(old, new)
+
+    with open(filepath, 'w') as f:
+        f.write(content)
+    print(f"Updated {filepath}")


The function update_file is defined but is never called. The script uses process_template instead. This unused function should be removed to keep the codebase clean.

tyrone-yu · 2026-01-13T02:55:10Z

engine/datasource/datasource_adaptor_mgr.cc

-      {DataSourceKind::ARROWSQL, std::make_shared<ArrowSqlAdaptorFactory>()});
+  auto arrow_sql_adaptor_factory = std::make_shared<ArrowSqlAdaptorFactory>();
+  factory_maps_.insert({DataSourceKind::ARROWSQL, arrow_sql_adaptor_factory});
+  // Hive uses Arrow Flight SQL protocol for better performance and native columnar support


这里不修改，在配置文件里 kind 填 ARROWSQL 也是能运行的

Shellmia0 added 7 commits December 15, 2025 04:30

feat: for test

829c712

feat: add hive

f1b6d03

feat: test hive arrow server

9d81ad3

chore: add gitignore

3890817

chore: add gitignore and bash scripts, remove logs

24429f3

chore: remove local config

bb060d9

chore: remove local config

392a00a

gemini-code-assist bot reviewed Jan 12, 2026

View reviewed changes

tyrone-yu reviewed Jan 13, 2026

View reviewed changes

		PROJECT_ROOT="/root/autodl-tmp/scql"
		TUTORIAL_DIR="/root/autodl-tmp/scql/examples/scdb-tutorial"

	ROOT_PASSWORD = "p6>14%h:u2&79k83" # 从日志中获取
	ROOT_PASSWORD = os.environ.get("SCQL_ROOT_PASSWORD", "p6>14%h:u2&79k83") # 从日志中获取

feat: adding hive support #776

Are you sure you want to change the base?

feat: adding hive support #776

Uh oh!

Conversation

Shellmia0 commented Jan 12, 2026

Uh oh!

github-actions bot commented Jan 12, 2026

Uh oh!

gemini-code-assist bot commented Jan 12, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

Shellmia0 commented Jan 12, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

tyrone-yu Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants