Skip to content

Conversation

@Shellmia0
Copy link

feat: adding hive support

@github-actions
Copy link


Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.


I have read the CLA Document and I hereby sign the CLA


You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Shellmia0, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the system's data source capabilities by adding native support for Apache Hive. The integration is designed for high performance, utilizing the Arrow Flight SQL protocol for data exchange. To facilitate development, testing, and user understanding, a self-contained tutorial environment has been provided, which includes a simulated Hive backend and all necessary scripts to demonstrate federated queries with Hive.

Highlights

  • Hive Data Source Integration: Introduced a new HIVE data source kind, enabling the system to connect to Hive databases.
  • Arrow Flight SQL Protocol: Integrated Hive support using the Arrow Flight SQL protocol, leveraging the existing ArrowSqlAdaptorFactory for efficient data transfer and native columnar support.
  • Hive SQL Dialect: Implemented a HiveDialect for SQL parsing and rewriting, including specific function name mappings (e.g., ifnull to nvl) and type conversions (e.g., VARCHAR to STRING, LONG to BIGINT) to ensure compatibility with Hive's SQL syntax.
  • Comprehensive Tutorial and Testing Environment: Added a complete tutorial and testing infrastructure for Hive, featuring a Python-based Arrow Flight SQL server to simulate Hive backends, dedicated configuration files, and example SQL/Python scripts for setup, federated queries, and service management.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@Shellmia0
Copy link
Author

I have read the CLA Document and I hereby sign the CLA

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for Hive as a data source, primarily by reusing the Arrow Flight SQL adapter. The changes include backend modifications to recognize the Hive dialect and a comprehensive set of new files for an example tutorial, complete with configuration, setup scripts, and a mock server. While the core backend changes appear solid, my review has identified several critical and high-severity issues within the new example scripts. These include significant security risks from hardcoded credentials and disabled TLS verification, a critical correctness bug in a database initialization script, and maintainability problems stemming from hardcoded paths and brittle manual protobuf parsing in the mock server. Addressing these issues is crucial to ensure the example is secure, functional, and promotes best practices.

Comment on lines +24 to +90
def parse_flight_sql_command(data: bytes) -> str:
"""
解析 Arrow Flight SQL 的 CommandStatementQuery protobuf 消息
CommandStatementQuery 的 protobuf 定义大致是:
message CommandStatementQuery {
string query = 1;
string transaction_id = 2;
}
在 wire format 中:
- Field 1 (query): tag = 0x0a (field 1, wire type 2 = length-delimited)
- 然后是 varint 长度
- 然后是 UTF-8 编码的字符串
"""
if not data:
return ""

try:
# 检查是否是 google.protobuf.Any 包装
# Any 的格式是: field 1 = type_url, field 2 = value
# type_url 通常以 "type.googleapis.com/" 开头
if b"type.googleapis.com" in data:
# 跳过 Any 包装,查找内部的 CommandStatementQuery
# 查找 field 2 (value) 的开始位置
idx = 0
while idx < len(data):
if data[idx] == 0x12: # field 2, wire type 2
idx += 1
# 读取 varint 长度
length, varint_size = _read_varint(data, idx)
idx += varint_size
# 提取内部消息
inner_data = data[idx:idx+length]
# 递归解析内部消息
return parse_flight_sql_command(inner_data)
idx += 1

# 尝试直接解析 CommandStatementQuery
idx = 0
while idx < len(data):
tag = data[idx]
idx += 1

if tag == 0x0a: # field 1 (query), wire type 2 (length-delimited)
length, varint_size = _read_varint(data, idx)
idx += varint_size
query_bytes = data[idx:idx+length]
return query_bytes.decode("utf-8")
elif (tag & 0x07) == 2: # 其他 length-delimited 字段,跳过
length, varint_size = _read_varint(data, idx)
idx += varint_size + length
elif (tag & 0x07) == 0: # varint 字段,跳过
_, varint_size = _read_varint(data, idx)
idx += varint_size
else:
# 未知的 wire type,跳过
break

# 如果解析失败,尝试直接作为字符串解码
return data.decode("utf-8", errors="replace")

except Exception as e:
print(f"[警告] 解析 protobuf 失败: {e}")
# 回退到直接解码
return data.decode("utf-8", errors="replace")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The function parse_flight_sql_command and its helper _read_varint manually parse raw protobuf bytes. This approach is extremely brittle, difficult to maintain, and prone to errors. It can easily break if the underlying protobuf definitions change. It is strongly recommended to use a proper protobuf library (e.g., google-protobuf for Python) to handle serialization and deserialization safely and reliably. You can generate Python code from the .proto files for this purpose.

Comment on lines +8 to +34
CREATE TABLE user_stats (
ID STRING,
credit_rank INT,
income INT,
age INT
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;

INSERT INTO user_stats VALUES
('id0001', 6, 100000, 20),
('id0002', 5, 90000, 19),
('id0003', 6, 89700, 32),
('id0005', 6, 607000, 30),
('id0006', 5, 30070, 25),
('id0007', 6, 12070, 28),
('id0008', 6, 200800, 50),
('id0009', 6, 607000, 30),
('id0010', 5, 30070, 25),
('id0011', 5, 12070, 28),
('id0012', 6, 200800, 50),
('id0013', 5, 30070, 25),
('id0014', 5, 12070, 28),
('id0015', 6, 200800, 18),
('id0016', 5, 30070, 26),
('id0017', 5, 12070, 27),
('id0018', 6, 200800, 16),
('id0019', 6, 30070, 25),
('id0020', 5, 12070, 28);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The schema and data for the user_stats table appear to be incorrect, likely due to a copy-paste error from alice_init.hql. The CREATE TABLE statement and INSERT values define columns (credit_rank, income, age) that do not match the expected schema for Bob's data (ID, order_amount, is_active), as used in other parts of the tutorial like arrow_flight_server.py. This inconsistency will cause the tutorial to fail.

CREATE TABLE user_stats (
    ID STRING,
    order_amount INT,
    is_active INT
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;

INSERT INTO user_stats VALUES
    ('id0001', 5000, 1),
    ('id0002', 3000, 1),
    ('id0003', 8000, 0),
    ('id0005', 12000, 1),
    ('id0006', 1500, 1),
    ('id0007', 2500, 0),
    ('id0008', 9500, 1),
    ('id0009', 7000, 1),
    ('id0010', 500, 0),
    ('id0011', 3500, 1),
    ('id0012', 15000, 1),
    ('id0013', 2000, 0),
    ('id0014', 4500, 1),
    ('id0015', 6500, 1),
    ('id0016', 1000, 0),
    ('id0017', 8500, 1),
    ('id0018', 11000, 1),
    ('id0019', 3200, 1),
    ('id0020', 7500, 0);

Comment on lines +4 to +5
PROJECT_ROOT="/root/autodl-tmp/scql"
TUTORIAL_DIR="/root/autodl-tmp/scql/examples/scdb-tutorial"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The script uses hardcoded absolute paths for PROJECT_ROOT and TUTORIAL_DIR, which makes it non-portable and will cause it to fail on any machine where the project is not located at /root/autodl-tmp/scql. These paths should be determined dynamically based on the script's own location.

Suggested change
PROJECT_ROOT="/root/autodl-tmp/scql"
TUTORIAL_DIR="/root/autodl-tmp/scql/examples/scdb-tutorial"
PROJECT_ROOT=$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)
TUTORIAL_DIR="$PROJECT_ROOT/examples/scdb-tutorial"

Comment on lines +5 to +6
PROJECT_ROOT="/root/autodl-tmp/scql"
TUTORIAL_DIR="/root/autodl-tmp/scql/examples/scdb-tutorial"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The script uses hardcoded absolute paths for PROJECT_ROOT and TUTORIAL_DIR, which makes it non-portable and will cause it to fail on any machine where the project is not located at /root/autodl-tmp/scql. These paths should be determined dynamically based on the script's own location.

Suggested change
PROJECT_ROOT="/root/autodl-tmp/scql"
TUTORIAL_DIR="/root/autodl-tmp/scql/examples/scdb-tutorial"
PROJECT_ROOT=$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)
TUTORIAL_DIR="$PROJECT_ROOT/examples/scdb-tutorial"

Comment on lines +1 to +14
{
"alice": {
"UserName": "alice",
"Password": "some_password"
},
"bob": {
"UserName": "bob",
"Password": "another_password"
},
"root": {
"UserName": "root",
"Password": "root"
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

Hardcoding default credentials, even for examples, poses a security risk as this pattern might be copied into production environments. It's recommended to use placeholder values that are clearly not real passwords and instruct the user to replace them, or to read them from a configuration source that is not checked into version control.

Comment on lines +7 to +10
CMD="mysql -u root"
if [ ! -z "$MYSQL_PASS" ]; then
CMD="mysql -u root -p$MYSQL_PASS"
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

Passing the MySQL password directly on the command line via -p$MYSQL_PASS is insecure, as the password becomes visible in the system's process list (ps). A more secure alternative is to use the MYSQL_PWD environment variable, which the mysql client will automatically use without exposing it in the process arguments.

Suggested change
CMD="mysql -u root"
if [ ! -z "$MYSQL_PASS" ]; then
CMD="mysql -u root -p$MYSQL_PASS"
fi
if [ ! -z "$MYSQL_PASS" ]; then
export MYSQL_PWD="$MYSQL_PASS"
fi
CMD="mysql -u root"


# 启动 SCDB Server(设置 root 密码为 "root")
echo "启动 SCDB Server (端口 8080)..."
export SCQL_ROOT_PASSWORD="root"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

Exporting a password to an environment variable (SCQL_ROOT_PASSWORD) can be a security risk, as other processes running under the same user account may be able to access it. For a tutorial, this might be a pragmatic choice, but it's a practice that should be avoided in production. Consider documenting this risk or using a more secure method for credential management.

import time

SCDB_URL = "http://localhost:8080"
ROOT_PASSWORD = "p6>14%h:u2&79k83" # 从日志中获取
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The root password is hardcoded directly in the script. This is a significant security risk. It is best practice to load sensitive information like passwords from environment variables or a configuration file that is not committed to version control. You can use os.environ.get() for this purpose (remember to import os at the top of the file).

Suggested change
ROOT_PASSWORD = "p6>14%h:u2&79k83" # 从日志中获取
ROOT_PASSWORD = os.environ.get("SCQL_ROOT_PASSWORD", "p6>14%h:u2&79k83") # 从日志中获取

alice_conf = "engine/alice/conf/gflags.conf"
bob_conf = "engine/bob/conf/gflags.conf"
scdb_conf = "scdb/conf/config.yml"
scdb_host = "scdb/conf/config.yml"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The variable scdb_host is assigned but never used. Unused code should be removed to improve code clarity and maintainability.

Comment on lines +23 to +32
def update_file(filepath, replacements):
with open(filepath, 'r') as f:
content = f.read()

for old, new in replacements.items():
content = content.replace(old, new)

with open(filepath, 'w') as f:
f.write(content)
print(f"Updated {filepath}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The function update_file is defined but is never called. The script uses process_template instead. This unused function should be removed to keep the codebase clean.

{DataSourceKind::ARROWSQL, std::make_shared<ArrowSqlAdaptorFactory>()});
auto arrow_sql_adaptor_factory = std::make_shared<ArrowSqlAdaptorFactory>();
factory_maps_.insert({DataSourceKind::ARROWSQL, arrow_sql_adaptor_factory});
// Hive uses Arrow Flight SQL protocol for better performance and native columnar support
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里不修改,在配置文件里 kind 填 ARROWSQL 也是能运行的

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants