-
Notifications
You must be signed in to change notification settings - Fork 74
feat: adding hive support #776
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
I have read the CLA Document and I hereby sign the CLA You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot. |
Summary of ChangesHello @Shellmia0, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands the system's data source capabilities by adding native support for Apache Hive. The integration is designed for high performance, utilizing the Arrow Flight SQL protocol for data exchange. To facilitate development, testing, and user understanding, a self-contained tutorial environment has been provided, which includes a simulated Hive backend and all necessary scripts to demonstrate federated queries with Hive. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
I have read the CLA Document and I hereby sign the CLA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for Hive as a data source, primarily by reusing the Arrow Flight SQL adapter. The changes include backend modifications to recognize the Hive dialect and a comprehensive set of new files for an example tutorial, complete with configuration, setup scripts, and a mock server. While the core backend changes appear solid, my review has identified several critical and high-severity issues within the new example scripts. These include significant security risks from hardcoded credentials and disabled TLS verification, a critical correctness bug in a database initialization script, and maintainability problems stemming from hardcoded paths and brittle manual protobuf parsing in the mock server. Addressing these issues is crucial to ensure the example is secure, functional, and promotes best practices.
| def parse_flight_sql_command(data: bytes) -> str: | ||
| """ | ||
| 解析 Arrow Flight SQL 的 CommandStatementQuery protobuf 消息 | ||
| CommandStatementQuery 的 protobuf 定义大致是: | ||
| message CommandStatementQuery { | ||
| string query = 1; | ||
| string transaction_id = 2; | ||
| } | ||
| 在 wire format 中: | ||
| - Field 1 (query): tag = 0x0a (field 1, wire type 2 = length-delimited) | ||
| - 然后是 varint 长度 | ||
| - 然后是 UTF-8 编码的字符串 | ||
| """ | ||
| if not data: | ||
| return "" | ||
|
|
||
| try: | ||
| # 检查是否是 google.protobuf.Any 包装 | ||
| # Any 的格式是: field 1 = type_url, field 2 = value | ||
| # type_url 通常以 "type.googleapis.com/" 开头 | ||
| if b"type.googleapis.com" in data: | ||
| # 跳过 Any 包装,查找内部的 CommandStatementQuery | ||
| # 查找 field 2 (value) 的开始位置 | ||
| idx = 0 | ||
| while idx < len(data): | ||
| if data[idx] == 0x12: # field 2, wire type 2 | ||
| idx += 1 | ||
| # 读取 varint 长度 | ||
| length, varint_size = _read_varint(data, idx) | ||
| idx += varint_size | ||
| # 提取内部消息 | ||
| inner_data = data[idx:idx+length] | ||
| # 递归解析内部消息 | ||
| return parse_flight_sql_command(inner_data) | ||
| idx += 1 | ||
|
|
||
| # 尝试直接解析 CommandStatementQuery | ||
| idx = 0 | ||
| while idx < len(data): | ||
| tag = data[idx] | ||
| idx += 1 | ||
|
|
||
| if tag == 0x0a: # field 1 (query), wire type 2 (length-delimited) | ||
| length, varint_size = _read_varint(data, idx) | ||
| idx += varint_size | ||
| query_bytes = data[idx:idx+length] | ||
| return query_bytes.decode("utf-8") | ||
| elif (tag & 0x07) == 2: # 其他 length-delimited 字段,跳过 | ||
| length, varint_size = _read_varint(data, idx) | ||
| idx += varint_size + length | ||
| elif (tag & 0x07) == 0: # varint 字段,跳过 | ||
| _, varint_size = _read_varint(data, idx) | ||
| idx += varint_size | ||
| else: | ||
| # 未知的 wire type,跳过 | ||
| break | ||
|
|
||
| # 如果解析失败,尝试直接作为字符串解码 | ||
| return data.decode("utf-8", errors="replace") | ||
|
|
||
| except Exception as e: | ||
| print(f"[警告] 解析 protobuf 失败: {e}") | ||
| # 回退到直接解码 | ||
| return data.decode("utf-8", errors="replace") | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function parse_flight_sql_command and its helper _read_varint manually parse raw protobuf bytes. This approach is extremely brittle, difficult to maintain, and prone to errors. It can easily break if the underlying protobuf definitions change. It is strongly recommended to use a proper protobuf library (e.g., google-protobuf for Python) to handle serialization and deserialization safely and reliably. You can generate Python code from the .proto files for this purpose.
| CREATE TABLE user_stats ( | ||
| ID STRING, | ||
| credit_rank INT, | ||
| income INT, | ||
| age INT | ||
| ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; | ||
|
|
||
| INSERT INTO user_stats VALUES | ||
| ('id0001', 6, 100000, 20), | ||
| ('id0002', 5, 90000, 19), | ||
| ('id0003', 6, 89700, 32), | ||
| ('id0005', 6, 607000, 30), | ||
| ('id0006', 5, 30070, 25), | ||
| ('id0007', 6, 12070, 28), | ||
| ('id0008', 6, 200800, 50), | ||
| ('id0009', 6, 607000, 30), | ||
| ('id0010', 5, 30070, 25), | ||
| ('id0011', 5, 12070, 28), | ||
| ('id0012', 6, 200800, 50), | ||
| ('id0013', 5, 30070, 25), | ||
| ('id0014', 5, 12070, 28), | ||
| ('id0015', 6, 200800, 18), | ||
| ('id0016', 5, 30070, 26), | ||
| ('id0017', 5, 12070, 27), | ||
| ('id0018', 6, 200800, 16), | ||
| ('id0019', 6, 30070, 25), | ||
| ('id0020', 5, 12070, 28); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The schema and data for the user_stats table appear to be incorrect, likely due to a copy-paste error from alice_init.hql. The CREATE TABLE statement and INSERT values define columns (credit_rank, income, age) that do not match the expected schema for Bob's data (ID, order_amount, is_active), as used in other parts of the tutorial like arrow_flight_server.py. This inconsistency will cause the tutorial to fail.
CREATE TABLE user_stats (
ID STRING,
order_amount INT,
is_active INT
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
INSERT INTO user_stats VALUES
('id0001', 5000, 1),
('id0002', 3000, 1),
('id0003', 8000, 0),
('id0005', 12000, 1),
('id0006', 1500, 1),
('id0007', 2500, 0),
('id0008', 9500, 1),
('id0009', 7000, 1),
('id0010', 500, 0),
('id0011', 3500, 1),
('id0012', 15000, 1),
('id0013', 2000, 0),
('id0014', 4500, 1),
('id0015', 6500, 1),
('id0016', 1000, 0),
('id0017', 8500, 1),
('id0018', 11000, 1),
('id0019', 3200, 1),
('id0020', 7500, 0);
| PROJECT_ROOT="/root/autodl-tmp/scql" | ||
| TUTORIAL_DIR="/root/autodl-tmp/scql/examples/scdb-tutorial" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script uses hardcoded absolute paths for PROJECT_ROOT and TUTORIAL_DIR, which makes it non-portable and will cause it to fail on any machine where the project is not located at /root/autodl-tmp/scql. These paths should be determined dynamically based on the script's own location.
| PROJECT_ROOT="/root/autodl-tmp/scql" | |
| TUTORIAL_DIR="/root/autodl-tmp/scql/examples/scdb-tutorial" | |
| PROJECT_ROOT=$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd) | |
| TUTORIAL_DIR="$PROJECT_ROOT/examples/scdb-tutorial" |
| PROJECT_ROOT="/root/autodl-tmp/scql" | ||
| TUTORIAL_DIR="/root/autodl-tmp/scql/examples/scdb-tutorial" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script uses hardcoded absolute paths for PROJECT_ROOT and TUTORIAL_DIR, which makes it non-portable and will cause it to fail on any machine where the project is not located at /root/autodl-tmp/scql. These paths should be determined dynamically based on the script's own location.
| PROJECT_ROOT="/root/autodl-tmp/scql" | |
| TUTORIAL_DIR="/root/autodl-tmp/scql/examples/scdb-tutorial" | |
| PROJECT_ROOT=$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd) | |
| TUTORIAL_DIR="$PROJECT_ROOT/examples/scdb-tutorial" |
| { | ||
| "alice": { | ||
| "UserName": "alice", | ||
| "Password": "some_password" | ||
| }, | ||
| "bob": { | ||
| "UserName": "bob", | ||
| "Password": "another_password" | ||
| }, | ||
| "root": { | ||
| "UserName": "root", | ||
| "Password": "root" | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoding default credentials, even for examples, poses a security risk as this pattern might be copied into production environments. It's recommended to use placeholder values that are clearly not real passwords and instruct the user to replace them, or to read them from a configuration source that is not checked into version control.
| CMD="mysql -u root" | ||
| if [ ! -z "$MYSQL_PASS" ]; then | ||
| CMD="mysql -u root -p$MYSQL_PASS" | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing the MySQL password directly on the command line via -p$MYSQL_PASS is insecure, as the password becomes visible in the system's process list (ps). A more secure alternative is to use the MYSQL_PWD environment variable, which the mysql client will automatically use without exposing it in the process arguments.
| CMD="mysql -u root" | |
| if [ ! -z "$MYSQL_PASS" ]; then | |
| CMD="mysql -u root -p$MYSQL_PASS" | |
| fi | |
| if [ ! -z "$MYSQL_PASS" ]; then | |
| export MYSQL_PWD="$MYSQL_PASS" | |
| fi | |
| CMD="mysql -u root" |
|
|
||
| # 启动 SCDB Server(设置 root 密码为 "root") | ||
| echo "启动 SCDB Server (端口 8080)..." | ||
| export SCQL_ROOT_PASSWORD="root" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exporting a password to an environment variable (SCQL_ROOT_PASSWORD) can be a security risk, as other processes running under the same user account may be able to access it. For a tutorial, this might be a pragmatic choice, but it's a practice that should be avoided in production. Consider documenting this risk or using a more secure method for credential management.
| import time | ||
|
|
||
| SCDB_URL = "http://localhost:8080" | ||
| ROOT_PASSWORD = "p6>14%h:u2&79k83" # 从日志中获取 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The root password is hardcoded directly in the script. This is a significant security risk. It is best practice to load sensitive information like passwords from environment variables or a configuration file that is not committed to version control. You can use os.environ.get() for this purpose (remember to import os at the top of the file).
| ROOT_PASSWORD = "p6>14%h:u2&79k83" # 从日志中获取 | |
| ROOT_PASSWORD = os.environ.get("SCQL_ROOT_PASSWORD", "p6>14%h:u2&79k83") # 从日志中获取 |
| alice_conf = "engine/alice/conf/gflags.conf" | ||
| bob_conf = "engine/bob/conf/gflags.conf" | ||
| scdb_conf = "scdb/conf/config.yml" | ||
| scdb_host = "scdb/conf/config.yml" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def update_file(filepath, replacements): | ||
| with open(filepath, 'r') as f: | ||
| content = f.read() | ||
|
|
||
| for old, new in replacements.items(): | ||
| content = content.replace(old, new) | ||
|
|
||
| with open(filepath, 'w') as f: | ||
| f.write(content) | ||
| print(f"Updated {filepath}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| {DataSourceKind::ARROWSQL, std::make_shared<ArrowSqlAdaptorFactory>()}); | ||
| auto arrow_sql_adaptor_factory = std::make_shared<ArrowSqlAdaptorFactory>(); | ||
| factory_maps_.insert({DataSourceKind::ARROWSQL, arrow_sql_adaptor_factory}); | ||
| // Hive uses Arrow Flight SQL protocol for better performance and native columnar support |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里不修改,在配置文件里 kind 填 ARROWSQL 也是能运行的
feat: adding hive support