Skip to content

AnkitKumarSingh11/hadoop-scripts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HBase Table Size Reporter

This Python script reports the HDFS disk usage of HBase tables based on a given table name prefix. It calculates the table size with and without the replication factor and generates a human-readable report in both text and CSV formats.


🔧 Features

  • Accepts one or more table name prefixes as input.
  • Lists all HBase tables in HDFS that start with each prefix.
  • Calculates:
    • Raw HDFS size
    • Size with replication factor (default is 3)
  • Outputs:
    • Formatted console output
    • Consolidated report.txt
    • Consolidated report.csv
  • Sizes shown with up to 4 decimal places

🧩 Prerequisites

  • Python 3.x

  • Hadoop command line tools configured (hadoop fs should work)

  • humanize Python module:

    Install it via pip:

        pip install humanize

🚀 Usage bash

python3 script.py <table_prefix1> <table_prefix2> ...

Example:

python3 script.py meta person

📂 Output report.txt: Pretty-formatted human-readable report for all table prefixes.

report.csv: CSV file suitable for spreadsheets or further analysis.

Both files include:

Table name

Size without replication

Size with replication

Per-prefix totals

Final total at the bottom

🛠️ Configuration You can change the HDFS base directory or replication factor by modifying these lines in the script:

HDFS_BASE_DIR = "/hbase/data/hbase"
REPLICATION_FACTOR = 3

📌 Notes Tables with 0 bytes (likely empty or non-existent) are still shown in the output for completeness.

Final totals are computed separately for each prefix and also overall at the end.

About

HDFS disk usage of HBase tables based on a given table name prefix

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages