TPC-H

Create Your Own Oracle TPC-H Play ground on ARM64 Linux


TPC as being the council for TPC-H benchmarks delivers a standardized data generation tool for all benchmarks.
This bundle contains a bunch of C files to be compiled to form dbgen. Copy the zip file into one of your folders and ensure that your Linux environment has the necessary toolkit to compile C language (gcc, make, etc.)

Download TPC-H Data Generator (dbgen)

Register and download the zip file of TPC-H suite from [[http://www.tpc.org/tpc_documents_current_versions/current_specifications.asp|TPC org]]

Build DBGEN

Next thing you should do is to cp makefile.suite file in tpch directory and change some parameters in copied file.

CC
DATABASE
MACHINE
WORKLOAD
hduser@tpch_2_17_0/dbgen$ cp makefile.suite Makefile
hduser@tpch_2_17_0/dbgen$ vim Makefile
...
################
CHANGE NAME OF ANSI COMPILER HERE
################
CC = gcc
Current values for DATABASE are: INFORMIX, DB2, TDAT (Teradata)
SQLSERVER, SYBASE, ORACLE
Current values for MACHINE are: ATT, DOS, HP, IBM, ICL, MVS,
SGI, SUN, U2200, VMS, LINUX, WIN32
Current values for WORKLOAD are: TPCH
DATABASE= ORACLE
MACHINE = LINUX
WORKLOAD = TPCH
...


Now run make command:

hduser@tpch_2_17_0/dbgen$ make


After a successful compilation you should have dbgen executable in your current working director:

hduser@tpch_2_17_0/dbgen$ ./dbgen -h

TPC-H Population Generator (Version 2.17.0 build 0)
Copyright Transaction Processing Performance Council 1994 - 2010
USAGE:
dbgen [-
{vf}
][-T
{pcsoPSOL}
]
[-s <scale>][-C <procs>][-S <step>]
dbgen [-v] [-O m] [-s <scale>] [-U <updates>]
Basic Options
===========================
-C <n> – separate data set into <n> chunks (requires -S, default: 1)
-f – force. Overwrite existing files
-h – display this message
-q – enable QUIET mode
-s <n> – set Scale Factor (SF) to <n> (default: 1)
-S <n> – build the <n>th step of the data/update set (used with -C or -U)
-U <n> – generate <n> update sets
-v – enable VERBOSE mode
Advanced Options
===========================
-b <s> – load distributions for <s> (default: dists.dss)
-d <n> – split deletes between <n> files (requires -U)
-i <n> – split inserts between <n> files (requires -U)
-T c – generate cutomers ONLY
-T l – generate nation/region ONLY
-T L – generate lineitem ONLY
-T n – generate nation ONLY
-T o – generate orders/lineitem ONLY
-T O – generate orders ONLY
-T p – generate parts/partsupp ONLY
-T P – generate parts ONLY
-T r – generate region ONLY
-T s – generate suppliers ONLY
-T S – generate partsupp ONLY
To generate the SF=1 (1GB), validation database population, use:
dbgen -vf -s 1
To generate updates for a SF=1 (1GB), use:
dbgen -v -U 1 -s 1

hduser@tpch_2_17_0/dbgen$ 


Use dbgen to generate a 4G of TPC-H benchmark data. In order to be able to load generated files in parallel by using Oracle External tables, we will be using file split feature of dbgen (Remember that this step might take some time and can be parallelized depending on your CPU & I/O capacity):

hduser@tpch_2_17_0/dbgen$ ./dbgen -s 4 -S 1 -C 8 -v

TPC-H Population Generator (Version 2.17.0)
Copyright Transaction Processing Performance Council 1994 - 2010
Starting to load stage 1 of 8 for suppliers table.../
Preloading text ... 100%
done.
Starting to load stage 1 of 8 for customers table...done.
Starting to load stage 1 of 8 for orders/lineitem tables...done.
Starting to load stage 1 of 8 for part/partsupplier tables...done.
Generating data for nation tabledone.
Generating data for region tabledone.
hduser@tpch_2_17_0/dbgen$ ./dbgen -s 4 -S 2 -C 8 -v
TPC-H Population Generator (Version 2.17.0)
Copyright Transaction Processing Performance Council 1994 - 2010
Starting to load stage 2 of 8 for suppliers table.../
Preloading text ... 100%
done.
Starting to load stage 2 of 8 for customers table...done.
Starting to load stage 2 of 8 for orders/lineitem tables...done.
Starting to load stage 2 of 8 for part/partsupplier tables...done.
Generating data for nation tableDo you want to overwrite ./nation.tbl ? [Y/N]: Y
done.
Generating data for region tableDo you want to overwrite ./region.tbl ? [Y/N]: Y
done.

hduser@tpch_2_17_0/dbgen$ ./dbgen -s 4 -S 8 -C 8 -v

TPC-H Population Generator (Version 2.17.0)
Copyright Transaction Processing Performance Council 1994 - 2010
Starting to load stage 8 of 8 for suppliers table.../
Preloading text ... 100%
done.
Starting to load stage 8 of 8 for customers table...done.
Starting to load stage 8 of 8 for orders/lineitem tables...done.
Starting to load stage 8 of 8 for part/partsupplier tables...done.
Generating data for nation tableDo you want to overwrite ./nation.tbl ? [Y/N]: Y
done.
Generating data for region tableDo you want to overwrite ./region.tbl ? [Y/N]: Y
done.


When you are done with all 8 executions you will have *tbl* files in your current working directory. Those are pipe separated files which you will be loading into your database.

DBGEN Options

–s 4 specifies that we are using a scale factor of 4 meaning that we are generating approximately 4GB of benchmark data. –S 1 instructs dbgen to generate first of 8 chunks. –C 8 is the total number of files for each large dataset (excluding nation and region tables). –v is setting the verbosity for dbgen.

DBGEN Output

In total you will see that all *tbl* files will be approximately 4 GB in size.

 hduser@tpch_2_17_0/dbgen$ du -ch .tbl | tail -1
1.6G total


A good idea is to compress all those files with gzip so that they will consume minimum disk space and optimize read I/O in case of CPU power abundance.

hduser@tpch_2_17_0/dbgen$ gzip -4 -v .tbl

customer.tbl.1: 61.4% – replaced with customer.tbl.1.gz
customer.tbl.2: 61.6% – replaced with customer.tbl.2.gz
customer.tbl.8: 61.6% – replaced with customer.tbl.8.gz
lineitem.tbl.1: 68.1% – replaced with lineitem.tbl.1.gz
lineitem.tbl.2: 68.1% – replaced with lineitem.tbl.2.gz
lineitem.tbl.8: 68.3% – replaced with lineitem.tbl.8.gz
nation.tbl: 56.8% – replaced with nation.tbl.gz
orders.tbl.1: 69.6% – replaced with orders.tbl.1.gz
orders.tbl.2: 69.7% – replaced with orders.tbl.2.gz
orders.tbl.8: 69.9% – replaced with orders.tbl.8.gz
partsupp.tbl.1: 74.0% – replaced with partsupp.tbl.1.gz
partsupp.tbl.2: 74.1% – replaced with partsupp.tbl.2.gz
partsupp.tbl.8: 74.0% – replaced with partsupp.tbl.8.gz
part.tbl.1: 73.8% – replaced with part.tbl.1.gz
part.tbl.2: 74.0% – replaced with part.tbl.2.gz
part.tbl.8: 74.0% – replaced with part.tbl.8.gz
region.tbl: 38.8% – replaced with region.tbl.gz
supplier.tbl.1: 58.4% – replaced with supplier.tbl.1.gz
supplier.tbl.2: 58.5% – replaced with supplier.tbl.2.gz
supplier.tbl.8: 58.7% – replaced with supplier.tbl.8.gz

hduser@tpch_2_17_0/dbgen$ du -hc .tbl. | tail -1
494M total


As you see above we have achieved more than 1:3 compression ratio for all external files on the average.