Quick Start Guide

This guide provides a quick peek to Sharp ETL's capabilities. For a video guide, please check:

Setup

Sharp ETL works well with Spark-2.3+ & Spark-3.2.+ versions. You can follow the instructions here for supported spark version.

bash

Build from source for your spark version:

//for spark 3.1 with scala 2.12
./gradlew buildJars -PscalaVersion=2.12 -PsparkVersion=3.1.2 -PscalaCompt=2.12.15
  
//for spark 2.4 with scala 2.12
./gradlew buildJars -PscalaVersion=2.12 -PsparkVersion=2.4.8 -PscalaCompt=2.12.15
  
//for spark 2.4 with scala 2.11
./gradlew buildJars -PscalaVersion=2.11 -PsparkVersion=2.4.8 -PscalaCompt=2.11.12

Please note the following

Spark only support JDK 1.8 before spark version 3.x
Starting from spark version 3.x, JDK 11 support added, but support of JDK 17 is still missing
For different versions of hive support, please follow Interacting with Different Versions of Hive Metastore

Start a postgres instance

docker run --name postgres -e POSTGRES_PASSWORD=postgres -d -p 5432:5432 postgres:12.0-alpine

Start a ETL db instance

docker run --name mysql8 -d -p 3306:3306 -e MYSQL_ROOT_PASSWORD=root -e MYSQL_DATABASE=sharp_etl mysql:8.0

Suppose we have a table named online_order in postgres with schema sales:

-- This extension provides a function to generate a version 4 UUID, we must enable the extension first.
CREATE EXTENSION IF NOT EXISTS "pgcrypto";
create schema sales;
create table sales.online_order
(
    order_no           varchar(64) default gen_random_uuid() not null
        primary key,
    user_id            varchar(32)                           not null,
    user_name          varchar(32)                           not null,
    order_total_amount numeric,
    actual_amount      numeric,
    post_amount        numeric,
    order_pay_amount   numeric,
    total_discount     numeric,
    pay_type           varchar(32),
    source_type        varchar(32),
    order_status       varchar(32),
    note               varchar(32),
    confirm_status     varchar(32),
    payment_time       timestamp,
    delivery_time      timestamp,
    receive_time       timestamp,
    comment_time       timestamp,
    delivery_company   varchar(32),
    delivery_code      varchar(32),
    business_date      date        default CURRENT_DATE,
    return_flag        varchar(32),
    created_at         timestamp   default CURRENT_TIMESTAMP,
    updated_at         timestamp   default CURRENT_TIMESTAMP,
    deleted_at         timestamp
);

We can download this excel to your ~/Desktop for the quick start guide.

tip

You can also use the existing excel template for your new cases.

Generate sql files from excel config

bash

./gradlew :spark:run --args="generate-ods-sql -f ~/Desktop/sharp-etl-Quick-Start-Guide.xlsx --output ~/Desktop/"

And you can see a new file generated at ~/Desktop/sales.online_order.sql

-- workflow=ods__t_fact_online_order
--  period=1440
--  loadType=incremental
--  logDrivenType=timewindow

-- step=1
-- source=postgres
--  dbName=postgres
--  tableName=sales.online_order
-- target=hive
--  dbName=ods
--  tableName=t_fact_online_order
-- writeMode=append
SELECT "order_no" AS "order_no",
       "user_id" AS "user_id",
       "user_name" AS "user_name",
       "order_total_amount" AS "order_total_amount",
       "actual_amount" AS "actual_amount",
       "post_amount" AS "post_amount",
       "order_pay_amount" AS "order_pay_amount",
       "total_discount" AS "total_discount",
       "pay_type" AS "pay_type",
       "source_type" AS "source_type",
       "order_status" AS "order_status",
       "note" AS "note",
       "confirm_status" AS "confirm_status",
       "payment_time" AS "payment_time",
       "delivery_time" AS "delivery_time",
       "receive_time" AS "receive_time",
       "comment_time" AS "comment_time",
       "delivery_company" AS "delivery_company",
       "delivery_code" AS "delivery_code",
       "business_date" AS "business_date",
       "return_flag" AS "return_flag",
       "created_at" AS "created_at",
       "updated_at" AS "updated_at",
       "deleted_at" AS "deleted_at",
       ${JOB_ID} AS "job_id",
       to_char("business_date", 'yyyy') as "year",
       to_char("business_date", 'MM') as "month",
       to_char("business_date", 'DD') as "day"
FROM "postgres"."sales"."online_order"
WHERE "business_date" >= '${DATA_RANGE_START}' AND "business_date" < '${DATA_RANGE_END}';

Create ODS table

create schema ods;
create table ods.t_fact_online_order
(
    order_no           varchar(64) not null,
    user_id            varchar(32) not null,
    user_name          varchar(32) not null,
    order_total_amount numeric,
    actual_amount      numeric,
    post_amount        numeric,
    order_pay_amount   numeric,
    total_discount     numeric,
    pay_type           varchar(32),
    source_type        varchar(32),
    order_status       varchar(32),
    note               varchar(32),
    confirm_status     varchar(32),
    payment_time       timestamp,
    delivery_time      timestamp,
    receive_time       timestamp,
    comment_time       timestamp,
    delivery_company   varchar(32),
    delivery_code      varchar(32),
    business_date      date      default CURRENT_DATE,
    return_flag        varchar(32),
    created_at         timestamp default CURRENT_TIMESTAMP,
    updated_at         timestamp default CURRENT_TIMESTAMP,
    deleted_at         timestamp,
    job_id             varchar(16)
);

Insert data

insert into sales.online_order(order_no, user_id, user_name, order_total_amount, actual_amount, post_amount,
                                order_pay_amount,
                                total_discount, pay_type, source_type, order_status, payment_time, business_date,
                                created_at, updated_at, deleted_at)
VALUES ('2021093000001', 1, '张三ð', 200.0, 100, 0, 99, 101.0, 'wechat', 'mini-program', 'paid', '2021-09-30 09:00:35',
        '2021-09-30',
        '2021-09-30 09:00:00', '2021-09-30 09:00:35', null),
        ('2021093000002', 2, '李四o(╥﹏╥)o', 399.0, 200, 0, 200, 199.0, 'wechat', 'official-website', 'paid',
        '2021-09-30 19:00:35',
        '2021-09-30',
        '2021-09-30 19:00:00', '2021-09-30 19:00:35', null);

Before run job

The db connection infomation is specified in application.properties and Make sure your db connections are included in the file. For a quick start, we need to add followings:

postgres.postgres.url=jdbc:postgresql://localhost:5432/postgres?stringtype=unspecified
postgres.postgres.user=postgres
postgres.postgres.password=postgres
postgres.postgres.driver=org.postgresql.Driver
postgres.postgres.fetchsize=10

Run the job

Then we will run a sample job which reads data from sales.online_order table and write them into ods.t_fact_online_order

bash

# run single job by `spark-submit`
spark-submit --class com.github.sharpdata.sharpetl.spark.Entrypoint spark/build/libs/spark-1.0.0-SNAPSHOT.jar single-job --name=sales.online_order --period=1440 --default-start-time="2021-09-30 00:00:00" --local --once

# run single job locally
./gradlew :spark:run --args="single-job --name=sales.online_order --period=1440 --default-start-time='2021-09-30 00:00:00' --local --once"

note

You need to put sql file under spark/src/main/resources/tasks or put it into HDFS/DBFS to run single-job If you want to configure the sql file folder, please set etl.workflow.path in application.properties

Query job result

SELECT * FROM ods.t_fact_online_order;

order_no	user_id	user_name	order_total_amount	actual_amount	post_amount	order_pay_amount	total_discount	pay_type	source_type	order_status	payment_time	business_date	created_at	updated_at	job_id
2021093000002	2	李四o(╥﹏╥)o	399	200	0	200	199	wechat	official-website	paid	2021-09-30 19:00:35.000000	2021-09-30	2021-09-30 19:00:00.000000	2021-09-30 19:00:35.000000	2
2021093000001	1	张三ð	200	100	0	99	101	wechat	mini-program	paid	2021-09-30 09:00:35.000000	2021-09-30	2021-09-30 09:00:00.000000	2021-09-30 09:00:35.000000	2

Setup​

Generate sql files from excel config​

Create ODS table​

Insert data​

Before run job​

Run the job​

Query job result​