Java

Posts

spark union issue

May 25, 2022

while union we will face issue in mix-match records var colList = propObject.getProperty("colList").split(",").map(_.trim) or 1. create df using csv val df = spark.read.option("header", "true").option("delimiter", "|").option("inferSchema", "true").csv("*") 2. create collist val collist=df.columns collist: Array[String] = Array(BU, LEVEL, ranking) 3. select with head and tail method val fin=df.select(collist.head,collist.tail:_*).distinct fin: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [BU: string, LEVEL: string 4.getting head and tail val fin=df.select(colList.head,colList.tail:_*).distinct ########################################## val df2 = spark.read.option("header", "true").option("delimiter", "|").option("inferSchema", "true").csv("*") val collist=df2.columns val fin=df.select(collist.head,collist.tail:_*...

capacity-scheduler-in-yarn-hadoop

May 17, 2022

Capacity Scheduler in YARN In the post YARN in Hadoop we have already seen that it is the scheduler component of the ResourceManager which is responsible for allocating resources to the running jobs. The scheduler component is pluggable in Hadoop and there are two options capacity scheduler and fair scheduler . This post talks about the capacity scheduler in YARN, its benefits and how capacity scheduler can be configured in Hadoop cluster. Capacity scheduler Capacity scheduler in YARN allows multi-tenancy of the Hadoop cluster where multiple users can share the large cluster. Every organization having their own private cluster leads to a poor resource utilization. An organization may provide enough resources in the cluster to meet their peak demand but that peak demand may not occur that frequently, resulting in poor resource utilization at rest of the time. Thus sharing cluster among organizations is a more cost effective idea. However, organizations are concerned ...

cdc

May 17, 2022

select CONCAT(TRIM(ASSET.asset_number),0),asset.data_sou) AS cdc_key ,CONCAT(nvl(TRIM(asset.asset_number),'0'),asset.data_sou) AS dat_sou ,md5(concat(nvl(asset.asset_number,'0'),nvl(asset.dat_sou,'0'),nvl(asset.coln,'0') AS cdc_hash ,asset.* from ( select distinct CONCAT(trim(cus_nu),'-'), cast(NULL as string) as vendor_id, ACUR.*, 'N' ISDELETED from INERMIDTE.ACUR )ASSET ===================================== WITH ADDTION AS ( SELECT A.* ,current_timestamp AS Date_created1, ,current_timestamp AS Date_updated1, ,'N' AS ISDELTED1 FROM precdc.asset A left outer join processed.asset B ON (A.cdc_key = B.cdc_key) WHERE B.data_sou='1002' AND (B.cdc_hash IS Null) ) ,DELETION AS ( SELECT B.* ,B.current_timestamp AS Date_created1, ,B.current_timestamp AS Date_updated1, ,'Y' AS ISDELTED1 FROM precdc.asset A rigt outer join processed.asset B ON (A.cdc_key = B.cdc_key) WHERE B.data_sou='1002' AND (B.cdc_hash IS ...

hive sqoop increment

October 21, 2020

create database ccdm_mstr; use ccdm_mstr; create table clm_cvs_fact( clm_cvs_fact_ket varchar(30), dw_cret_aud_key int(25), dw_updt_aud_key int(30)); insert into clm_cvs_fact (clm_cvs_fact_ket, dw_cret_aud_key, dw_updt_aud_key) values ('123456700','12347771', '12347771'); insert into clm_cvs_fact (clm_cvs_fact_ket, dw_cret_aud_key, dw_updt_aud_key) values ('123456701','12347772', '12347772'); select * from clm_cvs_fact; output: +------------------+-----------------+-----------------+ | clm_cvs_fact_ket | dw_cret_aud_key | dw_updt_aud_key | +------------------+-----------------+-----------------+ | 123456700 | 12347771 | 12347771 | | 123456701 | 12347772 | 12347772 | +------------------+-----------------+-----------------+ [cloudera@quickstart ~]$ sqoop import --connect jdbc:mysql://localhost/ccdm_mstr --usern...

incrementally update

September 23, 2020

Incrementally update an imported table In CDP Private Cloud Base , updating imported tables involves importing incremental changes made to the original table using Sqoop and then merging changes with the tables imported into Hive. After ingesting data from an operational database to Hive, you usually need to set up a process for periodically synchronizing the imported table with the operational database table. The base table is a Hive-managed table that was created during the first data ingestion. Incrementally updating Hive tables from operational database systems involves merging the base table and change records to reflect the latest record set. You create the incremental table as a Hive external table, typically from CSV data in HDFS, to store the change records. This external table contains the changes (INSERTs and UPDATEs) from the operational database since the last data ingestion. Generally, the table is partitioned and only the latest partition is updated, making this pro...

sed, awk, grep

July 13, 2020

1. Print 1-6 range of lines with p cat demo.txt | sed -n '1, 6p' 2. Print Except 4,9 lines with d cat demo.txt | sed -n '6, 10d' 3. Print all non consecutive lines cat demo.txt | sed -n -e '6, 10p' -e '10, 13p' 3.replace string golbal cat demo.txt | sed "s/oldword/newword/g" Igonre character case cat demo.txt | sed "s/oldword/newword/gi" 4. Replace blank space cat demo.txt | sed -n 's/ */ /g' 5. Replace in b/w 4,9 lines cat demo.txt | sed -n '10,15 s/oldwrd/new/g' 6. Delete 4,9 lines with d cat demo.txt | sed -n '6, 10d' 7. Other 4,9 Delete another lines with d & n cat demo.txt | sed -n '6, 10!d' Example :1) Displaying partial text of a file With sed, we can view only some part of a file rather than seeing whole file. To see some lines of the file, use the following command, [linuxtechi@localhost ~]$ sed -n 22,29p testfile.txt here, option ‘n’ suppresses printing of whole file & option ‘p’ wi...

Search This Blog