Skip to content

Hive Query Language

Zsolt Fekete edited this page Jun 9, 2017 · 2 revisions

This is a cheat sheet for basic Hive SQL syntax. For a detailed documentation check this page: https://cwiki.apache.org/confluence/display/Hive/LanguageManual

Create/delete/use database

show databases;
create database mydb;
drop database mydb cascade;
drop database mydb cascade;
use mydb;
show tables;

Table creation

create table books(id int, title string, genre string);

-- delmited format with specified delimiter
create table books(id int, title string, genre string)
  row format delimited fields terminated by '|';

-- stored in orc format
create table books(id int, title string, genre string) stored as orc;

-- partitionen table
create table books(id int, title string) partitioned by (genre string);

-- bucketed table
create table books(id int, title string, genre string) clustered by (id) into 10 buckets;  

-- create skewed table
create table cskew(c_customer_sk bigint, c_first_name string)
  skewed by (c_first_name)
  on ('', 'James', 'John', 'Robert', 'Michael', 'William',
      'David', 'Richard', 'Charles', 'Joseph')
  stored as orc;

The command above creates only the scheme but the table will be mpty. With a CTAS (create table as select) command one can create the table and load data into that:

create table t3 as select x, y, z from t4;

Drop table:

drop table t1;

Load data into table

If the table is already created:

load data inpath '/user/hive/data/b.csv' OVERWRITE INTO TABLE b1;

insert into t2 values (1, 'aaa', 2.3), (2, 'bbb', 1000000000);

insert into table t2 select col1, col2, col3 from t1;

insert into overwrite table t2 select col1, col2, col3 from t1;

Note that the above mentioned CTAS is a data load command as well.

Inserting into a partitioned table:

insert into table c2 partition (c_birth_month=1) values (111), (222), (333);

insert into table c2 partition (c_birth_month=2)
  select c_customer_sk from c1 where c_birth_month=2;

With dynamic partitioning, first make sure that

set hive.exec.dynamic.partition.mode=nonstrict;

With these setting we can omit specifying the partition value:

insert overwrite table c2 partition (c_birth_month)
  select c_customer_sk,c_birth_month from tpcds.customer;

Query metadata

show databases;
show tables;
describe customer;
describe extended tpcds.customer;
describe formatted tpcds.customer;
show create table b1;

Querying

Select, where, group by, order by, join (left/right/full outer, inner) are supported.

Exectution plan

-- default explain
explain select 1;

-- a more detailed/verbose execution plan
explain extended select 1;

-- runs the query and shows the estimated rowcounts and the real rowcounts
explain analyze select 1;

-- this setting makes the explain outputs even more verbose (the default is true)
set hive.explain.user=false;

Analytical functions

select * from (
    select c_first_name, c_birth_month, c_customer_sk , rank()
    over (partition by c_birth_month order by c_customer_sk desc) as rank from customer
  ) ranked
  where ranked.rank =1

Json, xml

JSONPath selection is partly supported by the get_json_object function:

select get_json_object('{"a":123, "b": [1,"x", null]}', '$.b[1]');

And with xpath function we can have xpath selection from an xml:

select xpath ('<a><b id="1"><c/></b><b id="2"><c/></b></a>',
  '/descendant::c/ancestor::b/@id');

Lateral view

create table tt1 (id int, arr array<int>);
insert into table tt1 select 20, array(5,6,7,8,9);
insert into table tt1 select 10, array(1,2,3);
select * from tt1;
select id, arr[0], arr[2] from tt1;

select id, a from tt1 lateral view explode (arr) aaa as a;

create table tt2 (id int, s string);
insert into tt2 values (1, "a,b,c"), (2, "x,y,z,w");
select id, split(s,",") from tt2;
select id, a from tt2 lateral view explode (split(s,",")) bbb as a;

Complex structures

Hive supports array, map, struc and these types can be nested. For the details see this wiki page: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types