Skip to end of metadata
Go to start of metadata
Hive Operators and User-Defined Functions (UDFs)
In Beeline or the CLI, use the commands below to show the latest documentation:
Built-in Operators
Operators Precedences
Relational Operators
The following operators compare the passed operands and generate a TRUE or FALSE value depending on whether the comparison between the operands holds.
Arithmetic Operators
The following operators support various common arithmetic operations on the operands. All return number types; if any of the operands are NULL, then the result is also NULL.
Logical Operators
The following operators provide support for creating logical expressions. All of them return boolean TRUE, FALSE, or NULL depending upon the boolean values of the operands. NULL behaves as an "unknown" flag, so if the result depends on the state of an unknown, the result itself is unknown.
String Operators
Complex Type Constructors
The following functions construct instances of complex types.
Operators on Complex Types
The following operators provide mechanisms to access elements in Complex Types.
Built-in Functions
Mathematical Functions
The following built-in mathematical functions are supported in Hive; most return NULL when the argument(s) are NULL:
Mathematical Functions and Operators for Decimal Datatypes
All regular arithmetic operators (such as +, -, *, /) and relevant mathematical UDFs (Floor, Ceil, Round, and many more) have been updated to handle decimal types. For a list of supported UDFs, see Mathematical UDFs in Hive Data Types.
Collection Functions
The following built-in collection functions are supported in Hive:
Type Conversion Functions
The following type conversion functions are supported in Hive:
Date Functions
The following built-in date functions are supported in Hive:
Conditional Functions
String Functions
The following built-in String functions are supported in Hive:
Data Masking Functions
The following built-in data masking functions are supported in Hive:
Misc. Functions
xpath
The following functions are described in LanguageManual XPathUDF:
- xpath, xpath_short, xpath_int, xpath_long, xpath_float, xpath_double, xpath_number, xpath_string
get_json_object
A limited version of JSONPath is supported:
- $ : Root object
- . : Child operator
- [] : Subscript operator for array
- * : Wildcard for []
Syntax not supported that's worth noticing:
- : Zero length string as key
- .. : Recursive descent
- @ : Current object/element
- () : Script expression
- ?() : Filter (script) expression.
- [,] : Union operator
- [start:end.step] : array slice operator
Example: src_json table is a single column (json), single row table:
+----+
json
+----+
{
"store"
:
{
"fruit"
:\[{
"weight"
:
8
,
"type"
:
"apple"
},{
"weight"
:
9
,
"type"
:
"pear"
}],
"bicycle"
:{
"price"
:
19.95
,
"color"
:
"red"
}
},
"email"
:
"amy@only_for_json_udf_test.net"
,
"owner"
:
"amy"
}
+----+
|
The fields of the json object can be extracted using these queries:
hive> SELECT get_json_object(src_json.json,
'$.owner'
) FROM src_json;
amy
hive> SELECT get_json_object(src_json.json,
'$.store.fruit\[0]'
) FROM src_json;
{
"weight"
:
8
,
"type"
:
"apple"
}
hive> SELECT get_json_object(src_json.json,
'$.non_exist_key'
) FROM src_json;
NULL
|
Built-in Aggregate Functions (UDAF)
The following built-in aggregate functions are supported in Hive:
Built-in Table-Generating Functions (UDTF)
Normal user-defined functions, such as concat(), take in a single input row and output a single output row. In contrast, table-generating functions transform a single input row to multiple output rows.
Usage Examples
explode (array)
select
explode(array(
'A'
,
'B'
,
'C'
));
select
explode(array(
'A'
,
'B'
,
'C'
))
as
col;
select
tf.*
from
(
select
0) t lateral
view
explode(array(
'A'
,
'B'
,
'C'
)) tf;
select
tf.*
from
(
select
0) t lateral
view
explode(array(
'A'
,
'B'
,
'C'
)) tf
as
col;
|
explode (map)
select
explode(map(
'A'
,10,
'B'
,20,
'C'
,30));
select
explode(map(
'A'
,10,
'B'
,20,
'C'
,30))
as
(
key
,value);
select
tf.*
from
(
select
0) t lateral
view
explode(map(
'A'
,10,
'B'
,20,
'C'
,30)) tf;
select
tf.*
from
(
select
0) t lateral
view
explode(map(
'A'
,10,
'B'
,20,
'C'
,30)) tf
as
key
,value;
|
posexplode (array)
select
posexplode(array(
'A'
,
'B'
,
'C'
));
select
posexplode(array(
'A'
,
'B'
,
'C'
))
as
(pos,val);
select
tf.*
from
(
select
0) t lateral
view
posexplode(array(
'A'
,
'B'
,
'C'
)) tf;
select
tf.*
from
(
select
0) t lateral
view
posexplode(array(
'A'
,
'B'
,
'C'
)) tf
as
pos,val;
|
inline (array of structs)
select
inline(array(struct(
'A'
,10,
date
'2015-01-01'
),struct(
'B'
,20,
date
'2016-02-02'
)));
select
inline(array(struct(
'A'
,10,
date
'2015-01-01'
),struct(
'B'
,20,
date
'2016-02-02'
)))
as
(col1,col2,col3);
select
tf.*
from
(
select
0) t lateral
view
inline(array(struct(
'A'
,10,
date
'2015-01-01'
),struct(
'B'
,20,
date
'2016-02-02'
))) tf;
select
tf.*
from
(
select
0) t lateral
view
inline(array(struct(
'A'
,10,
date
'2015-01-01'
),struct(
'B'
,20,
date
'2016-02-02'
))) tf
as
col1,col2,col3;
|
stack (values)
select
stack(2,
'A'
,10,
date
'2015-01-01'
,
'B'
,20,
date
'2016-01-01'
);
select
stack(2,
'A'
,10,
date
'2015-01-01'
,
'B'
,20,
date
'2016-01-01'
)
as
(col0,col1,col2);
select
tf.*
from
(
select
0) t lateral
view
stack(2,
'A'
,10,
date
'2015-01-01'
,
'B'
,20,
date
'2016-01-01'
) tf;
select
tf.*
from
(
select
0) t lateral
view
stack(2,
'A'
,10,
date
'2015-01-01'
,
'B'
,20,
date
'2016-01-01'
) tf
as
col0,col1,col2;
|
Using the syntax "SELECT udtf(col) AS colAlias..." has a few limitations:
- No other expressions are allowed in SELECT
- SELECT pageid, explode(adid_list) AS myCol... is not supported
- UDTF's can't be nested
- SELECT explode(explode(adid_list)) AS myCol... is not supported
- GROUP BY / CLUSTER BY / DISTRIBUTE BY / SORT BY is not supported
- SELECT explode(adid_list) AS myCol ... GROUP BY myCol is not supported
Please see LanguageManual LateralView for an alternative syntax that does not have these limitations.
Also see Writing UDTFs if you want to create a custom UDTF.
explode
explode() takes in an array (or a map) as an input and outputs the elements of the array (map) as separate rows. UDTFs can be used in the SELECT expression list and as a part of LATERAL VIEW.
As an example of using explode() in the SELECT expression list, consider a table named myTable that has a single column (myCol) and two rows:
[100,200,300] |
[400,500,600] |
Then running the query:
SELECT explode(myCol) AS myNewCol FROM myTable;
|
will produce:
The usage with Maps is similar:
SELECT
explode(myMap)
AS
(myMapKey, myMapValue)
FROM
myMapTable;
|
posexplode
posexplode() is similar to explode but instead of just returning the elements of the array it returns the element as well as its position in the original array.
As an example of using posexplode() in the SELECT expression list, consider a table named myTable that has a single column (myCol) and two rows:
[100,200,300] |
[400,500,600] |
Then running the query:
SELECT posexplode(myCol) AS pos, myNewCol FROM myTable;
|
will produce:
json_tuple
A new json_tuple() UDTF is introduced in Hive 0.7. It takes a set of names (keys) and a JSON string, and returns a tuple of values using one function. This is much more efficient than calling GET_JSON_OBJECT to retrieve more than one key from a single JSON string. In any case where a single JSON string would be parsed more than once, your query will be more efficient if you parse it once, which is what JSON_TUPLE is for. As JSON_TUPLE is a UDTF, you will need to use the LATERAL VIEW syntax in order to achieve the same goal.
For example,
select a.timestamp, get_json_object(a.appevents,
'$.eventid'
), get_json_object(a.appenvets,
'$.eventname'
) from log a;
|
should be changed to:
select a.timestamp, b.*
from log a lateral view json_tuple(a.appevent,
'eventid'
,
'eventname'
) b as f1, f2;
|
parse_url_tuple
The parse_url_tuple() UDTF is similar to parse_url(), but can extract multiple parts of a given URL, returning the data in a tuple. Values for a particular key in QUERY can be extracted by appending a colon and the key to the partToExtract argument, for example, parse_url_tuple('http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1', 'QUERY:k1', 'QUERY:k2') returns a tuple with values of 'v1','v2'. This is more efficient than calling parse_url() multiple times. All the input parameters and output column types are string.
SELECT b.*
FROM src LATERAL VIEW parse_url_tuple(fullurl,
'HOST'
,
'PATH'
,
'QUERY'
,
'QUERY:id'
) b as host, path, query, query_id LIMIT
1
;
|
GROUPing and SORTing on f(column)
A typical OLAP pattern is that you have a timestamp column and you want to group by daily or other less granular date windows than by second. So you might want to select concat(year(dt),month(dt)) and then group on that concat(). But if you attempt to GROUP BY or SORT BY a column on which you've applied a function and alias, like this:
select f(col) as fc, count(*) from table_name group by fc;
|
you will get an error:
FAILED: Error in semantic analysis: line
1
:
69
Invalid Table Alias or Column Reference fc
|
because you are not able to GROUP BY or SORT BY a column alias on which a function has been applied. There are two workarounds. First, you can reformulate this query with subqueries, which is somewhat complicated:
select sq.fc,col1,col2,...,colN,count(*) from
(select f(col) as fc,col1,col2,...,colN from table_name) sq
group by sq.fc,col1,col2,...,colN;
|
Or you can make sure not to use a column alias, which is simpler:
select f(col) as fc, count(*) from table_name group by f(col);
|
Contact Tim Ellis (tellis) at RiotGames dot com if you would like to discuss this in further detail.
UDF internals
The context of a UDF's evaluate method is one row at a time. A simple invocation of a UDF like
SELECT length(string_col) FROM table_name;
|
would evaluate the length of each of the string_col's values in the map portion of the job. The side effect of the UDF being evaluated on the map-side is that you can't control the order of rows which get sent to the mapper. It is the same order in which the file split sent to the mapper gets deserialized. Any reduce side operation (such as SORT BY, ORDER BY, regular JOIN, etc.) would apply to the UDFs output as if it is just another column of the table. This is fine since the context of the UDF's evaluate method is meant to be one row at a time.
If you would like to control which rows get sent to the same UDF (and possibly in what order), you will have the urge to make the UDF evaluate during the reduce phase. This is achievable by making use of DISTRIBUTE BY, DISTRIBUTE BY + SORT BY, CLUSTER BY. An example query would be:
SELECT reducer_udf(my_col, distribute_col, sort_col) FROM
(SELECT my_col, distribute_col, sort_col FROM table_name DISTRIBUTE BY distribute_col SORT BY distribute_col, sort_col) t
|
However, one could argue that the very premise of your requirement to control the set of rows sent to the same UDF is to do aggregation in that UDF. In such a case, using a User Defined Aggregate Function (UDAF) is a better choice. You can read more about writing a UDAF here. Alternatively, you can user a custom reduce script to accomplish the same using Hive's Transform functionality. Both of these options would do aggregations on the reduce side.
Creating Custom UDFs
For information about how to create a custom UDF, see Hive Plugins and Create Function.
select explode(array('A','B','C'));select explode(array('A','B','C')) as col;select tf.* from (select 0) t lateral view explode(array('A','B','C')) tf;select tf.* from (select 0) t lateral view explode(array('A','B','C')) tf as col;