创建表:
drop table t
create table if not exists t (t string) partitioned by (log_date string) row format delimited fields terminated by '\t' lines terminated by '\n'
create table t1 like t
create table t2 as select * from t //如果加多余的字段或者加group by等关键字会报错
装载数据:
load data local inpath '/opt/aimcpro/libc/first.txt' into table test_libc_20140711_1;
insert overwrite into table tt partition(ds='{$begin}')
动态分区插入:
insert overwrite t
partition (contry)
select ... ,se.cty
from emp se
一次性从表中导入多个分区:
from table t
insert overwrite table em (p=a)
select * from t where p=a
insert overwrite table em (p=b)
select * from t where p=b
导出数据:
insert overwrite local dirctory '/tmp'
select * from t
或者:hadoop fs -cp sr des
表生成函数:结合正则表达式可以对文本词统计
浮点数自动转化问题
显示类型转换:
cast(salary as float)
mysql导出远程数据:
mysql -h 10.199.82.31 -P 3304 -umns -pmns mns -e" select UID,NOTICETYPE from SUBSCRIBEINFO where NOTICETYPE=403 or NOTICETYPE=4 or NOTICETYPE=304;" > user_notify.lst
正则表达式抽取:
regexp_extract(col2,'(\\d*)(:)0',0)
匹配值为:12345678901:0
在hive中执行dfs命令:只需将hadoop关键字去掉即可
实现随机抽样:
对数据做标记后再对标记随机排序
select user_device from
(select user_device from
(select user_device,"1" flag from test_libc_t) t distribute by t.flag sort by t.flag,rand()) tt limit 100
调整日志输出级别,sql出错时可以得到详细错误信息:
hive --hiveconf hive.root.logger=DEBUG,console 只针对当前回话修改
桶为表分区的更细粒度划分,创建带桶的表:
create table bucketed_user(id int,name string) clustered by (id) sorted by(name) into 4 buckets row format delimited fields terminated by '\t' stored as textfile;
在这里,我们使用用户ID来确定如何划分桶
对桶中的数据进行采样:
hive> SELECT * FROM bucketed_users > TABLESAMPLE(BUCKET 1 OUT OF 4 ON id); 即区第一个桶的数据(4/4=1表示取一个桶的数据,1表示从第一个开始取)