把nginx的访问日志access.log导入hive中

首先创建日志表

hive> CREATE TABLE nginxlog ( ipaddress STRING, time STRING, method STRING, path STRING, version STRING, status STRING, size STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = '([^ ]+) [^ ]+ [^ ]+ \\[([^\\]]+)\\] "(\\w+) ([^ ]+) ([^ "]+)" (\\d+) (\\d+) [^ ]+ "([^"]+)"',"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s")STORED AS TEXTFILE;


注:这一步中的正则表达式部分根据实际日志格式确定,创建表的时候可能会提示正则表达式错误,说明不支持当前的正则表达式某个部分的写法,可以尝试换个写法。

日志格式如下:

66.249.67.29 - - [29/Jun/2015:16:58:53 +0800] "GET / HTTP/1.1" 200 2132 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"


然后导入日志文件

hive> load data local inpath '/usr/local/nginx/logs/access.log' overwrite into table nginxlog;

稍等几秒等待,导入数据成功,我测试的机器5万多条记录,大概用了5秒,配置是单核1G内存,配置好点的可能更快。


查询导入的数据

hive> select count(*) from nginxlog;

Query ID = root_20150630135403_37ebac3f-bda4-4e5d-9d96-d71432bf2a53

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapreduce.job.reduces=<number>

Job running in-process (local Hadoop)

2015-06-30 13:54:05,304 Stage-1 map = 0%,  reduce = 0%

2015-06-30 13:54:06,317 Stage-1 map = 100%,  reduce = 100%

Ended Job = job_local104223731_0008

MapReduce Jobs Launched:

Stage-Stage-1:  HDFS Read: 436240548 HDFS Write: 31159454 SUCCESS

Total MapReduce CPU Time Spent: 0 msec

OK

58198

Time taken: 3.059 seconds, Fetched: 1 row(s)


以上操作都是在hive的命令行界面完成,执行$HIVE_HOME/bin/hive进入命令行界面。

上面步骤的关键是第一步的正则表达式编写,这个地方需要花一些时间。

附一个在线调试正则表达式的网站,非常强大。

https://regex101.com/


提交评论