用Linux命令行实现SQL的groupby

eyesmore

浏览: 364765 次
性别:
来自: 北京

最近访客更多访客>>

Java技术干货

fhtwins

zzy_001

我的西雅图之梦

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

SQL Linux QQ 正则表达式 Perl

用Linux命令行实现SQL的groupby

需求

给定一个sql脚本文件，里面全是insert操作，样例如下：
insert into `passport_daily` (uid,loginname,ip,operatetime,result) values ('9454268',' 876334249@qq.com','222.168.129.186','1281801604225','register');
insert into `passport_daily` (uid,loginname,ip,operatetime,result) values ('9454269',' 375682141@qq.com','60.181.1.247','1281801607618','register');
insert into `passport_daily` (uid,loginname,ip,operatetime,result) values ('9036732','670981425@qq.com','118.112.144.140','1281801609235','login');
insert into `passport_daily` (uid,loginname,ip,operatetime,result) values ('9454270',' 35845122@qq.com','111.112.18.214','1281801612152','register');
insert into `passport_daily` (uid,loginname,ip,operatetime,result) values ('8363103','jianglong418@126.com','113.106.101.3','1281801615463','login');

关系表(uid,loginname,ip,operatetime,result)的语义是：某某用户（账号ID是：uid，登录名是：loginname）在operatetime时间按从源IP执行了一个result操作。

现在需要统计：来自不同的IP分别有多少？限用Linux命令统计。

实现结果

gawk -F "','" '{ print $3 }' sql.txt | sort -T ./ | uniq -c | sort -k 1 -nrs -T ./ > sql-stat.txt

23 60.181.1.247
9 111.112.18.214
5 113.106.101.3
1 118.112.144.140
1 222.168.129.186

命令解释

gawk

gawk -F "','" '{ print $3 }' sql.txt 表示从sql.txt文件中读取第三列。其中列分割符是’,’ （包含三个字符），由于列分割符号超过三个字符可使用双引号或单引号包围。由于此处分割符’,’中已经有单引号，所以使用双引号包围。
命令选项：
-F fs --field-separator=fs 用来指定列分割符号

insert into `passport_daily` (uid,loginname,ip,operatetime,result) values ('9454268',' 876334249@qq.com','222.168.129.186','1281801604225','register');

当然通过grep+正则表达式也可以提取IP地址：
grep -P '([0-9]{1,3}\.){3}[0-9]{1,3}' sql.txt –o

uniq

-c, --count           prefix lines by the number of occurrences （统计功能）
-d, --repeated        only print duplicate lines （只找出重复行）
-u, --unique          only print unique lines   （只找出单行）

-f, --skip-fields=N   avoid comparing the first N fields （对于有文件头的文件可以忽略前N行）

但是uniq命令有个前提：重复，单行的判断依据是只跟前后几行对比，并不是全文对比。所以执行uniq前一般先要sort。
Discard all but one of successive identical lines from INPUT (or
standard input), writing to OUTPUT (or standard output).

实例1：假设t1.txt文件如下
1
2
1
# uniq -u t1.txt        （虽然有两个1，但是不是连续的，依然会作为单行）
1
2
1
实例2：假设t2.txt文件如下
1
1
2
# uniq -u t2.txt    （连续的两个1，被认为是冗余行，所以取单行时被删除了）
2

Sort

-n,-r选项
-n, --numeric-sort compare according to string numerical value 作为数字，而不是作为字母
-r, --reverse reverse the result of comparisons 倒序
-s, --stable              stabilize sort by disabling last-resort comparison 如果希望稳定排序，则使用-s。所谓“稳定排序”举个例子，两个二元组原始顺序：(1,zhangsan), (1,lisi); 如果稳定排序，那么排序的结果始终会是(1,zhangsan)在(1,lisi)前面；非稳定的，那么(1,lisi)可能会排(1,zhangsan)的前面。
-k , -t 选项
-k, --key=POS1[,POS2]     start a key at POS1, end it at POS2 (origin 1)
-t, --field-separator=SEP use SEP instead of non-blank to blank transition

这两个选项可以让我们指定排序关键字。
假如有三行(uid,name,ip,state)：
8345891, zhangsan, 201.22.135.64, 0
3845891, wangwu, 69.22.136.64, 1
3845891, wangwu2, 198.22.16.164, 1

那么，如果按uid升序排序：
# sort -k 1 -t ',' -n s.txt   （-t 是字段分割符，-k用来指定排序关键字（可以是多个））
3845891, wangwu2, 198.22.16.164, 1 （非稳定排序，尽管wangwu和wangwu2的uid都是：3845891，但是wangwu2排wangwu前面去了。）
3845891, wangwu, 69.22.136.64, 1
8345891, zhangsan, 201.22.135.64, 0

# sort -k 1 -t ',' -n s.txt –s （稳定排序，wangwu依然在wangwu2的前面）
3845891, wangwu, 69.22.136.64, 1
3845891, wangwu2, 198.22.16.164, 1
8345891, zhangsan, 201.22.135.64, 0

# sort -k 2 -t ',' s.txt -s （以第二个减排序）
3845891, wangwu2, 198.22.16.164, 1
3845891, wangwu, 69.22.136.64, 1
8345891, zhangsan, 201.22.135.64, 0

-T 选项
-T, --temporary-directory=DIR use DIR for temporaries, not $TMPDIR or /tmp;
                              multiple options specify multiple directories

-T是用来指定临时目录的。因为sort排序过程中间可能需要用到一些临时文件作为交换空间，默认情况下这些临时文件会保存在$TMPDIR或/tmp中，但是我们也可以通过-T参数来指定。这个参数一般用在如果待排序文件特别大，默认临时目录可能没有足够空间。有个帖子说明了这一点：
http://stackoverflow.com/questions/3451388/perl-sort-temporary-directory

I ran into space issues on my machine and therefore the sort command in unix failed because of lack of space in /tmp. In order to circumvent this, I decided to run sort with the -T option allowing it to use some other directory for creating temporary files. here is the perl script I have
   my $TMPDIR              = "/home/xyz/workspace/";
    my $sortCommand         = "awk 'NR == 1; NR > 1 { print \$0 | \"sort -T \$TMPDIR -k1,1\" }' test > test.sort";
    system_call($sortCommand, "Sort");
    sub system_call {
      .......
}
this works perfectly on my desktop.

总结

gawk -F "','" '{ print $3 }' sql.txt | sort -T ./ | uniq -c | sort -k 1 -nrs -T ./ > sql-stat.txt

（1）    gawk -F "','" '{ print $3 }' sql.txt 用来从sql.txt文件中取出IP列；
（2）    sort -T ./ | uniq –c 用来统计各个IP的数量。由于uniq –c统计的前提是相同的元素得相邻，所以之前得对输入进行排序。另外考虑到待排序输入流很大，默认临时空间可能不够，因此指定临时空间为当前目录。
（3）    sort -k 1 -nrs -T ./ 按第一列，以数字，倒序排列，而且是稳定排序。（原本只有一个IP列，但是uniq –c后，在前面插入了统计量，因此变成两列了。）
（4）    > sql-stat.txt 将结果重定向输出到sql-stat.txt文件。