PostgreSQL/Greenplum查询计划

查询模板

示例查询表

--学生成绩信息
create table sc( --成绩信息
	sno varchar(10), --学号
	cno varchar(10), --课程号
	score int --成绩
);

--学生课程信息
create table course( --课程信息
	cno varchar(10), --课程号
	cname varchar(10), --课程名称
	credit int --学分
	priorcourse varchar(10) --前置课程
);

--学生班级信息
create table class( --班级信息
	classno varchar(10), --班级编号
	classname varchar(10), --班级名称
	gno varchar(10)
);

--学生信息
create table student( --学生信息
	sno varchar(10), --学号
	sname varchar(10), --学生姓名
	gender varchar(2), --性别
	age int, --年龄
	nation varchar(10), --国籍
	classno varchar(10) --班级编号
);

示例查询

select classno,classname,avg(score) as avg_score
from sc,(select * from class where class.gno='grade one') as sub
where
sc.sno in (select sno from student where student.classno = sub.classno)
and
sc.sno in (select course.cno from course where course.cname = 'computer')
group by classno, classname
having avg(score) > 60
order by avg_score;

上一级节点的启动成本和总成本包含下一级节点的启动成本和总成本,因此,上一级节点的成本总是大于下一级节点的成本。

查询计划关键词

cost

measured in units of disk page fetches; that is, 1.0 equals one sequential disk page read. The first estimate is the start-up cost (cost of getting to the first row) and the second is the total cost (cost of getting all rows). Note that the total cost assumes that all rows will be retrieved, which may not always be the case (if using LIMIT for example).

数据库自定义的消耗单位,通过统计信息来估计SQL消耗。

rows

the total number of rows output by this plan node. This is usually less than the actual number of rows processed or scanned by the plan node, reflecting the estimated selectivity of any WHERE clause conditions. Ideally the top-level nodes estimate will approximate the number of rows actually returned, updated, or deleted by the query.

width

total bytes of all the rows output by this plan node.

返回结果集每一行的长度,这个长度值是根据pg_statistic表中的统计信息来计算的。

slice

Greenplum在实现分布式执行计划的时候,需要将SQL拆分成多个切片,每个slice是单裤执行的一部分SQL,每一个广播或者重分布会产生一个切片,每一个切片在每一个数据结点上都会对应的发起一个进程来处理该slice负责的数据,上一层负责该slice的进程会读取下级slice广播或重分布的数据,之后进行相应的计算。

The query optimizer produces a query execution plan, consisting of a series of tasks called operators (labeled D in the diagram). Operators perform tasks such as table scans or joins, and typically produce intermediate query results by processing one or more sets of input rows. Operators receive a share of the memory the resource queue allocates to a query. If an operator cannot perform all of its work in the memory allocated to it, it caches data on disk in spill files.

EXPLAIN 

查询计划可以用于如下:

若一个查询性能很差,查看查询计划可以帮助我们找到问题;
问题有以下几个方面:
* 计划中是否有一个操作花费时间过长;
* 优化器的评估成本是否接近实际情况;
* 选择性强的条件是否较早出现;
* 优化器是否选择了最佳的关联顺序;
* 优化器是否选择性的扫描分区;
* 优化器是否合适的选择了hash聚合与hash关联操作;

Greenplum SQL plan checker  
Greenplum+Hadoop学习笔记-工作负载及资源管理 

(未完)

查询计划可视化:
AlexTatiyants/pev