SQL Aggregation, Grouping, and Having
In SQL, the standard aggregation operators are
COUNT and we can apply them to the attributes of relations. For example we have a schema
employee(eid, name, dob, hired_date, salary), which holds records on each of a company’s employees. We can
COUNT and see how many employees the company has with the following query:
SELECT COUNT(*) FROM employee;
* is an argument to the count operator and means to count all tuples in the relation. With the
DISTINCT operator we eliminate duplicates for the query result, so if we wanted to see if any of the employees had the same name we could compare the previous query with the following:
SELECT COUNT(DISTINCT name) FROM employee;
If the first count is larger than the second, then the company has at least two employees with the same name.
COUNT also does not count tuple attributes with
NULL value, and
SUM don’t factor them in when calculating either. We can use
AVG to find the average salary of our employees and
SUM to calculate the total amount paid for employee salaries with these queries:
SELECT AVG(salary) FROM (employee); SELECT SUM(salary) FROM (employee);
A grouping query is executed on a relation with the
GROUP BY clause.
GROUP BY is followed by an attribute list and the resulting tuples are grouped successively according to the list. To see how grouping works, let’s say that the company from our earlier examples is a shipping company and owns trucks. The company keeps track of truck purchases and their usage with the following relations:
truck(tid, make, model, year, purch_year, purch_quarter, cost). In a simple example we could query the database and ask which years the company purchased trucks:
SELECT purch_year FROM truck GROUP BY purch_year;
GROUP BY would come after a
WHERE clause, but in this case we didn’t need one so it was omitted. The result of this query would be a single attribute relation (a single column table) with values reflecting the years in which the company purchased at least one truck. If the company purchased more that one truck in that year that purchase would be aggregated with all others in that year.
Only the attributes that are listed in the
GROUP BY clause can appear unaggregated in the
SELECT clause. The next two queries would tell the company which years and in which quarters they purchased trucks and the total amount spent in each quarter.
SELECT purch_year, quarter FROM truck GROUP BY year, quarter ORDER BY quarter, year; SELECT purch_year, SUM(cost) as annualcost FROM truck GROUP BY purch_year;
The first query would return a relation of 2-tuples with the first element being
purch_year and the second
quarter (the result is also first sorted by quarter, then by year). The second query would also return a 2-tuple but the elements would be
annualcost which is the aggregated (by simple addition in this case) cost of trucks for each year. Because
cost is not in the
GROUP BY clause, it must be aggregated.
The HAVING Clause
In our last set of queries, we asked for the costs of trucks purchased on an annual basis. Suppose for some reason the company wanted to find the average cost spent in a year only in years where every truck purchased in that year cost over $15,000. To ask this question we can use the
SELECT purch_year, AVG(cost) as avgcost FROM truck GROUP BY purch_year HAVING MIN(cost) > 15000;
HAVING clause applies to each group. In this case,
MIN is calculated on the attribute
cost for each purchase year (
purch_year) then compared to 15,000. If the minimum value for all costs in a purchase year are greater that 15,000 then the
purch_year shows up in the query result (a 2-tuple