SQL Aggregation, Grouping, and Having

Aggregating

In SQL, the standard aggregation operators are SUM, MIN, MAX, and COUNT and we can apply them to the attributes of relations. For example we have a schema employee(eid, name, dob, hired_date, salary), which holds records on each of a company’s employees. We can COUNT and see how many employees the company has with the following query:

SELECT COUNT(*)
FROM employee;

The * is an argument to the count operator and means to count all tuples in the relation. With the DISTINCT operator we eliminate duplicates for the query result, so if we wanted to see if any of the employees had the same name we could compare the previous query with the following:

SELECT COUNT(DISTINCT name)
FROM employee;

If the first count is larger than the second, then the company has at least two employees with the same name. COUNT also does not count tuple attributes with NULL value, and AVG and SUM don’t factor them in when calculating either. We can use AVG to find the average salary of our employees and SUM to calculate the total amount paid for employee salaries with these queries:

SELECT AVG(salary)
FROM (employee);

SELECT SUM(salary)
FROM (employee);

Grouping

A grouping query is executed on a relation with the GROUP BY clause. GROUP BY is followed by an attribute list and the resulting tuples are grouped successively according to the list. To see how grouping works, let’s say that the company from our earlier examples is a shipping company and owns trucks. The company keeps track of truck purchases and their usage with the following relations: truck(tid, make, model, year, purch_year, purch_quarter, cost). In a simple example we could query the database and ask which years the company purchased trucks:

SELECT purch_year
FROM truck
GROUP BY purch_year;

GROUP BY would come after a WHERE clause, but in this case we didn’t need one so it was omitted. The result of this query would be a single attribute relation (a single column table) with values reflecting the years in which the company purchased at least one truck. If the company purchased more that one truck in that year that purchase would be aggregated with all others in that year.

Only the attributes that are listed in the GROUP BY clause can appear unaggregated in the SELECT clause. The next two queries would tell the company which years and in which quarters they purchased trucks and the total amount spent in each quarter.

SELECT purch_year, quarter
FROM truck
GROUP BY year, quarter
ORDER BY quarter, year;

SELECT purch_year, SUM(cost) as annualcost
FROM truck
GROUP BY purch_year;

The first query would return a relation of 2-tuples with the first element being purch_year and the second quarter (the result is also first sorted by quarter, then by year). The second query would also return a 2-tuple but the elements would be purch_year and annualcost which is the aggregated (by simple addition in this case) cost of trucks for each year.  Because cost is not in the GROUP BY clause, it must be aggregated.

The HAVING Clause

In our last set of queries, we asked for the costs of trucks purchased on an annual basis. Suppose for some reason the company wanted to find the average cost spent in a year only in years where every truck purchased in that year cost over $15,000. To ask this question we can use the HAVING clause.

SELECT purch_year, AVG(cost) as avgcost
FROM truck
GROUP BY purch_year
HAVING MIN(cost) > 15000;

The HAVING clause applies to each group. In this case, MIN is calculated on the attribute cost for each purchase year (purch_year) then compared to 15,000. If the minimum value for all costs in a purchase year are greater that 15,000 then the purch_year shows up in the query result (a 2-tuple (purch_year, avgcost)).