Aggregating
In SQL, the standard aggregation operators are SUM
, MIN
, MAX
, and COUNT
and we can apply them to the attributes of relations. For example we have a schema employee(eid, name, dob, hired_date, salary)
, which holds records on each of a company’s employees. We can COUNT
and see how many employees the company has with the following query:
SELECT COUNT(*) FROM employee;
The *
is an argument to the count operator and means to count all tuples in the relation. With the DISTINCT
operator we eliminate duplicates for the query result, so if we wanted to see if any of the employees had the same name we could compare the previous query with the following:
SELECT COUNT(DISTINCT name) FROM employee;
If the first count is larger than the second, then the company has at least two employees with the same name. COUNT
also does not count tuple attributes with NULL
value, and AVG
and SUM
don’t factor them in when calculating either. We can use AVG
to find the average salary of our employees and SUM
to calculate the total amount paid for employee salaries with these queries:
SELECT AVG(salary) FROM (employee); SELECT SUM(salary) FROM (employee);
Grouping
A grouping query is executed on a relation with the GROUP BY
clause. GROUP BY
is followed by an attribute list and the resulting tuples are grouped successively according to the list. To see how grouping works, let’s say that the company from our earlier examples is a shipping company and owns trucks. The company keeps track of truck purchases and their usage with the following relations: truck(tid, make, model, year, purch_year, purch_quarter, cost)
. In a simple example we could query the database and ask which years the company purchased trucks:
SELECT purch_year FROM truck GROUP BY purch_year;
GROUP BY
would come after a WHERE
clause, but in this case we didn’t need one so it was omitted. The result of this query would be a single attribute relation (a single column table) with values reflecting the years in which the company purchased at least one truck. If the company purchased more that one truck in that year that purchase would be aggregated with all others in that year.
Only the attributes that are listed in the GROUP BY
clause can appear unaggregated in the SELECT
clause. The next two queries would tell the company which years and in which quarters they purchased trucks and the total amount spent in each quarter.
SELECT purch_year, quarter FROM truck GROUP BY year, quarter ORDER BY quarter, year; SELECT purch_year, SUM(cost) as annualcost FROM truck GROUP BY purch_year;
The first query would return a relation of 2-tuples with the first element being purch_year
and the second quarter
(the result is also first sorted by quarter, then by year). The second query would also return a 2-tuple but the elements would be purch_year
and annualcost
which is the aggregated (by simple addition in this case) cost of trucks for each year. Because cost
is not in the GROUP BY
clause, it must be aggregated.
The HAVING Clause
In our last set of queries, we asked for the costs of trucks purchased on an annual basis. Suppose for some reason the company wanted to find the average cost spent in a year only in years where every truck purchased in that year cost over $15,000. To ask this question we can use the HAVING
clause.
SELECT purch_year, AVG(cost) as avgcost FROM truck GROUP BY purch_year HAVING MIN(cost) > 15000;
The HAVING
clause applies to each group. In this case, MIN
is calculated on the attribute cost
for each purchase year (purch_year
) then compared to 15,000. If the minimum value for all costs in a purchase year are greater that 15,000 then the purch_year
shows up in the query result (a 2-tuple (purch_year, avgcost)
).