In location based web applications we come across a problem that requires us to fetch some information from our database that is at some certain distance from a particular location. Generally this problem can be defined as following.

Give me all the things near location

Queries like “Fetch all the nearest friends”, “Fetch all the nearest restaurants” or “Fetch all the closest software developers” are essentially same as above.

If we analyze above problem, we see that we have three keywords in it viz. things, near and location. things is the data-set in which we are looking for information. It is usually a table in our database. It could be a users or restaurants table. The keyword near is the amount of distance within which we need to look into our data-set for information. It is given by a user using our application or already defined by us. Normally it is a few kilometers or miles. The third keyword, location is the place from which we need to measure the nearness of records in our data-set. This is also given by users using our application. It is identified by latitude and longitude on earth’s surface.

If we analyze above problem from engineering point of view, we can define it as following.

Give me all the things
order by distance from location
where distance < kilometers

We are actually looking for records sorted by their distance from location. And we are considering only those records that have distance less than specified by user in kilometers or miles. Out of four keywords in the problem, we know three already viz. things — a table in our database, location and kilometers — provided by user. The only thing that we do not know is the distance between location and records in our data-set. Lets find a way to calculate the distance.

We have latitude and longitude of all the records in our data-set. For now lets assume that we have a hotels table in our database that has following structure.

id name latitude longitude
1 Le Meridien 6.4420085 3.415344
2

We also have location of the user in the form of latitude and longitude.

The distance between two points on the surface of a sphere is given by Haversine Formula. Distance is given by following formula.

In above formula,

d = Distance between two points.
R = Radius of the sphere on which the points are. In our case it is earth with 3959 Miles or 6371 Kilometers.
lat1 and long1 are co-ordinates of the first point and lat2 and long2 are co-ordinates of second point in radians.

The corresponding MySQL statement to calculate distance between locations orig and dest:

  6371 * 2 * ASIN(SQRT(POWER(SIN(RADIANS(orig.lat - ABS(dest.lat))), 2) + COS(RADIANS(orig.lat)) * COS(RADIANS(ABS(dest.lat))) * POWER(SIN(RADIANS(orig.long - dest.long)), 2)))

And our complete MySQL query to get all the hotels that are within 10 kilometers from location identified by orig.

SELECT id, name,
6371 * 2 * ASIN(SQRT(POWER(SIN(RADIANS(orig.lat - ABS(hotels.latitude))), 2) + COS(RADIANS(orig.lat)) * COS(RADIANS(ABS(hotels.latitude))) * POWER(SIN(RADIANS(orig.long - hotels.longitude)), 2))) AS distance
FROM hotels
HAVING distance < 10
ORDER BY distance LIMIT 10;

Benchmark

If we run EXPLAIN on the previous query we do not get very pleasant results.

mysql> EXPLAIN query \G
           id: 1
  select_type: SIMPLE
        table: hotels
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 10574
        Extra: Using filesort
1 row in set (0.00 sec)

This query scanned the whole table! It went through all the 10574 records in the table one by one and found the distance of that record from the location and then sorted all the records by its distance from it. This is not scalable for large data-sets.

Optimization

If we analyze our original problem once again.

Give me all the things
order by distance from location
where distance < kilometers

Earlier we scanned whole table to find out the distance between each record and location specified by the user. But we are looking for records that are within the kilometers as specified by the user. If somehow we can find a relationship that describes how co-ordinates change with change in distance, we can get the co-ordinates of hotels that may be farthest from the location specified by the user. These farthest co-ordinates will form a boundary within which all other records will reside. Then we need to look only for the records that are within this boundary. It turns out to be that we do have relationship between co-ordinates and distance.

1Ëš of latitude ~= 111.04 Kms
1Ëš of longitude ~= cos(latitude) * 111.04 Kms

We calculate the co-ordinates of the bounding box using above relationship between distance and co-ordinate change. We do this for location from orig and allowed distance given by dist

long1 = orig.long – dist / ABS(COS(RADIANS(orig.lat)) * 111.04)
long2 = orig.long + dist / ABS(COS(RADIANS(orig.lat)) * 111.04)

lat1 = orig.lat – dist / (111.04)
lat2 = orig.lat + dist / (111.04)

The bounding box will encompass the circle whose center and radius is given by user. Following figure gives us an idea how it will look like.

The black dot is center of the circle it is also the location of the user. Radius of the circle is allowed kilometers withing which we need to look. Since the box is larger than the circle, it will encompass some records that do not fall within the circle. This may give use some false hits. The red dots denote false hits. The green dots denote correct hits.

For faster lookups, we define indexes on latitude and longitude columns in our table.

CREATE INDEX `index_hotels_on_latitude_and_longitude` ON `hotels` (`latitude`, `longitude`)

We also change our original query to find the records within above box whose boundaries are defined by lat1, long1 and lat2, long2.

SELECT id, name,
6371 * 2 * ASIN(SQRT(POWER(SIN(RADIANS(orig.lat - ABS(hotels.latitude))), 2) + COS(RADIANS(orig.lat)) * COS(RADIANS(ABS(hotels.latitude))) * POWER(SIN(RADIANS(orig.long - hotels.longitude)), 2))) AS distance
FROM hotels
WHERE hotels.latitude BETWEEN lat1 AND lat2 AND hotels.longitude BETWEEN long1 AND long2
HAVING distance < 10
ORDER BY distance LIMIT 10;

If we run EXPLAIN on above query.

mysql> EXPLAIN query \G
           id: 1
  select_type: SIMPLE
        table: listings
         type: range
possible_keys: index_hotels_on_latitude_and_longitude
          key: index_hotels_on_latitude_and_longitude
      key_len: 10
          ref: NULL
         rows: 825
        Extra: Using where, Using filesort
1 row in set (0.00 sec)

We see that number of rows now scanned are far more less than earlier. This time it scanned only 825 rows. In fact these are the number of records that are within the bounding box. So the number of rows that are to be scanned depend upon the number of records falling within the box. If our application has records that are concentrated around a place, number of scanned rows will be higher for query executed around that place.

Using MySQL Spatial Data types

MySQL natively supports Spatial Data Types. Lets use these data types to solve our problem.

ALTER TABLE hotels ADD location POINT NOT NULL;

UPDATE hotels SET location = POINT(hotels.latitude, hotels.longitude);

ALTER TABLE hotels ADD SPATIAL KEY (location);

First we add a location column in our hotels table of POINT type. We also update the location column to have existing co-ordinates of our records. Finally we add a spatial key on the location column.

As of now MySQL does not have any standard GIS function that calculates distance between two points on the surface of a sphere. Lets define our own.

DELIMITER $$
 CREATE FUNCTION distance (a POINT, b POINT) RETURNS double DETERMINISTIC
   BEGIN
     RETURN 6371 * 2 * ASIN(SQRT(POWER(SIN(RADIANS(ABS(X(a)) - ABS(X(b)))), 2) + COS(RADIANS(ABS(X(a)))) * COS(RADIANS(ABS(X(b)))) * POWER(SIN(RADIANS(Y(a) - Y(b))), 2)));
   END  $$
DELIMITER ;

We use the same haversine formula to calculate distance between point a and point b.

Since we are now using Point data type to store locations of hotels in our database, we will use MySQL’s GIS functions to specify bounding box.

SET @bbox = 'POLYGON((lat1 long1, lat1 long2, lat2 long2, lat2 long1, lat1 long1))'

Above defines a polygon that has four vertices defined by (lat1 long1), (lat1 long2), (lat2 long2) and (lat2 long1). The last coordinate explicitly specifies that it is a closed polygon. The closing point is same as the first vertice of the polygon. Variables lat1, lat2, long1 and long2 are the same that we calculated earlier.

Now we update our SQL query to use POINT data types.

SELECT id, name, distance(hotels.location, POINT(orig.lat, orig.long)) AS cdist
FROM hotels
WHERE INTERSECTS(hotels.location, PolygonFromText(@bbox))
HAVING cdist < 10
ORDER BY cdist LIMIT 10;

Above, we are using INTERSECTS function provided by MySQL to find out whether location of a hotel falls within the bounnding box or not. PolygonFromText simply returs a polygon defined in a string. Lets run EXPLAIN on above query.

mysql> EXPLAIN query \G
           id: 1
  select_type: SIMPLE
        table: hotels
         type: range
possible_keys: location
          key: location
      key_len: 34
          ref: NULL
         rows: 70
        Extra: Using where; Using temporary; Using filesort

The number of rows scanned is significantly decreased. We can see that storing coordinates in POINT data type has improved the performance.

Following are some pointers that might be helpful exploring MySQL and Proximity searching.

Share this:

Privacy Preference Center