In location based web applications we come across a problem that requires us to fetch some information from our database that is at some certain distance from a particular location. Generally this problem can be defined as following.

Give me all the

things near location

Queries like “Fetch all the nearest friends”, “Fetch all the nearest restaurants” or “Fetch all the closest software developers” are essentially same as above.

If we analyze above problem, we see that we have three keywords in it viz. **things**, **near** and **location**. **things** is the data-set in which we are looking for information. It is usually a table in our database. It could be a users or restaurants table. The keyword **near** is the amount of distance within which we need to look into our data-set for information. It is given by a user using our application or already defined by us. Normally it is a few kilometers or miles. The third keyword, **location** is the place from which we need to measure the nearness of records in our data-set. This is also given by users using our application. It is identified by latitude and longitude on earth’s surface.

If we analyze above problem from engineering point of view, we can define it as following.

Give me all the

things

order bydistancefromlocation

wheredistance<kilometers

We are actually looking for records sorted by their **distance** from **location**. And we are considering only those records that have **distance** less than specified by user in kilometers or miles. Out of four keywords in the problem, we know three already viz. **things** — a table in our database, **location** and **kilometers** — provided by user. The only thing that we do not know is the **distance** between **location** and records in our data-set. Lets find a way to calculate the **distance**.

We have latitude and longitude of all the records in our data-set. For now lets assume that we have a **hotels** table in our database that has following structure.

id | name | latitude | longitude |
---|---|---|---|

1 | Le Meridien | 6.4420085 | 3.415344 |

2 | … | … | … |

We also have **location** of the user in the form of latitude and longitude.

The distance between two points on the surface of a sphere is given by Haversine Formula. Distance is given by following formula.

In above formula,

d = Distance between two points.

R = Radius of the sphere on which the points are. In our case it is earth with 3959 Miles or 6371 Kilometers.

lat1 and long1 are co-ordinates of the first point and lat2 and long2 are co-ordinates of second point in radians.

The corresponding MySQL statement to calculate distance between locations **orig** and **dest**:

6371 * 2 * ASIN(SQRT(POWER(SIN(RADIANS(orig.lat - ABS(dest.lat))), 2) + COS(RADIANS(orig.lat)) * COS(RADIANS(ABS(dest.lat))) * POWER(SIN(RADIANS(orig.long - dest.long)), 2)))

And our complete MySQL query to get all the hotels that are within 10 kilometers from location identified by **orig**.

SELECT id, name, 6371 * 2 * ASIN(SQRT(POWER(SIN(RADIANS(orig.lat - ABS(hotels.latitude))), 2) + COS(RADIANS(orig.lat)) * COS(RADIANS(ABS(hotels.latitude))) * POWER(SIN(RADIANS(orig.long - hotels.longitude)), 2))) AS distance FROM hotels HAVING distance < 10 ORDER BY distance LIMIT 10;

#### Benchmark

If we run **EXPLAIN** on the previous query we do not get very pleasant results.

mysql> EXPLAIN query \G id: 1 select_type: SIMPLE table: hotels type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 10574 Extra: Using filesort 1 row in set (0.00 sec)

This query scanned the whole table! It went through all the **10574** records in the table one by one and found the distance of that record from the location and then sorted all the records by its distance from it. This is not scalable for large data-sets.

#### Optimization

If we analyze our original problem once again.

Give me all the things

order by distance from location

wheredistance<kilometers

Earlier we scanned whole table to find out the distance between each record and location specified by the user. But we are looking for records that are within the kilometers as specified by the user. If somehow we can find a relationship that describes how co-ordinates change with change in distance, we can get the co-ordinates of hotels that may be farthest from the location specified by the user. These farthest co-ordinates will form a boundary within which all other records will reside. Then we need to look only for the records that are within this boundary. It turns out to be that we do have relationship between co-ordinates and distance.

1Ëš of latitude ~= 111.04 Kms

1Ëš of longitude ~= cos(latitude) * 111.04 Kms

We calculate the co-ordinates of the bounding box using above relationship between distance and co-ordinate change. We do this for location from **orig** and allowed **distance** given by **dist**

long1 = orig.long â€“ dist / ABS(COS(RADIANS(orig.lat)) * 111.04) long2 = orig.long + dist / ABS(COS(RADIANS(orig.lat)) * 111.04) lat1 = orig.lat â€“ dist / (111.04) lat2 = orig.lat + dist / (111.04)

The bounding box will encompass the circle whose center and radius is given by user. Following figure gives us an idea how it will look like.

The black dot is center of the circle it is also the location of the user. Radius of the circle is allowed kilometers withing which we need to look. Since the box is larger than the circle, it will encompass some records that do not fall within the circle. This may give use some false hits. The red dots denote false hits. The green dots denote correct hits.

For faster lookups, we define indexes on latitude and longitude columns in our table.

CREATE INDEX `index_hotels_on_latitude_and_longitude` ON `hotels` (`latitude`, `longitude`)

We also change our original query to find the records within above box whose boundaries are defined by lat1, long1 and lat2, long2.

SELECT id, name, 6371 * 2 * ASIN(SQRT(POWER(SIN(RADIANS(orig.lat - ABS(hotels.latitude))), 2) + COS(RADIANS(orig.lat)) * COS(RADIANS(ABS(hotels.latitude))) * POWER(SIN(RADIANS(orig.long - hotels.longitude)), 2))) AS distance FROM hotels WHERE hotels.latitude BETWEEN lat1 AND lat2 AND hotels.longitude BETWEEN long1 AND long2 HAVING distance < 10 ORDER BY distance LIMIT 10;

If we run **EXPLAIN** on above query.

mysql> EXPLAIN query \G id: 1 select_type: SIMPLE table: listings type: range possible_keys: index_hotels_on_latitude_and_longitude key: index_hotels_on_latitude_and_longitude key_len: 10 ref: NULL rows: 825 Extra: Using where, Using filesort 1 row in set (0.00 sec)

We see that number of rows now scanned are far more less than earlier. This time it scanned only **825** rows. In fact these are the number of records that are within the bounding box. So the number of rows that are to be scanned depend upon the number of records falling within the box. If our application has records that are concentrated around a place, number of scanned rows will be higher for query executed around that place.

#### Using MySQL Spatial Data types

MySQL natively supports Spatial Data Types. Lets use these data types to solve our problem.

ALTER TABLE hotels ADD location POINT NOT NULL; UPDATE hotels SET location = POINT(hotels.latitude, hotels.longitude); ALTER TABLE hotels ADD SPATIAL KEY (location);

First we add a **location** column in our hotels table of POINT type. We also update the location column to have existing co-ordinates of our records. Finally we add a spatial key on the location column.

As of now MySQL does not have any standard GIS function that calculates distance between two points on the surface of a sphere. Lets define our own.

DELIMITER $$ CREATE FUNCTION distance (a POINT, b POINT) RETURNS double DETERMINISTIC BEGIN RETURN 6371 * 2 * ASIN(SQRT(POWER(SIN(RADIANS(ABS(X(a)) - ABS(X(b)))), 2) + COS(RADIANS(ABS(X(a)))) * COS(RADIANS(ABS(X(b)))) * POWER(SIN(RADIANS(Y(a) - Y(b))), 2))); END $$ DELIMITER ;

We use the same haversine formula to calculate distance between point a and point b.

Since we are now using Point data type to store locations of hotels in our database, we will use MySQL’s GIS functions to specify bounding box.

SET @bbox = 'POLYGON((lat1 long1, lat1 long2, lat2 long2, lat2 long1, lat1 long1))'

Above defines a polygon that has four vertices defined by (lat1 long1), (lat1 long2), (lat2 long2) and (lat2 long1). The last coordinate explicitly specifies that it is a closed polygon. The closing point is same as the first vertice of the polygon. Variables lat1, lat2, long1 and long2 are the same that we calculated earlier.

Now we update our SQL query to use POINT data types.

SELECT id, name, distance(hotels.location, POINT(orig.lat, orig.long)) AS cdist FROM hotels WHERE INTERSECTS(hotels.location, PolygonFromText(@bbox)) HAVING cdist < 10 ORDER BY cdist LIMIT 10;

Above, we are using INTERSECTS function provided by MySQL to find out whether location of a hotel falls within the bounnding box or not. PolygonFromText simply returs a polygon defined in a string. Lets run EXPLAIN on above query.

mysql> EXPLAIN query \G id: 1 select_type: SIMPLE table: hotels type: range possible_keys: location key: location key_len: 34 ref: NULL rows: 70 Extra: Using where; Using temporary; Using filesort

The number of rows scanned is significantly decreased. We can see that storing coordinates in POINT data type has improved the performance.

Following are some pointers that might be helpful exploring MySQL and Proximity searching.

Good work

The complexity has already been handled in Geocoder gem

https://github.com/alexreisner/geocoder

Great article!

Hi,

very good article!

I guess Eddy is right. The division through two in und SIN seems to be missing so that the distance formula looks like this:

DELIMITER $$Â CREATE FUNCTION distance (a POINT, b POINT) RETURNS double DETERMINISTICÂ Â BEGINÂ Â Â RETURN 6375 * 2 * ASIN( SQRT( POWER(SIN( RADIANS( X(a) – ABS(X(b)) ) / 2 ), 2)Â Â Â Â Â Â Â Â Â + COS(RADIANS(ABS(X(a))))Â Â Â Â Â Â Â Â Â * COS(RADIANS(ABS(X(b))))Â Â Â Â Â Â Â Â Â * POWER(SIN(RADIANS(Y(a) – Y(b))), 2)));Â Â END Â $$DELIMITER ;

Great article, but there appears to be an error in your statement of the distance formula (which carries through into the code). Shouldn’t the differences of the co-ordinates in the sin squared terms be divided by two? Shoot me if I’m wrong 🙂

Awsome Article. Thanks for providing such a detailed article. This is life saver for me. This helps me a lot to optimize my near by location query.Â

Very Nice. Thx! Just a little mistake IMHO, POINT mysql function takes longitude (x-axis) as first argument and latitude (y-axis) as Second argument:Â

http://dev.mysql.com/doc/refman/5.6/en/gis-class-point.htmlÂ Â

Anybody ever have a problem with southern hemisphere coordinates?