Page head image

How to process three weeks of GPS data


Handling three weeks of GPS data collected on an RV trip through the American West is quite a challenge. Roughly 930,000 GPS positions are far too much to process directly. There are lots of wrong fixes that need to be removed, lots of identical or nearly identical positions, and a huge amount of fixes located on a straight line. Feeding the data unfiltered to Google Earth either results in GE drawing nonsense, locking up or crashing. The gpsbabel filters don't help, either. Thus, I wrote my own filter. It performs several steps:

Simple filtering

First, invalid records or those with not enough accuracy need to be removed. This is the easy part: the record has to be marked as "valid" and at least three satellites should be in view. Also, a record which is almost identical to the one before can be dropped right away.

Validating fixes

It is probably impossible to validate a set of fixes by not taking into account how these were taken. A distance of 20m covered within one second is not quite spectacular when driving a car. I seriously doubt that it would be possible by foot, though. Discarding records which would require an impossibly high speed eliminates most erroneous fixes quite reliably. But sometimes this is not good enough, especially at lower speeds. For example, it cannot catch that I certainly was not driving 75 mph on the Las Vegas Boulevard. Actually, you cannot tell without having a map of speed limits. But luckily, a vehicle has a maximum acceleration. Checking records to not exceed this value removes the remaining spectacular incorrect measurements. A test for a maximum negative acceleration probably does not make that much sense, though.

Surprisingly, with the Navilock USB GPS receiver the amount of entirely wrong locations is extremely low. It gives me one record with an impossibly high speed, and 28 with the maximum acceleration exceeded. The logs with the Garmin Geko had 1552 records above maximum speed and 1766 above maximum acceleration for 379445 records, with still a lot of errors undiscovered due to the strongly varying fix rate. But at least it eliminates the worst errors.

What I wasn't able to automatically recognize were the wrong fixes at very low speed, as illustrated by my last article on the subject. Does anyone have a good idea?

Smoothing Curves

One lesson I learned: you need to know what to google for, otherwise you'll reinvent the wheel. There are still too many remaining valid fixes. How about trying to find straight lines as long as possible and removing any records which are located on or sufficiently near the line? Unfortunately, c't 19/2008 came too late, otherwise I wouldn't have tried to implement (effectively) a non-recursive version of the Ramer-Douglas-Peucker algorithm. Well, recursion in Perl sucks anyway and my version always gives the longest possible line, as I don't have to choose an end point. It is faster for short segments, however a lot slower for long ones. The results should be quite similar though.

What gave me quite a headache was that I couldn't remember how to calculate the distance of a point to a line. Embarrassing, isn't it? But I was never good at geometry. With a sheet of checked paper I finally found out, though. We don't even need trigonometric functions. Well, strictly speaking we do, as we are operating on polar coordinates. Luckily, sin(x) ≃ x for small values of x, and we are talking about small differences between angles. I don't have to tell you that I'm not very good at trigonometry, either. Geo::Distance (used to calculate the distance between two points) by default uses the Haversine formula. Probably using the Polar Coordinate Flat-Earth formula would be sufficient, but I read the perldoc after I already processed all data. The array operations are much more expensive than the actual calculations anyway.

Putting it together

The script is written in Perl, though I ususally prefer Python. However, CPAN has the Geo::*-packages, and I haven't found any similar package for Python. Using a compiled program would greatly improve the execution time, though for rapid prototyping script languages are best. Extending the script to not only read NMEA records, but also GPX and gpsdrive log files was very easy. Likewise, making everything configurable on the command line. The script writes a simple CSV file with the unix time stamp, a type flag (0=track point, 1=first point of a track segment), latitude, longitude and altitude. It cannot be processed by gpsbabel without losing the time stamp, also gpsbabel insists creating track points for every segment. But converts it to a GPX file which can be read by gpsbabel and other programs.


$ time bzcat usa-2008.nmea.bz2 | ../bin/ -s -i nmea -v 140 -a 6 -F /dev/null
/dev/stdin (nmea) => /dev/null (csv)
930067 records read
99975 records written
95 invalid records
646152 identical records
1 records with max speed exceeded
28 records with max acceleration exceeded
183675 records on straight line eliminated (approx.)

real 2m12.753s
user 2m10.044s
sys 0m2.440s
$ time bzcat usa-2008.nmea.bz2 >/dev/null
real 0m4.635s
user 0m4.576s
sys 0m0.032s

The resulting set of positions is lower than the one mentioned in an earlier article, as I changed the algorithm to find the distance of a point to a line (see above.) It does not make any visible difference when rendering the track.

Show me the source!

Sure, if you can stand my terrible Perl coding style:

For documentation, read the source or run --help. If you want to do console I/O on a certain non-Unix OS, you probably need to specify CON: as the input or output file or modify the script accordingly.


There are currently no comments

New Comment


required (not published)