Wednesday, July 2, 2014

Code challenge: IP to country with redis

The challenge: 
Lookup country code by IP address using redis and python.

The solution:
A quick google search found several free sources of data. http://dev.maxmind.com/geoip/legacy/geolite/

I only care about country and not city or lat/lon or timezone. The "GeoLite Country" data will work. It is 95,000 lines of csv data updated once a month.

The format is "start IP","end IP", "start decimal", "end decimal", "country code", "country name". I'm using the csv module to parse it.

The redis sorted set takes a score and a unique element. Note that country code gets repeated in the data and so I made the country code unique by appending the IP.

Converting an IP address to decimal value would work easily with netaddr module. I don't have that on my mac, so I'm using the struct.unpack("!L",socket.inet_aton(ip))[0] method.

I also detect missing ranges in the data and insert "empty" for those ranges into the data.


The Lookup code:

import redis
import GeoIP 
red = redis.Redis() 
CC = GeoIP.ip_to_country("xx.xx.xx.xx") 
print CC 

GeoIP.ip_to_country(ip) will return the country code from the data or "empty" for empty blocks - say 127.0.0.1. It will return "unknown" for things outside the range or invalid formatted IP addresses

The data is about 20MB.


The module:

# The GeoLite databases are distributed under
# the Creative Commons Attribution-ShareAlike 3.0 Unported License.
# The attribution requirement may be met by including the following
# in all advertising and documentation mentioning features of or use of this database:
#
# This product includes GeoLite data created by MaxMind, available from
#  http://www.maxmind.com.


import redis

def data_to_redis( myRedis, key, filename ):
    import csv

    # the data is csv. "start","end","d_start","d_end","country code","country"
    # the redis data is a zlist. note the country code repeats

    fp = open( filename, "r" )
    count = 0
    empty = 0
    lastEnd = 0
    csv_reader = csv.reader( fp )
    for line in csv_reader:
        #print line
        try:
            startIP,endIP,startDec,endDec,CC,country = line
        except:
            print line

        #print "{0} {1} {2} {3} {4}".format( startIP, endIP, startDec, endDec, CC )
        # use the startDec as the score
        score = int(startDec.strip('" '))
        endDec = int(endDec.strip('" '))
        if score-1 > lastEnd:
            # assume a missing block.
            #print "missing block: {0} to {1}".format( lastEnd+1, score-1 )
            myRedis.zadd( key, "empty|{0}".format(lastEnd), lastEnd+1 )
            empty += 1

        lastEnd = endDec
        # use CC|startDec as the
        member = CC.strip('" ')+"|"+str(score)
        myRedis.zadd( key, member, score )
        #if count > 10:
        #    print "early exit for debug"
        #    return
        count += 1
    print "added {0} records to {1}. empty blocks {2}".format( count, key, empty )


def ip_to_country( myRedis, key, ip ):
    dec = ip_to_dec( ip )
    data = myRedis.zrevrangebyscore( key, dec,0, num=1, start=0 )
    if len(data) > 0:
        CC,start = data[0].split("|")
    else:
        CC = "unknown"
    return CC

def ip_to_dec( ip ):
    import struct
    import socket
    "convert decimal dotted quad string to long integer"
    # note the big vs little indian packing
    return struct.unpack('!L',socket.inet_aton(ip))[0]



Todo:
add some more error checking and perhaps create a CC to country name hash in redis.

No comments: