GeoIP decoding using BigQuery - ipv4 and ipv6

Almost every web analytical tool collects the user ip address and this information has much more value if you know the geographic location of that ip, even if the location is approximate. The are many tutorials and example of how you can do geoip decoding of and ipv4 address but I was not able to find one for the ipv6 address using BigQuery.

The solution

Sign up for an account at https://maxmind.com.

Download and extract the following databases:

GeoLite2-City-CSV

These 3 files must exist into you current folder

GeoLite2-City-Blocks-IPv4.csv
GeoLite2-City-Blocks-IPv6.csv
GeoLite2-City-Locations-en.csv 

Setup the BigQuery dataset.

DATASET=demo_bq_dataset
bq mk $DATASET

Load tables

bq load --skip_leading_rows=1 $DATASET.geoip_city_block_v4 \
    GeoLite2-City-Blocks-IPv4.csv \
    "network:STRING,geoname_id:INTEGER,registered_country_geoname_id:INTEGER,represented_country_geoname_id:INTEGER,is_anonymous_proxy:BOOL,is_satellite_provider:BOOL,postal_code:STRING,latitude:FLOAT64,longitude:FLOAT64,accuracy_radius:FLOAT64"

bq load --skip_leading_rows=1 $DATASET.geoip_city_block_v6 \
    GeoLite2-City-Blocks-IPv6.csv \
    "network:STRING,geoname_id:INTEGER,registered_country_geoname_id:INTEGER,represented_country_geoname_id:INTEGER,is_anonymous_proxy:BOOL,is_satellite_provider:BOOL,postal_code:STRING,latitude:FLOAT64,longitude:FLOAT64,accuracy_radius:FLOAT64"

bq load --skip_leading_rows=1 $DATASET.geoip_city_locations \
    GeoLite2-City-Locations-en.csv \
    "geoname_id:INTEGER,locale_code:STRING,continent_code:STRING,continent_name:STRING,country_iso_code:STRING,country_name:STRING,subdivision_1_iso_code:STRING,subdivision_1_name:STRING,subdivision_2_iso_code:STRING,subdivision_2_name:STRING,city_name:STRING,metro_code:STRING,time_zone:STRING,is_in_european_union:BOOL"

Next a new set of tables must be prepared in order to be used by the final queries.

Using the SQL Workspace in the BiqQuery section of the Google Cloud Platform Console run the following queries:

CREATE TABLE `demo_bq_dataset.geoip_city_v4`
AS
SELECT 
  NET.IP_FROM_STRING(REGEXP_EXTRACT(network, r'(.*)/' )) network_bin,
  CAST(REGEXP_EXTRACT(network, r'/(.*)' ) AS INT64) mask,
  l.city_name, l.country_iso_code, l.country_name, b.latitude, b.longitude
FROM `demo_bq_dataset.geoip_city_block_v4` b
JOIN `demo_bq_dataset.geoip_city_locations` l 
USING(geoname_id);
CREATE TABLE `demo_bq_dataset.geoip_city_v6`
AS
SELECT 
  NET.IP_FROM_STRING(REGEXP_EXTRACT(network, r'(.*)/' )) network_bin,
  CAST(REGEXP_EXTRACT(network, r'/(.*)' ) AS INT64) mask,
  l.city_name, l.country_iso_code, l.country_name, b.latitude, b.longitude
FROM `demo_bq_dataset.geoip_city_block_v6` b
JOIN `demo_bq_dataset.geoip_city_locations` l
USING(geoname_id);

Now the database is ready to be queried:

WITH test_data AS (
    SELECT '2a02:2f0c:570c:fe00:1db7:21c4:21fa:f89' AS ip UNION ALL 
    SELECT '79.114.150.111' AS ip
)
-- replace the input_data with your data
, ipv4 AS (
    SELECT DISTINCT ip, NET.SAFE_IP_FROM_STRING(ip) AS ip_bytes
    FROM test_data 
    WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(ip)) = 4
), ipv4d AS (
    SELECT ip, city_name, country_name, latitude, longitude
    FROM (
        SELECT ip, ip_bytes & NET.IP_NET_MASK(4, mask) network_bin, mask
        FROM ipv4, UNNEST(GENERATE_ARRAY(8,32)) mask
    )
    JOIN `demo_bq_dataset.geoip_city_v4`
    USING (network_bin, mask)
), ipv6 AS (
    SELECT DISTINCT ip, NET.SAFE_IP_FROM_STRING(ip) AS ip_bytes
    FROM test_data 
    WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(ip)) = 16
), ipv6d AS (
    SELECT ip, city_name, country_name, latitude, longitude
    FROM (
        SELECT  ip, ip_bytes & NET.IP_NET_MASK(16, mask) network_bin, mask
        FROM ipv6, UNNEST(GENERATE_ARRAY(19,64)) mask
    )
    JOIN `demo_bq_dataset.geoip_city_v6`  
    USING (network_bin, mask)
)
SELECT * FROM ipv4d
UNION ALL 
SELECT * FROM ipv6d

Conclusion

Long live BigQuery.