GeoIP decoding using BigQuery - ipv4 and ipv6
Almost every web analytical tool collects the user ip address and this information has much more value if you know the geographic location of that ip, even if the location is approximate. The are many tutorials and example of how you can do geoip decoding of and ipv4 address but I was not able to find one for the ipv6 address using BigQuery.
The solution
Sign up for an account at https://maxmind.com.
Download and extract the following databases:
GeoLite2-City-CSV
These 3 files must exist into you current folder
GeoLite2-City-Blocks-IPv4.csv
GeoLite2-City-Blocks-IPv6.csv
GeoLite2-City-Locations-en.csv
Setup the BigQuery dataset.
DATASET=demo_bq_dataset
bq mk $DATASET
Load tables
bq load --skip_leading_rows=1 $DATASET.geoip_city_block_v4 \
GeoLite2-City-Blocks-IPv4.csv \
"network:STRING,geoname_id:INTEGER,registered_country_geoname_id:INTEGER,represented_country_geoname_id:INTEGER,is_anonymous_proxy:BOOL,is_satellite_provider:BOOL,postal_code:STRING,latitude:FLOAT64,longitude:FLOAT64,accuracy_radius:FLOAT64"
bq load --skip_leading_rows=1 $DATASET.geoip_city_block_v6 \
GeoLite2-City-Blocks-IPv6.csv \
"network:STRING,geoname_id:INTEGER,registered_country_geoname_id:INTEGER,represented_country_geoname_id:INTEGER,is_anonymous_proxy:BOOL,is_satellite_provider:BOOL,postal_code:STRING,latitude:FLOAT64,longitude:FLOAT64,accuracy_radius:FLOAT64"
bq load --skip_leading_rows=1 $DATASET.geoip_city_locations \
GeoLite2-City-Locations-en.csv \
"geoname_id:INTEGER,locale_code:STRING,continent_code:STRING,continent_name:STRING,country_iso_code:STRING,country_name:STRING,subdivision_1_iso_code:STRING,subdivision_1_name:STRING,subdivision_2_iso_code:STRING,subdivision_2_name:STRING,city_name:STRING,metro_code:STRING,time_zone:STRING,is_in_european_union:BOOL"
Next a new set of tables must be prepared in order to be used by the final queries.
Using the SQL Workspace in the BiqQuery section of the Google Cloud Platform Console run the following queries:
CREATE TABLE `demo_bq_dataset.geoip_city_v4`
AS
SELECT
NET.IP_FROM_STRING(REGEXP_EXTRACT(network, r'(.*)/' )) network_bin,
CAST(REGEXP_EXTRACT(network, r'/(.*)' ) AS INT64) mask,
l.city_name, l.country_iso_code, l.country_name, b.latitude, b.longitude
FROM `demo_bq_dataset.geoip_city_block_v4` b
JOIN `demo_bq_dataset.geoip_city_locations` l
USING(geoname_id);
CREATE TABLE `demo_bq_dataset.geoip_city_v6`
AS
SELECT
NET.IP_FROM_STRING(REGEXP_EXTRACT(network, r'(.*)/' )) network_bin,
CAST(REGEXP_EXTRACT(network, r'/(.*)' ) AS INT64) mask,
l.city_name, l.country_iso_code, l.country_name, b.latitude, b.longitude
FROM `demo_bq_dataset.geoip_city_block_v6` b
JOIN `demo_bq_dataset.geoip_city_locations` l
USING(geoname_id);
Now the database is ready to be queried:
WITH test_data AS (
SELECT '2a02:2f0c:570c:fe00:1db7:21c4:21fa:f89' AS ip UNION ALL
SELECT '79.114.150.111' AS ip
)
-- replace the input_data with your data
, ipv4 AS (
SELECT DISTINCT ip, NET.SAFE_IP_FROM_STRING(ip) AS ip_bytes
FROM test_data
WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(ip)) = 4
), ipv4d AS (
SELECT ip, city_name, country_name, latitude, longitude
FROM (
SELECT ip, ip_bytes & NET.IP_NET_MASK(4, mask) network_bin, mask
FROM ipv4, UNNEST(GENERATE_ARRAY(8,32)) mask
)
JOIN `demo_bq_dataset.geoip_city_v4`
USING (network_bin, mask)
), ipv6 AS (
SELECT DISTINCT ip, NET.SAFE_IP_FROM_STRING(ip) AS ip_bytes
FROM test_data
WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(ip)) = 16
), ipv6d AS (
SELECT ip, city_name, country_name, latitude, longitude
FROM (
SELECT ip, ip_bytes & NET.IP_NET_MASK(16, mask) network_bin, mask
FROM ipv6, UNNEST(GENERATE_ARRAY(19,64)) mask
)
JOIN `demo_bq_dataset.geoip_city_v6`
USING (network_bin, mask)
)
SELECT * FROM ipv4d
UNION ALL
SELECT * FROM ipv6d
Conclusion
Long live BigQuery.