| Title: | Build Geographic Crosswalk Tables for New York City Boundaries |
|---|---|
| Description: | Downloads official New York City geographic boundary data from NYC Open Data, the Department of City Planning, and other sources, then generates comprehensive crosswalk tables showing how administrative and spatial boundaries overlap. Produces wide-format and long-format CSV crosswalks with intersection area and percentage calculations. Based on the Python tool by Nathan Storey at MODA-NYC (<https://github.com/MODA-NYC/nyc-geography-crosswalks>), which builds on earlier work by BetaNYC (<https://github.com/BetaNYC/nyc-boundaries>). |
| Authors: | Kieran Healy [aut, cre] (ORCID: <https://orcid.org/0000-0001-9114-981X>), Nathan Storey [aut] (Author of the original Python implementation) |
| Maintainer: | Kieran Healy <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.0.0.9000 |
| Built: | 2026-05-19 09:59:55 UTC |
| Source: | https://github.com/kjhealy/nycbuildxwalks |
Reads a combined boundaries GeoJSON file, reprojects to EPSG:2263 (NY State Plane, feet) for accurate area calculations, and builds both long-form and wide-format crosswalk CSVs for each primary geography.
build_crosswalks( boundaries_path, run_dir, buffer_feet = -50, min_area_final = 100, epsilon = 1e-06, exclude_ids = "cc_upcoming", primary_only = NULL, targets = NULL, max_primaries = NULL )build_crosswalks( boundaries_path, run_dir, buffer_feet = -50, min_area_final = 100, epsilon = 1e-06, exclude_ids = "cc_upcoming", primary_only = NULL, targets = NULL, max_primaries = NULL )
boundaries_path |
Path to |
run_dir |
Output directory for crosswalk CSVs and metadata. |
buffer_feet |
Negative buffer in feet for intersection de-noising.
Default |
min_area_final |
Minimum intersection area in square feet. Default
|
epsilon |
Tiny area to suppress numerical noise. Default |
exclude_ids |
Character vector of geography IDs to exclude. Default
|
primary_only |
Optional character vector to restrict which primary geography IDs are processed. |
targets |
Optional character vector to restrict target geography IDs. |
max_primaries |
Optional integer to limit features per primary (for testing). |
The path to run_dir (invisibly).
For each feature of the primary geography, computes the intersection area with every feature of each target geography. Returns a tibble with one row per significant pairwise intersection, including area and percentage overlap.
build_longform_for_primary( all_gdf, primary_id, other_ids, buffer_feet = -50, min_area = 100, epsilon = 1e-06, max_primaries = NULL )build_longform_for_primary( all_gdf, primary_id, other_ids, buffer_feet = -50, min_area = 100, epsilon = 1e-06, max_primaries = NULL )
all_gdf |
An sf object containing all boundaries, projected to EPSG:2263 (NY State Plane, feet). |
primary_id |
The geography ID to use as the primary (e.g., |
other_ids |
Character vector of target geography IDs. |
buffer_feet |
Negative buffer (in feet) applied during intersection
to suppress edge artifacts. Default |
min_area |
Minimum intersection area in square feet. Default |
epsilon |
Tiny area threshold to suppress numerical noise. Default
|
max_primaries |
Optional integer to limit the number of primary features processed (useful for testing). |
A tibble with columns primary_geo_id, primary_geo_name,
other_geo_id, other_geo_name, primary_area_sqft,
intersection_area_sqft, and pct_overlap.
For each feature of the primary geography, finds overlapping features from all target geographies and returns a tibble with one row per primary feature and one column per target geography (values are semicolon-separated names of overlapping features).
build_wide_for_primary( all_gdf, primary_id, other_ids, buffer_feet = -50, min_area = 100, epsilon = 1e-06, max_primaries = NULL )build_wide_for_primary( all_gdf, primary_id, other_ids, buffer_feet = -50, min_area = 100, epsilon = 1e-06, max_primaries = NULL )
all_gdf |
An sf object containing all boundaries, projected to EPSG:2263 (NY State Plane, feet). |
primary_id |
The geography ID to use as the primary (e.g., |
other_ids |
Character vector of target geography IDs. |
buffer_feet |
Negative buffer (in feet) applied during intersection
to suppress edge artifacts. Default |
min_area |
Minimum intersection area in square feet. Default |
epsilon |
Tiny area threshold to suppress numerical noise. Default
|
max_primaries |
Optional integer to limit the number of primary features processed (useful for testing). |
A tibble with one row per primary feature. The first column is
named after primary_id and contains the primary feature names.
Remaining columns are named after each target geography ID and contain
semicolon-separated overlapping feature names.
Groups features by name_col and unions their geometries, producing one
feature per unique name. This prevents duplicate rows for multipart
features (e.g., some MODZCTAs).
dissolve_by_name(gdf)dissolve_by_name(gdf)
gdf |
An sf object with columns |
An sf object with one row per unique name_col value, with
columns id, name_col, and dissolved geometry.
Iterates over the datasets defined in nyc_datasets, downloads each one
via process_dataset(), combines them into a single sf object, fixes
invalid geometries, and writes the results to a timestamped output
directory.
download_all_boundaries( output_dir = "outputs", auto_detect_latest = TRUE, preferred_cycle = NULL, external_data_dir = "data/external", datasets = NULL )download_all_boundaries( output_dir = "outputs", auto_detect_latest = TRUE, preferred_cycle = NULL, external_data_dir = "data/external", datasets = NULL )
output_dir |
Base directory for timestamped run outputs. Default
|
auto_detect_latest |
Logical. Attempt to auto-detect the latest DCP
cycle letter. Default |
preferred_cycle |
Optional single letter to pin a DCP cycle. |
external_data_dir |
Path to local fallback files. Default
|
datasets |
A tibble of dataset definitions. Defaults to nyc_datasets. |
The path to the timestamped run directory (invisibly).
Orchestrates the complete workflow: downloads all NYC boundary datasets
via download_all_boundaries(), then builds crosswalk tables via
build_crosswalks(). Optionally creates ZIP archives of the outputs.
make_run( output_dir = "outputs", auto_detect_latest = TRUE, preferred_cycle = NULL, external_data_dir = "data/external", buffer_feet = -50, min_area_final = 100, epsilon = 1e-06, exclude_ids = "cc_upcoming", primary_only = NULL, targets = NULL, max_primaries = NULL, zip_artifacts = FALSE )make_run( output_dir = "outputs", auto_detect_latest = TRUE, preferred_cycle = NULL, external_data_dir = "data/external", buffer_feet = -50, min_area_final = 100, epsilon = 1e-06, exclude_ids = "cc_upcoming", primary_only = NULL, targets = NULL, max_primaries = NULL, zip_artifacts = FALSE )
output_dir |
Base directory for outputs. Default |
auto_detect_latest |
Logical. Auto-detect latest DCP cycle. Default
|
preferred_cycle |
Optional DCP cycle letter to pin. |
external_data_dir |
Path to local fallback files. Default
|
buffer_feet |
Negative buffer for intersection de-noising. Default
|
min_area_final |
Minimum intersection area (sq ft). Default |
epsilon |
Numerical noise threshold. Default |
exclude_ids |
Geography IDs to exclude. Default |
primary_only |
Optional character vector of primary IDs to build. |
targets |
Optional character vector of target IDs. |
max_primaries |
Optional limit on features per primary (for testing). |
zip_artifacts |
Logical. If |
The path to the timestamped run directory (invisibly).
A tibble containing metadata for the 15 NYC geographic boundary datasets used to build crosswalk tables. Each row defines a dataset's source URL, the column used for feature names, and the source type.
nyc_datasetsnyc_datasets
nyc_datasetsA tibble with 15 rows and 6 columns:
Short identifier for the geography (e.g., "cd", "pp", "nta")
Human-readable name of the geography
Source URL for downloading the boundary data
Name of the column in the source data that contains feature names
Optional alternate name column, or NA if none
One of "dcp_zip" (DCP shapefile zip with cycle detection), "opendata_shapefile" (NYC Open Data shapefile export), "opendata_geojson" (NYC Open Data GeoJSON), or "edc_zip" (EDC shapefile zip)
NYC Department of City Planning (DCP), NYC Open Data, and the NYC Economic Development Corporation (EDC). Dataset definitions adapted from the Python tool by Nathan Storey at MODA-NYC (https://github.com/MODA-NYC/nyc-geography-crosswalks).
Downloads a geographic boundary dataset, reads it into an sf object,
reprojects to EPSG:4326, and standardizes the columns to a common schema
(id, name_col, optionally name_alt, and geometry).
process_dataset( dataset_info, auto_detect_latest = TRUE, preferred_cycle = NULL, external_data_dir = "data/external" )process_dataset( dataset_info, auto_detect_latest = TRUE, preferred_cycle = NULL, external_data_dir = "data/external" )
dataset_info |
A single-row tibble or list with elements |
auto_detect_latest |
Logical. Passed to |
preferred_cycle |
Optional cycle letter passed to
|
external_data_dir |
Path to a directory containing local fallback
files (e.g., |
A list with components:
gdf: An sf object with standardized columns, or NULL on failure.
meta: A list with id, original_url, resolved_url, cycle,
auto_detected, status, and error.
DCP boundary files are versioned with a cycle suffix (e.g., _25a.zip).
This function probes for newer cycles by sending HTTP HEAD requests with
ascending letters (b, c, d, ...) and returns the URL for the highest
available cycle.
resolve_dcp_cycle(url, auto_detect = TRUE, preferred_cycle = NULL)resolve_dcp_cycle(url, auto_detect = TRUE, preferred_cycle = NULL)
url |
A DCP boundary URL containing a cycle suffix like |
auto_detect |
Logical. If |
preferred_cycle |
Optional single lowercase letter to pin a specific
cycle (e.g., |
A list with components:
resolved_url: The URL with the best available cycle.
meta: A list with cycle_source, cycle_resolved, auto_detected,
and probes.
Groups features by name_col and unions each group's geometry into a
single geometry. Returns a tibble of name-geometry pairs rather than an
sf object, for use in intersection calculations.
union_by_name(gdf)union_by_name(gdf)
gdf |
An sf object with a |
A tibble with columns name (character) and geometry
(sfc_GEOMETRY).