01-structure_exploration.Rmd
The DEFID2 dataset is organized in a Relational Database Management System (RDMBS); the version maintained at JRC is a Postgres (PostGIS) database and is converted to a sqlite (spatialite) file database to ease data distribution. Organizing the data in such a rigid structured data form facilitates data management and consistency checks while preserving complex relations between attributes and ensuring data integrity. A simplified overview of the database structure is presented in Figure 1.1. The two main tables of that schema are Event and Geom for disturbance events and geometries associated to these events.
The defid2R
package eases data access and reconstruction into tabular form. Its core function is read_defid()
which reads data as an sf dataframe with multiple filter and attribute selection options. In the example below, a data subset covering only Italy and containing hosts and agents information is returned.
library(defid2R)
## Checking local database version
## ......
## OK
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
df <- read_defid(select=c('agents', 'hosts'), country_filter='Italy') %>%
sample_n(10) %>%
select(-event_id, -geom_id)
knitr::kable(df, "html")
survey_date | is_affected | agents | hosts | geometry |
---|---|---|---|---|
2005-06-01 | TRUE | Coleophora laricella | Larix decidua | POLYGON ((11.57057 46.62383… |
2006-06-01 | TRUE | Coleophora laricella | Larix decidua | POLYGON ((11.28648 46.79223… |
2014-11-15 | TRUE | Thaumetopoea pityocampa | Pinus sylvestris | POLYGON ((10.55242 46.60414… |
2003-06-01 | TRUE | Lymantria dispar | Quercus sp.,other deciduous sp. | POLYGON ((11.30087 46.46252… |
2017-04-15 | TRUE | Cenangium ferruginosum | Pinus sylvestris | POLYGON ((10.80943 46.63712… |
1992-08-15 | TRUE | Orgyia antiqua | Larix decidua,Picea abies | POLYGON ((11.39113 46.82159… |
1989-07-15 | TRUE | Oligonychus ununguis | Picea abies | POLYGON ((11.21385 46.75997… |
1997-07-15 | TRUE | Glis glis,Sciurus vulgaris | Larix decidua | POLYGON ((10.66192 46.64887… |
2015-11-15 | TRUE | Thaumetopoea pityocampa | Pinus sylvestris | POLYGON ((10.56 46.63266, 1… |
2017-01-18 | TRUE | Rhabdocline laricis | Larix decidua | POLYGON ((10.83719 46.73887… |
It is also possible to use a spatial filter to subset the dataset; in the example below, the boundaries of a NUTS level 3 region are used to filter the data. One attribute of the resulting sf dataframe is then visualized using ggplot2.
library(giscoR)
library(sf)
library(ggplot2)
cz_072 <- gisco_get_nuts(resolution = '01', nuts_id = 'CZ072') %>%
st_as_sfc()
df <- read_defid(select=c('hosts'), st_intersects_filter=cz_072) %>%
mutate(survey_date=as.Date(survey_date))
ggplot(df) +
geom_sf(data=cz_072) +
geom_sf(aes(fill=survey_date, color=survey_date), size=0.2) +
theme_bw()
Note that for users experienced with SQL, data can be directly explored that way. This approach is facilitated by the SQL engine integrated with Rmarkdown (used to write this document). It is first necessary to create a database connection object. That object is then passed to the connection=
parameter of the sql chunk.
db_path <- defid_get_db_path()
con <- defid_get_connection(db_path)
For instance to quickly know how many countries are represented in the database, the following query can be used.
ncountries |
---|
6 |
The package contains helpers functions to easily reproduce results presented in the scientific article. For instance one can quickly get an overview of all datasets by running the defid_datasets()
function.
df <- defid_datasets()
knitr::kable(df, "html")
dataset_code | country | n_records | geom_event_relation_type | method | contributing_organizations |
---|---|---|---|---|---|
CH-002 | Switzerland | 144 | substitute point | Field surveys | Waldschutz Schweiz (WSS), Eidgenössische Forschungsanstalt für Wald, Schnee und Landschaft (WSL) |
CZ-001 | Czechia | 98 | exact | Remote sensing classification | Remote Sensing and Geospatial Analytics Division, GMV Innovation Solutions … Department of Forest Management and Applied Geoinformatics, Faculty of Forestry and Wood Technology, Mendel University in Brno … University Forest Enterprise Masaryk Forest Křtiny, Mendel University in Brno |
CZ-003 | Czechia | 6883 | exact | Remote sensing classification | Czechglobe - Global Change Research Institute, CAS … Ústav pro hospodářskou úpravu lesů - Forest Management Institute (FMI) |
CZ-004 | Czechia | 1986 | exact | Remote sensing classification | Czechglobe - Global Change Research Institute, CAS … Ústav pro hospodářskou úpravu lesů - Forest Management Institute (FMI) |
ES-001 | NA | 1 | NA | NA | UXAFOREST Research Group, University of Santiago de Compostela … GEOINCA Research Group, University of León |
ES-002 | NA | 1 | NA | NA | UXAFOREST Research Group, University of Santiago de Compostela … GEOINCA Research Group, University of León |
ES-003 | Spain | 434 | substitute polygon | Field surveys | Laboratori de Sanitat Forestal, Servei d’Ordenació i Gestió Forestal, Conselleria d’Agricultura, Desenvolupament Rural, Emergència Climàtica i Transició Ecològica, Generalitat Valenciana |
ES-004 | Spain | 3 | exact point | Field surveys | Department of Forest Engineering, University of Córdoba … Instituto de Agricultura Sostenible (IAS), Consejo Superior de Investigaciones Científicas (CSIC), Córdoba, Spain |
FI-001 | NA | 1 | NA | NA | Department of Forest Sciences, University of Helsinki |
IT-001 | Italy | 19 | exact point | Field surveys | CREA - Consiglio per la ricerca in agricoltura e l’analisi dell’economia agraria … CREA Research Centre for Plant Protection and Certification, Florence, Italy |
IT-003 | NA | 1 | NA | NA | DAFNAE-Entomology, University of Padova |
IT-004 | Italy | 2 | exact point | Field surveys | DAFNAE-Entomology, University of Padova |
IT-005 | Italy | 1 | exact | Field surveys | DAFNAE-Entomology, University of Padova |
IT-006 | Italy | 46 | exact | Field surveys | Ufficio Pianificazione forestale della Provincia Autonoma di Bolzano, Universita degli Studi di Padova |
RO-001 | Romania | 27 | exact | Satellite photointerpretation | Department of Geomorphology-Pedology-Geomatics, Faculty of Geography, University of Bucharest … National Institute for Research and Development in Forestry ‘Marin Drăcea’ (INCDS) |
RO-002 | NA | 1 | NA | NA | Forestry Faculty, University of Suceava |
RO-003 | NA | 1 | NA | NA | National Research & Development Institute in Forestry ‘Marin Drăcea’, Craiova Station |
RO-004 | Romania | 2 | exact | Field surveys | National Research & Development Institute in Forestry ‘Marin Drăcea’, Brașov Station |
SE-001 | Sweden | 6 | exact | Aerial photointerpretation | University of Lund |
SE-002 | Sweden | 1 | exact | Field surveys | University of Lund |
SK-001 | NA | 1 | NA | NA | National Forest Centre - Forest Protection Service … Tatra National Park Research Station, 059 60 Tatranská Lomnica, Slovakia |
The functions defid_agents()
and defid_hosts()
allow getting frequency statistics on agents and hosts respectively. Note the use of logarithmic scales to better visualize the largely unbalanced frequencies due to over representation of Picea abies and Ips typographus.
df <- defid_agents() %>%
top_n(10, n_records) %>%
arrange(n_records)
gg <- ggplot(df, aes(x=n_records, y=name)) +
geom_bar(stat='identity') +
scale_x_continuous(
trans = "log10",
breaks = c(10, 100, 500, 1000, 5000, 10000, 100000, 1000000),
labels = scales::comma
) +
scale_y_discrete(limits=rev) +
ylab('Agent name') +
xlab('Number of records') +
theme_bw() +
theme(axis.text.x=element_text(angle=45,hjust=1))
gg
df <- defid_hosts() %>%
top_n(10, n_records) %>%
arrange(n_records)
gg <- ggplot(df, aes(x=n_records, y=name)) +
geom_bar(stat='identity') +
scale_x_continuous(
trans = "log10",
breaks = c(10, 100, 500, 1000, 5000, 10000, 100000, 1000000),
labels = scales::comma
) +
scale_y_discrete(limits=rev) +
ylab('Host name') +
xlab('Number of records') +
theme_bw() +
theme(axis.text.x=element_text(angle=45,hjust=1))
gg
Finally we can explore the spatio-temporal distribution of the dataset. This is facilitated by the defid_st_distrib()
function, which returns a simplified sf dataframe with centroid of reported geometries as geom column and survey_date.
library(tidyr) # For pivoting function
# Define spatial extent
bbox_europe <- st_bbox(c(xmin=-12,xmax=35,ymin=33,ymax=70), crs = st_crs(4326)) %>%
st_as_sfc() %>%
st_transform(st_crs(3035))
# Get gisco data
countries <- gisco_get_nuts(year='2021', epsg='3035') %>%
filter(LEVL_CODE == 0) %>%
st_crop(bbox_europe)
# create grid
hex_grid <- st_make_grid(countries, cellsize = 152200, square=FALSE) %>% # hex of about 20,000 km^2
st_sf() %>%
st_filter(countries)
# Get defid data
defid_points <- defid_st_distrib() %>%
st_transform(st_crs(3035))
# Build st_dataframe of count within spatial bin
hex_grid$`before 2000` <- sapply(st_intersects(hex_grid, filter(defid_points, survey_date < as.Date('2000-01-01'))), length)
hex_grid$`2000-2009` <- sapply(st_intersects(hex_grid, filter(defid_points, survey_date >= as.Date('2000-01-01'), survey_date < as.Date('2010-01-01'))), length)
hex_grid$`2010-2019` <- sapply(st_intersects(hex_grid, filter(defid_points, survey_date >= as.Date('2010-01-01'), survey_date < as.Date('2020-01-01'))), length)
hex_grid$`2020 and after` <- sapply(st_intersects(hex_grid, filter(defid_points, survey_date >= as.Date('2020-01-01'))), length)
# Reshape sf dataframe for facet grid
ggdf <- pivot_longer(data = hex_grid, cols = -geometry, names_to = 'Period', values_to = 'count')
ggdf$Period <- factor(ggdf$Period, levels = c('before 2000', '2000-2009', '2010-2019', '2020 and after'))
gg <- ggplot(ggdf) +
geom_sf(aes(fill = count), size=0.1, color='grey') +
geom_sf(data=countries, fill=NA, colour='black', size=0.2) +
facet_wrap(vars(Period)) +
theme_bw() +
#scale_fill_viridis_c(trans='log')
scale_fill_gradient(name = "Number of records", trans = "log", low='white', high = 'magenta', na.value = "white", breaks=c(0,20, 400, 8000)) +
theme(legend.position="bottom")
gg