1 Database structure

The DEFID2 dataset is organized in a Relational Database Management System (RDMBS); the version maintained at JRC is a Postgres (PostGIS) database and is converted to a sqlite (spatialite) file database to ease data distribution. Organizing the data in such a rigid structured data form facilitates data management and consistency checks while preserving complex relations between attributes and ensuring data integrity. A simplified overview of the database structure is presented in Figure 1.1. The two main tables of that schema are Event and Geom for disturbance events and geometries associated to these events.

Simplified Entity Relation Diagram (ERD) of the DEFID2 database

Figure 1.1: Simplified Entity Relation Diagram (ERD) of the DEFID2 database

2 Main package functionalities

2.1 Data loading

The defid2R package eases data access and reconstruction into tabular form. Its core function is read_defid() which reads data as an sf dataframe with multiple filter and attribute selection options. In the example below, a data subset covering only Italy and containing hosts and agents information is returned.

library(defid2R)
## Checking local database version
## ......
## OK
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
df <- read_defid(select=c('agents', 'hosts'), country_filter='Italy') %>%
    sample_n(10) %>%
    select(-event_id, -geom_id)
knitr::kable(df, "html")
survey_date is_affected agents hosts geometry
2005-06-01 TRUE Coleophora laricella Larix decidua POLYGON ((11.57057 46.62383…
2006-06-01 TRUE Coleophora laricella Larix decidua POLYGON ((11.28648 46.79223…
2014-11-15 TRUE Thaumetopoea pityocampa Pinus sylvestris POLYGON ((10.55242 46.60414…
2003-06-01 TRUE Lymantria dispar Quercus sp.,other deciduous sp. POLYGON ((11.30087 46.46252…
2017-04-15 TRUE Cenangium ferruginosum Pinus sylvestris POLYGON ((10.80943 46.63712…
1992-08-15 TRUE Orgyia antiqua Larix decidua,Picea abies POLYGON ((11.39113 46.82159…
1989-07-15 TRUE Oligonychus ununguis Picea abies POLYGON ((11.21385 46.75997…
1997-07-15 TRUE Glis glis,Sciurus vulgaris Larix decidua POLYGON ((10.66192 46.64887…
2015-11-15 TRUE Thaumetopoea pityocampa Pinus sylvestris POLYGON ((10.56 46.63266, 1…
2017-01-18 TRUE Rhabdocline laricis Larix decidua POLYGON ((10.83719 46.73887…

It is also possible to use a spatial filter to subset the dataset; in the example below, the boundaries of a NUTS level 3 region are used to filter the data. One attribute of the resulting sf dataframe is then visualized using ggplot2.

library(giscoR)
library(sf)
library(ggplot2)

cz_072 <- gisco_get_nuts(resolution = '01', nuts_id = 'CZ072') %>%
    st_as_sfc()

df <- read_defid(select=c('hosts'), st_intersects_filter=cz_072) %>%
    mutate(survey_date=as.Date(survey_date))

ggplot(df) +
    geom_sf(data=cz_072) +
    geom_sf(aes(fill=survey_date, color=survey_date), size=0.2) +
    theme_bw()

2.2 SQL

Note that for users experienced with SQL, data can be directly explored that way. This approach is facilitated by the SQL engine integrated with Rmarkdown (used to write this document). It is first necessary to create a database connection object. That object is then passed to the connection= parameter of the sql chunk.

db_path <- defid_get_db_path()
con <- defid_get_connection(db_path)

For instance to quickly know how many countries are represented in the database, the following query can be used.

SELECT
  COUNT(DISTINCT(country_id)) AS ncountries
FROM
  event;
Table 2.1: 1 records
ncountries
6

3 Data exploration

The package contains helpers functions to easily reproduce results presented in the scientific article. For instance one can quickly get an overview of all datasets by running the defid_datasets() function.

df <- defid_datasets()
knitr::kable(df, "html")
dataset_code country n_records geom_event_relation_type method contributing_organizations
CH-002 Switzerland 144 substitute point Field surveys Waldschutz Schweiz (WSS), Eidgenössische Forschungsanstalt für Wald, Schnee und Landschaft (WSL)
CZ-001 Czechia 98 exact Remote sensing classification Remote Sensing and Geospatial Analytics Division, GMV Innovation Solutions … Department of Forest Management and Applied Geoinformatics, Faculty of Forestry and Wood Technology, Mendel University in Brno … University Forest Enterprise Masaryk Forest Křtiny, Mendel University in Brno
CZ-003 Czechia 6883 exact Remote sensing classification Czechglobe - Global Change Research Institute, CAS … Ústav pro hospodářskou úpravu lesů - Forest Management Institute (FMI)
CZ-004 Czechia 1986 exact Remote sensing classification Czechglobe - Global Change Research Institute, CAS … Ústav pro hospodářskou úpravu lesů - Forest Management Institute (FMI)
ES-001 NA 1 NA NA UXAFOREST Research Group, University of Santiago de Compostela … GEOINCA Research Group, University of León
ES-002 NA 1 NA NA UXAFOREST Research Group, University of Santiago de Compostela … GEOINCA Research Group, University of León
ES-003 Spain 434 substitute polygon Field surveys Laboratori de Sanitat Forestal, Servei d’Ordenació i Gestió Forestal, Conselleria d’Agricultura, Desenvolupament Rural, Emergència Climàtica i Transició Ecològica, Generalitat Valenciana
ES-004 Spain 3 exact point Field surveys Department of Forest Engineering, University of Córdoba … Instituto de Agricultura Sostenible (IAS), Consejo Superior de Investigaciones Científicas (CSIC), Córdoba, Spain
FI-001 NA 1 NA NA Department of Forest Sciences, University of Helsinki
IT-001 Italy 19 exact point Field surveys CREA - Consiglio per la ricerca in agricoltura e l’analisi dell’economia agraria … CREA Research Centre for Plant Protection and Certification, Florence, Italy
IT-003 NA 1 NA NA DAFNAE-Entomology, University of Padova
IT-004 Italy 2 exact point Field surveys DAFNAE-Entomology, University of Padova
IT-005 Italy 1 exact Field surveys DAFNAE-Entomology, University of Padova
IT-006 Italy 46 exact Field surveys Ufficio Pianificazione forestale della Provincia Autonoma di Bolzano, Universita degli Studi di Padova
RO-001 Romania 27 exact Satellite photointerpretation Department of Geomorphology-Pedology-Geomatics, Faculty of Geography, University of Bucharest … National Institute for Research and Development in Forestry ‘Marin Drăcea’ (INCDS)
RO-002 NA 1 NA NA Forestry Faculty, University of Suceava
RO-003 NA 1 NA NA National Research & Development Institute in Forestry ‘Marin Drăcea’, Craiova Station
RO-004 Romania 2 exact Field surveys National Research & Development Institute in Forestry ‘Marin Drăcea’, Brașov Station
SE-001 Sweden 6 exact Aerial photointerpretation University of Lund
SE-002 Sweden 1 exact Field surveys University of Lund
SK-001 NA 1 NA NA National Forest Centre - Forest Protection Service … Tatra National Park Research Station, 059 60 Tatranská Lomnica, Slovakia

The functions defid_agents() and defid_hosts() allow getting frequency statistics on agents and hosts respectively. Note the use of logarithmic scales to better visualize the largely unbalanced frequencies due to over representation of Picea abies and Ips typographus.

df <- defid_agents() %>%
    top_n(10, n_records) %>%
    arrange(n_records)
gg <- ggplot(df, aes(x=n_records, y=name)) +
    geom_bar(stat='identity') +
    scale_x_continuous(
        trans = "log10",
        breaks = c(10, 100, 500, 1000, 5000, 10000, 100000, 1000000),
        labels = scales::comma
    ) +
    scale_y_discrete(limits=rev) +
    ylab('Agent name') +
    xlab('Number of records') +
    theme_bw() +
    theme(axis.text.x=element_text(angle=45,hjust=1))
gg

df <- defid_hosts() %>%
    top_n(10, n_records) %>%
    arrange(n_records)
gg <- ggplot(df, aes(x=n_records, y=name)) +
    geom_bar(stat='identity') +
    scale_x_continuous(
        trans = "log10",
        breaks = c(10, 100, 500, 1000, 5000, 10000, 100000, 1000000),
        labels = scales::comma
    ) +
    scale_y_discrete(limits=rev) +
    ylab('Host name') +
    xlab('Number of records') +
    theme_bw() +
    theme(axis.text.x=element_text(angle=45,hjust=1))
gg

Finally we can explore the spatio-temporal distribution of the dataset. This is facilitated by the defid_st_distrib() function, which returns a simplified sf dataframe with centroid of reported geometries as geom column and survey_date.

library(tidyr) # For pivoting function

# Define spatial extent
bbox_europe <- st_bbox(c(xmin=-12,xmax=35,ymin=33,ymax=70), crs = st_crs(4326)) %>%
    st_as_sfc() %>%
    st_transform(st_crs(3035))
# Get gisco data
countries <- gisco_get_nuts(year='2021', epsg='3035') %>%
    filter(LEVL_CODE == 0) %>%
    st_crop(bbox_europe)
# create grid
hex_grid <- st_make_grid(countries, cellsize = 152200, square=FALSE) %>% # hex of about 20,000 km^2
    st_sf() %>%
    st_filter(countries)
# Get defid data
defid_points <- defid_st_distrib() %>%
    st_transform(st_crs(3035))

# Build st_dataframe of count within spatial bin
hex_grid$`before 2000` <- sapply(st_intersects(hex_grid, filter(defid_points, survey_date < as.Date('2000-01-01'))), length)
hex_grid$`2000-2009` <- sapply(st_intersects(hex_grid, filter(defid_points, survey_date >= as.Date('2000-01-01'), survey_date < as.Date('2010-01-01'))), length)
hex_grid$`2010-2019` <- sapply(st_intersects(hex_grid, filter(defid_points, survey_date >= as.Date('2010-01-01'), survey_date < as.Date('2020-01-01'))), length)
hex_grid$`2020 and after` <- sapply(st_intersects(hex_grid, filter(defid_points, survey_date >= as.Date('2020-01-01'))), length)

# Reshape sf dataframe for facet grid
ggdf <- pivot_longer(data = hex_grid, cols = -geometry, names_to = 'Period', values_to = 'count')
ggdf$Period <- factor(ggdf$Period, levels = c('before 2000', '2000-2009', '2010-2019', '2020 and after'))
    

gg <- ggplot(ggdf) +
    geom_sf(aes(fill = count), size=0.1, color='grey') +
    geom_sf(data=countries, fill=NA, colour='black', size=0.2) +
    facet_wrap(vars(Period)) +
    theme_bw() +
    #scale_fill_viridis_c(trans='log')
    scale_fill_gradient(name = "Number of records", trans = "log", low='white', high = 'magenta', na.value = "white", breaks=c(0,20, 400, 8000)) +
    theme(legend.position="bottom")
gg