---
title: "Comparing Search Strings"

author: ""

date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Comparing Search Strings}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  eval = any(dir.exists(c("working_example_data", "benchmark_data", "new_benchmark_data", "topic_data", "valid_data", "new_stage_data"))),
  comment = "#>",
  warning = FALSE,
  fig.width = 8,
  fig.height = 6
)
```

## About this vignette

CiteSource provides three custom metadata fields for labeling citation records: `cite_source`, `cite_label`, and `cite_string`. Most workflows use `cite_source` to identify the database and `cite_label` to track the review stage (search, screened, final). The `cite_string` field provides a third dimension for cases where you need to distinguish between variations of a search strategy within the same source.

The most common use case is **within-source string comparison**: you are testing multiple query formulations in a single database before finalizing your search strategy, and you want to compare how each performs without conflating the query variation with the source identity. Encoding the variations as separate `cite_source` values would work, but it loses the ability to aggregate results at the database level. Using `cite_string` keeps the database identity intact while enabling a separate axis of analysis.

In this example, five search strings were run in Web of Science. We use `cite_source` to record the database and `cite_string` to label each query variation, then compare their performance against a set of benchmark studies.

## Installation and setup

```{r, results = FALSE, message=FALSE, warning=FALSE}
#install.packages("CiteSource")
library(CiteSource)
```

## Import citation files

```{r}
file_path <- "../vignettes/new_benchmark_data/"
citation_files <- list.files(path = file_path, pattern = "\\.ris", full.names = TRUE)
citation_files
```

## Assign metadata using all three fields

The key difference from a standard import: `cite_source` is the same database ("WoS") for all search strings, while `cite_string` differentiates the query variations. The benchmark file gets `cite_source = NA` and `cite_label = "benchmark"`.

```{r}
imported_tbl <- tibble::tribble(
  ~files,              ~cite_sources,  ~cite_labels,  ~cite_strings,
  "benchmark_15.ris",  NA,             "benchmark",   NA,
  "search1_166.ris",   "WoS",          "search",      "string 1",
  "search2_278.ris",   "WoS",          "search",      "string 2",
  "search3_302.ris",   "WoS",          "search",      "string 3",
  "search4_460.ris",   "WoS",          "search",      "string 4",
  "search5_495.ris",   "WoS",          "search",      "string 5"
) |>
  dplyr::mutate(files = paste0(file_path, files))

raw_citations <- read_citations(metadata = imported_tbl, verbose = FALSE)
```

## Deduplicate and create comparison data

```{r}
unique_citations <- dedup_citations(raw_citations)
n_unique         <- count_unique(unique_citations)

# Compare by string rather than source
string_comparison <- compare_sources(unique_citations, comp_type = "strings")
```

## Review initial record counts

```{r}
initial_records <- calculate_initial_records(unique_citations)
create_initial_record_table(initial_records)
```

## Visualize overlap between strings

### Upset plot by string

The upset plot shows how records are distributed across string combinations. This tells you which strings are finding records the others miss and how much overlap exists between query variations.

```{r, fig.alt="Upset plot showing overlap between five search string variations run in Web of Science."}
plot_source_overlap_upset(string_comparison, groups = "string", decreasing = c(TRUE, TRUE))
```

### Heatmap by string

The heatmap provides a pairwise view of overlap between strings, either as raw counts or as percentages.

```{r}
plot_source_overlap_heatmap(string_comparison, cells = "string")
plot_source_overlap_heatmap(string_comparison, cells = "string", plot_type = "percentages")
```

## Compare string contributions

`plot_contributions()` shows unique and shared record counts for each string. Strings with a high proportion of unique records are contributing coverage that the other strings miss; strings with mostly shared records may be redundant.

```{r}
plot_contributions(n_unique, facets = cite_string, center = TRUE)
```

## Benchmark coverage by string

Filtering to the benchmark records and using the record-level table shows exactly which benchmark studies each string found — and which were missed entirely.

```{r}
unique_citations |>
  dplyr::filter(stringr::str_detect(cite_label, "benchmark")) |>
  record_level_table(return = "DT")
```

## Detailed contribution table by string

```{r}
detailed_records <- calculate_detailed_records(unique_citations, n_unique)
create_detailed_record_table(detailed_records)
```

## When to use cite_string vs cite_source

| Scenario | Recommended field |
|---|---|
| Different databases (PubMed, Scopus, WoS) | `cite_source` |
| Same database, different query variations | `cite_string` |
| Hand searching, citation chasing alongside database searches | `cite_string` (method) + `cite_source` (target) |
| Tracking records through review stages | `cite_label` |

For most reviews, `cite_source` and `cite_label` are sufficient. `cite_string` becomes valuable when you are doing pre-search validation with multiple query variants, or when you want to distinguish supplementary search methods from the primary database searches while keeping both associated with the same source.