Title: | Crawler and Data Scraper for Open Journal System ('OJS') |
---|---|
Description: | Crawler for 'OJS' pages and scraper for meta-data from articles. You can crawl 'OJS' archives, issues, articles, galleys, and search results. You can scrape articles metadata from their head tag in html, or from Open Archives Initiative ('OAI') records. Most of these functions rely on 'OJS' routing conventions (<https://docs.pkp.sfu.ca/dev/documentation/en/architecture-routes>). |
Authors: | Gaston Becerra [aut, cre] |
Maintainer: | Gaston Becerra <[email protected]> |
License: | GPL-3 |
Version: | 0.1.5 |
Built: | 2025-03-13 05:17:56 UTC |
Source: | https://github.com/gastonbecerra/ojsr |
Takes a vector of OJS (issue) URLs and scrapes the links to articles from the issues table of content
get_articles_from_issue(input_url, verbose = FALSE)
get_articles_from_issue(input_url, verbose = FALSE)
input_url |
Character vector. |
verbose |
Logical. |
A long-format dataframe with the url you provided (input_url) and the articles url scrapped (output_url)
issue <- 'https://revistas.ucn.cl/index.php/saludysociedad/issue/view/65' articles <- ojsr::get_articles_from_issue(input_url = issue)
issue <- 'https://revistas.ucn.cl/index.php/saludysociedad/issue/view/65' articles <- ojsr::get_articles_from_issue(input_url = issue)
takes a vector of OJS URLs and a string for search criteria to compose search result URLs, (including pagination) then it scrapes them to retrieve the articles’ URLs.
get_articles_from_search(input_url, search_criteria, verbose = FALSE)
get_articles_from_search(input_url, search_criteria, verbose = FALSE)
input_url |
Character vector. |
search_criteria |
Character string |
verbose |
Logical. |
A dataframe with the urls of the articles linked from the OJS issue page.
journals <- c( 'https://revistapsicologia.uchile.cl/index.php/RDP/', 'https://publicaciones.sociales.uba.ar/index.php/psicologiasocial/' ) criteria <- "actitudes" search_result_pages <- ojsr::get_articles_from_search(input_url = journals, search_criteria = criteria, verbose = TRUE)
journals <- c( 'https://revistapsicologia.uchile.cl/index.php/RDP/', 'https://publicaciones.sociales.uba.ar/index.php/psicologiasocial/' ) criteria <- "actitudes" search_result_pages <- ojsr::get_articles_from_search(input_url = journals, search_criteria = criteria, verbose = TRUE)
Takes a vector of OJS URLs and scrapes all the galleys URLs from the article view
get_galleys_from_article(input_url, verbose = FALSE)
get_galleys_from_article(input_url, verbose = FALSE)
input_url |
Character vector. |
verbose |
Logical. |
A long-format dataframe with the url you provided (input_url), the articles url scrapped (output_url), the format of the galley (format), and the url that forces download of the galley (download_url)
article <- 'https://revistapsicologia.uchile.cl/index.php/RDP/article/view/55657' galleys <- ojsr::get_galleys_from_article(input_url = article)
article <- 'https://revistapsicologia.uchile.cl/index.php/RDP/article/view/55657' galleys <- ojsr::get_galleys_from_article(input_url = article)
Takes a vector of OJS URLs and scrapes all metadata written in HTML from the article view
get_html_meta_from_article(input_url, verbose = FALSE)
get_html_meta_from_article(input_url, verbose = FALSE)
input_url |
Character vector. |
verbose |
Logical. |
A long-format dataframe with the url you provided (input_url), the name of the metadata (meta_data_name), the content of the metadata (meta_data_content), the standard in which the content is annotated (meta_data_scheme), and the language in which the metadata was entered (meta_data_xmllang)
article <- 'https://dspace.palermo.edu/ojs/index.php/psicodebate/article/view/516/311' metadata <- ojsr::get_html_meta_from_article(article)
article <- 'https://dspace.palermo.edu/ojs/index.php/psicodebate/article/view/516/311' metadata <- ojsr::get_html_meta_from_article(article)
Takes a vector of OJS URLs and scrapes the issues URLs from the issue archive.
get_issues_from_archive(input_url, verbose = FALSE)
get_issues_from_archive(input_url, verbose = FALSE)
input_url |
Character vector. |
verbose |
Logical. |
A long-format dataframe with the url you provided (input_url) and the url of issues found (output_url)
journal <- 'https://dspace.palermo.edu/ojs/index.php/psicodebate/issue/archive' issues <- ojsr::get_issues_from_archive(input_url = journal)
journal <- 'https://dspace.palermo.edu/ojs/index.php/psicodebate/issue/archive' issues <- ojsr::get_issues_from_archive(input_url = journal)
This functions access OAI records (within OJS) for any article for which you provided an URL.
get_oai_meta_from_article(input_url, verbose = FALSE)
get_oai_meta_from_article(input_url, verbose = FALSE)
input_url |
Character vector. |
verbose |
Logical. |
Several limitations are in place. Please refer to vignette.
A long-format dataframe with the url you provided (input_url), the name of the metadata (meta_data_name), and the content of the metadata (meta_data_content).
article <- 'https://dspace.palermo.edu/ojs/index.php/psicodebate/article/view/516/311' metadata_oai <- ojsr::get_oai_meta_from_article(input_url = article)
article <- 'https://dspace.palermo.edu/ojs/index.php/psicodebate/article/view/516/311' metadata_oai <- ojsr::get_oai_meta_from_article(input_url = article)
Takes a vector of urls and parses them according to OJS routing conventions, then retrieves OJS base url.
parse_base_url(input_url)
parse_base_url(input_url)
input_url |
Character vector. |
A vector of the same length of your input.
mix_links <- c( 'https://dspace.palermo.edu/ojs/index.php/psicodebate/issue/archive', 'https://publicaciones.sociales.uba.ar/index.php/psicologiasocial/article/view/2903' ) base_url <- ojsr::parse_base_url(input_url = mix_links)
mix_links <- c( 'https://dspace.palermo.edu/ojs/index.php/psicodebate/issue/archive', 'https://publicaciones.sociales.uba.ar/index.php/psicologiasocial/article/view/2903' ) base_url <- ojsr::parse_base_url(input_url = mix_links)
Takes a vector of urls and parses them according to OJS routing conventions, then retrieves OAI entry url.
parse_oai_url(input_url)
parse_oai_url(input_url)
input_url |
Character vector. |
A vector of the same length of your input.
mix_links <- c( 'https://dspace.palermo.edu/ojs/index.php/psicodebate/issue/archive', 'https://publicaciones.sociales.uba.ar/index.php/psicologiasocial/article/view/2903' ) oai_url <- ojsr::parse_oai_url(input_url = mix_links)
mix_links <- c( 'https://dspace.palermo.edu/ojs/index.php/psicodebate/issue/archive', 'https://publicaciones.sociales.uba.ar/index.php/psicologiasocial/article/view/2903' ) oai_url <- ojsr::parse_oai_url(input_url = mix_links)