Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

Evaluation of the extraction of methodological study characteristics … – Nature.com

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.
Advertisement
Carousel with three slides shown at a time. Use the Previous and Next buttons to navigate three slides at a time, or the slide dot buttons at the end to jump three slides at a time.
30 September 2021
Ingmar Böschen
30 May 2018
Neal Robert Haddaway & Biljana Macura
15 November 2021
Christopher D. Chambers & Loukia Tzavella
16 August 2021
John K. Kruschke
22 May 2018
Timothy H. Parker, Simon C. Griffith, … Shinichi Nakagawa
02 September 2022
Uri Simonsohn, Joseph Simmons & Leif D. Nelson
08 March 2018
Jessica Gurevitch, Julia Koricheva, … Gavin Stewart
29 November 2022
Rick Dale, Anne S. Warlaumont & Kerri L. Johnson
11 January 2021
Eliana Lima, Robert Hyde & Martin Green
Scientific Reports volume 13, Article number: 139 (2023) Cite this article
219 Accesses
1 Altmetric
Metrics details
This paper introduces and evaluates the study.character module from the JATSdecoder package which extracts several key methodological study characteristics from NISO-JATS coded scientific articles. study.character splits the text into sections and applies its heuristic-driven extraction procedures to the text of the method and result section/s. When used individually, study.character’s functions can also be applied to any textual input. An externally coded data set of 288 PDF articles serves as an indicator of study.character’s capabilities in extracting the number of sub-studies reported per article, the statistical methods applied and software solutions used. Its precision of extraction of the reported (alpha )-level, power, correction procedures for multiple testing, use of interactions, definition of outlier, and mentions of statistical assumptions are evaluated by a comparison to a manually curated data set of the same collection of articles. Sensitivity, specificity, and accuracy measures are reported for each of the evaluated functions. study.character reliably extracts the methodological study characteristics targeted here from psychological research articles. Most extractions have very low false positive rates and high accuracy ((ge 0.9)). Most non-detections are due to PDF-specific conversion errors and complex text structures, that are not yet manageable. study.character can be applied to large text resources in order to examine methodological trends over time, by journal and/or by topic. It also enables a new way of identifying study sets for meta-analyzes and systematic reviews.
In scientific research practice, many individual decisions can be made that affect the scientific quality of a study. There are also changing standards set by journal editors and the community. This applies not only to the study design, but also to the choice of statistical methods and their settings. With new methods and standards, the way research is planned, conducted and presented changes over time and represents an interesting field of research. One aspect to consider is the ever-increasing number of scientific publications coming out each year. Numerous studies have investigated the use and development of statistical techniques in scientific research practice1,2,3,4,5,6,7. Most of these studies used manually coded data of a limited number of articles, journals, topics or time interval. The selectivity of these samples therefore severely limits the generalizability of the findings to a wider scope. For example, Blanca et al.7 analyzed the use of statistical methods and analysis software solutions in 288 articles (36 articles each from 8 journals), all from a publication period of about one year.
A technology that is suitable for analyzing large amounts of text and helps to overcome the problem of small samples in the analysis of scientific research practice is text mining. Text mining is the process of discovering and capturing knowledge or useful patterns from a large amount of unstructured textual data8. It is an interdisciplinary field that draws on data mining, machine learning, natural language processing, statistics, and more8. It facilitates extraction and unification tasks that cannot be done by hand when the analyzed text corpus becomes large. In addition to rudimentary computer commands on textual input (regular expressions), there are also many software programs and toolkits that provide model-based methods of natural language processing (NLP).
Well-known NLP libraries such as NLTK9 or spaCy10 provide users with a variety of programs for linguistic evaluation of natural language. This often involves the use of statistical models and machine learning. In contrast, the JATSdecoder package11 focuses on metadata and study feature extraction (in the context of the NISO-JATS format). This extraction is implemented using expert-driven heuristics. Thus, unlike in the aforementioned large multipurpose NLP libraries, no further programming effort is required to perform specific extraction.
Research on scientific practice can benefit greatly from NLP techniques. Compared to manual coding, an automated identification of study characteristics is very time and cost-efficient. It enables large-scale and trend analyzes, mirroring of scientific research practices and identification of studies that meet certain methodological requirements for meta-analyses and systematic reviews. In addition, automated plausibility checks and global summaries can support quality management.
In general, most methodological study characteristics (e.g., statistical results, (alpha )-level, power, etc.) are reported in a fairly standard way. Here, the module study.character from the R package JATSdecoder11 is presented and evaluated as a tool for extracting key methodological features from scientific reports. The evaluation of the built-in extraction functions is performed on a medium-sized collection of articles (N = 287) but highlights the possibilities in mirroring and identifying methodological trends in rather big article collections. Although the use of method-based NLP methods might be appropriate for the study features focused here, all functions run fine-tuned expert-driven extraction heuristics to achieve a robust extraction and traceability of errors. While many NLP libraries can be thought of as a toolbox for a variety of problems, JATSdecoder represents a precision tool for a specific problem.
Scientific research is mostly published in two ways. In addition to a printable version which is distributed as PDF file, machine-readable versions are accessible in various formats (HTML, XML, JSON). The PubMed Central database12 currently stores almost five million open access documents from the biology and health sciences, distributed as XML files and structured using the Journal Archiving Tag System NISO-JATS13. The NISO-JATS is an HTML tag standard to store scientific article content without any graphical parameters (website layout, text arrangement, etc.), graphical content is hyper referenced.
JATSdecoder11 is a software package for the statistical programming language R15. Its function JATSdecoder converts NISO-JATS encoded XML documents into a list with metadata, user-adjustable sectioned text and reference list16. The structured list is very useful for costum search and extraction procedures, as it facilitates these tasks on selectively defined text parts (e.g., section headings, method or results section, reference list).
The algorithms of JATSdecoder were iteratively developed based on the PubMed Central article collection (at that time (approx 3) million native NISO-JATS XML) and more than 10,000 PDF files from different journals that were converted to XML files with the Content ExtRactor and MINEr14(CERMINE).
CERMINE is a sophisticated PDF conversion tool which extracts metadata, full text and parsed references from scientific literature in PDF format. The output can be returned as plain text or NISO-JATS encoded content. Compared to a pure text extraction, the transfer into the NISO-JATS format with CERMINE is a great advantage for post-processing. Article metadata can be accessed directly and the text of multi-column blocks is extracted correctly, which is often not the case with the output of other conversion software. Supervised and unsupervised machine learning algorithms enable CERMINE to adapt to the different document layouts and styles of the scientific literature. Large file collections can be converted using batch processing. Thus, with the help of CERMINE, the publications of another large group of publishers can be processed with JATSdecoder.
In addition to the extraction of metadata and study features, JATSdecoder provides some convenient, purely heuristic-driven functions that can be useful for any text-analytic approach. An overview of these functions and their functionality is given in Table 1. All functions are based on the basic R environment and make intense use of regular expressions. letter.convert() unifies hexadecimal and many HTML letters into a Unicode representation and corrects most PDF and CERMINE specific conversion errors. For example, more than 20 different hexadecimal characters that encode a space are converted to a standard space, invisible spaces (e.g.: ‘u200b’) are removed. When extracting text from PDF documents, special characters can often not be read correctly, as they can be stored in a wide variety of formats. Badly compiled Greek letters (e.g., ‘v2’ not ‘(chi ^2)’) and operators (e.g., ‘5‘ not ‘(=)’) are corrected, a ‘({})’ is inserted for missing operators (e.g., ‘t({})1.2, p({})0.05’ for ‘t 1.2, p 0.05’). These unifications are important for further processing and facilitate text search tasks and extractions. text2sentences() converts floating text into a vector of sentences. Many not purely digit-based representations of numbers (words, fractions, percentages, very small/high numbers denoted by (10^x) or (e+x)) can be converted to decimals with text2num() (e.g., ‘five percent’ ({-}{>}) ‘0.05’, ‘0.05/5’ ({-}{>}) ‘0.01’). ngram() extracts a definable number of words occurring before and/or after a word within a list of sentences (({pm })n-gram bag of words). The presence of multiple search patterns can be checked with which.term(). The output is either a binary hit vector for each search pattern or a vector of detected search patterns. The functions grep2() and strsplit2() are useful extensions of the basic R functions grep() and strsplit(). grep2() enables the identification and extraction of text using multiple search patterns linked with a logical AND. Compared to strsplit(), which deletes the search pattern when splitting text into pieces, strsplit2() allows to preserve the search pattern in the output by supporting splits before or after the recognized pattern.
The study.character module bundles multiple text selection and manipulation tasks for specific contents of the list created by JATSdecoder. It extracts important study features such as the number of studies reported, the statistical methods applied, reported (alpha )-level and power, correction procedures for multiple testing, assumptions mentioned, the statistical results reported, analytical software solution used, and whether the results include an analysis of interacting covariates, mediation and/or moderation effects. All functions use sophisticated, expert-guided heuristics for text extraction and manipulation, developed with great effort and domain expertise. One advantage of the time-intensive development of efficient rules is the robust recognition of a wide range of linguistic and technical representations of the targeted features, as well as a clear assignment of the causes of incorrect extractions. A functional limitation of most study.character functions is that they can only handle English content.
In general, study.character attempts to split a document into four sections (Introduction, Methods, Results, Discussion). The text of the introduction, which explains the theory and describes other work and results, and the discussion section, which contains implications, limitations, and suggestions for future procedures, can easily lead to false-positive extractions of actually realized study features. This also applies to the information in the bibliography. Therefore, mostly only the methods and results sections and captions are processed to extract the study characteristics from an article.
It has been demonstrated that study.character’s function get.stats() outperforms the program statcheck17 in extracting and recalculating p-values of statistical results reported within an article in both PDF and XML format18. Here, study.character’s functions to extract the statistical methods applied, statistical software used, number of studies per article, reported (alpha )-level and power, test direction, correction method for multiple testing, and mentioned assumptions are evaluated using manually coded data of the study characteristics.
A brief description of the targeted study feature and the implemented extraction heuristic of each function is given in the following section. Minor uniformization tasks are not listed, but can be traced using the source code of each function. The text processing and feature extraction are implemented with basic R functions (e.g., grep(), gsub(), strsplit()) and JATSdecoder’s text processing solutions, which are also based on these basic functions. A main feature of these functions is that they can be used with regular expressions, which makes them very powerful if used wisely. The grep() function performs search queries, gsub() finds and replaces text. Using strsplit(), text input can be split into a vector at text locations that match a search pattern. The search pattern itself is removed.
To draw contextual conclusions, researchers use various statistical methods and procedures to process and summarize their data. Although any descriptive as well as inferential method can be considered a statistical method, the focus here is on inferential methods. Inferential methods are based on either simple or more complex models, which also allow differing depths of data analysis and inference. Some of these methods are widespread in the literature (e.g., t-test, correlation, ANOVA, multiple regression), while other techniques are rarely used.
The function get.method() extracts the statistical methods mentioned in the input text. It detects sentences containing a statistical method with a list of search terms that most commonly used procedures share as an identifier (e.g., test, correlation, regression, ANOVA, method, theorem, interval, algorithm, etc.). After lowerization, up to seven preceding words with the identified search term at the end are extracted with ngram() and further cleaned up with an iteratively generated list of redundant words (e.g., prepositions, verbs). Users can expand the possible result space by passing additive search words to the ‘add’ argument of get.method(). The current heuristic enables the extraction of new, still unknown procedures (e.g., ‘JATSdecoder algorithm’), if their name ends with one of the prespecified or user-adjusted search terms. Simple descriptive measures (e.g., mean, standard deviation, proportion) are not extracted, because they are overly common and therefore do not differentiate well. Methods with a specifying term after the search term (e.g., ‘test for homogeneity of variances’) cannot be identified by get.method() yet.
Theoretically, any frequentist decision process requires an a-priori set significance criterion, the (alpha )-level or type-1 error probability. The type-1 or (alpha )-error is the probability of rejecting a correct null hypothesis. Because it has become a widespread standard to work with an (alpha )-level of 0.05, it is often not explicitly stated in practice. Among many synonyms (e.g., ‘alpha level’, ‘level of significance’, ‘significance threshold’, ‘significance criterion’) and made up terms (e.g., ‘level of confidence’, ‘level of probability’), it may be reported as critical p-value (e.g., ‘p-values ( are considered significant’) and/or with a verbal operator (e.g., ‘the (alpha )-error was set to 0.05’), making it difficult to detect and extract reliably. In addition, the (alpha )-level may be reported with a value, that has been corrected for multiple testing, which does not lower the nominal (alpha )-level. Another indirect but clearly identifiable report of an (alpha )-error probability is the use of 1-(alpha ) confidence intervals.
The text of the method and result sections/s, as well as the figure and table captions, are passed to get.alpha.error(). Prior to the numerical extraction of the reported (alpha )-level/s, several unification tasks are performed on synonymously used terms for (alpha )-errors and reporting styles. Levels of different p-values that are coded with asterisks are not considered (alpha )-levels. When a corrected (alpha ) is reported by a fraction that also contains the nominal value (e.g., ‘(alpha =0.05/4)’) both values are returned (0.05 and 0.0125). The argument ‘p2alpha’ is activated by default to increase the detection rate. This option allows extraction of p-values expressing (alpha )-levels (e.g., ‘Results with p-values (alpha )
-level/s and extractions from 1-(alpha ) confidence intervals. Since some articles report multiple (alpha )-levels, all detected values are max- and minimized to facilitate further processing.
The nominal (alpha )-level refers to a single test situation. When multiple tests are performed with the same (alpha )-level, the probability of obtaining at least one significant result increases with each test and always exceeds (alpha ). There are several correction procedures to control the inflation of the (alpha )-error or false discovery rate, when running multiple tests on the same data.
A two-step search task is performed for the text of the methods and results section/s, as well as figure and table captions by get.multiple.comparison(). Sentences containing any of the search terms ‘adjust’, ‘correct’, ‘post-hoc’ or ‘multiple’ are further inspected for twelve author names (e.g., ‘Benjamini’, ‘Bonferroni’) that refer to correction procedures, as well as four specific procedures (e.g., ‘family-wise error rate’, ‘false discovery rate’) that correct for multiple testing (see Online Appendix A for the full list of specified search terms). The output is a vector with all identified authors of correction methods. Common spelling errors (e.g., ‘Bonfferoni’ instead of ‘Bonferroni’) are also detected, but returned with the correct name.
The concept of power describes the probability of correctly rejecting a false null hypothesis given a theoretical (a-priori) or empirical (post-hoc) effect. It can be used to estimate an optimal sample size (a-priori) or as a descriptive post-hoc measure.
get.power() extracts the reported aimed and achieved power value/s that are reported in the full text of the document. Since the term power is used in different contexts, sentences containing certain terms are omitted (e.g., volts, amps, hz). To reduce the likelihood of false positives, detected values that fall outside the valid power range ([0; 1]) are omitted. get.power() unifies some synonyms of power (e.g., (1-beta )) and extracts the corresponding value/s if they fall within the range of 0–1. When (beta )-errors are reported instead of power values, they are converted to power values by replacing (beta ) with (1-beta ).
Analyses with more than one independent variable can be conducted with or without an interaction effect of the covariates. The term interaction effect refers to any type of interplay of two or more covariates that have dynamic effects on an outcome. In most research settings, the analysis of interactions is of great interest, as it may represent the central research hypothesis or lead to restrictions and/or reinforcement for the hypothesis/theory being tested. In addition to statistical models that explicitly include an interaction effect, mediation- and moderation analyses focus on dynamic effects of covariates on an outcome.
has.interaction() searches the lowerized text of the methods and results section/s for specific search patterns that relate to an interaction effect. To avoid false positive hits when analyzing articles dealing with interactions of organisms instead of variables, sentences containing specific search terms (e.g., social, child, mother, baby, cell) are removed. The output distinguishes between an identified interaction, mediator and/or moderator effect.
Most research is based on theories that allow a prediction about the direction of the effect under study. Besides several procedures, that do not allow a direct conclusion about the direction of an observed effect (e.g., (chi ^2)-Test, ANOVA), others can be applied to test directed hypotheses (e.g., t-test). Adjusting an undirected test to a directed test increases its power, if the sample and effect size are held constant, and the effect is present in the predicted direction.
Sentences containing a statistical result or one of several search terms (e.g., ‘test’, ‘hypothesis’) are searched by get.test.direction() for synonyms of one- and two-sided testing and hypothesis (e.g., directed test, undirected hypothesis). To avoid false positives for one-sidedness, sentences containing certain reference words (e.g., paper, page, pathway) are excluded and detected values less than one are omitted.
Since many popular statistical measures are sensitive to extreme values (e.g., mean, variance, regression coefficients), their empirical values may not be appropriate to describe a sample. In practice, there are two popular techniques to deal with extreme values and still compute the desired statistic. Simple exclusion of outliers reduces the sample size and test power, while adjustments towards the mean preserve the original sample size. Both procedures can, of course, distort the conclusions drawn from the data because the uncertainty (variance) is artificially reduced. It is difficult to justify why valid extreme values are manipulated or removed to calculate a particular measure rather than choosing an appropriate measure (e.g., median, interquartile range). On the other hand, outliers may indicate measurement errors, that warrant special treatment. A popular measure for detecting outliers is the distance from the empirical mean, expressed in standard deviations.
get.outlier.def() identifies sentences containing a reference word of a removal process or an outlier value (e.g., outlier, extreme, remove, delete), and a number (numeric or word) followed by the term ‘standard deviation’ or ‘sd’. Verbal representations of numbers are converted to numeric values. Since very large deviations from the mean are more likely to indicate a measurement error than an outlier definition, and to minimize erroneous extractions of overly small values, the default result space of the output is limited to values between 1 and 10.
Any statistical procedure/model is based on mathematical assumptions about the sampling mechanism, scaling, the one and/or multidimensional distribution of covariates and the residual noise (errors). The underlying assumptions justify the statistical properties of an estimator and a test statistic (e.g., best linear unbiased estimator, distributional properties, (alpha )-error/p-value). There may be serious consequences for the validity of the conclusions drawn from these statistics, if the underlying assumptions are violated.
To extract the mentioned assumptions within an article, get.assumption() performs a dictionary search in the text of the methods and results sections. A total of 20 common assumptions related to the model adequacy, covariate structure, missing and sampling mechanisms can be identified (see Online Appendix C for the full list of specified search terms).
Statistical software solutions are a key element in modern data analysis. Some programs are specifically designed to perform certain procedures, while others focus on universality, performance, or usability.
To identify the analytic software solution mentioned in the methods and results sections, get.software() is used to perform a manually curated, fine-grained dictionary search of software names and their empirical representation in text. Tools for data acquisition or other data management purposes are not part of the list. However, they can be tracked down with a vector of user-defined search terms, passed to the ‘add’ argument. A total of 55 different software solutions can be detected in standard mode (see Online Appendix B for the complete list of specified search terms).
Research reports may contain single or multiple study reports. To determine the total number of studies reported in an article, the section titles and abstract text are passed to get.n.studies(). Enumerated studies or experiments are identified, and the highest value is returned. The function returns ‘1’ if no numbering of the studies is identified.
To evaluate the extraction capabilities of study.character, a manually coded dataset serves as reference data. The statistical methods used, the number of studies reported, and the software solutions used were coded by Blanca et al.7 and provided to the author. All articles were manually rescanned for those study characteristics that are extracted by study.character but were not part of the original dataset.
The collection of articles by Blanca et al.7 consists of 288 empirical studies published in eight psychological journals (British Journal of Clinical Psychology, British Journal of Educational Psychology, Developmental Psychology, European Journal of Social Psychology, Health Psychology, Journal of Experimental Psychology-Applied, Psicothema, Psychological Research) between 2016 and 2017.
The absolute frequencies of the identified statistical procedures used in the main analysis by Blanca et al.7 are contrasted with those of study.character. The manually created categories of the statistical methods from Blanca et al.7 are compared to the uncategorized statistical methods extracted using study.character. The search tasks for counting the frequency of articles using a specific category of procedures are implemented with regular expressions. An exploratory view of the entire result space of get.method() is displayed in a word cloud.
To explore the correct/false positive/negative detections by study.character all other extracted features are compared to the manually recoded data. A correct positive (CP) detection refers to an exact match to a manually coded feature within an article. A false positive (FP) refers to an extraction that is not part of the manually coded data. Articles that do not contain a feature and for which no feature has been detected are referred to as a correct negative (CN). Finally, a false negative (FN) refers to a feature that was not detected but was manually identified.
If a target feature is identified multiple times in an article, study.character will output this feature once. Therefore, the evaluation of the detection rates is carried out at the article level. Since most of the features focused on here can potentially have multiple values per article, the extractions may be fully or partially correct. This can be illustrated by the example of the extraction of the (alpha )-level. If the manual coding revealed the use of a 5% and a 10% (alpha )-level and study.character identifies the 5% and an unreported 1%, this is counted to be 1 correct positive, 1 false negative and 1 false positive for this article. It follows, that the number of correct (CP+CN) and total decisions (CP+FN+CN+FP) may be larger than the total number of articles analyzed.
Global descriptive quality measures (sensitivity, specificity, accuracy) are reported for every extracted feature.
Sensitivity refers to the proportion of correctly detected features within all features present (CP+FN).
Specificity refers to the proportion of correct non-detections within all articles that do not contain the searched pattern (CN+FP).
Finally, accuracy is the proportion of correct detections (CP+CN) within all existing features and non-existing features (CP+FN+CN+FP).
Absolute frequency tables of manual and automatic detections are presented for each characteristic, and a causal association of the deviations that occurred is provided.
The 288 articles in the raw data provided by Blanca et al.7 were manually downloaded as PDF files. The PDF files were converted to NISO-JATS encoded XML using the open-source software CERMINE14, before being processed with study.character. Since the compilation with CERMINE can lead to various errors (text sectioning/structuring, non-conversion of special characters), this can be considered as a rough test condition for the evaluated functions. All processes are performed with a Dell 4-core processor running with Linux Ubuntu 20.04.1 LTS and the open-source software R 4.015. To enable multicore processing, the R package future.apply19 is used. The word cloud of the identified methods is drawn using the wordcloud20 package.
The extraction properties and causes of deviations from the manually coded study features are given in the following section for each function. A total of 287 articles are included in the analyses, as the Blanca et al.7 data contain one article twice.
It should be noted that the extractions of statistical methods and software solutions from Blanca et al.7 are not directly comparable to the output of study.character as they coded the statistical methods used in the main analyses (rather than each mention) and that are explicitly reported to be used for these main analyses.
An insight into the overall result space of the statistical methods extracted by study.character is given in Fig. 1, where the frequency table of the extractions is shown as a word cloud. Bigger words indicate higher frequencies. It is obvious, that correlation analysis and ANOVA are the most frequently mentioned methods in this article selection.
Word cloud of the extracted statistical methods by study.character.
In order to compare the extractions of get.method() with the extractions of the main analysis procedure of Blanca et al.7 the absolute frequencies of the detected studies using a specific class of methods are listed in Table 2. The regular expressions listed are used as search terms to count the hits of get.method() per categorized method.
Because Blanca et al.7 coded the statistical method used in the main analysis (all methods reported in preliminary analyses or manipulation checks, footnotes, or in the participants or measures section, were not coded), most methods are more commonly identified by get.method(). Two rare categories cannot be identified at all with the search terms used (‘correlation comparison test’, ‘multilevel logistic regression’ (^{wedge })).
The large differences in most of the identified methods (e.g., descriptive statistics, correlation, (chi ^2)-statistics) are due to the different inclusion criteria (each mentioned method vs. method of main analysis). In addition, using of ‘regression’ as a search term in the output of get.method() also results in hits when more complex regression models were found (e.g., multilevel or multivariate regression), whereas Blanca et al.7 consider simple regression models and more specific regression models to be disjoint.
Table 3 shows the sensitivity, specificity, and accuracy measures for study.character’s extractions based on the manually coded data. Most of the extractions work very accurately and can replace a manual coding.
Except for the (alpha )-level detection with ‘p2alpha’ activated, all extractions have low false positive rates. In default mode, the empirical sensitivity of all extractions is above 0.8, the specificity above 0.9. Since there are usually very few false positive extractions, five specificity measures reach 1.
Accuracy is lowest for (alpha )-level detection (0.86 with ‘p2alpha’ deactivated, 0.9 in default mode) and statistical assumption extraction (0.9). The accuracy of all other extractions is above 0.9. The binarized outputs for the extracted interaction and the stated assumptions have higher accuracy than the raw extractions.
Although most of the studies examined make use of inferential statistics, only 78 (27%) explicitly report an (alpha )-level. In all cases, where no (alpha )-level is reported, the standard of (alpha =5%) is applied, but not considered an extractable feature. Since some studies report the use of multiple (alpha )-levels, the total number of detected and undetected (alpha )-levels exceeds the number of articles. Eight articles report the use of a 90% confidence interval and a 95% confidence interval.
The absolute frequency of (alpha )-levels extracted from 1-(alpha ) confidence intervals by study.character and the manual analysis are shown in Table 4. study.character correctly extracts the (alpha )-value in 105 out of 126 (83%) confidence interval reports in 97 out of 118 (82%) articles. No false positive extraction is observed. Seven non-detections by study.character are due to CERMINE specific conversion errors of figure captions, 11 to the non-processing of column names and content of tables. Two reports of confidence intervals cannot be recognized due to unusual wording (‘95% confidence area’, ‘confidence intervals set to 0.95’), one due to a report in the unprocessed discussion section.
The corrected (alpha )-level cannot be well distinguished from an uncorrected (alpha )-value. Only one out of eight corrected (alpha )-levels is correctly labeled and extracted by study.character, one is a false positive detection of a nominal (alpha ). The extracted nominal (alpha )-level contains three of the manually extracted corrected (alpha )-values.
For simplicity, the extracted nominal and corrected (alpha )-levels are merged with the extraction from the confidence intervals and reduced to their maximum value, which corresponds to the nominal (alpha )-level. Table 5 shows the frequency distribution of the extracted maximum (alpha )-level with the deactivated conversion of p- to (alpha )-values and the default setting.
The conversion procedure of p-values increases the accuracy of (alpha )-level extraction, but brings one additional false positive extraction, which is caused by a statistical test result reported with a standard threshold of p-values ((p). Thus, enabling the conversion of p- to (alpha )-values slightly increases the false positive rate of explicitly reported (alpha )-levels, especially for rather rarely applied levels (0.1, 0.01 and 0.001).
Since test power can be reported as both a-priori and a-posteriori results, some articles contain multiple power values. The absolute distribution of categorized power values found by study.character and manual coding is shown in Table 6. The evaluation of the categorized power values differs from the results in Table 3 because here, four unrecognized values in articles with several power values of the same category are evaluated as fully correct. There are two false-positive extractions caused by a poorly compiled table and a citation of Cohen’s recommendation to plan studies with at least 80% power21. Both errors occur in documents that contain other correctly extracted power values. Overall, 61 of 73



This post first appeared on IT Zone, please read the originial post: here

Share the post

Evaluation of the extraction of methodological study characteristics … – Nature.com

×

Subscribe to It Zone

Get updates delivered right to your inbox!

Thank you for your subscription

×