BERT is as Gentle as a Sledgehammer: Too Powerful or Too Blunt? It Depends on the Benchmark

↓ Direkt till sidans innehåll
↓ Direkt till sidans sekundära innehåll (sidomenyn)

Sökning: id:"swepub:oai:DiVA.org:ri-58557" > BERT is as Gentle a...

1 av 1
Föregående post
Nästa post
Till träfflistan

BERT is as Gentle as a Sledgehammer: Too Powerful or Too Blunt? It Depends on the Benchmark

Capshaw, Riley (författare): Linköping University, Sweden

Blomqvist, Eva (författare): Linköping University, Sweden

Santini, Marina, 1960- (författare): RISE,Prototypande samhälle

visa fler...

Alirezaie, Marjan (författare): Örebro University, Sweden

visa färre...

(creator_code:org_t)

2021
2021
Engelska.

Relaterad länk:: https://spraakbanken...; visa fler...; https://ri.diva-port... (primary) (Raw object); https://urn.kb.se/re...; visa färre...

Konferensbidrag (övrigt vetenskapligt/konstnärligt)

Abstract Ämnesord

Stäng

In this position statement, we wish to contribute to the discussion about how to assess quality and coverage of a model.We believe that BERT's prominence as a single-step pipeline for contextualization and classification highlights the need for benchmarks to evolve concurrently with models. Much recent work has touted BERT's raw power for solving natural language tasks, so we used a 12-layer uncased BERT pipeline with a linear classifier as a quick-and-dirty model to score well on the SemEval 2010 Task 8 dataset for relation classification between nominals. We initially expected there to be significant enough bias from BERT's training to influence downstream tasks, since it is well-known that biased training corpora can lead to biased language models (LMs). Gender bias is the most common example, where gender roles are codified within language models. To handle such training data bias, we took inspiration from work in the field of computer vision. Tang et al. (2020) mitigate human reporting bias over the labels of a scene graph generation task using a form of causal reasoning based on counterfactual analysis. They extract the total direct effect of the context image on the prediction task by "blanking out" detected objects, intuitively asking "What if these objects were not here?" If the system still predicts the same label, then the original prediction is likely caused by bias in some form. Our goal was to remove any effects from biases learned during BERT's pre-training, so we analyzed total effect (TE) instead. However, across several experimental configurations we found no noticeable effects from using TE analysis. One disappointing possibility was that BERT might be resistant to causal analysis due to its complexity. Another was that BERT is so powerful (or blunt?) that it can find unanticipated trends in its input, rendering any human-generated causal analysis of its predictions useless. We nearly concluded that what we expected to be delicate experimentation was more akin to trying to carve a masterpiece sculpture with a self-driven sledgehammer. We then found related work where BERT fooled humans by exploiting unexpected characteristics of a benchmark. When we used BERT to predict a relation for random words in the benchmark sentences, it guessed the same label as it would have for the corresponding marked entities roughly half of the time. Since the task had nineteen roughly-balanced labels, we expected much less consistency. This finding repeated across all pipeline configurations; BERT was treating the benchmark as a sequence classification task! Our final conclusion was that the benchmark is inadequate: all sentences appeared exactly once with exactly one pair of entities, so the task was equivalent to simply labeling each sentence. We passionately claim from our experience that the current trend of using larger and more complex LMs must include concurrent evolution of benchmarks. We as researchers need to be diligent in keeping our tools for measuring as sophisticated as the models being measured, as any scientific domain does.

Till lärosätets databas

1 av 1
Föregående post
Nästa post
Till träfflistan

Hitta mer i SwePub

Av författaren/redakt...: Capshaw, Riley; Blomqvist, Eva; Santini, Marina, ...; Alirezaie, Marja ...

Om ämnet

NATURVETENSKAP: NATURVETENSKAP; och Data och informa ...; och Språkteknologi

Av lärosätet: RISE

Sök utanför SwePub

Sök vidare i:: Google; Google Book Search; Google Scholar

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

LIBRIS.kb.se

BERT is as Gentle as a Sledgehammer: Too Powerful or Too Blunt? It Depends on the Benchmark

Ämnesord

Publikations- och innehållstyp

Till lärosätets databas

Hitta mer i SwePub

Sök utanför SwePub