Table of Contents


There are a lot of issues of AI safety involved here. One of the salient issues with NLP programs is bias. There are two general sorts of bias
- Explicit bias, where output is clearly toxic (cursing, slander)
- Implicit bias, where the policy from the output changes based on context (e.g. opposition to programs that help certain groups, probabilities about roles).

One of the problems with implicit bias is that it's hard to disentangle it from expertise. For instance, you want a system to make assumptions about the world given partial information, which any context will necessarily have. For instance, if I were to say "It's snowing", we want the system to use its understanding that snow likely means that to plan a task to get out of the house we'll need to shovel. In many cases, the system thinking a shovel is necessary will outperform the naive system where the system doesn't plan to shovel in a snowstorm. However, if a location has a heated driveway/walkway, then the naive system might be seen to outperform the expert system in this particular case. So while bias isn't necessarily bad in and of itself, it's important to know what the biases of the system are to help it react as new information comes in. (Richard Heuer's work on the Psychology of Intelligence Analysis from CIA gets into this issue extensively).

Beyond general biases, there are many social biases that are quite harmful and require more rigorous study than biases about the world. Those biases include assumptions about gender, race, sexuality, and other issues that deal with people. Many of these biases have built up over decades in different social contexts, made it into texts, and are then learned by language systems such as GPT. These biases are not trivial to measure in their explicit form, although some checks for toxicity may help, and they are even harder to measure in their implicit form. However, it may be the case that the work done in social science can transfer to AI; hopefully since GPT can answer questions, surveys that people do to detect explicit and implicit biases in people can transfer to studying those biases in GPT. I want to note here that since GPT just acts by auto-completing input, studying bias of the system at rest might not do any good. Instead, the goal of this section is to provide ways to evaluate the bias of the system after being primed by prompts.

Lit Review

why is this not formatting properly

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License