Mathematics & Science

The best books on Educational Testing

recommended by André Rupp

Nearly everyone has had to sit a standardised test at some point in their lives and felt the grip it might hold over their future—and not always in a good way. André Rupp, research director at ETS, the nonprofit company that runs some of the most well-known tests, talks us through what's going on at the forefront of research and the new kinds of tests that are being developed.

Buy
  • 1

    Handbook of Item Response Theory (3-volume series)
    by Wim van der Linden (editor)

  • 2

    Principles and Practice of Structural Equation Modeling
    by Rex Kline

  • 3

    Handbook of Test Development
    by Mark Raymond and Thomas Haladyna (Editors) & Suzanne Lane

  • 4

    The Skilled Facilitator: A Comprehensive Resource for Consultants, Facilitators, Coaches, and Trainers
    by Roger Schwarz

  • 5

    Hamilton: The Revolution
    by Jeremy McCarter & Lin-Manuel Miranda

Nearly everyone has had to sit a standardised test at some point in their lives and felt the grip it might hold over their future—and not always in a good way. André Rupp, research director at ETS, the nonprofit company that runs some of the most well-known tests, talks us through what's going on at the forefront of research and the new kinds of tests that are being developed.

André Rupp

André Rupp is Research Director at the Educational Testing Service (ETS) in Princeton, New Jersey.

Save for later
 

Do you want to start by saying what educational testing is all about? What is this field that you’re specialised in?

The area of educational assessment is relatively broad. However, most people assume that educational assessment is synonymous with large-scale standardised educational testing. Standardised tests prototypically have a lot of multiple-choice or other selected-response questions and they often have relatively high stakes attached to them—for individual children in schools or for adults seeking licensure or certification for example.

While it is true that a large part of educational assessment is concerned with standardized tests, there are also areas, nowadays, where there is a lot of innovative research: into the way children learn, the way they engage in more complex tasks, similar to the tasks that people in professional practices would perform. There is substantial innovation around the way these tasks are designed, the way they are scored, and the way feedback is provided.

In principle, educational assessment is about making inferences or interpretations about what the person who engages with a task is thinking about while solving the task and what kind of competencies or skills they are relying on when doing so. It is all about designing the right kinds of tasks or activities or environments for them to be able to demonstrate these kinds of competencies best. Then it is about setting this system up with the right kind of constraints to get data that you can then summarise in reports or for feedback, so that the interpretations that you are making are defensible to the people who have to act upon them.

“It is all about measuring the unobservable characteristics of individuals that you cannot see directly.”

For a large-scale test of English proficiency, say, that could take the form of people writing essays or answering questions about reading passages to learn something about their proficiency in English at a certain grade level. But it could also mean that you have interactive simulation or game-based environments in which learners either work individually or in small groups to set up scientific experiments and manipulate devices on a virtual screen and then write up protocols and make inferences about the scientific phenomena that we are interested in.

In both of these very contrasting cases, you have essentially the same principles at play. You are trying to make sense of what learners do in certain situations and assign scores to parts of what they are doing. You might be scoring the selections that they make on a multiple-choice question, or, in the case of essay writing, you might have human graders or machines looking at the essay responses and identify patterns of errors or ways the essay is structured and content is being delivered.

In simulation environments, you have what is sometimes called ‘process data.’ Those are, essentially, the log files that these systems collect in which you can see what sequences of actions the learners perform, how long they spent in certain parts of environments, how what they said in a chat window might go together with what they did in the environment, what numbers they entered, and so on. When they collaborate, you can even have questions about how the interactional patterns among them is shaping the way knowledge is being generated and discussed.

“In general, the understanding of what scores mean, what interpretations can and cannot be made, has become more sophisticated and more nuanced.”

All of that, scientifically speaking, involves a lot of different design decisions that need to be made—about the way the task and the activities are set up, the way scores are developed, the way these activities are delivered in the environments, the way you put that information together to score reports, the way you convince the ‘stakeholders’—the people who have to act upon that information—that what you are saying is trustworthy, generalizable, and usable.

That, in a nutshell, is what the key ideas of educational assessment are about. Different people in different testing organisations, in academic institutions, or in other private settings typically work on different aspects of this problem. I currently work for a large not-for-profit testing company in an area where people in my team, and people we work with, think a lot about how to do the scoring part of these kinds of assessments.

Would you say that educational testing is becoming fairer? There is always the issue that people who practise doing tests do better than those who don’t, so that favours learners whose parents are making sure they practise. Also, I would make a general observation about, say, the GRE—which quite a few of my friends took when they were applying for graduate school—that it seemed to favour people who are good at maths. Is the field becoming more sophisticated in terms of these differences?

That is a good question. I would say that, in general, the understanding of what scores mean, what interpretations can and cannot be made, has become more sophisticated and more nuanced. We have built up a relatively large research base for a variety of assessments including, for example, hundreds of validity studies for the GRE alone.

When you work for a testing company that administers assessments with relatively high stakes for individuals, typically there is a lot of effort dedicated in-house to design various studies that help you understand what you can and cannot say about test scores. You develop lots of resources for people who use test scores such as interpretational guides, test preparation materials, and sets of disclaimers.

In my company’s case, we even undergo internal audits to make sure that key standards that people in professions have articulated are met. That is, in the US there is a very important set of standards, jointly developed by the American Psychological Association, the National Council for Measurement in Education, and the American Educational Research Association. When our company develops an assessment, we do our very best to provide evidence that we meet as many of these standards as we can, with specific emphasis on fairness and accessibility.

The question around fairness also becomes important when you change the assessment activities. When you have learners engaging in collaborative problem solving for interactive scientific assessment tasks, for example, you have to think about fairness in a more complex and differentiated manner than if you are thinking about fairness for scoring essay responses or multiple-choice questions.

What you see is that the innovative edge of assessment is really moving more and more toward digitally delivered performance tasks; measuring complex competencies in adaptive ways; thinking about how characteristics like engagement, motivation, grit, systems thinking and other kinds of complex competencies go together with basic knowledge.

The most innovation in educational assessment typically happens in contexts that are more formative, that help learners learn better, and provide diagnostic feedback. Before you move that kind of innovation into a large-scale assessment context, however, you have to invest a lot of resources to make sure that fairness considerations are taken into account and that you meet all key professional standards.

Shall we go through the books that you’ve chosen and what they bring to the table in terms of the overall picture?

If I may, I will say a few words about how I have chosen the books. I am someone who has a training in formal educational measurement, which is a fancy way of saying the statistical analysis of data that you might get from different kinds of assessment. But I am really an interdisciplinary person and I recognise that most of our work is a blend of scientific rigour and artful practice. I have selected Five Books that are not discipline-specific, but cover a range of the responsibilities and the ways of thinking I need to bring to bear in my job. When I do talk about the books, you will see that they are relatively diverse, and often illustrative of a certain kind of book beyond being useful for a particular application.

So your first choice is  The Handbook of Modern Item Response Theory (2016-7). You’d better start by explaining what ‘item response theory’ is.

Item response theory or IRT is a framework that people who are statisticians—we call them ‘psychometricians’ in educational and psychological assessment—use a lot. It is, in fact, the predominant framework for taking data from assessments, summarising that data, and reporting scores out to learners, and is a very powerful framework. It is also a very large framework, and subsumes a lot of different models under its hood. What Wim van der Linden—who is the editor of this three-volume series—has done that I find so remarkable is that he has updated a single book that he had several years ago and really brought together a large number of these models under a single umbrella in a coherent and principled fashion.

“Wim van der Linden is one of the smartest people alive working in the psychometric field.”

If you are someone who needs to learn about the range of models that exist out there, what they offer in terms of how they summarise data, how you can make inferences with them, what we currently know about how they should be estimated statistically, how their fit to data should be evaluated, and so on, then you can really get a wonderful sense of the entire space by looking across these three volumes. As a reference framework to have on the shelf, it is really indispensable for anybody who studies these kinds of models. And if you work in educational assessment, and you are somebody who works with the quantitative data, you need to learn about these models. To me, it is a must-have volume.

In addition, Wim van der Linden is one of the smartest people alive working in the psychometric field. As I said, he has been very principled and rigorous and detailed in editing these books, so that sets of chapters have similar kinds of structures and give a similar balance to the different kinds of topics. I admire that kind of editorial and contributory work as someone who has, myself, written and edited three books. I know how much hard work that is—to pull together many people with different styles and different personalities and different ways of expressing their ideas. So I admire this book not only for its content, but also in terms of what it represents as an editorial effort.

How big is the field? Are psychometricians a huge group, or do you tend to know each other?

It is difficult to put a number on it. There are currently over a thousand members of the National Council for Measurement Education, for example, which is one of the larger associations that has historically existed.

Nowadays, one of the challenges is that when you think about where the field has its boundaries, it is becoming fuzzy. When you think about educational assessment in the way I talked about it earlier, you also have to think about people who are in learning analytics, data science, and educational data mining fields for instance. These are often people who have an interdisciplinary training, many with a strong emphasis in computer science. The numbers are just mushrooming from year to year as these kinds of applications get larger and now we have areas like ‘computational psychometrics’ and very computationally oriented psychometrics programs like at the business school in Cambridge.

“Assessment activities or tasks are like scientific instruments. Once you change the instrument, you can ask new questions about the subject that you are studying.”

You also have a large number of different companies and start-ups concerned with educational assessment nowadays. You have companies like ETS, which are historically relatively well established and therefore ‘robust’ in some important ways. For example, we have a relatively large research division compared to many smaller educational assessment companies, with many specialists dedicated to statistics, psychometrics, learning sciences, cognitive science, and so on. But if you go to conferences, you do, of course, repeatedly run into certain key people within your field from across various institutions.

Moreover, when you work in a scientific field, from the outside it often seems holistic and relatively undifferentiated but it typically breaks down relatively quickly into lines of work that people are concerned with. For example, I work in an area called diagnostic measurement, which is an area on which I co-wrote a book. In that community I have 25 or so colleagues who do consistent recognizable work but quite a few more colleagues who occasionally dabble in it.

I love the word ‘psychometric.’ Is it literally about measuring the brain?

Measuring psychological traits, yes. It is all about measuring the unobservable characteristics of individuals that you cannot see directly. The logic is that you design situations—which we often call ‘items’ or ‘tasks’ or ‘activities’ or ‘environments’—in such a way that people, when they interact with them, draw on those skills and give you data—behavioural traces essentially—around the things that they do. They select options. They move around in an environment in a particular way. They write an essay. They give a spoken response. Nowadays you could even measure gestures or facial expressions. You then analyse those data, and infer back from the things that you directly observe to what they might have been relying on when they were doing these kinds of things. It is that chain of reasoning that makes assessment so challenging.

Is testing nowadays more able to bring out the individual qualities which vary from person to person? Traditionally it’s a way of saying, “That person is clever. That person is less clever.”

When you say someone is a little bit more ‘clever’ than another person, then that is essentially a very intuitive way of thinking about what we do whenever we make comparative judgments but it is not all. In addition, we may say ‘clever’ meaning a certain person is very competent in English writing. They are at the top end of the scale. They are able to write essays that are informationally relevant, are well structured, contain few errors, are on topic, and so on and so forth. People who are not so skilled might make a lot of mistakes. So that intuition is correct. A lot of testing is either about comparing people – rank ordering or sorting them into different groups. But it is also about evaluating their performance in absolute terms against a particular criterion or standards. Such kinds of evaluations can be done along either one conceptual dimension – like global proficiency in reading, mathematics, or science for example – or multiple subcompetencies in these domains.

What we find nowadays is that, as the assessment environments that people engage in become more complex and interactive – and to some degree more open ended – and we open up all this space about how individuals and teams could work on these problems, we essentially have to change the kind of questions that we ask about people. Assessment activities or tasks are like scientific instruments. Once you change the instrument, you can ask new questions about the subject that you are studying. It might be the learners in a particular grade or adults in a particular professional situation. As the questions get more complex, the data analytics get more complex, which means that any of these studies that you have to design to convince yourself that what you are seeing is trustworthy also get more complex.

But I think, nowadays, we are able to capture—in a more authentic and comprehensive way—the abilities learners across a lifespan have, and what sort of non-cognitive factors they bring to bear when they engage in these activities. That is why research in this area is still very much ongoing and the field is continuing to grow. There are things we already know that are very well established, hard facts and truths that you don’t really have to re-question. But there are also a lot of new questions that get asked that have all of these new research efforts attached to them that are worth pursuing.

Let’s go on to your second book, Principles and Practice of Structural Equation Modelling. What’s that about?

Structural equation modelling is another set of statistical models that are very popular in the social sciences. They are often used by people who want to investigate how different kinds of abilities, often called ‘constructs’, relate to one another. People design studies where they give survey instruments, for example, or educational assessments. They then create scores from these. They relate all of these to one another. Essentially, it is a very nice way of taking graphical representations of these relationships, and taking data and quantifying how strong these relationships are. Which variable predicts which other variable? Are there moderating or mediating effects between variables that might influence that relationship? How strong is it? Which direction does it go in? And so on and so forth.

The reason why I chose this book in particular is because it is reflective of a series that I have really come to like. In this series, the publisher is really trying to break down relatively complex information around assessment methods for people who are educated but not yet experts. If you compare that with the item response theory handbook, it is much more accessible and much more at an introductory level.

Get the weekly Five Books newsletter

I think this is such an important kind of work to do in our field. It is the kind of work I identify myself with very much. It is what I call ‘handholding for smart people’. It is the same style in which I co-wrote a book with two colleagues a few years back, and with which Jackie and I have edited our latest handbook. You try to describe the key ideas, the key principles, the key practices in an area at a level where you use technical terms sometimes as well as mathematical equations and graphics but you still talk that all through, step-by-step, so that you do not lose all of the nuance and abstract it so much that you trivialise the ideas.

I think sometimes that colleagues who are scientists think that that is maybe not as valuable, and it is much more valuable to produce very technical publications in peer review journals, but I personally think this kind of book is a very important contribution. It turns out that this topic, structural equation modelling, represents a very popular, very important family of models. This particular book is already in its fourth edition, so it has clearly found a lot of people who appreciate it practically.

Can you give a real life example of something the book is talking about?

Imagine you have an application where you are looking at the relationship between different competencies in English language. Let’s say you have three variables: writing competency, speaking competency, and interpersonal communicative competence. You are interested in how these relate to background variables that people bring to bear in assessment. Maybe the kind of educational background that they have, the kinds of households that they come from, or the educational context in which they are learning English. You might also be interested in how certain kinds of non-cognitive factors like motivation, grit, or persistence mediate how they use these competencies to solve tasks. With structural equation modelling you can set up a model where you have these different constructs represented and you can try to see, say, whether one is predictive of the other.

Let’s go on to book number three, which is The Handbook of Test Development. What’s this about?

The first two books that we have talked about were really about different ways of making a certain technical body of knowledge accessible to different audiences. This next book is about the entirety of the test development process.

When you get a degree in graduate school in the area of educational measurement, it is often very much focused on the statistical models, like the ones in the first two books. One of the advantages is that people that come out of these graduate programmes have really solid and detailed training about how to think about the models, how to estimate them, the relative advantages and disadvantages, and so on. What they often lack is a systemic understanding of what happens when you actually try to do these data analyses in a real life context, where you have to design a test from A to Z.

It is so much more than just scoring. It has to do with making complex decisions about the kind of competencies that you want to measure, the kind of tasks that you want to design, the kinds of reports you want to create. And the kinds of studies that you need to do in order to justify the defensibility of those reports. It is about the kind of computational architecture that you need to set up, the kind of data you see, the Excel spreadsheets, the Word documents, all of that. The kind of skill sets that you need in order to manage that entire process. It is about all the really complex, systemic thinking that needs to go into this.

“If someone…wants to get a sense of the real world of test development through reading a particular book, this is a really great book to have in your hands.”

Equally important is all the resource constraints that this happens under. When you work for an assessment company and you actually have to design, evaluate, deploy, and monitor an assessment—whether that is a traditional large-scale assessment or more of an innovative assessment—you have only so much time, only so much money to spend on certain studies, and only so much experience typically to manage all of these processes, which creates constraints that you have to work under.

Often what people who are trained in educational measurement find is that when they have learned about all these fancy and wonderful models and they come to a testing company, the models that are being used are relatively simple. They are much simpler than they would expect although that is not necessarily a good thing. But it has to do with the fact that the simpler models may create graphics or summaries for you that are easily interpretable, or they do the job well enough for operational reporting purposes so that fine-tuning is not necessary. They may be easier to communicate to the clients who have to use the data.

Or it might be about sample sizes—you do not have enough people for your assessment. You do not have enough items or tasks that you need for a particular kind of competency that you are interested in measuring, so you cannot really do anything reliably with a fancy model yet although you can start to think about how to do it eventually. I think that is often a real wake up call.

In this handbook, the editors have done a really nice job of getting together authors who have written on different aspects of this process. I think if someone is in graduate school and learns about statistical models and wants to get a sense of the real world of test development through reading a particular book, this is a really great book to have in your hands. It does convey a sense of the entire enterprise, with warts and all.

You’re from Germany and working in the United States. When you look around the world, do you get a sense that educational assessment is handled differently in different countries?

I think the United States is certainly a country that is well known for large-scale assessment, which has its advantages and disadvantages. It is where a lot of really strong educational measurement programmes are. A lot of cutting edge research comes out of the United States.

The irony is that it is often done by people that have grown up in very different countries. It is people like me or people from Australia, from Italy, from the Netherlands, from Britain, from Turkey, who do take up jobs in the United States because there is more of a job market. They bring their cultural background and their scientific backgrounds to the table. In that sense, the US is a very attractive place for this kind of work.

One of the big trends that many people are, at some level, familiar with is these international comparison surveys of student achievement that are being done. For example, the PISA survey, which stands for the Programme for International Student Assessment, is one of those international surveys that 30-plus countries participate in every three years. Reading, math, and science are the focal areas and it is essentially a fancy way of summarising the performance of 15-year-olds in these areas and then doing global comparisons of where countries stand.

In the mid-2000s, in my country, there was this belief that the German educational system was very advanced and had produced all these wonderful strong thinkers and doers. We were almost implicitly expected to perform well on this assessment. But when we participated in PISA for the first time that was not at all the case. We were somewhere in the middle or upper middle of the scale on all these competencies. That was known as the ‘PISA shock’.

As a result of that, in Germany, large-scale educational testing got kick-started in the middle-2000s. At the time, I was working at the first national institute for this kind of an assessment, the Institute for Educational Progress in Berlin, which still exists today. We had done other studies like PISA before but this was the first time standards-based large-scale assessment was done rigorously on a national scale.

“There is…sometimes, a misperception that assessments are only of a certain, simplistic kind, and therefore any assessment is necessarily a bad thing. I think societally and politically, you have to wrestle with that.”

For better or for worse, it kick-started an entire culture of that kind of assessment in my country. It meant that people had to wrestle with this idea of students being tested at certain intervals, that deficiencies and strengths were being made more public, that money was funnelled into those enterprises now out of state or federal funds, or that certain lines of research were suddenly being advanced to a stronger degree than it had been before. That was a really big change.

Nowadays what we have is a world where, in these large-scale surveys, you also use interactive technologies much more. You have more tablet-based or PC-based delivery. With more and more countries participating, there is also more of an awareness of what those kinds of assessments can do and what learners in those countries are able to do.

I think one of the challenges with all the innovations in assessment is when you get into more impoverished areas of the world, whether that is within the developing world or the developed world. You still have to struggle with access to the technology for assessment although some large-scale surveys have recently gone fully computer-based. I think some countries are very fortunate in that a lot of investments are made and computer labs or tablets are becoming very commonplace. But even in those countries, you have pockets where that is not the case. To your earlier question about fairness, it is always a challenge how to make sure that you get a fair representation of what learners can do, given those simple delivery constraints.

Do you feel it’s a good thing that Germany has moved in that direction? You said it was ‘for better or for worse.’ Do you feel it’s for better?

First of all I should say that I have not followed the politics and the societal implications of this closely since I left my country. However, I feel that creating that kind of an awareness about strengths and weaknesses—and challenging the public and the scientific community to wrestle with issues such as how to measure relevant, novel, and complex competencies while we are competing in a global market—is really beneficial. I informally hear from colleagues that there is a weariness setting in about the amount of assessment that is being done. There is also, sometimes, a misperception that assessments are only of a certain, simplistic kind, and therefore any assessment is necessarily a bad thing. I think societally and politically, you have to wrestle with that.

I am someone who, as a person, strives for authenticity and integrity when I interact with different stakeholders. I think the worst thing that we can do as scientists or as specialists is to oversell or undersell certain kinds of products such as assessments. Our job is to educate our audience as best as we can about what the relative advantages and disadvantages are of doing certain kinds of things around assessments. The hard aspect about this is that the deep answers to these questions are often very complex and nuanced, and we live in a world where people want fast answers. They want simple answers, and they want to make quick judgements. Even with the best intentions, you sometimes have audiences that are just not receptive.

Most of the time, when questions get asked about the value of educational testing, typically people think of ‘institutions’ that are doing certain things. The reality is that it is always ‘individuals’ who have to communicate ideas, even if at the end of the day, an institution releases a pamphlet or an FAQ. When you talk to individuals, most scientists, most teachers, most parents, most students, have very good intentions, and many are really very thoughtful, want to do a good job, and want to understand certain complexities.

As a result, I always think that it is good to think about the human component of all of this and really engage with the human being and ask the appropriate questions, be willing and open to learn from the other person— but it clearly goes both ways. If people were to do that frequently, I think the understanding about what assessment can and cannot do in certain situations would be much more evolved and much more nuanced and maybe much more representative of what the current state of the art actually is.

Tell me about the next book, your fourth choice, which is The Skilled Facilitator.

The Skilled Facilitator is a book that I have picked because my current job requires management. I am currently a research director at ETS, and that means I have a team of colleagues who work with me on different projects. Even if I was not in that position, a lot of work at this company is interdisciplinary, and so you work with people with all different kinds of backgrounds and training. You have to bring them together and share information, get reasonable buy-in for ideas, for processes, for practices, and that is a hard thing to do.

“I really believe that even if you just have a small unit, you need to live by a set of values that are really constructive, that also reflect you, and that you want others to live by when you are not with them. ”

The current director of the division that I am in at one point referenced the ‘mutual learning’ framework by Roger Schwartz. This particular book is one of the first books in which Roger talks about that framework in detail. It is essentially a very accessible and elegant way of communicating that in order to work together, you have to have a psychological mindset that is formed on ideas like transparency, curiosity, compassion, and that you should ground communication and negotiation on those kinds of values and their surrounding culture.

Put simply, rather than being top-down, punitive, secretive, and unnecessarily directive, this kind of mindset really helps teams work together better. It helps you connect with individuals better. It helps you make smarter, more efficient, more effective managerial choices, which, if you think about that whole test development process that we talked about earlier in our third book, is really what you need. I have found this framework to be really, really helpful. I actually have a poster on my door from a workshop that Roger Schwartz and colleagues did to remind people that when we have conversations about ideas, those are the principles, the assumptions, and also the behaviours that we should be guided by. I have seen it work really well.

Incidentally, I always believed in those kinds of values anyway, so for me it was just fine-tuning that, reminding myself to continue to improve myself as a manager, as a director, as a colleague. I really believe that even if you just have a small unit, you need to live by a set of values that are really constructive, that also reflect you, and that you want others to live by when you are not with them. And let’s face it, as a director or manager or coach that is 99% of the time. I personally believe that this is a critical part of how we do our job and are successful, and how we help others be successful.

Now, we’re finally at the last book you’ve chosen, which is Hamilton: The Revolution.

Finally some light reading, right? I chose this book because the work that we do specifically on the innovative edges of educational assessment is, as I said earlier, a mixture of scientific rigour and artful practice. Essentially, it is all about designing under constraints. The design decisions have to permeate everything, from the way you design the activities to the way you design your scoring to the way you design the reporting to the way you design how in teams you work together to make all of this happen. I am, personally, a big supporter and fan of performing arts: musicals, plays, concerts, comedy. I have seen over 900 shows in my life.

Wow.

In many different countries. When I moved to the East Coast, I eventually got closer to New York City, and that is like paradise for anybody who loves the performing arts of any kind. Every day you could go to several different shows that are phenomenal. Hamilton is a musical, by Lin-Manuel Miranda. I think that most people have heard of it in some way by now.

What he did to me is just such an inspiration and I love what it represents on so many levels. It is a musical that he created based on an inspiration that he had when he read Alexander Hamilton’s biography on a vacation. He thought it would be a hip-hop story. Then, over many years—as you find out if you watch his film or read this book—he created, in many different steps, with many different iterations, many different colleagues, and many different decisions, this engrossing musical that is so different from any other musicals that currently exist in the world.

It fuses a variety of musical styles like pop, rock and hip-hop in a beautifully flowing narrative. It teaches you about a part of history. It teaches you about the personal challenges of people who were involved in the history. That makes it really accessible. It is beautifully staged. The music is phenomenal. The lighting is fantastic. I admire the complexity of all of these decisions that had to be made and all of the people who had to come together to make a project like this successful.

If you know anything about Broadway or other kinds of professional theatre, producers will say that most of the money gets lost and they are not profitable. I forgot what the number was, but I think only 10-15% of musicals ever recoup their initial investment on Broadway. Everybody wants the holy grail like Hamilton. To really change the culture of what it means to be a musical in this way, I find that so inspiring.

For me, going with my wife and seeing this show or any other good show for that matter really lifts me up. It lifts my soul up, and I bring that to my work. I try to take that same spirit into conversations that we have around assessment design or when we write articles, and to really always be an artist while also being a scientist. I think having this kind of creation out there as a landmark is just unbelievable. It is such a wonderful and admirable piece of work, as many others have said. I highly recommend seeing Hamilton and supporting the performing arts.

One of the sadder things about standardised assessment over the years, is that the predominant focus has often been on math, science, and reading. STEM—science, technology, engineering, and mathematics—is important but the inclusion of ‘A’ for the arts—STEAM—is really important, because I feel the arts are such a powerful contributor to how human beings are shaped, what their values are, what their beliefs are. It is how their passions are ignited. It can bring out the best in people. It is important to support that through educational assessments, which is I why I really admire innovative assessments where maybe learners or learners have to design certain kinds of artefacts or tools or environments, and we try to measure that, and model it, and give them feedback on it in such a way that it is still assessment and not just a cool exercise.

That’s happening, is it?

Yes, absolutely. It is even happening, to some degree, on these international surveys that I mentioned earlier. It is certainly happening in research projects that have been used in school districts. For example, one of my colleagues has designed a system called InqITS for science. The research team has developed apps for teachers where they can monitor how learners in their classroom are doing and get indicators of engagement and feedback on the learners, while the learners can do interactive tasks that are smartly designed to help them do scientific experimentation.

Then I have colleagues who are doing research on video games—some people call those ‘serious’ educational games. One is called Newton’s Playground, where learners have to design, graphically, innovative solutions that help a ball reach a balloon in an environment with obstacles. It uses understanding of physics to help learners do that, but it has these creative design components. All of this is happening, but it is typically at the edges, so there are quite a few research projects that are funded by the National Science Foundation, MacArthur, the Bill and Melinda Gates Foundation, the Institute for Educational Sciences, or start-ups. I think those assessments are a critical part of our future.

Unfortunately, it is understandably not the first thing that people think about when they think of educational assessment, which is more like the standardised, sit-in-a-classroom, paper-and-pencil test, with relatively abstract questions that seem, for many people, disconnected from what adults do in their professions. That is of course partly true, but that association is also partly a shame because that is not the entirety of where the field is or the core of where it is going.

Interview by Sophie Roell

Five Books aims to keep its book recommendations and interviews up to date. If you are the interviewee and would like to update your choice of books (or even just what you say about them) please email us at editor@fivebooks.com

Support Five Books

Five Books interviews are expensive to produce. If you've enjoyed this interview, please support us by donating a small amount, or by buying some of our most recommended books from Amazon. Since we are enrolled in their affiliate program, we receive a small percentage of any product you buy, at no extra cost to you.

André Rupp

André Rupp is Research Director at the Educational Testing Service (ETS) in Princeton, New Jersey.