December 16, 2014

DANCE STATS

On November 7th and 8th, 2014, I explored my very first “dance experiment” in front of a live audience.  Here’s how it all went down.  (Note: if you're like "snooze, I was there I know what's going on" you can skip to the results.  I made that header nice and big so you will see it as you scroll.)

The Premise
The audience views one minute of dance work.  They rate it on a scale of 1 to 10, where 1 is “I would not like to see this” and 10 is “I would pay to see an evening of this.”  There are sixteen one-minute works.  Based on audience responses, a regression is fitted to the data to predict what that audience is responding to.  The performers create an optimized piece of work to present as the “finale.”

Hypothesis
the show flyer was on engineering paper. it's the little things.
My hypothesis, going into this project, was that there are so many nuanced factors that affect an individual’s perception of a piece of art, and there are so many different opinions, that we would find no statistically significant factors that influence an audience.

First, a mini-primer on Experimental Design
At Northwestern, one of my favorite classes in the Industrial Engineering department was a class called Statistical Design of Experiments, and in it we learned a statistical method for testing the impact of a large number of variables using a relatively small number of resources. (Doesn't that sound like an amazing class?? Nerd alert.)

Glossary
run = a one-minute dance piece
factors = controllable variables = the things we can change in a piece of choreography
output = results = what the audience wrote down
main effect = how a factor affects the audience
2nd-order interaction = how two main effects work together to change the audience's opinion
(example: if factor A is unison movement and factor B is number of dancers, a positive 2nd-order interaction between the two means the audience likes unison movement when there are many dancers on stage; when there are only a few dancers on stage, they prefer to see non-unison movement.) 
The sixteen runs (pieces of dance) were created to fit a 2k experimental design with eight different factors.   A 2k  factorial design basically means that you have k factors, and they are each at 2 levels.  We’ll call these “high” and “low” levels, noted by a + or a -.  (All of the factors themselves are explained in full, later, so don’t fret.) I chose a 28-4IV fractional factorial design, because I wanted to keep the number of runs down while keeping the number of factors tested relatively high; additionally, this design creates overlaps only between main effects and 3rd-order interactions.  I assumed that main effects and 2nd-order interactions might be significant, but our results probably wouldn’t be more complex than that.  You don't have to worry too much about what that means, but if you must know, you can read a lot more about experimental design and specifically 2k experiments here and here and here for more full explanations of the math behind the design.  I’m guessing many of you don’t really care about fractional factorial experiments as deeply as I do, so we’ll leave that reading extracurricular.

i'll keep adding photos when this gets dense.
The 28-4IV design was also chosen for experimental logistics.  Artistically, I chose one-minute dance pieces because it felt like enough time to get “into” a piece, but not enough time to get bored. Logistically, I chose 28-4IV design because it only requires 16 runs; I figured 16 one-minute segments was an acceptable amount of time for an audience to sit, while 8 would have been super short (and statistically limiting) and 32 might have felt long or confusing.

Typically, with a high number of factors (eight is considered high), it is helpful to have multiple replicates of an experiment.  This means for each level of settings (or for our purposes, each run, each dance piece), you need to see the output multiple times.  Fortunately for us, this is built in!  Each audience member records his or her own response, so we have as many replicates as there are audience members. 

For each of the sixteen runs, I assigned a level for each factor A, B, C, and D, and used the aliases E = BCD, F= ACD, G = ABC and H = ABD to assign the remaining factors for the 28-4IV design.  Here’s what it looked like.




For this experiment, I kept it really simple and ran the experiment in order, from run 1 to run 16.  More on that later.


Assumptions & Limitations
We are operating under several assumptions, and there are also limitations in the experiment itself.  Here they are.

1. It’s all my choreography.  You can only get so much differentiation in one person’s work, and while I like to consider my tastes pretty diverse, it is necessarily an extremely limited sample of choreography as a whole.

2. We stayed in the realm of “contemporary modern dance.”  I chose to stay within one (albeit very general) “style” of dance for a couple of reasons.  First, it’s what I do.  (Other than a significant amount of tap dance choreography in college – hey TONIK! – and dorky hip-hop mostly for eleven-year-olds, that is.)  Secondly, though, it also brings the performers’ talents to the same level.  I had an amazing cast of dancers; the differences in their strengths/preferences may have been more apparent to the audience (and therefore may have affected results) if we introduced different styles of dance. 

3. The order wasn’t randomized.  We ran the show in the same order every time, largely because it was much easier on the performers.  As audiences were filling out their forms, they were understandably using the first few runs to “calibrate” their results.  The first few runs will skew to the middle for most audience members simply because they haven’t seen enough to be really comparing them yet.  A few of the audience members called this out on their response sheets:

“I would have rated differently, in retrospect.” – regular dance watcher, age 29

I did create a “normalized” version of each audience member’s responses to combat this, which I’ll explain later.

4.  We’re only measuring what we’re measuring.  This is a limitation inherent in any of these types of experiments.  Experimenters do their best to select and test variables that can be controlled and that likely have an effect on the outcome.  I did my best here, too.  As we began rehearsing, though, there were already factors I wanted to change and add.  And I know there could be factors out there that I never even considered that were affecting the outcome, but that I wasn’t measuring. 

But anyway, onward with our experiment!

Artistic Goals
I get asked all the time, “where did that idea come from?”  The answer is a meandering one. 

First off, I was poking fun at what I see as a division in the dance world.  On one side, there’s academic modern/postmodern/experimental dance.  These folks have devoted their careers and lives to learning about the language of movement, how we interpret things, the cultural and gender-specific implications and ways of viewing, etc.  I have the utmost respect for dance scholars and the work being done at the academic level, and I think it's crucial for artists to also be scholars. I also think (blasphemous, I know) that parts of that world and that mode of thinking can be a little ...silly.

On the other side, there’s the dance competition/reality TV/acrobatics/cheesier musical theatre side of the dance world, where how high you can kick and how you can throw your partner around are marks of success and unison jazz hands are a given.  There’s nothing wrong with that – it’s often physically impressive, and I’m happy to get dance in front of larger audiences even as weekday-night TV entertainment.  As an exposure to the art form, though, it’s extremely narrow, and I do think it sometimes undervalues audiences and what they want to see.  It can be an effective “gateway drug” to the dance world (“ooh, I saw that choreographer on So You Think You Can Dance, let’s get tickets to her show in NYC!”), but I believe producers and audiences themselves might be surprised by how much higher the average dance-watcher’s threshold is for the art of movement. 

I consistently find myself at what I see as the crossroads of these two extremes (as I suspect many choreographers do), and this project was a way to play with that, and potentially strip down some of the style elements (lights! music! costumes! stars!) that affect a viewing of dance.  We’re forcing folks to look at the movement, and it is my wager that audiences dig it.

Also, when I go see a dance work, I can easily identify choreographic choices that I respond to emotionally (musicality/rhythmic interest is a big one); however, speaking intelligently about the work is difficult with friends and colleagues of mine who don’t share my educational background.  There are so many ways of talking about dance, so many frames to put on it, and so many layers of experience.  This isn’t unique to dance, of course—this happens with any art form, or arguably, any subject at all.  My goal in creating this project was to illuminate for audiences what they’re responding to, and to provide different ways of viewing/talking about dance, particularly for non-dancers.  This, again, is a pitfall I see in art all over the place – a high percentage of dance audiences are often made up of dancers.  And isn't that boring?  How do we bring more diverse points of view into the dance process?  If I can give even one person a more specific vocabulary for speaking about dance, I think this project is a win.


The Factors
Right. Back to the nitty-gritty.  

As I started coming up with a list of potential factors for the experiment, I created a big ol’ brainstorm and wrote out three notebook pages full of things I considered influential in dance viewing.  Then I combed back through the list and began categorizing the factors.  I also rolled similar factors together to create more universal/applicable versions (for example, I chose to combine “pedestrian movements” and “gestural movements”).  Three categories emerged: movement factors, which were about the actual vocabulary each performer was using, speed, dynamics of the movement, etc; compositional factors, which dealt with how movement was put together for the stage; and musical factors, which I included because I expected the choice of music to produce opinionated responses.  I chose a few key factors in each category, and ended up with this:




Choreographic Rehearsal Process
I worked with the performers twice a week for five weeks.  We used some improvisation and experimentation, but largely I walked in with specific work to set on them.  I shared the factors we were testing with each piece and used them as a basis for coaching their performance.  I chose roles for each performer within the work that I felt were natural for them to perform; I also structured it so the audience saw each of the four dancers for a roughly equal amount of stage time. 

I also think it’s worth noting up front how much time and attention we took to remove personal bias from each piece.  I took care not to place judgment on the pieces I created, even though I loved some of them and really disliked some of them; I gave each work a specific intention in terms of focus in the face and body, facial expression, attack of the movement, etc—we wanted to present, as much as possible, and equally committed set of sixteen pieces to each audience. 

I believe a huge part of the success of our first showing was due to the dancers and how willingly they jumped in and engaged with the process.  A number of audience members noted how much the dancers seemed to enjoy working together, which is basically a dream come true.  They are individually wonderful artists and it was a treat to have them all together!

The Results
you know i'm serious because i'm wearing a blazer.
Now that you have the history, it’s time for the exciting part… results!  We’ll start with what the final regression was for each showing as is, and then I’ll present some other slices of analysis.

Friday Showing 1
The first showing on Friday had eight audience members (starting nice and easy), with a median age of 29.  This crowd found none of the main effects to be statistically significant; however, there were four second order interactions that were significant: 

AG: classical shapes and syncopated movement or unfamiliar dance movements and arrhythmic musicality
BC: many dancers dancing to pop music or a solo/duet to non-pop music
DE: full-body big movements in unison or isolated movements in the dancers’ own time
FH: fast music and dancers interacting with one another or slow music and dancers in their own world

Now, this allowed us to create a final piece of movement to present on that evening, but it doesn’t really give us that much information.  Because of our alias setup (remember, factor level E = BCD, level F = ACD, G = ABC, H = ABD), main effects of factors don’t overlap with 2nd-order interactions, but 2nd-order interactions do overlap with each other.  And while it’s exciting to think that there are four interactions that are significant, the math deflates us:

AG = A(ABC) = BC
DE = D(BCD) = BC
FH = ACD(ABD) = BC

… so they all overlap.  We can’t really tell what’s significant here.

It’s also worth noting the other alias-generated overlaps:

CD = GH = BE = AF
FG = CE = BD = AH
EH = BG = DF = AC
EG = CF = BH = AD
EF = DH = CG = AB
DG = CH = BF = AE
Whenever one of these shows up, all of them show up.  (This happened every time, dang it.)

Friday Showing 2
The second showing on Friday had thirteen audience members, median age 31.  This crowd, too, had our same four significant 2nd-order interactions… However, they also had a significant main effect – they enjoyed movement in unison and/or canon (factor E).

Saturday Showing
On Saturday, our sample size was a bit larger at twenty-two responses, and slightly younger, too, with a median age of 27.  They were also a little more consistent with what they liked, and had four significant main effects:  they liked unfamiliar/gestural dance movements, they liked pieces to pop music, they enjoyed unison dancing, and they liked syncopated movements.  Of those effects, pop music and unison dancing had the greatest positive effect on the piece’s score.   

Since each showing was pretty close to the same (the same dancers, wearing the same clothing, doing the same movements in the same timing in the same order), we're gonna lump the data together to one superset.

Data Totals
Pooling all of the demographic data together yields a result that’s almost the opposite of my hypothesis; not only were there statistically significant factors, almost all of the factors were significant.  We start with null factor β(0) = 6.15, which corresponds to the overall mean of all the responses.  Then, the significant factors have a certain effect on that score, either raising or lowering it based on what the audience likes. Mathematically, it looked like this:

OUTPUT = 6.15 – 0.328(A) + 0.402(C) + 0.534(E) – 0.216(H) + 0.280(AB) + 0.297(AC) + 0.310(AD) – 0.221(AE) + 0.229(AF) + 0.573(AG) - 0.254(AH) + 0.573(BC) - 0.254(BD) + 0.229(BE) – 0.221(BF) +0.297(BG) + 0.310(BH) + 0.229(CD) – 0.254(CE) + 0.310(CF) + 0.280(CG) – 0.221(CH) + 0.573(DE) +0.297(DF) – 0.221(DG) + 0.280(DH) + 0.280(EF) + 0.310(EG) + 0.297(EH) – 0.254(FG) + 0.573(FH) +0.229(GH)

…Which is a lot to look at.  The standard error in this experiment meant that anything with a main effect larger than 0.4 (or smaller than -0.4) was significant.  If we bump that up and (kinda arbitrarily) say that we’re only looking at factors and interactions that are significant assuming a higher error of 0.8, we get a little more manageable equation:

OUTPUT = 6.15 +0.402(C) + 0.534(E) + 0.573(AG) + 0.573(BC) + 0.573(DE) + 0.573(FH)

Much better.  And we know that those 2nd-order interactions overlap; but for fun, let’s assume that because C and E are significant, the interactions involving these factors are the ones driving the 2nd-order significance, too.  Qualitatively, this translates to:

Audience starts at an average score of 6.15.  If the music is pop-y, add 0.402, and if it’s not, subtract it.  If the dancers are in unison, add 0.534, and if they’re not in unison, subtract it.  If there are many dancers on stage and it’s to pop music, or there are only a few dancers on stage and it’s not pop music, add 0.573.  If the dancers are using full-body movement and moving in unison, or they’re using gestural movement that is not in unison, add 0.573. 
i promise we're still talking about art

Statistical Means by piece
Now that we have all this data, we get to nerd out with secondary analysis!  (Don’t all choreographers feel this way?! ….no?)  One of the most common questions I got from my dancers and from friends was “which one was the favorite?”  Here is each piece and its mean audience response:



(I grant this means a lot more to you if you actually saw the pieces.  If you didn’t, here are the means lined up with their factor levels.)




Normalized results
Another very common question, and one I grappled with quite a bit in the rehearsal process, is this: How does performance order affect audience responses?  And also, related, a scale of 1-10 is so subjective!  I asked audiences up front to be harsh with us—a response sheet full of 6’s and 8’s doesn’t help us.  I believe our audiences did attempt to show a range of responses, but many ranges were limited—some audience members never rated anything below a 6; some never rated anything above a 7.  One responder kept hers between 10 and 6; another was between 7 and 4. 

To combat this a little bit, I created normalized responses for each audience member, using this conversion for each data point:

y = 1 + (x-A)*(10-1)/(B-A)

where y is the normalized response, x is the audience member’s original response, A is the audience member’s personal minimum, and B is the audience member’s personal maximum.  The result is that each audience member’s highest-rated piece becomes a 10, and their lowest-rated piece becomes a 1, and their middle-rated piece ends up at a 5.5 (halfway between 1 and 10).  For example,  our middling audience member number 57 gave these responses:


6
4
7
6
4
6
5
5
7
6
6
5
7
5
7
6

When we normalize them, they become:

7
1
10
7
1
7
4
4
10
7
7
4
10
4
10
7

 
The highest-rated pieces (7s) became 10s, the lowest-rated pieces (4s) became 1s, and the rest of the numbers effectively “stretch out” to fit the new range.

How does this change our regression?  Not too much.  It does emphasize the effects of our statistically significant main effect factors, and, interestingly, it adds one more factor.

Our β(0) factor reduces to be a little closer to the middle of the range at 5.94.  (The audience liked us!)  The effects of factors A, C, and E now create a difference of almost half a point each, when before they were more subtle.  We can additionally eliminate the effect of factor H (the amount of interaction between the dancers), while factor G (level of syncopation) emerges significant.

Artistically, that makes a lot of sense to me – I could see dancer interaction as a 2nd-order interaction, maybe tied to non-unison movement, but I was a little surprised that it was significant by itself.  (Full disclosure, I also think dancer interaction was one of the least clear elements in my choreography, which surely didn’t help my results, while syncopation/musicality is something that matters a lot to me and is something I consider myself good at.  It's hard to hide that.)

Cross-sections by demographics
I also wanted to take a look at results for different demographic groups.  (I used the normalized data for these results.) I threw out two responses because they chose not to provide demographic data.

Gender
Do men have different preferences than women?

The results for male responses contain the same significant variables as the group as a whole; however, the greatest predictor of a higher score was the amount of unison movement there was in the piece, with a regression coefficient of 0.894 (if there’s a lot of unison, the men in the audience ranked the piece almost a whole point higher, on average).  The highest mean score was awarded to piece #9, a quirky, gestural solo.

The women keep significant factors A (prefer unfamiliar dance shapes), C (pop music) and E (unison).  We drop G (level of syncopation).  Each factor has a smoother effect – nothing singularly jumps out the way it did for the men.  (This, however, could also be due to the fact that the majority of our audience was women; 29 female responses vs. 12 male responses-- more data tends to be smoother.)  Interestingly, the highest mean score was a tie between #15 and #16, two group pieces with lots of unison to pop music—as predicted by the regression—while the male favorite #9 was merely average. 

Age
The vast majority of my audience was in their twenties.  (Like the performers and the choreographer, so … not shocking.)  For fun, we’ll split it at age 35. (GenX + Boomers vs the Millenials?)

For the under-35 crowd, the preferences are nearly identical to that of the group of women; therefore, also pretty similar to the group as a whole.  Mean favorite, a hearty .7 points ahead of the pack, was #16, easily the cheesiest number of the bunch.  Sigh, youth. (Kidding!)

For 35+, something kind of fascinating happens.  Almost all of the statistically significant factors fall away—we are left with a highly significant preference for unison/canon movement, and two sets of our overlapping 2nd-order interactions, including a preference for classical dance shapes to pop music and unfamiliar dance shapes to non-pop; also, many dancers moving with high syncopation and few dancers moving arhythmically.  The older crowd had a mean favorite in #8, a group number with classical shapes to Billie Holiday.  Also interestingly, the mean favorite here was really set apart from the rest – a mean score of 1.2 points higher than the three second-favorites. 

Dance-Watching
This is the demographic comparison I was most interested in before this experiment began; my expectation was to find that audience members who are familiar with the art and see it regularly might have a more unusual preference, veering away from the “flash” of familiar pop music, unison group dancing with classical dance shapes. 

I was half right.  The group that self-identified as watching dance “very often” (there were 11 of them) fell beautifully into my original hypothesis – none of the factors and none of the 2nd-order interactions were significant.  This is the only group that broke down this way!  These experienced viewers had completely varied opinions that didn’t create any trends.

For the moderate dance-watchers who said they see dance “regularly” and “occasionally,” the data matched the group as a whole with one key difference… they dropped their preference for pop music.  I’m really not sure what this one means, but I think it’s very interesting!

For the newbies, who identified themselves seeing dance “rarely” or “this is my first time!” had an extremely strong preference for unison movement (high levels of unison bumped their average response up a whole point).  This makes sense to me artistically – unison movement is organized and easier to take in and follow quickly, so it was likely pleasing to those who don’t watch much dance.  This group also had a very strong preference for pop music… again, I’m not really surprised by that.  They ranked highest the numbers that had both of those qualities, with sassy #16 in solid first place.

So there you have it – all in all, the short answer: my hypothesis is still inconclusive with this smaller data set.  

So, what’s next?
Analysis
Even with this limited data set, there are so many ways to analyze and glean results.  My next step (stay tuned!) is to remove the variables we know not to be significant and use the data set as additional replicates for a “smaller” design.  I would also design it so that 2nd-order interactions don't overlap with each other.  
PS. I made a version of my spreadsheet available to the public, so if any data nerds out there wanna give it a go with other analysis, I welcome you!

Further experimentation
As I mentioned above, even early on in the rehearsal process, there were some variables that I wanted to adjust—specifically, the lack of choreographic clarity between “isolations” and “syncopated movements,” and a clearer representation of interaction between dancers (more partnering, for example).  I'd add a factor to measure level of repetition/motif.  

I also want to explore more about the effect of music on an audience perception, which could be an entire separate presentation!

There is additional power in simply gaining more data—I want to go through another rehearsal process to work out some kinks and then get this in front of larger audiences and more opinions!

I see opportunity to go bigger with this in three main settings:
Dance/Theatre presentation. With a larger audience, the typing-the-responses-into-a-spreadsheet method of calculation isn’t scalable; I would gather responses electronically through a voting system, or I would develop a free app for audience members to put on their smartphones and collect responses that way.  

Schools. I am passionate about connections between the arts and science, statistics, and data.  Blended projects like this are a wonderful opportunity for students at the high school or university level to work together with students with completely different strengths and interests.  What if this exposes more engineering/math/statistics students to artistic applications?  What if this illuminates the power of data analysis for artists?  I want to travel down that path! (If you are interested in collaboration, get at me.)

Online.  This question came up in almost every post-show talkback – what if the one-minute videos were online, and there was a massive collection of data from all over the country? The world?  I've hesitated on this one because I feel like the anonymity of the internet allows biases like “that girl is hot, I’ll vote for her” and “Sia fans unite!” to skew the voting; I also think there’s something imperative that is lost when dance isn’t seen live, and when there isn’t a human face in front of the audience asking them to be impartial and to look carefully at the movement.  I do see, however, the power of so much more data, and I’m interested in exploring it.



Unrelated Observations
  • There was one audience member who admitted that this show was his very first live dance experience.  I wonder what this will do to the rest of his dance-viewing life!
  • It’s fascinating how many people went for it with the feedback – not only did they score each piece, some audience members wrote me full out choreography notes for each piece, including things they thought were interesting, what they would have done instead, and how the dancers could step up their performance.  None of that affected my analysis, but it was informative for my process!

From audience response sheets:
“Probaby not an awesome critic, but the music impacts me for sure.” – first-time dance watcher, 29
“[I look for] a depiction of emotion through movement.” – occasional dance watcher, 23
“Something that moves me – gives me a gut reaction.” – occasional dance watcher, 27
“[I love] ugh, commercial trick-heavy dance! I’m a heathen!” – dancer, 26
“Teach me something/REVEAL something” – regular dance watcher, choreographer, 63
“I look for work to wake me up – something that transports me spiritually. Also, an appreciation for ‘time’ as an element in the art work.” – dancer, 46
“Character development, athleticism, musicality.” – very often dance watcher, choreographer, dancer, 39
“Correlation between emotion & music” – occasional dance watcher, 23
“To be exposed to new ideas/perspectives.” – dancer, 24
“I look for something or someone to inspire my own dancing.” – dancer/choreographer, 30
“I look for a clear movement vocabulary that helps me compare moments.” – non-dancer, regular dance watcher, 29
“A depiction of the way the music makes me feel.” – occasional dance watcher


Thanks for reading along! I'd love hear your thoughts/reactions-- comment here or email me at info@jaemajoydance.com.

Much gratitude for all the photos on this post, by Travis Magee Dance Photography