Figure 1. First video call of the AIMOSxWISDOM research team.

AIMOS 2023 was both the first conference we presented OHM to and our first opportunity to generalize WISDOM, by peer-reviewing contributions to the AIMOS conference itself. In the months that followed, AIMOS members formed a research team and met online to discuss various issues, prepare the prototype, and send it out for open review (see the project board for details). The report below is a summary of this collaborative effort. 

Methods & Results

Materials

Prototype upgrades

We made several upgrades to the prototype, based on learnings from our Tiny OHM prototype. Most notably, we:

  • condensed individual surveys into a single spreadsheet so that all reviews and results can be seen together
  • developed a pre-review process to select a representative sample of diverse contributions
  • selected dimensions of interest based on popular response from the community
  • added a dynamically-updating token count to reward reviewers as they completed reviews

Participants

After the conference, attendees were sent a survey form inviting them to (a) nominate contributions to the conference, (b) consent to having their contributions reviewed, (c) nominate values to be reflected in the study, and (d) self-nominate as a reviewer or research team member. 22 people responded to our Contribution Survey form, including 10 who self-nominated as reviewers and 5 who elected to join the research team (Anna Finnane, Matt Ruby, Aidan Tan, Ginny Barbour, Cooper Smout).

Procedure

These methods will be presented in chronological order using the new WISDOM stages of Value, Record, Review, Recognise, Reward, and Respect.

0. Choose Values

Survey respondents nominated a range of values to be reflected in this research project. We visualised these responses as a semantic word cloud and the research team agreed to use the 5 most popular words as dimensions in this study: Generosity, Inclusivity, Transparency, Integrity, and Creativity. We also included Gratitude as the baseline dimension, since this was the only stable baseline dimension in our Tiny OHM Prototype and including it would enable comparison between datasets.

Figure 2. Crowd-generated word cloud of rating dimensions

1. Record

Survey respondents nominated a diverse range of conference contributions, including both visible contributions (e.g., talks, workshops) and less visible contributions that would normally go unnoticed (e.g., administration tasks). One person was nominated who “Made a really big effort during discussion breakouts to attend the group which had the lowest turnout, regardless of their own interests”, demonstrating the inclusive nature of WISDOM.

The research team then completed a pre-review process to select a subset of 24 contributions for review. CS predicted the vaue of each contribution in Gratitude Units. CS & AF added tags to each contribution and created histograms to show diversity of selected contributions. CS then selected a range of contributions, aiming to create a normal distribution across Gratitude scores and a representative sample across contributors and tags (see ‘preprocessing’ tab).

2. Review

Ten survey respondents elected to become reviewers, completing a total of 936 pairwise comparison review sets (X regular reviews, X metareviews. Reviewers completed between 10 and 215 reviews each. Regular reviewers voted between pairs of conference contributions on the following 6 questions:

  • 1-5: “Select the contribution that demonstrates more… Generosity / Inclusivity / Transparency / Integrity / Creativity”
  • 6: “Overall, which are you more grateful for?”

Figure 3. Sample review, comparing two conference contributions on all six dimensions.

Meta-reviewers answered just one question when comparing event contributions to review contributions: “Overall, which are you more grateful for?”

Figure 4. Sample meta-review, comparing one conference contribution against one review contribution on the Gratitude dimension.

3. Recognise

Atomic Reviews were awarded one Gratitude Unit each (blue dots) and meta-review votes were used to fit a function to convert votes into Gratitude Units. Gratitude units were then calculated for all other contributions by interpolating this function (red dots).

Figure 5. Normalisation procedure.

Scores in the Gratitude dimension were then used to normalise the other five dimensions (e.g., Gratitude Units * Generosity votes / Gratitude votes), generating a multi-dimensional representation of each contribution on the six dimensions of interest.

The Polar Plot below shows each contribution as a coloured line, with scores on each dimension represented as a point on the relevant axis. You can see that “Created website” scored the highest on all dimensions, followed closely by “Create AIMOS-WISDOM prototype” (yes, we recognized our own work in running this study!). Conversely, “Attempting to balance attendance at breakout groups”, mentioned above, was one of the lowest scoring contributions, but that person might still appreciate being acknoweldged and recognized for their efforts (see the data in more detail here).

Figure 6. Multidimensional representation of each contribution’s qualities on all six dimensions of interest. 

4. Comprehensive value accounting

Next, we extrapolated scores for all contributions. We identified 163 contributions, including all participant nominations, registration fees, administration tasks, financial payments, scheduled presentations, and contributions to the WISDOM prototype itself. Gratitude for reviewed contributions was averaged within each category to give an average rate of gratitude per minute or dollar, depending on the type of contribution.

Figure 7. Gratitude per minute for each contribution category. As expected, lightning talks have the highest rate, reflecting gratitude for the work ‘behind the scenes’ of the 5-minute talk itself.

Figure 8. Gratitude per minute for financial contributions. The non-zero y-intercept for Registration Fees suggests that this metric might be capturing some variance not related to the financial contribution itself, for example gratitude for someone attending irrespective of how much they paid.

These averages were then applied to the broader set of contributions to produce a gratitude score for every single contribution. Contribution scores were then aggregated to produce the gross amount of gratitude per contributor (light blue bars below). We calculated the average gratitude per contributor and subtracted that from each persons total score, producing the net gratitude for each person (dark blue bars below). Note that these scores sum to zero across all participants, indicating which participants contributed more or less than their fair share of effort. The AIMOS president, Adrian Barnett, contributed an order of magnitude more than anyone else, approaching the entire budgetary contribution by AIMOS.

Figure 9. Gratitude per contributor, showing only those participants who consented and provided a username via our form.

5. Net gratitude
These data make it possible to compare the net gratitude for various contributions, after subtracting the gratitude for their inputs (e.g., financial cost). As you can see below, there was a net surplus of gratitude for the administration staff and a net deficit for the ECR breakfast and plenary talks. One possible interpretation is that these were not worthwhile purchases; another is that we failed to capture the full range of value added (e.g., simply having plenary speakers attend is likely a big drawcard for many attendees). 
6. Contributor profiles
Contribution scores can also be filtered to produce rich representations of each person’s contribution profile. 

 

7. Automated rewards
Contributors could also be rewarded directly for their contributions, based on the value provided.Funders could target particular categories and distribute funds just to those people who have contributed in that category. For example, if someone wanted to support the WISDOM project itself, they could make a donation and request that it be divided between all contributions tagged with ‘WISDOM’ (developing the prototype, nominating contributions, performing reviews, etc.). In turn, this could directly incentivise people to contribute to the WISDOM project, creating a virtuous cycle where contributors get rewarded for their valuable contributions, funders get recognition for their donations, and everyone collectively benefits through the development of a fairer and more efficient recognition and reward system. 

 

8. Reviewer reliability

Finally, we can check the integrity of reviewers by looking at test-retest reliability within each reviewer. Each pair of contributions was delivered in two orders: A vs B and B vs A. Although pairings were delivered randomly, some reviewers repeated the same comparison twice, but with the order reversed, giving us a measure of test-retest reliability. In the future, new analyses and greater control over the pairing delivery can be leveraged to produce a richer representation of reviewer reliablility. Reviewer rewards might also be moderated based on reliability and other metrics, incentivising reviewers to do a good job and not game the system. 

Figure. Test-retest reliability for the four most prodigious reviewers.

Discussion

Autonomous Valuation

The key point to note here is that no one person was in charge of generating the ratings, everyone was invited to participate in an inclusive review protocol that required minimal expertise or time to complete, and all reviewers were directly and immediately rewarded for their valuable review data. The review data and recognition scores become a public good that could be utilised by users and the community in a myriad of ways.

Incentivizing contributions

Already, we have the beginnings of an incentive mechanism. Future contributors could use this dataset to predict how their contributions will be received by the community. We also have the beginnings of a replicable template, whereby future conference organizers could use this data to plan and assign tasks to the organizing team, or let the crowd self-select tasks based on their own abilities. And thirdly, we have the beginnings of a virtuous cycle, where contributors are directly rewarded for the value they provide, incentivising further high-value contributions. 

Limitations

As with the Tiny OHM prototype, we restricted the review to a subset of all contributions, since reviewing all contributions would result in a combinatorial explosion of pairings under the fully balanced pairwise design.

Conclusion

The AIMOS prototype marked the first generalization of WISDOM outside of OHM and proof-of-concept for use in academia. We validated interest in the framework and developed a proof-of-concept for rating diverse contributions to an academic conference. Future conference organisers might find this dataset useful for planning purposes or for recognizing contributors. Future efforts could expand on this prototype to explore other use cases, such as rating contributions to a paper, contributions to a research group, or contributions to an entire research field (e.g., papers, data, code). In all cases, it will be valuable to explore more advanced algorithms that can handle a larger set of contributions without an excessive burden of time on reviewers.

Work in progress

The AIMOS prototype is a dynamically evolving experiment, much like the WISDOM framework in general. Ongoing work will explore relationships between the dimensions and other research questions, along with Reward and Respect metrics for contributors and reviewers. If there’s any research questions you think we should explore, we’d love to hear from you!