ALEXANDRIA AI

Our Viewpoint on the State of Academia

Alexandria Team

Academic research is, by nature, an inefficient nash equilibrium.

From a game theory perspective, all academic researchers have an incentive to publish first and prevent their work from being 'scooped,' but no incentive to help others publish faster (unless they are altruistic and believe in open sourcing, which a small proportion do).

We can break it down even further. The first people to publish often gain most of the credit (Nobel prizes, Nature publications), etc. This incentive structure is so extreme that people even sometimes lie or misconstrue data to get there faster (see the STAP stem cell controversy).

What if you generate some results along the way that aren't really meaningful? This is specifically prominent in experimental or biological research (and generally a non-issue in fields with more open science and reproducibility like CS), where replication cycles are longer and generally expensive to run (reagents and cell lines do not come cheap). It is extremely possible to spend multiple months on something, have it not pan out, and then you need to scrap the whole project.

But how many times has somebody else scrapped that project before and you didn't know? How many people will work on the same one after you because they didn't know it failed?

What do you do then? You could spend time to open source your work or try to publish it, but publishing null results is a) unlikely and b) like "getting the crap beat out of you for 6 months" (quote from an unnamed Harvard professor).

From a game theory standpoint, putting your failed work would be beneficial to the research community as a whole (preventing others from making the same mistake) but you gain nothing while needing to make a sacrifice of your time and perhaps reputational risk. You would also benefit from reading other people's failed research if they put it out, but you have no incentive to put out your own research and help a competitor move faster.

This forms a classic prisoner's dilemma trap (where both parties would be better off if they all published negative results, but neither have an incentive to publish their own) that we arrive at an inefficient Nash equilibrium.

What if information flowed fast enough this didn't really matter? The other key issue is that information transfer is incredibly asymmetrical (forming a classic Crawford-Sobel model) and the credibility of signals is difficult to analyze. All of the unpublished information is stored locally on notebooks, physically within labs, and in your brain. Some of it might leave when you speak to colleagues, go to conferences, or help out someone else (if they know to ask you in the first place), but for the most part, this unpublished information is decentralized and unrecorded, often lost to history.

Information flow isn't just slow, however, the volume of information is also too high. Many scientists wish they read more (just because there is too much scientific literature being published) but do not have the bandwidth to read most publications that aren't directly relevant to them. Getting the right information to the right people has now become a challenge.

These questions have never been Google-able. Oftentimes, you have a very specific question in a very niche area; for example, how does X metabolic engineering affect Y protein. You can do a literature review to try to find out in published or open-sourced data, or ask around from your lab or network, but from there your options are pretty limited. In fact, most science is done in someone's head, some small amount of that makes it out into the world, and then some even smaller sliver makes it into a publication.

This problem extends to local lab-level knowledge. Even the protocols or encoded knowledge of how people do things are often decentralized and spread from person-to-person. The most time consuming part of training a new member of a lab is often onboarding and showing them how things are done "around here." When a graduate student leaves (which they almost always do, by design of the academic system), most of their encoded knowledge either stays in decentralized notebooks, drives, and published papers, as well as in their brains. Repeat this process 100 times in an organization, and things become increasingly convoluted and decentralized.

The bottleneck of a single laboratory is often the professor that runs the lab (who serves as the primary investigator). Over the course of decades, they are the one consistent fixture that mentally indexes and reasons through the entirety of a lab's publication history, spending an egregious amount of time attempting to pass this onto their students to enable productive researchers. Academia is already mentally draining enough, but they are tasked with the menial job of repeating and transferring information from a laboratory where a body of knowledge already exists in published literature, and an even larger body of knowledge is stored within the brains of the dozens of lab members. This is a classic hair-on-fire use case.

Scientific knowledge and data is being lost. It is being siloed, decentralized, and forgotten. Drawing upon it becomes the question of knowing where to look.

The internet greatly advanced science by making access to information faster. Publications have transitioned online, the internet has indexed them, and they are now searchable and easy to get with a few clicks of a button (compared to the previous generation of printed physical journals). The speed at which we access information is faster, and this has enabled more progress.

But what is going to enable the next level of speed? We have done a solid job with published research, and it forms the backbone of our foundational scientific knowledge. Publications, however, tell a story--a carefully manicured, beautiful analysis that loses a lot of the knowledge and experience gained along the way.

Serendipity is constrained. Reproducibility is now a crisis in biology, causing huge amounts of waste and misguided knowledge.

Unpublished data is an unconquered wild west.

How do you get the right, context-aware answers? How can we index all of the scientific data generated in the world? How do we do this, while operating under game theory rules, so we can maximize societal benefits?

We believe the future is in building a collective shared intelligence. Not to replace researchers, but to enable them with the generations of learning that goes undocumented.

Many AI scientist companies exist and do great work within the constraints of publicly available data. Futurehouse, Potato AI, are all attempting this model and making interesting progress. However, none of these companies have access to the granularity of data that is necessary to train true reasoning models in fields where failure rates are 80, 90, 95 percent.

Reasoning often does not necessarily transfer between labs well. Encoded knowledge, at an organizational and institutional scale for researchers, has never been adequately captured.

Pharma companies try to document aggressively so they either commercialize (in cases like gaining IP or developing drugs) or sell scrap assets (in cases like bankruptcy or when another party is interested in data), etc.

Academic reasoning is oftentimes doomed to live in a filing cabinet or scattered through decentralized knowledge.

Our inability to index this data has led to lack of replicability, reasoning, and progress.

By doing so, we want to build a shared intelligence system that can be applied locally and globally. We hope to be predictive, meaningful, and extract value from knowledge that already exists.