Excel Kept Messing Up the Names of Genes, So Scientists Renamed Them
I love making spreadsheets. I like lining up little columns of numbers and writing formulas to do things to them. It’s halfway between coding and note-taking. I have sheets for accounts (obviously) but also for projects, holidays, and hobbies. There’s one for the contents of my loft. My New Year’s resolutions? They’re in a spreadsheet. Often, when I start thinking about something, I automatically open a sheet and structure my thoughts into rows and columns. If all you have is a spreadsheet, everything looks like a cell (to misquote Abraham Maslow.)
Use Excel for any length of time and you become familiar with its foibles. Type in a phone number and, if you’re unlucky, it’ll turn it into something like 8.E+09. Best case scenario you’ll lose the first 0. Sometimes numbers get turned into dates. Sometimes dates get turned into numbers. I’ve got used to seeing #N/A.
These things are annoying, but you get used to them. However, if you’re a geneticist, problems like these plague your industry. Typing most genes into Excel isn’t a problem. “Myosin regulatory light chain interacting protein” is fine (shortened to MYLIP), but type in “Membrane-associated ring-CH-type fingers” (shortened to MARCH1) and Excel recognizes it as a date and “helpfully” converts it to March 1, 2020.
This tickles me. It’s the sort of weird edge case I find amusing. When the first Excel software engineer wrote the feature to scan text and convert certain values to dates, who would have thought that one day that would mess up scientific research documents? I also have a sense of relief that I’m not the only one who has to battle Excel. But this gene formatting, more than an amusing quirk, is actually a surprisingly big issue. “A programmatic scan of leading genomics journals reveals that approximately one-fifth of papers with supplementary Excel gene lists contain erroneous gene name conversions,” scientists wrote in a study four years ago. Indeed, they have been writing about the issues Excel causes them since 2004. This delightfully quirky oddity has been messing up genomics journals for two decades.
That was until a few weeks ago when the HUGO Gene Nomenclature Committee (HGNC) decided to rename the problematic genes so that they didn’t get converted into dates in Excel. MARCH1 becomes MARCHF1, SEPT1 becomes SEPTIN1, and so on. Put another way: Geneticists got so annoyed Excel messed up their data they changed the official scientific names to make them more Excel-friendly.
I also have a sense of relief that I’m not the only one who has to battle Excel.
There’s something Kafkaesque about this. The sublime comes crashingly into contact with the banal: Important scientific work, meet Excel formatting. It’s strange to see our individual experiences mirrored on a global scale. You wouldn’t think genetics, as an industry, would have the same problems that I have, as an individual.
You Can Now Get Your Whole Genome Sequenced for Less Than an iPhone
But will people buy it — and is all that genetic information actually worth it?
Online, after the initial lols and hahas, I’ve spotted three distinct responses to this. Firstly, the “learn to use Excel properly” response. That is: There’s nothing wrong with Excel, the scientists just aren’t using the tool correctly. If they want their data to be kept as it is without formatting, they should add an apostrophe before the value, or they should set the column type to text. It’s their own fault their data is getting messed up, and the whole fiasco is an indictment of the scientific world’s computer proficiency.
Secondly, there’s the “scientists shouldn’t be using Excel anyway” excuse: Excel is too basic a tool for scientists to be using. They should be using Matlab, or R, or some other advanced scripting language or application to handle their data, and then they wouldn’t be having this problem.
And finally, there are the Microsoft haters: This is all Microsoft’s fault for corrupting data. Not only should Excel stop doing this to these specific 27 genes that match dates, but it should also stop doing anything to any data at all, all of the time. Excel is a scourge on humanity and we should join forces with the scientists to mount an attack on Microsoft and get them to change their ways. In this response, the whole fiasco is an indictment of the poor state of software usability.
The sublime comes crashingly into contact with the banal: Important scientific work, meet Excel formatting.
I have sympathy for all of these views, but the truth surely lies somewhere in the middle. When the Human Genome Organization (HUGO) made this change it was because everyone is trapped between a rock and a hard place. Between scientists and lab assistants with wildly different computer skills. Between geneticists and software backward compatibility.
Many scientists will, of course, know about data formats and how to stop their data from being converted to dates. But still, accidents creep in. Tables get saved in CSV format, loaded into Excel again, and corrupted. Junior researchers forget. Something will happen that will cause a problem to creep back in. “It’s really, really annoying,” one geneticist told The Verge. It’s the data formatting that broke the researchers back.
For Microsoft, this is a strange edge case. These 27 genes just coincidentally match strings that could be read as dates. And to be fair to Microsoft: The names of the months came first. (Indeed, when Excel was first written these genes hadn’t been named.) Perhaps there is a world where this issue gained publicity and Microsoft released a new version of Excel with the date-parsing code changed to explicitly avoid converting these to dates. But that’s fiddly and complicated and even if Microsoft made that update, it would take years to have any impact as universities around the world gradually renewed their Microsoft software enterprise agreements and updated to the latest version of Excel. More likely, if Microsoft had even been alerted to this issue, they would have just sent a link to the relevant KB article.
And so geneticists had to choose which side of Bernard Shaw’s famous witticism they fell on: whether they were the reasonable ones who adapted to the world or the unreasonable ones who persisted in trying to adapt the world. They adapted themselves.
It’s a game of Where’s Waldo for incorrectly formatted genes.
There are interesting political points here about the relative powers of these two entities. Maybe a point about a sort of widespread, low-level incompetence on the part of humans when it comes to computers. Or about Excel itself. Nearly a decade ago, Joel Spolsky, a former Excel program manager at Microsoft, pointed out that “most Excel users never enter a formula. They use Excel when they need a table. The gridlines are the most important feature of Excel, not recalc.”
The criticism is focused on Microsoft because Excel has become the genericized trademark for spreadsheets. But the same problem occurs in Google Sheets, and so even if Microsoft did change Excel, the problem wouldn’t go away. For completeness’s sake, I also tried importing genes into Numbers, Apple’s spreadsheet software, and found it doesn’t reformat MARCH1 into a date. While this would be great for geneticists, I can’t help wondering if this lack of automatic format detection is one of the reasons Numbers isn’t more popular.
The Era of DNA Database Hacks Is Here
A major data breach shows genetic information is vulnerable to attack
I’ve become fascinated by this whole debacle, as it seems to stand for something much larger: our own powerlessness, and even the powerlessness of whole industries, in the face of technology.
I find myself thinking about how software — limited and difficult to use, often unsuitable to the task, fragile — has spread around the world, pervading and invading every facet of every industry. You can’t get away from software. A computer on every desk, and in every house, yes, but also in every pocket, in every shop and office. A computer behind every action and thought. And we can no more change software to work for our industry than we can alter the shifting of the tides. Off-the-shelf apps are a force of nature. Research scientists work around software limitations in the same way sailors work around tidal charts.
I’ve downloaded spreadsheets of gene data, meaningless to me, just to play with and spot the errors. It’s a game of Where’s Waldo for incorrectly formatted genes. As I find myself philosophizing about this, I remember that the industry has to keep going. The whole thing is ludicrous of course, but the HGNC has made a sensible, pragmatic decision, which pleased geneticists, to combat what was, essentially, an unfortunate, if amusing, naming clash.
There’s a funny sort of epilogue to all of this. Scrolling through lists of genes, I’ve caught sight of some other names. And, sometimes, they are just weird. One gene is named “Sonic Hedgehog,” named partly after the video game character and the band Sonic Youth. Another is called “Bag of Marbles.” And there’s also Cheap Date, Buttonhead, and Dunce. There are lots of names like these. This all seems like a bit of a laugh until you’re the doctor trying to sensitively tell a parent their child has a serious health worry and you have to explain with solemnity they have a mutation in their One-Eyed Pinhead.
Before reading about these, my ignorance of genes led me to believe the names were carefully crafted by scientists, and it was something of an outrage that they had to be renamed for as trivial a reason as Excel. Now, it’s hard to believe that scientists are the adults in the room. Online, I’ve seen outrage about the Excel change, but I realize that has all been on behalf of geneticists, rather than from geneticists themselves, who, on the whole, seem relieved. Perhaps there’s another more human point here about our assumptions. I think one of the reasons this Excel story catches imaginations is because of the sanctity of science and our assumptions of scientists as logical, research-driven individuals. Rather than people who, like all of us, have a joke-around and, given half a chance, give genes silly names. And, like all of us, are just trying to do the best they can with the software they have.