Price’s Protein Puzzle: 2023 update

One of the joys (?) of having been online for…quite some time now…is watching topics reappear every few years or so. So what’s new? In terms of English word matches: not much. Some new proteins
Price’s Protein Puzzle: 2023 update

One of the joys (?) of having been online for…quite some time now…is watching topics reappear every few years or so.

What is the longest coherent word or phrase present in the amino acid sequence of a real protein? — Dr. Caroline Bartman (@Caroline_Bartma) July 21, 2023 (https://twitter.com/Caroline_Bartma/status/1682453205492420630?ref_src=twsrc%5Etfw) Yes, it’s Price’s Protein Puzzle which I last wrote about back in 2019 (https://nsaunders.wordpress.com/2019/01/30/prices-protein-puzzle-2019-update/). The good news is that my code still runs (https://github.com/neilfws/utils4bioinformatics/tree/master/uniprot_words), so I’ve updated the results (https://github.com/neilfws/utils4bioinformatics/tree/master/uniprot_words/data) of an English word search versus the UniProt Reviewed (Swiss-Prot) protein database. Just for fun I threw in a few other languages too.

So what’s new?

In terms of English word matches: not much. Some new proteins but no new 9-letter words. The Twitter thread, above, contains an interesting reply about an approach using generative AI:

I found PARALEGALS in addition to AGGRAVATES for 10 with the algorithm: (1) Ask chat-gpt for a list of x-letter words likely to be found in a coding sequence (2) blastp and check it’s not an obvious artifact (3) check chat-gpt didn’t hallucinate a word (TRAVELGAGS?) — Zach Hensel (@alchemytoday) July 22, 2023 (https://twitter.com/alchemytoday/status/1682856665404657664?ref_src=twsrc%5Etfw) Note that the match is to the NR protein database. I’d like to work with this locally but I believe it’s in the order of 150 GB now, so it would take some work to optimise.

Other languages are somewhat constrained by (1) the quality of the word lists that I could find online (in Github) and (2) for some languages, the presence of characters not found in the English alphabet which reduces the viable word list even further. That said, there a few fun matches. I am not a linguist so I’m relying on Google Translate and other online translators here.

In addition to the previously-noted 10-letter Italian word ANNIDAVATE we have:

• sp|B2II34|KATG_BEII9 – GANGARILLA (Spanish) – a company of strolling players

• sp|P40069|IMB4_YEAST – FERRAILLAI (French) – je ferraillai (I scrapped)

All of the languages have 9-letter matches except Swedish (maximum 8 letters, for example STALLARE – stabler). Spanish was a rich source of hits (452 distinct words > 7 letters), although that’s probably due in large part to the large size of the Spanish word list used. Swedish the lowest (26 distinct words > 7 letters), perhaps due to the large number of unusable words with non-amino acid alphabet characters.

There are 9 hits to the start of a protein. Some of these are:

• sp|O49997|1433E_TOBAC – MAESTREEN (Spanish) – you direct

• sp|Q3V0Q6|SPAG8_MOUSE – METTESTE (Italian) – you put

• sp|Q49135|FCHA_METEA – MAGNETIET (Dutch) – magnetite

And so ends the update for another year.

Write a comment
No comments yet.