nf-core/proteinfamilies
Generation and updating of protein families
metagenomicsprotein-familiesproteomics
Version history
Added
- #124
- Added new subworkflow
MERGE_FAMILIESthat can optionally merge similar (but not redundant) generated protein families. (by @vagkaratzas) - Added new functionality to the local module
IDENTIFY_REDUNDANT_FAMSwhich now also detects and outputs the identifiers of similar families that can optionally be merged downstream. These identifiers are written to “/remove_redundancy/<samplename>/similar_fam_ids.txt”, and the corresponding family pairwise similarity scores to “/remove_redundancy/<samplename>/similarities.csv”. (by @vagkaratzas) - Added new local module
POOL_SIMILAR_COMPONENTSthat generates family clusters, from a family-similarity edgelist. (by @vagkaratzas) - Added new local module
MERGE_SEEDSthat merges seed alignments of similar families, before restarting the family generation subworkflow. (by @vagkaratzas)
- Added new subworkflow
- #118
- Added preprint citation to the repo. (by @vagkaratzas)
- Added separate metro map files for dark and light browser modes. (by @vagkaratzas)
- Added new local module
EXTRACT_FAMILY_MEMBERSwhich outputs a two-column TSV file containing the final family identifiers and their corresponding member sequence identifiers. The file is saved at “/family_reps/<samplename>/<samplename>.tsv”. (by @vagkaratzas)
- #117
- Added
SEQKIT_SEQfor optional sequence preprocessing in the quality check subworkflow. (by @vagkaratzas) - Added
SEQKIT_REPLACEfor optional sequence name parsing in the quality check subworkflow. (by @vagkaratzas) - Added
SEQKIT_RMDUPfor optional removal of duplicate names and sequences in the quality check subworkflow. (by @vagkaratzas)
- Added
Changed
- #128 - nf-core tools template update to 3.4.1.
- #124
- Conditional workflow flags switched to their
skipopposites;--trim_msato--skip_msa_trimming,--recruit_sequences_with_modelsto--skip_additional_sequence_recruiting,--remove_family_redundancyto--skip_family_redundancy_removal,--remove_sequence_redundancyto--skip_sequence_redundancy_removal. (by @vagkaratzas)
- Conditional workflow flags switched to their
- #118
- Swapped the local
CHECK_QUALITYsubworkflow with the new nf-core oneFAA_SEQFU_SEQKIT. (by @vagkaratzas) - Based on protein family reproducibility benchmarks (i.e., computationally reproducing manually curated protein family resources), the
cluster_seq_identityandcluster_coverageparameter default values have been updated to0.3and0.5(down from0.5and0.9) respectively. (by @vagkaratzas)
- Swapped the local
- #117 - Swapped the local
SEQKIT_STATSand the localSEQKIT_STATS_TO_MQCmodules with theSEQFU_STATSone, which runs a bit faster and produces a MultiQC-ready output without the need for manual parsing. (by @vagkaratzas)
Dependencies
| Tool | Previous version | New version |
|---|---|---|
| seqfu | - | 1.20.3 |
| multiqc | 1.30 | 1.31 |
Deprecated
- #124 - Deprecated
--trim_msa,--recruit_sequences_with_models,--remove_family_redundancyand--remove_sequence_redundancy. (by @vagkaratzas)
Special Thanks
To @jfy133, @erikrikarddaniel and @chrisAta for this version’s PR code reviews.
Fixed
- #112 - Fixed a bug in
EXTRACT_FAMILY_REPS, where all sequences were pasted into the family representative one, and updated the relevant local nf-test. (by @vagkaratzas)
Changed
- #106 - Swapped the local
EXECUTE_CLUSTERINGsubworkflow with the new nf-coreMMSEQS_FASTA_CLUSTERone. (by @vagkaratzas)
Dependencies
| Tool | Previous version | New version |
|---|---|---|
| multiqc | 1.29 | 1.30 |
Changed
- #104 - Pulling
paramsfrom local subworkflows into main workflow. - #103 - Parallelized execution for the
EXTRACT_FAMILY_REPSlocal module and changed its input fromfull_msatofasta. - #100 -
CAT_CATmodule replaced withFIND_CONCATENATEto avoid large scaleArgument list too longerrors. - #98 - nf-core tools template update to 3.3.2.
Added
- #105 -
CHECK_QUALITYsubworkflow added at the start of the pipeline. It utilizes theseqkit/statsnf-core module to generate aMultiQC-ready report with statistics for the input amino acid sequences. The metro-map has been updated to reflect this change.
Added
- #93
- Added nf-test and
meta.ymlfile for local subworkflowGENERATE_FAMILIES. - Added nf-test and
meta.ymlfile for local subworkflowREMOVE_REDUNDANCY. - Added nf-test and
meta.ymlfile for local subworkflowUPDATE_FAMILIES.
- Added nf-test and
- #88
- Added nf-test and
meta.ymlfile for local moduleBRANCH_HITS_FASTA. - Added nf-test and
meta.ymlfile for local moduleFILTER_NON_REDUNDANT_FAMS. - Added nf-test and
meta.ymlfile for local moduleIDENTIFY_REDUNDANT_FAMS. - Added nf-test and
meta.ymlfile for local moduleEXTRACT_FAMILY_REPS. - Added the default pipeline end-to-end nf-test.
- Added nf-test and
Changed
- #81 - nf-core tools template update to 3.3.1.
Fixed
- #80 - Fixed a bug where, due to a missing check for equal family sizes, non-redundant families were erroneously marked as redundant through transitive relationships and were removed
Changed
- #77 - Default branch changed from
mastertomain. - #73 - Changed the fasta parsing library of the
CHUNK_CLUSTERSlocal module, frompyfastxback to the latest version ofbiopython, and parallelized its writing mechanism, achieving decreased execution time.
Dependencies
| Tool | Previous version | New version |
|---|---|---|
| biopython | 1.84 | 1.85 |
| pyfastx | 2.2.0 |
Removed
- #73 - Deprecated
pyfastxmodule version ofCHUNK_CLUSTERS, since it was struggling performance-wise with larger datasets.
Added
- #69 - Added the
hhsuite/reformatnf-core module to reformat.stoalignments to.faswhen in-family sequence redundancy is not removed. Also added the option to save intermediate and final family fasta files throughout the workflow with varioussaveparameters. - #58 - Added nf-test and
meta.ymlfile for local moduleREMOVE_REDUNDANCY_SEQS(Hackathon 2025) - #56 - Added nf-test and
meta.ymlfile for local moduleFILTER_RECRUITED(Hackathon 2025) - #55 - Added nf-test and
meta.ymlfile for local moduleCHUNK_CLUSTERS(Hackathon 2025) - #54 - Added nf-test for local subworkflow
ALIGN_SEQUENCES(Hackathon 2025) - #53 - Added nf-test for local subworkflow
EXECUTE_CLUSTERING(Hackathon 2025) - #51 - Added nf-test and
meta.ymlfile for local moduleCALCULATE_CLUSTER_DISTRIBUTION(Hackathon 2025) - #34 - Added the
EXTRACT_UNIQUE_CLUSTER_REPSmodule, that calculates initialMMseqsclustering metadata, for each sample, to print withMultiQC(Id,Cluster Size,Number of Clusters)
Fixed
- #69 - Fixed a bug where redundant family alignments were not published properly, if intra-family redundancy removal mechanism was switched off #68
- #65 - Fixed a bug in
CHUNK_CLUSTERS, where pipeline would crash if the module filtered out all clusters, due to a high membership threshold #64 - #35 - Fixed a bug in
remove_redundant_fams.py, where comparison was between strings instead of integers to keep larger family - #33 - Fixed an always-true condition at the
filter_non_redundant_hmms.pyscript, by adding missing parentheses - #29 - Fixed
hmmalignempty input crash error, by preventing theFILTER_RECRUITEDmodule from creating an empty output .fasta.gz file, when there are no remaining sequences after filtering thehmmsearchresults #28
Changed
- #69 - Changed the publish directory architecture for HMMs, seed MSAs, full MSAs and family FASTA files, to make it more intuitive.
REMOVE_REDUNDANT_FAMSlocal module converted toIDENTIFY_REDUNDANT_FAMSto extract redundant family ids which will then be used downstream.FILTER_NON_REDUNDANT_HMMSlocal module converted toFILTER_NON_REDUNDANT_FAMSand reused four times (HMM, seed MSA, full MSA, FASTA). Changed the output format of theEXTRACT_FAMILY_REPSandREMOVE_REDUNDANT_SEQSlocal modules from.fato.faa. Metro map updated with newhhsuite/reformatmodule. - #57 - slight improvements of
nextflow_schema.json(Hackathon 2025) - #57 - slight improtmenets of
assets/schema_input.json(Hackathon 2025) - #34 - Swapped the
SeqIOpython library withpyfastxfor theCHUNK_CLUSTERSmodule, quartering its duration - #32 - Updated
ClipKIT2.4.0 -> 2.4.1, that now also allows ends-only trimming, to completely replace the customCLIP_ENDSmodule. Users can now also define its output format by setting the--clipkit_out_formatparameter (default:clipkit)
Dependencies
| Tool | Previous version | New version |
|---|---|---|
| ClipKIT | 2.4.0 | 2.4.1 |
| pyfastx | 2.2.0 | |
| hhsuite | 3.3.0 | |
| multiqc | 1.27 | 1.28 |
Deprecated
- #32 - Deprecated
CLIP_ENDSmodule and--clipping_toolparameter. The only option now isClipKIT, covering both previous modes, via setting--trim_ends_only
Initial release of nf-core/proteinfamilies, created with the nf-core template.
Added
- Amino acid sequence clustering (mmseqs)
- Multiple sequence alignment (famsa, mafft, clipkit)
- Hidden Markov Model generation (hmmer)
- Between families redundancy removal (hmmer)
- In-family sequence redundancy removal (mmseqs)
- Family updating (hmmer, seqkit, mmseqs, famsa, mafft, clipkit)
- Family statistics presentation (multiqc)
By @vagkaratzas and @mberacochea.