Collects a number of biological data analytics tasks. This library provides tools for the bio_db served data, empowering downstream analyses of user's experimental data.
Installation and loading:
?- pack_install(bio_analytics). % also installs pack(lib). other dependencies are installed at load time via lib ?- use_module(library(lib)). ?- lib(bio_analytics).
The library comes with one experimental dataset: data/silac/bt.csv which is used in the example files in directory examples/ .
?- bio_analytics_version(V,D). V = 0:6:0, D = date(2023,6,6).
Exp is in the form of a mtx/1 matrix. Selected (DEs) and rejected (NonDEs) are returned in common format,
of either (default) Gene-Ev, where Ev is the expression value for Gene in Exp, or sub-matrices of Exp.
The format is controlled by option as_pairs
.
We use ev as short for expression value, to avoid confusion with exp which short for experiment.
Opts
mtx(Exp,Mtx)
call)?- lib(mtx), absolute_file_name(pack('bio_analytics/data/silac/bt.csv'), CsvF), mtx(CsvF, Mtx, convert(true)), assert(mtx_data(Mtx)). CsvF = '/usr/local/users/nicos/local/git/lib/swipl/pack/bio_analytics/data/silac/bt.csv', Mtx = [row('Protein IDs', 'Symbols', log2FC, adj.pvalue), row('Q9P126', ...), row(..., ..., ..., ...)|...]. ?- lib(debug_call), debug(testo), mtx_data(Mtx), debug_call(testo, dims, mtx/Mtx), bio_diffex(Mtx, DEPrs, NonDEPrs, []), debug_call(testo, length, de/DEPrs), debug_call(testo, length, nde/NonDEPrs). % Dimensions for matrix, (mtx) nR: 1245, nC: 4. % Length for list, de: 426. % Length for list, nde: 818. DEPrs = ['CLEC1B'- -4.80614893171469, 'LGALS1'- -2.065877096524, ... - ...|...], NonDEPrs = ['CNN2'- -0.69348026828097, 'CXCR4'-0.73039667221395, ... - ...|...]. ?- mtx_data(Mtx), bio_diffex(Mtx, DEPrs, NonDEPrs, exp_pv_cut(0.01)), debug_call(testo, length, de/DEPrs), debug_call(testo, length, nde/NonDEPrs). % Length for list, de: 286. % Length for list, nde: 958. ?- mtx_data(Mtx), Opts = [exp_ev_cut_let(inf),exp_ev_cut_get(-inf)], bio_diffex(Mtx, DEPrs, NonDEPrs, Opts), debug_call(testo, length, de/DEPrs), debug_call(testo, length, nde/NonDEPrs). % Length for list, de: 581. % Length for list, nde: 663. ?- mtx_data(Mtx), Opts = [exp_ev_cut_let(inf),exp_ev_cut_get(-inf),as_pairs(false)], bio_diffex(Mtx, DEs, NonDEs, Opts), debug_call(testo, length, de/DEPrs), debug_call(testo, length, nde/NonDEPrs). % Length for list, de: 582. % Length for list, nde: 664. Mtx = [row('Protein IDs', 'Symbols', log2FC, adj.pvalue), row(..., ..., ..., ...)|...], Opts = [exp_ev_cut_let(inf), exp_ev_cut_get(-inf), as_pairs(false)], DEs = [row('Protein IDs', 'Symbols', log2FC, adj.pvalue), row('Q9P126', 'CLEC1B', -4.80614893171469, 2.057e-37), row(..., ..., ..., ...)|...], NonDEs = [row('Protein IDs', 'Symbols', log2FC, adj.pvalue), row('B4DUT8;Q6FHC3;Q6FHE4;Q99439;B4DDF4;Q53GK7;B4DN57;B4DHU5;K7ES69;H3BVI6;H3BQH0', 'CNN2', -0.69348026828097, 0.0814675), row(..., ..., ..., ...)|...].
Vect can be a list or pairlist (K-V) of which the K element is taken to be the identifier. Vect can also be an atomic representing the Nth column on column for Mtx.
Opts
?- hgnc_homs_hgnc_ncbi( 19295, Ncbi ), bio_symbols( [Ncbi], Symbs, org_exp_id(Ncbi) ). Ncbi = 114783, Symbs = ['LMTK3'].
Uses ggplot2 and bio_analytics:bio_diffex/4. The predicate creates an R data.frame and creates a ggplot2 object from which typically output files are produced.
Opts
false
if no output is wantedggplot()
themeOpts are passed to bio_diffex/4. This predicate also pinches the defaults from there.
Examples, fixme: use iris dataset.
?- bio_volcano_plot([]).
Note that de-reguation trumps identification, that is, currently there is no distiction between genes that seen to be both significantly de-regulated and identified versus just those that are simply significantly de-regulated.
In the exp_gene_family_string_graph/6 version DEPrs and NonDEPrs can be provided, which optimises use in loops over families (see exp_go_over_string_graphs/4).
Opts
These Options are passed to a number of other pack predicates.
See [pack('bio_analytics/examples/bt.pl')].
?- lib(real), lib(mtx), absolute_file_name( pack('bio_analytics/data/silac/bt.csv'), CsvF ), Ppts = [ vjust= -1, node_size(3), mode="fruchtermanreingold", format(svg), stem(bt)], Opts = [ exp_ev_cut_let(inf), exp_ev_cut_get(-inf), include_non_present(false), include_non_significant(false), minw(200), wgraph_plot_opts(Ppts) ], exp_gene_family_string_graph( CsvF, autophagy, G, Opts ).
Produces file: bt.svg
[[../doc/images/bt.svg]]
For experimental data in CsvF select de-regulated genes and on
those perform over representation analysis in gene ontology.
Results in GoOver are either as a values list or a csv file, the name of which is
assumed to be the ground value of GoOVer.
Opts
[BP,MF,CC]
[genome,go,go_exp,experiment]
Options are also passed to bio_diffex/4.
?- absolute_file_name( pack('bio_analytics/data/silac/bt.csv'), CsvF ), exp_go_over( CsvF, GoOver, [] ). CsvF = '.../swipl/pack/bio_analytics/data/silac/bt.csv', GoOver = [row('GOBPID', 'Pvalue', adj.pvalue, ...), row('GO:0061387', 0.000187, ...)|...]. ?- absolute_file_name( pack('bio_analytics/data/silac/bt.csv'), CsvF ), exp_go_over( CsvF, GoOver, [stem(here),to_file(true)] ). CsvF = '.../swipl/pack/bio_analytics/data/silac/bt.csv', GoOver = here_gontBP_p0.05_univExp.csv. ?- absolute_file_name( pack('bio_analytics/data/silac/bt.csv'), CsvF ), exp_go_over( CsvF, 'a_file.csv', [stem(here),to_file(true)] ). CsvF = '.../swipl/pack/bio_analytics/data/silac/bt.csv', ?- lib(by_unix). ?- @wc(-l,'a_file.csv'). 183 a_file.csv ?- @wc(-l,'here_gontBP_p0.05_univExp.csv'). 183 here_gontBP_p0.05_univExp.csv true. ?- absolute_file_name( pack('bio_analytics/data/silac/bt.csv'), CsvF ), exp_go_over( CsvF, OverF, [to_file(true)] ). CsvF = '.../swipl/pack/bio_analytics/data/silac/bt.csv', OverF = go_over_gontBP_p0.05_univExp.csv. ?- absolute_file_name( pack('bio_analytics/data/silac/bt.csv'), CsvF ), exp_go_over( CsvF, OverF, [to_file(true),stem(false)] ). CsvF = '.../swipl/pack/bio_analytics/data/silac/bt.csv', OverF = '.../swipl/pack/bio_analytics/data/silac/bt_gontBP_p0.05_univExp.csv'.
When GoOver is a variable expr_go_over/3 is called to generate it.
Opts
go_name, go_id
.
Here the length of GO terms (Len) is added to form go_pair_ord(I,Len)
for forwardingdiffex_only = VizOpts
restricts genes to significants only.[vjust = -1, node_size(3), format(svg)]
.Options are passed to exp_gene_family_string_graph/4.
?- debug(real). ?- debug(exp_go_over_string_graphs). ?- absolute_file_name( pack('bio_analytics/data/silac/bt.csv'), Exp ), exp_go_over_string_graphs( Exp, Gov, Dir, [] ). % Sending to R: pltv <- ggnet2(lp_adj,vjexp_go_over_mtxust = -1,size = 3,label = pl_v_1,color = pl_v_2,edge.size = pl_v_3,edge.color = "#BEAED4")
Can ran across a number of values for parameters and number of datasets. A directory is created for each combination of input parameters. Subdirectories are created for each input dataset.
Can be used with single parameter values as well, in order to add that parameter to the output directory's name.
Opts
viz_de_opts()
of exp_go_over_string_graphs/4de_max()
and exp_pv_cut()
are the same, use viz_de_opts(diffex_only)
multi()
options are processed in the order they are given, with later ones varied first in the runs.
?- Opts = [], exp_go_over_string_graphs_multi( Opts ).
Currently 2 organisms are supported: human and mouse.
Family can be a gene ontology term (atom of the form GO:XXXXXX).
If family is a list of numbers or atoms that map to numbers, then they are taken to be Entrez ids which
are converted to gene symbols in Symbols.
If family is a list of symbols, it is passed on to Symbols.
Listens to debug(gene_family)
.
Known families:
Located family as the bio_analytics gene collection: autophagy ?- gene_family( autophagy, human, Auto ), length( Auto, Len ). Auto = ['AMBRA1', 'APOL1', 'ARNT', 'ARSA', 'ARSB', 'ATF4', 'ATF6', 'ATG10', 'ATG12'|...], Len = 232. ?- debug( gene_family ). ?- gene_family( 375, hs, Symbs ), length( Symbs, Len ). % Located GO term as the family identifier. Symbs = ['BCAS2', 'DBR1', 'DDX23', 'GEMIN2', 'KHSRP', 'LSM1', 'MPHOSPH10', 'PRPF3', 'PRPF4'|...], Len = 26. ?- gene_family( 'GO:0000375', hs, Symbs ), length( Symbs, Len ). % Located GO term as the family identifier. Symbs = ['BCAS2', 'DBR1', 'DDX23', 'GEMIN2', 'KHSRP', 'LSM1', 'MPHOSPH10', 'PRPF3', 'PRPF4'|...], Len = 26. ?- gene_family( [55626, 8542, 405], hs, Auto ). Converted input from Entrezes to Symbols. Auto = ['AMBRA1', 'APOL1', 'ARNT']. ?- gene_family( unknown, hs, Auto ). ERROR: Unhandled exception: gene_family(cannot_find_input_family_in_the_known_ones(unknown,[autophagy]))
Family datasets are in pack(bio_analytics/data/families)
.
Go can be either a GO: atom or an integer.
[debug] ?- go_org_symbol( mouse, 2, Symbs ), write(Symbs), nl, fail. Akt3 Mef2a Mgme1 Mpv17 Mrpl15 Mrpl17 Mrpl39 Msto1 Opa1 Slc25a33 Slc25a36 Tymp false. [debug] ?- go_org_symbol( hs, 'GO:0000002', Symbs ), write(Symbs), nl, fail. AKT3 LONP1 MEF2A MGME1 MPV17 MSTO1 OPA1 PIF1 SLC25A33 SLC25A36 SLC25A4 TYMP false.
?- go_org_symbols( 'GO:0000375', human, Symbs ), length( Symbs, Len ). Symbs = ['BCAS2', 'DBR1', 'DDX23', 'GEMIN2', 'KHSRP'|...], Len = 26. ?- go_org_symbols( 'GO:0000375', mouse, Symbs ), length( Symbs, Len ). Symbs = ['Dbr1', 'Rbm17', 'Rnu4atac', 'Rnu6atac', 'Scaf11', 'Sf3a2', 'Slu7', 'Srrm1', 'Srsf10'|...], Len = 10. ?- go_org_symbols( 375, human, Symbs ), length( Symbs, Len ). Symbs = ['BCAS2', 'DBR1', 'DDX23', 'GEMIN2', 'KHSRP', 'LSM1', 'MPHOSPH10', 'PRPF3', 'PRPF4'|...], Len = 26.
The Graph can be saved via options to wgraph_plot/2.
Opts are passed to symbols_string_graph/3, go_term_symbols/3 and wgraph_plot/2.
Opts
go_id
, go_pair
, go_pair_ord(I,Len)
.?- go_string_graph( 'GO:0016601', G, true ). ?- go_string_graph( 'GO:0016601', G, plot(true) ). ?- go_string_graph( 'GO:0016601', G, [plot(true),save(true)] ). G = ['ABI2', 'AIF1', 'CDH13', 'HACD3', 'NISCH', 'ALS2'-'RAC1':892, 'BRK1'-'CYFIP1':999, ... - ... : 904, ... : ...|...]. ?- ls. % Rac_protein_signal_transduction.csv Rac_protein_signal_transduction_graph.csv Rac_protein_signal_transduction_layout.csv true. ?- go_string_graph( 'GO:0016601', G, [plot(true),save(true),stem_type(go_pair)] ).
Opts
Listens to debug(go_symbols_reach)
.
?- go_symbols_reach( 'GO:0000375', Symbs, [] ), length( Symbs, Len ). Symbs = ['AAR2', 'ALYREF', 'AQR', 'ARC', 'BCAS2', 'BUD13', 'BUD31', 'CACTIN', 'CASC3'|...], Len = 293. ?- go_symbols_reach( 'GO:0000375', Symbs, org(mouse) ), length( Symbs, Len ). Symbs = ['4930595M18Rik', 'Aar2', 'Aqr', 'Bud13', 'Bud31', 'Casc3', 'Cdc40', 'Cdc5l', 'Cdk13'|...]. Len = 190. ?- go_symbols_reach( 375, Symbs, true ), length( Symbs, Len ). Symbs = ['AAR2', 'ALYREF', 'AQR', 'ARC', 'BCAS2', 'BUD13', 'BUD31', 'CACTIN', 'CASC3'|...], Len = 293.
Opts
[false,max,min,umax,umin]
the max
and min
versions are more efficient but do sort the edges,
whereas umax
and umin
take much longer but leave edges in found order?- Got = 'GO:0043552', gene_family( Got, hs, Symbs ), length( Symbs, SymbsLen ), symbols_string_graph(Symbs, Graph, true ), length( Graph, GraphLen ). Got = 'GO:0043552', Symbs = ['AMBRA1', 'ATG14', 'CCL19', 'CCL21', 'CCR7'|...], SymbsLen = 31, Graph = ['TNFAIP8L3', 'AMBRA1'-'ATG14':981, 'AMBRA1'-'PIK3R4':974|...], GraphLen = 235. % the following is flawed as it uses Got from mouse and tries to build the graph in human STRING... ?- Got = 'GO:0043552', gene_family( Got, mouse, Symbs ), length( Symbs, SymbsLen ), symbols_string_graph(Symbs, Graph, true ), length( Graph, GraphLen ). Got = 'GO:0043552', Symbs = Graph, Graph = ['Ambra1', 'Atg14', 'Ccl19', 'Cd19', 'Cdc42', 'Epha8', 'Fgf2', 'Fgfr3', 'Fgr'|...], SymbsLen = GraphLen, GraphLen = 28. % this the correct way for running the first query for mouse: ?- Got = 'GO:0043552', gene_family( Got, mouse, Symbs ), length( Symbs, SymbsLen ), symbols_string_graph(Symbs, Graph, org(mouse) ), length( Graph, GraphLen ). Got = 'GO:0043552', Symbs = ['Ambra1', 'Atg14', 'Ccl19', 'Cd19', 'Cdc42', 'Epha8'|...], SymbsLen = 28, Graph = ['Ambra1', 'Cdc42', 'Lyn', 'Nod2', 'Pdgfra', 'Sh3glb1', 'Tnfaip8l3', 'Vav3', ... : ...|...], GraphLen = 117.