Detailed documentation for epiNB

class epinb.NBMulti(n_pan=10, n_spec=10, n_jobs=None, **kwargs)

Bases: object

fit(X, y)

predict(X, return_series=False)

predict_log_odds(X, return_df=False)

predict_log_proba(X, return_df=False)

class epinb.NBScore(n_pan: int = 10, n_spec: int = 10, pan_feature_candidates: Optional[List] = None, *, smoothing_strength_1: float = 0.0, smoothing_strength_2: float = 0.0)

Bases: object

Create an epiNB model.

Parameters

n_pan – Number of pan-allelic 2nd-order motifs used in prediction. Default/recommended: 10.
n_spec – Number of allele-specific 2nd-order motifs used in prediction. Default/recommended: 10.
pan_feature_candidates – Customize pan-allelic 2nd-order motifs. Default/recommended: [(1, -1), (0, 1), (4, -4), (1, 2), (0, -1), (-2, -1), (2, 4), (-3, -1), (2, -1), (1, 3), (0, 2)]
smoothing_strength_1 – Smoothing used for 1st-order motifs using BLOSUM62. Default/recommended: 0.
smoothing_strength_2 – Smoothing used for 2nd-order motifs using BLOSUM62. Default/recommended: 0.

counter_1(X: ndarray) → ndarray

Counter for 1st order motifs.

Parameters: X – The peptide matrix
Returns: A matrix representing the frequency of each AA at each position

counter_2(X: ndarray, features: List[Tuple[int, int]]) → ndarray

Counter for 1st order motifs.

Parameters

X – The peptide matrix
features – The requested 2nd order motifs

Returns

A matrix representing the frequency of each AA combination (400 in total) at each 2nd order motifs.

fit(X: Iterable[str], min_len: int = 8, max_len: int = 11)

Fit the model.

Parameters

X – Peptides
min_len – minimum length of the peptide to be count in. Discard otherwise.
max_len – maximum length of the peptide to be count in. Discard otherwise.

Returns

Fitted model (self)

fit_details_1(what='freq')

Show frequency of AAs (aka motifs, as input for a logo plot)

Parameters: what – “freq” for frequency, or “log_odds” for log odds
Returns: the request details

fit_details_2(what='freq', topk=None)

Show frequency of AAs (aka motifs, as input for a logo plot)

Parameters

what – “freq” for frequency, “log_odds” for log odds, or “surplus” for P(ab) - P(a)P(b).
topk – If unspecified, return the matrix directly. If specified, sort the values, and return the AA combinations and the values in two data frames. Only the topk values will be returned. Thus, specify 400 if all values are wanted.

Returns

the request details

predict_details(X: Iterable[str], *, log_prior=6.906754778648553)

Show prediction details for a list of peptides.

Parameters

X – input peptides.
log_prior – log(Neg/Pos) as the prior for converting odds to probability.

Returns

a data frame containing prediction details

predict_log_odds(X: Iterable[str])

Predict log odds for a list of peptides. This is the recommended measurement to rank peptides because it minimizes numerical issues.

Parameters: X – Input peptides
Returns: log odds

predict_log_proba(X: Iterable[str], *, log_prior: float = 6.906754778648553)

Predict log probability for a list of peptides. This is not the recommended measurement because numerical issues may make it hard to rank peptides. Peptides that are ranked high may have indistinguishable log probabilities. Please use log odds for ranking.

Parameters

X – input peptides.
log_prior – log(Neg:Pos) as the prior for converting odds to probability.

Returns

log probabilities

predict_proba(X: Iterable[str], *, prior: float = 999.0) → DataFrame

Predict (linear scale) probability for a list of peptides. Use of this measurement in ranking peptides is discouraged because many will have identical probabilities. Please use log odds for ranking.

Parameters

X – input peptides.
prior – Neg:Pos ratio as the prior for converting odds to probability.

Returns

probabilities

seq2matrix(peptides: Iterable[str], no_warning: bool = False, return_ind: bool = False, min_len: int = 0, max_len: int = 100) → Tuple[Optional[List[int]], ndarray]

Helper function to convert sequences to a matrix

Parameters

peptides – Peptides to be processed.
no_warning – If true, do not warn when ignoring a peptide for unknown AAs (e.g. X)
return_ind – If true, return the indices of the kept peptides in the input. This helps to align the results with the input, even when some inputs are filtered out.
min_len – minimum length of the peptide to be count in. Discard otherwise.
max_len – maximum length of the peptide to be count in. Discard otherwise.

Returns

The matrix (in numpy) or the indices and the matrix when requested.

spec_feature_selection(X: ndarray, n_spec: int) → list[tuple[int, int]]

Select allele-specific 2nd order motifs

Parameters

X – The peptide matrix.
n_spec – Number of motifs.

Returns

A list of motifs