The installation currently works for Linux and MacOS only. The two currently available options for installing qsin use the files environment.yml and build.sh, which are located among this repository files.
# construct the environment
conda env create -f environment.yml
conda activate qsin
# install julia dependencies at qsin
./build.sh # Pull image
docker pull ulisesrosas/qsin_docker:latest
# run container
docker run -it ulisesrosas/qsin_dockerOption 3: Using Mamba
# construct the environment
mamba env create -f environment.yml
conda activate qsin
# install julia dependencies at qsin
./build.shOption 4: Manual Installation
For this it requires that you have julia, R and python installed in your system. You can install the dependencies for julia by running the following command in the Julia console:
using Pkg
Pkg.add("CSV"); Pkg.add("DataFrames"); Pkg.add(Pkg.PackageSpec(;name="PhyloNetworks", version="0.16.4")); Pkg.add("Suppressor")You can install the dependencies for R by running the following command in the R console:
install.packages("SiPhyNetwork")You can install the dependencies for python by running the following command in the terminal:
pip install qsin# create dir where sim nets will be stored
mkdir -p ./demo/test_sims
# simulate 1000 random networks
sim_nets.R 6 --max_iter 1000\
--prefix test\
--out_path ./demo/test_sims\
--ncores 4infer_qlls.jl ./test_data/1_seqgen.CFs.csv\
./demo/test_sims/test*.txt\
--outfile ./demo/test_qll.csv\
--ncores 4path_subsampling.py ./test_data/test_qll.csv\
./test_data/1_seqgen.CFs.csv\
--wpath \
--factor 0.5\
--e 1e-2 \
--cv\
--folds 5\
--alpha 0.5 0.95 0.99\
--ncores 4\
--verbose\
--prefix ./demo/linear_batches--wpathspecify whether the elastic net path information is output it.--factoris the subsampling factor. 0.5 means that we write the selected rows of the input file until around half of the rows are selected.--eis the constant that multiplies the lambda_max to get the lambda_min value. For larger datasets this value can be set to smaller values (e.g., 1e-3 or 1e-4) to get a larger elastic net path.--cvspecifies that cross-validation is to be used.--foldsis the number of folds for cross-validation.--alphais the alpha values to be used.--ncoresis the number of cores to be used in cross-validation.--verbosespecifies that the output should be verbose.--prefixis the prefix for the output files.
At some points of the Elastic Net path in some folds of cross-validation we might see convergence warning messages whenalpha 0.99 or 0.95 is tested. This means that, at these points, Lasso-like solutions are likely not the best fit. These messages dissapear when the tolerance value is increased (e.g., --tol 0.01).
Subsampling using ISLE, the non-parametric sparse model
We essentially add the --isle flag to the command above and the correspoding ensemble parameters:
path_subsampling.py ./test_data/test_qll.csv\
./test_data/1_seqgen.CFs.csv\
--wpath \
--factor 0.5\
--e 1e-2 \
--cv\
--folds 5\
--alpha 0.5 0.95 0.99\
--ncores 4\
--verbose\
--prefix ./demo/non_linear_batches\
--isle --eta 0.0316 0.1487 0.2658 0.3829 0.5\
--nu 0.01 0.0325 0.055 0.0775 0.1\
--max_leaf_nodes 2\
--cv_sample 300--islespecifies that the ISLE method is to be used.--etais the set of eta values to be used. This add randomness to the ensemble.--nuis the set of nu values to be used. This add memory to the ensemble.--max_leaf_nodesis the maximum number of leaf nodes to be used in the base learners. It is fixed to 2 in this example but it can also be a list of values.--cv_sampleis the maximum number of hyperparameter combinations to be tested in cross-validation. If the total number of combinations is larger than this value, a random sample of this size is taken from the total combinations.
You can check the elastic net path produced for creating the subsample by running the following python code:
import numpy as np
import matplotlib.pyplot as plt
errors_path = np.loadtxt("./demo/linear_batches_testErrors.csv", delimiter=',')
elnet_path = np.loadtxt("./demo/linear_batches_elnetPath.csv", delimiter=',')
fs = 14
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(8, 5))
ax1.plot(elnet_path[:,0], elnet_path[:,1:], marker='o', alpha=0.8, markersize=5)
ax1.set_xscale('log')
ax1.set_ylabel('Coefficient', fontsize=fs)
ax1.axhline(0, color='black', lw=2)
ax1.set_title('Elastic Net Path', fontsize=fs)
ax2.plot(errors_path[:,0], errors_path[:,1], marker='o', alpha=0.8, markersize=5)
ax2.set_xscale('log')
ax2.set_yscale('log')
ax2.set_xlabel('$\\lambda$', fontsize=fs)
ax2.set_ylabel('RMSE', fontsize=fs)
ax2.set_title('Test Error', fontsize=fs)
plt.tight_layout()# lets take the last row from the subsamples obtained above
tail -n1 ./demo/linear_batches_overlappedBatches.txt > ./demo/linear_batches_last.txt
infer_nets_batches.jl ./test_data/1_seqgen.QMC.tre\
./test_data/1_seqgen.CFs.csv\
./demo/linear_batches_last.txt\
--h_max 1\
--prefix ./demo/linear_overlapped\
--ncores 4\
--nruns 10We can compare the obtained networks with the original network, which is available in the test_data folder. You might need to install the PhyloPlots and RCall packages to run this code in julia. The first plot is the full data network, and the second plot is the half data network.
using PhyloNetworks;
using PhyloPlots; # you may need to install PhyloPlots package
using RCall; # you may need to install RCall package
qsin_net = readInputTrees("./demo/linear_overlapped_nets.txt");
ori_net = readInputTrees("./test_data/full_data_net6.txt");
R"layout(matrix(1:2, 1, 2))";
R"par"(mar=[0,0,1,0], bg = "white");
plot(ori_net[1], showgamma=true);
R"mtext"("Full data network (15 rows)")
plot(qsin_net[1], showgamma=true);
R"mtext"("Half data network (8 rows)")
