Skip to content

The implementation of Online Cross-Project JIT-SDP approaches proposed in the paper "Cross-Project Online Just-In-Time Software Defect Prediction" accepted in IEEE Transactions on Software Engineering (TSE), 2022, (accepted).

License

Notifications You must be signed in to change notification settings

sadiaTab/CPJITSDP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

84 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cross-Project Online Just-In-Time Software Defect Prediction

DOI

The repository contains:

  • Java implementation of online cross-project approaches proposed in "An Investigation of Cross-Project Learning in Online Just-In-Time Software Defect Prediction" (ICSE'20) and "Cross-Project Online Just-In-Time Software Defect Prediction" (TSE'22).
  • Opensource datasets used for the experiments and hyper-parameter tuning.

Abstract

Cross-Project (CP) Just-In-Time Software Defect Prediction (JIT-SDP) makes use of CP data to overcome the lack of data necessary to train well performing JIT-SDP classifiers at the beginning of software projects. However, such approaches have never been investigated in realistic online learning scenarios, where Within-Project (WP) software changes naturally arrive over time and can be used to automatically update the classifiers. We provide the first investigation of when and to what extent CP data are useful for JIT-SDP in such realistic scenarios. For that, we propose three different online CP JIT-SDP approaches that can be updated with incoming CP and WP training examples over time. We also collect data on 9 proprietary software projects and use 10 open source software projects to analyse these approaches. We find that training classifiers with incoming CP+WP data can lead to absolute improvements in G-mean of up to 53.89% and up to 35.02% at the initial stage of the projects compared to classifiers using WP-only and CP-only data, respectively. Using CP+WP data was also shown to be beneficial after a large number of WP data were received. Using CP data to supplement WP data helped the classifiers to reduce or prevent large drops in predictive performance that may occur over time, leading to absolute G-Mean improvements of up to 37.35% and 48.16% compared to WP-only and CP-only data during such periods, respectively. During periods of stable predictive performance, absolute improvements were of up to 29.03% and up to 41.25% compared to WP-only and CP-only classifiers, respectively. Our results highlight the importance of using both CP and WP data together in realistic online JIT-SDP scenarios.

Authors of the paper:

  • Sadia Tabassum (sxt901 at student dot bham dot ac dot uk)
  • Leandro Minku (L dot L dot Minku at bham dot ac dot uk)
  • Danyi Feng (danyi at ouchteam dot com)

Author of the Online CPJITSDP code:

  • Sadia Tabassum

Environment details:

  • MOA 2018.6.0
  • JDK and JRE 1.8

To run experiments for Online CPJITSDP

  • Go to the directory src/cpjitsdpexperiment
  • There are 4 experiment files- ExpAIO, ExpFilter, ExpOPAIO and ExpOPFilter for online cpjitsdp approaches (AIO, Filter, OPAIO and OPFilter, respectively).
  • Run appropriate experiment file (i.e. cpjitsdpexperiment.ExpAIO.java)

Example command (can be found in the experiment files):

CpjitsdpAIO -l (spdisc.meta.WFL_OO_ORB_Oza -i 15 -s "+ens+" -t "+theta+" -w "+waitingTime+" -p "+paramsORB+")  -s  (ArffFileStream -f (/"+datasetsArray[dsIdx]+") -c 15) -e (FadingFactorEachClassPerformanceEvaluator -a 0.99) -f 1 -d results/results.csv"
  • CpjitsdpAIO: Online CPJITSDP approach to run.
  • -i 15 - the position of the unixtimestamp of the commit in the arff
  • -s - the ensemble size
  • -t - the fading factor used for computing the class sizes
  • -w - the waiting time for assuming the commit label is available
  • -p - the parameters for the ORB.
  • Values for -s,-t,-w and -p can be passed as arguments.
  • Default values for -s,-t,-w and -p are (20,0.99,90 and 100;0.4;10;12;1.5;3)

MOA parameters:

  • -l the machine learning algorithm to be used.
  • -s (ArffFileStream -f -c ) is the path to the dataset in arff format, with -c indicating the index of the class label in the dataset file.
  • -e (FadingFactorEachClassPerformanceEvaluator -a ) is the performance evaluator to be used, with -a indicating the fading factor to be adopted.
  • -d is the path to the output file where the results of the experiments will be saved.

Datasets

Datasets used in the experiments are in ARFF format. The ARFF file must contain header with the following attributes and must maintain the order of the attributes.

    @attribute fix {False,True}
    @attribute ns numeric
    @attribute nd numeric
    @attribute nf numeric
    @attribute entrophy numeric
    @attribute la numeric
    @attribute ld numeric
    @attribute lt numeric
    @attribute ndev numeric
    @attribute age numeric
    @attribute nuc numeric
    @attribute exp numeric
    @attribute rexp numeric
    @attribute sexp numeric
    @attribute contains_bug {False,True}
    @attribute author_date_unix_timestamp numeric
    @attribute project_no numeric
    @attribute commit_type numeric
    @data
    
  • Attributes[1-14]: Software change metrics.
  • Attribute[15]: True label of the commit (whether the commit is really defect-inducing or clean).
  • Attribute[16]: Timestamp when the commit was submitted to the repository.
  • Attribute[17]: Index number associated to a project in datasetsArray. This index identifies a given project. Note that the index of the target project must be passed as argument dsIdx in the command line of the algorithm. For example, if our target project is Tomcat, then dsIdx should be 0. If the target project was JGroups, dsldx should be 1. datasetsArray contains names of the datasets and needs to be defined in the experiment file (i.e ExpAIO.java). Following datasetsArray is used in this paper:
    
      datasetsArray = {"tomcat","JGroups","spring-integration",
    		   "camel","brackets","nova","fabric8",
    		   "neutron","npm","BroadleafCommerce"
    		}
    
  • Attribute[18]: commit_type is a number assigned based on the following data processing scenario:

For each commit x:
	If x is clean:
		Add an instance with: 
			Software change metrics=Attributes[1-14], contains_bug=False, timestamp=[author_date_unix_timestamp], 
			project_no=relevant project index, commit_type=0
			(The online  cpjitsdp will use this instance as follows:
			If x is from target project:
				Test x as clean at timestamp=[author_date_unix_timestamp]
			For both target and cross-projects, train x as clean at timestamp=[author_date_unix_timestamp]+[W days (converted into unix_timestamp)])
	If x is buggy:
		If days_to_first_fix > W:
			Add an instance (which will be used for training) with:
					Software change metrics=Attributes[1-14], contains_bug=True, 
					timestamp=[author_date_unix_timestamp]+[days_to_first_fix (converted into unix_timestamp)], 
					project_no=relevant project index, commit_type=3
					
			If x is from target project:	
				Add an instance (which will be used for training) with:
					Software change metrics=Attributes[1-14], contains_bug=False, 
					timestamp=[author_date_unix_timestamp]+[W days (converted into unix_timestamp)], 
					project_no=relevant project index, commit_type=0
				Add an instance (which will be used for testing)  with:
					Software change metrics=Attributes[1-14], contains_bug=True, timestamp=[author_date_unix_timestamp], 
					project_no=relevant project index, commit_type=1
							
			If x is not from target project:
				Add an instance (which will be used for training) with:
					Software change metrics=Attributes[1-14], contains_bug=False, 
					timestamp=[author_date_unix_timestamp]+[W days (converted into unix_timestamp)], 
					project_no=relevant project index, commit_type=4
						
		If days_to_first_fix <= W:
			Add an instance (which will be used for training) with :
					Software change metrics=Attributes[1-14], contains_bug=True, 
					timestamp=[author_date_unix_timestamp]+[days_to_first_fix (converted into unix_timestamp)], 
					project_no=relevant project index, commit_type=3
			If x is from target project:
				Add an instance (which will be used for testing) with :
					Software change metrics=Attributes[1-14], contains_bug=True, timestamp=[author_date_unix_timestamp], 
					project_no=relevant project index, commit_type=2


After the processing, processed data needs to be sorted in ascending order of the timestamp to mainitain the chronology.

Note: MOA is provided within this repo under the GPL 3 license. Online CPJITSDP makes use of opensource code for ORB in http://doi.org/10.5281/zenodo.2555695.

About

The implementation of Online Cross-Project JIT-SDP approaches proposed in the paper "Cross-Project Online Just-In-Time Software Defect Prediction" accepted in IEEE Transactions on Software Engineering (TSE), 2022, (accepted).

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages