Wednesday 10 July 2024

How to Read in 1 or All Excel Sheets into a Pandas DF Using Xlwings?

 

                                                    Photo by Jeff Sheldon on Unsplash

One of the advantages of using Xlwings to read in Excel data into Pandas dataframe is that numeric columns that should be stored as string (ie. leading zeros are retained) will be object Dtype. Pandas.read_excel() will, instead, convert such columns into int64 Dtype (ie. leading zeros are removed).

I have created a function that can read in a single or all sheets of an Excel into a pandas df.

 



import pandas as pd
import os
import xlwings as xw

 

# =============================================================================

# INSTRUCTIONS FOR read_pw_excel

# Function can also read in CSV files. But as the underlying software is Excel, if the CSV file exceeds the maximum number of rows allowed in Excel, only the 1st row up to the limit will be read in.

 

# 1) Leave sheet blank to read in all sheets in the Excel file.

# Output will be a vertically stacked df of all sheets. Otherwise, specifiy the exact sheet to read in for df.

# If reading in all sheets,

   # Code will read in the 1st row of the 1st sheet as the header.

   # Whether subsequent sheets have header row or do not have header row (ie all rows are data), code will use the 1st row of the 1st sheet as the header.

# Code will drop columns with None as header (including the index column if it doesn't have header)

# Code will convert all header to upper case

 

# 2) Leave excel_range blank to read in the used range in the sheet(s). Otherwise specify the exact range like 'A:B'.

# 3) Leave pw blank if Excel file is not password protected

# =============================================================================

 

def read_pw_excel(link, sheet='', excel_range = '', pw=''):

 

    if not os.path.exists(link):

        print("link does not exist. Please try again.")

        raise Exception("LinkNotFoundError")

       

    df_merged=pd.DataFrame()

    col_names_first=''

    col_names_next=''

    lt_type=''

    index_col=''

    app = xw.App()

    app.display_alerts=False

   

    if pw=='':

        #filebook = xw.Book(link)

        filebook = app.books.open(link, update_links=False)

    else: filebook = app.books.open(link, update_links=False, password=pw)

   

    

    sht_ct = filebook.sheets.count

 

    for ct in range(0, sht_ct):

 

        if sheet=='':         

            if excel_range == '':

                range_to_use = filebook.sheets[ct].used_range  

            else: range_to_use = excel_range

       

            data = filebook.sheets[ct].range(range_to_use).value # Note this will be a list Eg [['Name', 'Age'], ['Tom', 4], ['Harry', 3]]. If there is only 1 column or 1 row, it will be ['Name', 'Tom', 'Harry'] or [ 'Tom', 'Tan', 6]

           

        else:

            if excel_range == '':

                range_to_use = filebook.sheets[sheet].used_range            

            else: range_to_use = excel_range

       

            data=filebook.sheets[sheet].range(range_to_use).value

 

 

 

        if ct==0:

           

            # if there is only 1 column, we convert header into a list: [['Name'], 'Tom', 'Harry'] - otherwise issue in converting to upper later when using [x.upper() if x is not None else x for x in data[0]]. Each letter of 'Name' will be separate element

            if not isinstance(data[0], list):

                data[0]=[data[0]]

                lt_type='single_col'

       

            

            pos_none= [pos for pos, x in enumerate(data[0]) if x == None] # data[0] is the header column. # when using used_range Python may read in some None columns, we find the position of these None columns

            if 0 in pos_none: # If column has NONE header, it is probably an index column. We will need an indicator to later remove this index column

                index_col='Y'

               

                

            col_names_first_ls=data[0]

            col_names_first_ls = [x for x in col_names_first_ls if x != None] # when using used_range Python may read in some None columns, we drop these

            col_names_first= ''.join(str(x).replace(' ', '').upper() for x in col_names_first_ls)

            

        else:

            

            # if there is only 1 column, we convert header into a list: [['Name'], 'Tom', 'Harry'] - otherwise issue in converting to upper later when using [x.upper() if x is not None else x for x in data[0]]. Each letter of 'Name' will be separate element

            if not isinstance(data[0], list) and lt_type=='single_col':

                pos_none=[]

              

                try:

                    data[0]=[data[0]]

                except: #TypeError: 'str' object does not support item assignment. Occurs when data is a single str/number (ie, there is only 1 column with 1 row)

                    data=[[data]]

                   

                col_names_next_ls=data[0]

               

                

                

            elif not isinstance(data[0], list) and lt_type!='single_col':  # imply there is only 1 data row without header

                col_names_next_ls=data

                col_names_next_ls = [x for x in col_names_next_ls if x != None]

           

            else: # imply there is > 1 row and >1 col

                pos_none= [pos for pos, x in enumerate(data[0]) if x == None]

                

                col_names_next_ls=data[0]

                col_names_next_ls = [x for x in col_names_next_ls if x != None]

               

            

            col_names_next= ''.join(str(x).replace(' ', '').upper() for x in col_names_next_ls)

           

       

        

       

        # for 1st sheet and 2nd onwards sheets which have same  header as 1st sheet

        if (ct==0) or ((ct!=0) & (col_names_first==col_names_next)):

            

            data[0]=[x for x in data[0] if x != None] # we remove any None header

 

            # IF THERE IS NONE header column read in, we remove all the data (should all be None) for that column

            if pos_none:

                for pos, x in enumerate(data[1:]):

                    i=0

                    for n in pos_none:

                        n=n+i

                        x.pop(n)

                        i-=1

 

            cols=[str(x).upper() if x is not None else x for x in data[0]] # this line will have issue if data[0]  is  a single string of the column name

            # cols=[]

            # if isinstance(data[0], list):

            #     for x in data[0]:

            #         if x is not None:

            #             cols.append(x.upper())

            #         else: cols.append(x)

           

            # elif not isinstance(data[0], list): cols.append(data[0].upper())

               

            df = pd.DataFrame(data[1:], columns = cols)

            df = df[df.columns.dropna()]  # drop columns with None as header. Some Excel file have >1 None columns, duplicate columns will have issue with concat

            #df.columns = map(str.upper, df.columns) # May not work if header has None 

            df_merged=pd.concat([df_merged, df],axis=0, ignore_index=True)

           

        elif (ct!=0) & (col_names_first!=col_names_next): # in case the 2nd onwards sheets don't have header (ie. 1st row is data), we need to add the header of the 1st sheet to it in order to concat

            if lt_type=='single_col': data[0]=data[0][0] # if only 1 column, the 1st row data shan't be a list [['Tom'], 'Dick']

               

            if len (col_names_next_ls) > len (col_names_first_ls): # Possibly some None header Column read in for ct!=. It could be an index column

                # IF THERE IS an index column read in, we remove all the data for that column and don't include it

       

                if index_col=='Y':

                    for x in data[0:]:

                        try:

                            x.pop(0)

                        except:  #AttributeError: 'float'/ 'str' object has no attribute 'pop'. This occurs if it is a single row. And data is a single list (not an embedded list) Eg [10.0, 'tom_left', 20.0, 200.0, 31.0]

                            data.pop(0)

                            break

                       

                new_col_ls=col_names_first_ls

 

                      

            elif len (col_names_next_ls) < len (col_names_first_ls):

                print("Number of column in Sheet " + str(ct) + " is fewer than in Sheet 0.")

                raise Exception("Number of Columns in Sheets Differ Error")

              

               

            elif len (col_names_next_ls) == len (col_names_first_ls):

                 new_col_ls=col_names_first_ls

             

        

            cols=[x.upper() if x is not None else x for x in new_col_ls]

           

            try:

                df = pd.DataFrame(data[0:], columns = cols)

 

            except: # when data[0:] is a single list with >1 col, sometimes will have ValueError: Shape of passed values is (x, 1), indices imply (x, x)  Then will have to nest it in a list

                df = pd.DataFrame([data[0:]], columns = cols)

          

            df = df[df.columns.dropna()]

            #df.columns = map(str.upper, df.columns)

            df_merged=pd.concat([df_merged, df],axis=0, ignore_index=True)

        

                

                

            

            

        if sheet!='': break

 

    filebook.close()

    app.quit()

   

    return df_merged

Wednesday 3 July 2024

Python Equivalent of SAS Proc Transpose

 

                                                Photo by Microsoft Edge on Unsplash

As Python does not have readily available functions that is exactly identical to SAS Proc Transpose, this article provides a created function SAS_proc_transpose that exactly reproduces most of SAS Proc Transpose procedure.

 

 



import pandas as pd
import numpy as np

# =============================================================================

# INSTRUCTIONS FOR SAS_proc_transpose

 

# The 1st arg is the df.

# The 2nd arg is the BY variable(s). Enter into the list. Leave the list empty if no BY variables required.

# The 3rd arg is the ID variable(s). Enter into the list. Leave the list empty if no ID variables required.

# The 4th arg is the VAR variable(s). Enter into the list. If the list is left empty, similar to SAS, all the numeric variables will be used as the VAR variables

 

## https://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a000063668.htm

# SAS_proc_transpose works similar to SAS PROC TRANSPOSE with LET option. In other words, when there are duplicate ID Values, it transposes the observation that contains the LAST occurrence of the duplicate ID value.

# SAS_proc_transpose, similar to SAS PROC TRANSPOSE, it does not transpose observations that have a missing value for one or more ID variables.

# =============================================================================

 

def SAS_proc_transpose (df_copy, by_1=[], var=[], id_1=[]):

    df=df_copy.copy() # this is to prevent the original df being modified in the function. Note that all df will be modified in function even if not returned.

   

    if not var: # if var is empty, we use all the numeric columns as the variables

        var_col = df.select_dtypes(include=['number']).columns.to_list()

    else: var_col=var

    if not id_1:

        if by_1:

            tranpos=pd.melt (df, id_vars=by_1, value_vars=var_col)

           

            by_1.append('variable')

            tranpos['g'] = tranpos.groupby(by_1, dropna=False).cumcount() + 1

           

            by_1.append('g')

            tranpos = tranpos.set_index(by_1)['value'].unstack().add_prefix('COL').reset_index()

            by_1.remove('g')

           

            tranpos.sort_values(by=by_1, na_position= 'first', ignore_index=True, kind='stable', inplace=True)

            tranpos.rename(columns={'variable': '_NAME_'}, inplace=True)

           

        elif not by_1:

             tranpos=df[var_col]

             tranpos=tranpos.T

             tranpos.columns = [ 'COL' + str(col + 1) for col in tranpos.columns]

             tranpos.index = tranpos.index.set_names('_NAME_')

             tranpos.reset_index(inplace=True)

    elif id_1:

        if by_1:

   

            #new_col=by_1 + id_1

            new_col0=by_1 + id_1

            new_col= []

            [new_col.append(x) for x in new_col0 if x not in new_col]  #Remove duplicate. In case there are same col in both by_1 & id_1

            tranpos=pd.melt (df, id_vars=new_col, value_vars=var_col)

            for col in new_col:

                #tranpos.dropna(subset = [col], inplace=True)

                tranpos[col]= tranpos[col].astype(str)

                tranpos[col]= tranpos[col].replace(['None', 'nan'] , 'nan_python')

           

            tranpos['key1']= tranpos[new_col + ['variable']].apply(lambda row: ''.join(row.values.astype(str)), axis=1)

            tranpos.drop_duplicates(subset=['key1'], keep='last', inplace=True)

           

            tranpos['key2']= tranpos[id_1].apply(lambda row: ''.join(row.values.astype(str)), axis=1)

            #tranpos.drop_duplicates(subset=['key2'], keep='last', inplace=True)

            new_col=by_1 + ['variable', 'key2']

            tranpos = tranpos.set_index(new_col)['value'].unstack().reset_index()

           

            tranpos.drop(tranpos.filter(regex = 'nan_python').columns, axis = 1, inplace = True, errors='ignore') # Drop columns with nan_python in column header. These are the id columns with nan cells that are transposed.

            tranpos.rename_axis(None, axis=1, inplace=True)

            tranpos.rename(columns={"variable": "_NAME_"}, inplace=True)

            tranpos= tranpos.replace(['nan_python'] , np.nan )

   

            tranpos.sort_values(by=by_1+['_NAME_'], axis=0, ascending=True, inplace=True, kind='stable', na_position='first', ignore_index=True)

   

        elif not by_1:

            tranpos=df.copy()

            for col in id_1:

                tranpos.dropna(subset = [col], inplace=True)

                tranpos[col]= tranpos[col].astype(str)

   

            tranpos['key']= tranpos[id_1].apply(lambda row: ''.join(row.values.astype(str)), axis=1)

            tranpos.drop_duplicates(subset=['key'], keep='last', inplace=True)

           

            new_col=var_col + ['key']

            tranpos=tranpos[new_col]

           

            #tranpos.rename(columns={'key': None}, inplace=True)

            #tranpos.set_index(None, inplace=True)

            tranpos.set_index('key', inplace=True)

            tranpos= tranpos.T

            

            tranpos.reset_index(inplace=True)

            tranpos.rename_axis(None, axis=1, inplace=True)

            tranpos.rename(columns={'index': '_NAME_'}, inplace=True)

      
    return tranpos
    

# Toy dataset to test out SAS_proc_transpose

 data = {

"name": ['dick_both', 'sam_both', 'sam_both', 'sam_both','cat_both',np.nan, 'susan_both', 'john_both', 'john_both', 'tom_left', 'tom_left', 'harry', 'test'],

"cat": ['A', 'B', 'B', 'C', 'C', 'D', 'E', 'H', np.nan, 'H', 'I', np.nan, 'E'],

"loc": ['E','W', 'W','S', 'S', 'E', 'W', 'N', 'N', 'S', 'E', 'E', 'W'],

"age": [20, 20, 19, 22, 20, 20, 20, 99, 88, 21, 22, 20, np.nan],

"ht": [np.nan, 200, 200, 200, 200, 200, 200,  200, 200,  200, 200, 200, np.nan],

"sales_l": [21, 22, 23, 24, 25, 26, 27, 28, 28, 30, 31, 32, np.nan]

}

left = pd.DataFrame(data)

df= SAS_proc_transpose (left, by_1=['name',], var=['ht','age'], id_1=['cat'])

print('PROC TRANSPOSE')

print(df)



Sunday 23 June 2024

Decoding SAS: A Guide to Python Equivalents

 

                                                Photo by Surface on Unsplash

 

As more people migrate from SAS to free open-source languages like Python, there could be a need to re-code SAS programs into the open-source language. This article seeks to provide a cheatsheet of SAS-to-Python Equivalent codes for some more complex procedures like Proc SQL, Proc Transpose. Refer to this PDF file for the cheat sheet:

 Python-to-SAS Equivalents

Python Equivalent of SAS Merge

 

                                                    Photo by Sora Sagano on Unsplash

While Python offers several merge options like pandas.merge, pandas.concat, and join; none of them can replicate the exact same output of SAS Merge.

Some differences between SAS and Python Merge…

  1. For SAS when merging datasets with overlapping columns, it will take the column in the right-most dataset. While for Python, overlapping columns will be suffixed to distinguish them.
  2. For SAS if df_left has more rows than df_right for a merge key, the non-overlapping columns from df_right for the extra rows will be filled with the data of the last row in df_right. For Python, however, it will do a Cartesian Join (Cross Join).

The below diagrams illustrates the differences.

For SAS merge, the overlapping columns age and ht take the values from df_right for Obs 1–2. For the non-overlapping column from df_right (sales_right), Obs 3 takes the same value as Obs 2.


For Python merge, the overlapping columns age and ht are suffixed with “_left” for the columns from df_left. Also, a cartesian join is performed resulting in 3 x 2 = 6 observations.

 


Thus I have come up with a function in Python to replicate SAS Merge.



import pandas as pd

import numpy as np

 

# =============================================================================

# SOME DIFFERENCES BETWEEN SAS & PYTHON MERGE

 

# For SAS, when merging datasets with overlapping columns, it will take the column in the right-most dataset

# For Python, overlapping columns will be suffixed to distinguish them

#

# For SAS if df_left has > rows than df_right, The non-overlapping columns from df_right, for the extra rows will be filled with the data of the last row in df_right.

# But in Python, it will do a Cartesian Join (Cross Join)

# =============================================================================

 

# =============================================================================

# INSTRUCTIONS FOR SAS_merge

 

# The 1st 2 args are the left and right dfs

# The 3rd arg is the HOW of the merge. Similar to SAS merge, the options are

    # 'if df_left'

    # 'if df_right'

    # 'if df_left and df_right'

    # '' # this is the SAS merge without the If statement

    # 'if df_left NOT =df_right'

    # 'if df_right NOT =df_left'

# The 4th arg is the ind. Put 'Y' if u want the variable '_merge' (this tells which df the record is from), else put 'N'

# The 5th arg is the merge key. It is of variable length. If merging on >1 variable separate the arg by ,

# =============================================================================

 

def SAS_merge (df_left_copy, df_right_copy, how, ind, *merge_key):

    df_left=df_left_copy.copy() # this is to prevent the original df being modified in the function. Note that all df will be modified in function even if not returned.

    df_right=df_right_copy.copy() # this is to prevent the original df being modified in the function. Note that all df will be modified in function even if not returned.

   

    df_left = df_left [df_left.columns.dropna()] #Drop columns with Nan header, or else there will  be errors

    df_right = df_right [df_right.columns.dropna()]

   

    df_left_subset_merge=pd.DataFrame()

    df_right_subset_merge=pd.DataFrame()

     

    df_left['KEY']=''

    df_right['KEY']=''

   

    for col in merge_key:

        df_left[col] = df_left[col].fillna('')

        df_right[col] = df_right[col].fillna('')

      

    for col in merge_key:

        df_left['KEY'] = df_left[['KEY', col]].apply(lambda row: ''.join(row.values.astype(str)), axis=1)

        df_right['KEY'] = df_right[['KEY', col]].apply(lambda row: ''.join(row.values.astype(str)), axis=1)

       

    

    if how=='if df_left' or how=='if df_right' or how=='if df_left and df_right' or how=='':

        df_left.sort_values(by=list(merge_key), kind='stable', ignore_index=True, inplace=True)

        df_right.sort_values(by=list(merge_key), kind='stable', ignore_index=True, inplace=True)

             

        #check_df_left = df_left['KEY'].to_list()

        #check_df_right = df_right['KEY'].to_list()

        #common_key=list(set(check_df_left) & set(check_df_right)) # find the unique key that exists in both df

       

        col_common =list(np.intersect1d(df_left.columns, df_right.columns))

        col_common.remove('KEY')

        for col in merge_key:

            col_common.remove(col)

       

        col_missing_right = df_right.columns.difference(df_left.columns).tolist() # find the columns in df_right that are missing from df_left

        col_missing_left = df_left.columns.difference(df_right.columns).tolist() # find the columns in df_left that are missing from df_right

       

        # find the common key in both df and their number of occurrence by inner merge

        check_df_left=pd.DataFrame(df_left.value_counts(subset=['KEY'], dropna=True)) # KEY becomes the index in the df

        check_df_right=pd.DataFrame(df_right.value_counts(subset=['KEY'], dropna=True))

       

        merge_check=pd.merge(check_df_left, check_df_right, how='inner', left_index=True, right_index=True).reset_index(drop=False)

        merge_left=merge_check[(merge_check['0_x']>=merge_check['0_y']) & (merge_check['0_y']>1)] #'0_x' is the num of occurrence in left df. issues occur if '0_y'>1 as Python does a cartesian merge

        merge_right=merge_check[(merge_check['0_x']<=merge_check['0_y']) & (merge_check['0_x']>1)]

           

 

        for c in merge_left['KEY'].to_list():

       

            df_left_key=df_left[df_left['KEY']==c].shape[0]

            df_right_key=df_right[df_right['KEY']==c].shape[0]

       

            #row_list=merge[merge['KEY']==c].index.tolist()

            df_left_subset=df_left[df_left['KEY']==c].reset_index(drop=True)

            df_left=df_left[df_left['KEY']!=c]

           

            df_right_subset=df_right[df_right['KEY']==c].reset_index(drop=True)

            df_right=df_right[df_right['KEY']!=c]

           

            # for overlapping columns fill df_left with values from df_right

            #df_left_subset[list(np.intersect1d(df_left_subset.columns, df_right_subset.columns))]=df_right_subset[list(np.intersect1d(df_left_subset.columns, df_right_subset.columns))]

            for col in col_common:

                for i in range (0, df_right_key):

                    df_left_subset.iloc[i, df_left_subset.columns.get_loc(col)]=df_right_subset.iloc[i, df_right_subset.columns.get_loc(col)]

           

            

            for col in col_missing_right:

                df_left_subset[col]=df_right_subset[col] # for this to work, both df must have same index

                # for SAS if df_left has > rows than df_right. The columns from df_right, for the extra rows will be filled with the data of the last row in df_right.

                for i in range (df_right_key, df_left_key):

                    df_left_subset.iloc[i, df_left_subset.columns.get_loc(col)]=df_left_subset.iloc[i-1, df_left_subset.columns.get_loc(col)]

       

            df_left_subset_merge = pd.concat([df_left_subset_merge, df_left_subset], axis = 0, ignore_index=True)

            

     

        for c in merge_right['KEY'].to_list():

           

            df_left_key=df_left[df_left['KEY']==c].shape[0]

            df_right_key=df_right[df_right['KEY']==c].shape[0]

       

            #row_list=merge[merge['KEY']==c].index.tolist()

            df_left_subset=df_left[df_left['KEY']==c].reset_index(drop=True)

            df_left=df_left[df_left['KEY']!=c]

           

            df_right_subset=df_right[df_right['KEY']==c].reset_index(drop=True)

            df_right=df_right[df_right['KEY']!=c]

           

            

            for col in col_missing_left:

                df_right_subset[col]=df_left_subset[col] # for this to work, both df must have same index

                # for SAS if df_right has > rows than df_left. The columns from df_left, for the extra rows will be filled with the data of the last row in df_left.

                for i in range (df_left_key, df_right_key):

                    df_right_subset.iloc[i, df_right_subset.columns.get_loc(col)]=df_right_subset.iloc[i-1, df_right_subset.columns.get_loc(col)]

       

            df_right_subset_merge = pd.concat([df_right_subset_merge, df_right_subset], axis = 0, ignore_index=True)

 

       

                    

 

        if how=='if df_left':

            merge = pd.merge(df_left, df_right, how='left', on=merge_key, suffixes=('_left', ''), sort=True, indicator=True)

            col_ls=[col[:-5] for col in merge.columns if col.endswith('_left')] 

            for col in col_ls:

                merge.loc[merge['_merge']=='left_only', col]=merge.loc[merge['_merge']=='left_only', col+'_left']

       

        elif how=='if df_right':

            merge = pd.merge(df_left, df_right, how='right', on=merge_key, suffixes=('_left', ''), sort=True, indicator=True)

       

        elif how=='if df_left and df_right':

            merge = pd.merge(df_left, df_right, how='inner', on=merge_key, suffixes=('_left', ''), sort=True, indicator=True)

           

        elif how=='':

            merge = pd.merge(df_left, df_right, how='outer', on=merge_key, suffixes=('_left', ''), sort=True, indicator=True)

            col_ls=[col[:-5] for col in merge.columns if col.endswith('_left')] 

            for col in col_ls:

                merge.loc[merge['_merge']=='left_only', col]=merge.loc[merge['_merge']=='left_only', col+'_left']

               

        merge = pd.concat([merge, df_left_subset_merge, df_right_subset_merge], axis = 0)

        merge = merge.loc[:, ~merge.columns.str.startswith('KEY')]

        merge ['_merge'].fillna('both', inplace=True) # these are the rows from df_left_subset_merge, df_right_subset_merge

        merge.drop(merge.filter(regex='_left$').columns, axis=1, inplace=True, errors='ignore')

        if ind=='N':

            merge.drop(['_merge'], axis=1, inplace=True, errors='ignore')

        merge.sort_values(by=list(merge_key), kind='stable', ignore_index=True, inplace=True)

       

 

 

    elif how=='if df_left NOT =df_right' or how=='if df_right NOT =df_left':

           

        df_left_subset_merge = df_left[~df_left['KEY'].isin(df_right['KEY'])]

        df_right_subset_merge = df_right[~df_right['KEY'].isin(df_left['KEY'])]

        merge = pd.concat([df_left_subset_merge, df_right_subset_merge], axis = 0)

        merge = merge.loc[:, ~merge.columns.str.startswith('KEY')]

        if ind=='N':

            merge.drop(['_merge'], axis=1, inplace=True, errors='ignore')

        merge.sort_values(by=list(merge_key), kind='stable', ignore_index=True, inplace=True)

 

    return merge

 
# Toy Dataset to Test Python Merge, Join, Concat vs SAS Merge

data = {      

"name": ['dick_both', 'sam_both', 'sam_both', 'sam_both','cat_both','susan_both', 'susan_both', 'john_both', 'john_both', 'tom_left', 'tom_left'],

"age": [20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20],

"ht": [200, 200, 200, 200, 200, 200, 200,  200, 200,  200, 200],

"sales_l": [21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]

}

 

left = pd.DataFrame(data)

 

data = {

      "name": ['dick_both', 'sam_both', 'sam_both','cat_both', 'cat_both', 'susan_both', 'john_both', 'john_both', 'john_both', 'harry_right', 'harry_right' ],

      "age": [50, 50, 50, 50,50, 50, 50, 50, 50, 50, 50],

      "ht": [5, 5, 5, 5,5, 5, 5,  5, 5, 5,  5],

      "sales_r": [51, 52, 53, 54,55, 56, 57, 58, 59, 60, 61]

        }

 

right = pd.DataFrame(data)

 

 

merge = pd.merge(left, right, how='right', on=['name'], suffixes=('_left', ''), sort=True)

print('PYTHON MERGE')

print(merge)

 

concat=pd.concat([left, right], axis=1,join='inner')

print('PYTHON CONCAT')

print(concat)

 

join=left.join(right, how='left', lsuffix='_left', rsuffix='')

print('PYTHON JOIN')

print(join)  

merge=SAS_merge (left, right,'if df_left', 'N', 'name')

 

print('SAS MERGE')

print(merge)

Thursday 20 June 2024

Python Equivalent of SAS Proc Summary

                                              Photo by Clément Hélardot on Unsplash

While Python is a power language, it may not have ready-to-use functions that replicates SAS procedure exactly. This article provides a created function SAS_pro_summary that exactly replicates most of SAS Proc Summary procedure.

import pandas as pd
import numpy as np
import math
from itertools import combinations
from statsmodels.stats.weightstats import DescrStatsW



# =============================================================================
# INSTRUCTIONS

# For stat, the options are all or a combinations of the below
# 'sum', 'sum_wt', 'count', 'min', 'max', 'mean', 'mean_wt','var', 'var_wt', np.std, 'std_wt'

# class_1 are the CLASS variables in SAS Proc Summary 
# For the variables in class_1, they should be in the same order as in the SAS proc summary code. Otherwise, the '_TYPE_' will differ

# var are the VAR variables in SAS Proc Summary 

# id_1 are the ID variables in SAS Proc Summary. Leave the [] empty if ID variable is not needed

# weight is the WEIGHT variable in SAS Proc Summary. Leave the [] empty if WEIGHT variable is not needed. 

# nway='Y' means the same as including nway in SAS Proc Summary. nway='N' means NOT to include

# missing='Y' means the same as including missing in SAS Proc Summary. missing='N' means NOT to include

# If only want the FREQ of the class variables, leave VAR empty and put 'count' in stat

# =============================================================================


def SAS_pro_summary (df_copy, stat=[], class_1=[], var=[], id_1=[], weight=[], nway='N', missing='Y'):
    df=df_copy.copy() # this is to prevent the original df being modified in the function. Note that all df will be modified in function even if not returned.
    
    df_final_gpby=pd.DataFrame()
    class_combinations = []
    code_list = []
    
    df['_FREQ_']=''
    df['_TYPE_']=''
    df['KEY']=''
    
    # This is a check in case users include the same variables in both class_1 & id_1. It isn't right to append the common class variables with the id
    # Find the common items
    common_items = list(set(id_1) & set(class_1))

    # Remove the common items from id_1
    id_1 = [item for item in id_1 if item not in common_items]
    
    if missing=='Y':
        pass
    elif missing=='N':
        for col in class_1:
            df[col].replace('', np.nan, inplace=True)
        df = df.dropna(subset=class_1, how='any', axis=0) # SAS will drop the whole row as long as any class variables has nan
   
    #for weighted sum
    if weight and 'sum_wt' in stat:
        df_weighted=df.copy()
        for n in var:
        #   df_weighted.dropna(subset=[n, weight[0]], axis=0, inplace=True)
            df_weighted[n]=df_weighted[n]*df_weighted[weight[0]]
          
             
    # this is for '_TYPE_'
    for n in range (1, len(class_1)+1):
        code_list.append('CODE_'+ str(n))
        
        
    for n in range(len(class_1) + 1):
      class_combinations += list(combinations(class_1, n)) # Create all the different permutation of the class variables
    class_combinations.pop(0) # remove  the empty permutation
   
    id_1.append('KEY') 
    for s in stat:
        df_TOTAL=pd.DataFrame()
        df_total_gpby=pd.DataFrame()
        df_unique=pd.DataFrame()
        var_dict={'_FREQ_': 'count'}
        
        df_gpby_variance_merged=pd.DataFrame()
        df_gpby_variance_merged['KEY']=''
        
        
        for n in var:
            if s=='sum_wt' or s=='sum':
                var_dict[n]=lambda x: np.nan if x.isna().all() else x.sum() # sum will return 0 even if all summed items are np.nan. So we use this function to return nan
                #var_dict[n]='sum'
            elif s=='mean_wt':
                #mean_wt_fn = lambda x: np.ma.average(x.dropna(), weights=df.loc[x.dropna().index, weight[0]]) # function to compute weighted mean and dropna
                mean_wt_fn = lambda x: np.ma.average(x, weights=df.loc[x.index, weight[0]]) 
                var_dict[n]=mean_wt_fn
            elif s=='var_wt' or s=='std_wt':
               pass
            else:
                var_dict[n]=s
                   
        # this is to create the stat based on the entire df
        if s=='sum_wt':
            df_TOTAL=pd.DataFrame(df_weighted.agg(var_dict))
            

        elif s=='var_wt' or s=='std_wt' or s=='mean_wt':   
            df_TOTAL=pd.DataFrame(df.agg(var_dict))
            for n in var:
                df_copy=df.copy()
                df_copy.dropna(subset=[n, weight[0]], how='any', axis=0, inplace=True) # Need to drop rows with nan as the function will return nan even if there is some valid values
    
                if s=='var_wt':
                    df_TOTAL.loc[n]=(DescrStatsW(df_copy[n], weights=df_copy[weight[0]], ddof=10).var)*10
                elif s=='std_wt':
                    df_TOTAL.loc[n]=math.sqrt((DescrStatsW(df_copy[n], weights=df_copy[weight[0]], ddof=10).var)*10)
                elif s=='mean_wt':
                    mean_wt_fn_subset = lambda x: np.ma.average(x, weights=df_copy.loc[x.index, weight[0]]) 
                    var_val=df_copy[n].agg({n: mean_wt_fn_subset })
                    df_TOTAL.loc[n]=var_val[0]
                             
        else:
            df_TOTAL=pd.DataFrame(df.agg(var_dict))
            
            
        df_TOTAL = df_TOTAL.T
        df_TOTAL['_TYPE_']=0
        if s!=np.std:
            df_TOTAL['_STAT_']=s
        elif s==np.std:
            df_TOTAL['_STAT_']='std'
        df.sort_values(by=id_1, kind='stable', ignore_index=True, inplace=True, na_position='first')
        for n in id_1:
            df_TOTAL[n]=df[n].iloc[-1]
      
        
        
        var_dict['KEY']='last' 
        for n in class_combinations:
            
            # this is for '_TYPE_'. For the class variables present in this iteration, assign 1 to the corresponding code variable
            code_dict = {}
            for i, k  in zip(class_1,  code_list):
                for o in n:
                    if o==i:
                        code_dict[k]='1'
                        break
                    elif  o!=i:
                        code_dict[k]='0'
                        
            code_full=''.join(code_dict.values())   # concat all the code variables into a single string
            
            
            sort_tup=n+tuple(id_1)
            df['KEY'] =''
            df.sort_values(by=list(sort_tup), kind='stable', ignore_index=True, inplace=True, na_position='first') # this is to ensure the class variables and id are sorted correctly, 
            
            if s=='sum_wt':
                df_weighted['KEY'] =''
                df_weighted.sort_values(by=list(sort_tup), kind='stable', ignore_index=True, inplace=True, na_position='first') 
                for o in n: # For the class variables present in this iteration, create a key
                    df_weighted[o] =  df_weighted[o].fillna('') # if na is not filled with blank, the key will be blank as long as any column has nan
                    df_weighted['KEY'] = df_weighted[['KEY', o]].apply(lambda row: ''.join(row.values.astype(str)), axis=1)
                    df_weighted[o].replace(r'^\s*$', np.nan, regex=True, inplace=True) # replace the empty space (created in the 1st step) with nan again. 
                    
                    
            for o in n: # For the class variables present in this iteration, create a key
                df[o] = df[o].fillna('') # if na is not filled with blank, the key will be blank as long as any column has nan
                df['KEY'] = df[['KEY', o]].apply(lambda row: ''.join(row.values.astype(str)), axis=1)
                df[o].replace(r'^\s*$', np.nan, regex=True, inplace=True) # replace the empty space (created in the 1st step) with nan again. 
            df_unique = df.drop_duplicates(subset=['KEY'], keep='last') # Extract only the last to merge in the id variables
            
            if missing=='Y':
                if s=='sum_wt':
                    df_gpby = df_weighted.groupby(list(n), dropna=False).agg(var_dict).reset_index()
                
                elif s=='mean_wt':
                    
                    # find groups that has >1 element
                    non_unique=df['KEY'].value_counts() 
                    non_unique=non_unique.where(non_unique>1).dropna()
                    non_unique_list=non_unique.index.tolist()
                
                    df_gpby=df.groupby(list(n), dropna=False).agg({'KEY':'last', '_FREQ_':'count'}).reset_index()
                   
                    for k in var:
                        # Need to drop those rows that belong to groups with > 1 element and has nan. Otherwise, entire group will have nan even if some valid values
                        #df_subset=pd.concat([df[(df['KEY'].isin(non_unique_list)) & (df[[weight[0],k]].notnull().all(axis=1))], df[~df['KEY'].isin(non_unique_list)]], axis=0) 
                        df_subset=df[~((df['KEY'].isin(non_unique_list)) & (df[[weight[0],k]].isnull().any(axis=1)))]
                        mean_wt_fn_subset = lambda x: np.ma.average(x, weights=df_subset.loc[x.index, weight[0]]) 
                        df_gpby_subset= df_subset.groupby(list(n), dropna=False).agg({k: mean_wt_fn, 'KEY':'last'}).reset_index()
                        #df_gpby=pd.concat([df_gpby, df_gpby_subset[df_gpby_subset.columns.difference(df_gpby.columns)]], axis=1)
                        df_gpby=pd.merge(df_gpby, df_gpby_subset[[k, 'KEY']], how='outer', on='KEY', suffixes=('_x', ''))
                        #print(df_gpby)
                        
                elif s=='var_wt' or s=='std_wt':   
                    df_gpby = df.groupby(list(n), dropna=False).agg(var_dict).reset_index()
                    
                    for k in var:
                        variance_fn =lambda x: (DescrStatsW(x[k], weights=x[weight[0]], ddof=10).var)*10  
                        df_gpby_variance= df.dropna(subset=[k, weight[0]], how='any', axis=0).groupby(list(n), dropna=False).apply(variance_fn).reset_index()
                        df_gpby_variance.rename(columns={0: k}, inplace=True)
                        
                        if s=='std_wt':
                            df_gpby_variance[k]= np.sqrt(df_gpby_variance[k])
                            
                        df_gpby_variance['KEY']=''
                        for z in n: # For the class variables present in this iteration, create a key
                            df_gpby_variance[z] = df_gpby_variance[z].fillna('') # if na is not filled with blank, the key will be blank as long as any column has nan
                            df_gpby_variance['KEY'] = df_gpby_variance[['KEY', z]].apply(lambda row: ''.join(row.values.astype(str)), axis=1)
                            df_gpby_variance[z].replace(r'^\s*$', np.nan, regex=True, inplace=True) # replace the empty space (created in the 1st step) with nan again. 
                               
                               
                        df_gpby_variance_merged=pd.merge(df_gpby_variance_merged, df_gpby_variance[[k, 'KEY']], how='outer', on='KEY', suffixes=('_x', ''))
                       
                    df_gpby = pd.merge(df_gpby, df_gpby_variance_merged[var + ['KEY']], how='left', on='KEY', suffixes=('_x', ''))
                    
                    
                else:
                    df_gpby = df.groupby(list(n), dropna=False).agg(var_dict).reset_index()
                     
            elif missing=='N':
                if s=='sum_wt':
                    df_gpby = df_weighted.groupby(list(n), dropna=True).agg(var_dict).reset_index()
                
                elif s=='mean_wt':
                    
                    # find groups that has >1 element
                    non_unique=df['KEY'].value_counts() 
                    non_unique=non_unique.where(non_unique>1).dropna()
                    non_unique_list=non_unique.index.tolist()
                     
                    df_gpby=df.groupby(list(n), dropna=True).agg({'KEY':'last', '_FREQ_':'count'}).reset_index()
               
                    for k in var:
                        # Need to drop those rows that belong to groups with > 1 element and has nan. Otherwise, entire group will have nan even if some valid values
                        #df_subset=pd.concat([df[(df['KEY'].isin(non_unique_list)) & (df[[weight[0],k]].notnull().all(axis=1))], df[~df['KEY'].isin(non_unique_list)]], axis=0) 
                        df_subset=df[~((df['KEY'].isin(non_unique_list)) & (df[[weight[0],k]].isnull().any(axis=1)))]
                        mean_wt_fn_subset = lambda x: np.ma.average(x, weights=df_subset.loc[x.index, weight[0]]) 
                        df_gpby_subset= df_subset.groupby(list(n), dropna=True).agg({k: mean_wt_fn, 'KEY':'last'}).reset_index()
                        #df_gpby=pd.concat([df_gpby, df_gpby_subset[df_gpby_subset.columns.difference(df_gpby.columns)]], axis=1)
                        df_gpby=pd.merge(df_gpby, df_gpby_subset[[k, 'KEY']], how='outer', on='KEY', suffixes=('_x', ''))
                      
                
                elif s=='var_wt' or s=='std_wt':   
                    df_gpby = df.groupby(list(n), dropna=True).agg(var_dict).reset_index()
                    
                    for k in var:
                        variance_fn =lambda x: (DescrStatsW(x[k], weights=x[weight[0]], ddof=10).var)*10  
                        df_gpby_variance= df.dropna(subset=[k, weight[0]], how='any', axis=0).groupby(list(n), dropna=True).apply(variance_fn).reset_index()
                        df_gpby_variance.rename(columns={0: k}, inplace=True)
                        
                        if s=='std_wt':
                            df_gpby_variance[k]= np.sqrt(df_gpby_variance[k])
                            
                        df_gpby_variance['KEY']=''
                        for z in n: # For the class variables present in this iteration, create a key
                            df_gpby_variance[z] = df_gpby_variance[z].fillna('') # if na is not filled with blank, the key will be blank as long as any column has nan
                            df_gpby_variance['KEY'] = df_gpby_variance[['KEY', z]].apply(lambda row: ''.join(row.values.astype(str)), axis=1)
                            df_gpby_variance[z].replace(r'^\s*$', np.nan, regex=True, inplace=True) # replace the empty space (created in the 1st step) with nan again. 
                               
                               
                        df_gpby_variance_merged=pd.merge(df_gpby_variance_merged, df_gpby_variance[[k, 'KEY']], how='outer', on='KEY', suffixes=('_x', ''))
                    
                    df_gpby = pd.merge(df_gpby, df_gpby_variance_merged[var + ['KEY']], how='left', on='KEY', suffixes=('_x', ''))
                    
                    
                else:
                    df_gpby = df.groupby(list(n), dropna=True).agg(var_dict).reset_index()
             
            df_gpby['_TYPE_']=code_full
            df_gpby = pd.merge(df_gpby, df_unique[id_1], how='left', on='KEY', suffixes=('_x', ''))
            
          
            df_total_gpby= pd.concat([df_total_gpby, df_gpby], axis = 0, ignore_index=True) 
        
        
        df_total_gpby.drop(df_total_gpby.filter(regex='_x$').columns, axis=1, inplace=True, errors='ignore')
        df_total_gpby.drop(columns=['KEY'], axis=1, inplace=True, errors='ignore')
    
        df_total_gpby['_TYPE_']=pd.to_numeric(df_total_gpby['_TYPE_'], errors='coerce')
        df_total_gpby.sort_values(by=['_TYPE_'], kind='stable', ignore_index=True, inplace=True, na_position='first')
        df_total_gpby['_TYPE_']=pd.factorize(df_total_gpby['_TYPE_'])[0] + 1 # factorise starts from 0, so need to +1
        
        df_total_gpby = pd.concat([df_TOTAL, df_total_gpby], axis = 0, ignore_index=True) 
        
        if  nway=='Y':
            df_total_gpby=df_total_gpby.nlargest(1, columns=['_TYPE_'], keep='all')
       
        if s!=np.std:
            df_total_gpby['_STAT_']=s
        elif s==np.std:
            df_total_gpby['_STAT_']='std'
            
        df_final_gpby = pd.concat([df_final_gpby, df_total_gpby], axis = 0, ignore_index=True)    
 
    df_final_gpby.drop(columns=['KEY'], axis=1, inplace=True, errors='ignore')
    sort_list=['_TYPE_'] + list(reversed(class_1)) + ['_STAT_']
    df_final_gpby.sort_values(by=sort_list, kind='stable', ignore_index=True, inplace=True, na_position='first')
    #print(df_final_gpby)

    return df_final_gpby


# TestSAS_pro_summary with the below toy dataset

data = {
        
"name": ['dick', 'sam', 'sam', 'sam','susan', 'susan', 'john', 'john', 'john', 'john',np.nan],
"last_name": ['lim', 'tan', 'tan', 'lim', 'wee', 'lim', 'low', 'tan',  'tan', 'lee' , 'tan'],
"id": ['S','F', 'S','F', 'F', 'S', 'S','F', 'F','F', 'F'],
"loc": ['E','W', 'N','S', 'N', 'S', 'E','N', 'N','N', 'N'],
"sex": ['M','M', 'F','F', 'F','F', 'M','F', 'F', 'F', 'F'],
"age": [np.nan,20, 20,20, 20, 789, 234,9999, 10, 10, np.nan],
"age2":[np.nan, -123, 20,20, 20, 789, 234,9999,  np.nan, np.nan, np.nan],
'wt': [90, 10, 10, 10, 10, 10, 10, 10, 10, np.nan,  np.nan]
}


df = pd.DataFrame(data)



df=SAS_pro_summary(df, stat=['count'], class_1=[ 'name'], var=[], id_1=['sex', 'last_name'], weight=['wt'], nway='N', missing='N')
print(df)




Wednesday 1 June 2022

Regular Expression (Regex) for Common Situations

                                              Photo by Bruno Martins on Unsplash  

Providing a couple of regex examples for commonly encountered situations 

This articles assumes the reader is already familiar with the frequently used symbols in regex like ., *. If you are not, you may refer to this very detailed guide:  

https://www3.ntu.edu.sg/home/ehchua/programming/howto/Regexe.html 

I will be using the sample text below for all the examples in this article. However, the code in all the examples can be modified according to your specific text.  

 


Code to read in sample text as a txt file


import re
#Use forward slash in file path
file='C:/Users/sample_text.txt'
with open (file, 'r') as a:    
	#readlines will read each line into an element in a list    
    file_lines=a.readlines()    
    sample_text="".join(file_lines) # we need to join each line into a variable
    
#Alternatively read in directly as a variable
sample_text= 'start\nstring\nsecond string\nend\n\nAfter months, the profit is 300**'

Example 1: Extracting text in the same line after a specific text (‘second’) 

1. Create a pattern with “second”

 


 (?<=xxx): will look for xxx that precedes the characters to be extracted 

 

2. Look for the pattern in sample_text

extracted=sample_regex.search(sample_text)  

3. Remove the leading/trailing spaces 

extracted2=extracted.group().strip()
print(extracted2) #return string     

Complete Code

sample_regex=re.compile(r'(?<=second).*')
extracted=sample_regex.search(sample_text)
extracted2=extracted.group().strip()
print(extracted2)

Example 2: Extracting text in the same line after a specific text with variation ('After') 

1. Create a pattern with “After months” or “quarters” 

extracted=sample_regex.search(sample_text)
print(extracted.group()) #return After months, the profit is 300**

2. Extract group(4) where 300** is: 

extracted2=extracted.group(4).strip()
print(extracted2) #return 300**

3. Strip the “*” on the right and remove all empty spaces 

extracted3=extracted2.rstrip('*').strip()
print(extracted3) #return 300

Complete Code

sample_regex=re.compile(r'(After )(months.+|quarters.+)(\s+)(.+)')
extracted=sample_regex.search(sample_text)
print('Example 2')
print(extracted.group())
extracted2=extracted.group(4).strip()
print(extracted2)
extracted3=extracted2.rstrip('*').strip()
print('Example 2')
print(extracted3)

Example 3: Extract all text between 2 text (non-inclusive) (“start” and “end”) 

1. Create a pattern with “start” and “end”


(?<=xxx): will look for xxx that precedes the characters to be extracted 

(?=xxx): will look for xxx that follows the characters to be extracted

extracted=sample_regex.search(sample_text)
print(extracted.group())
#return
string
second string

Example 4: Extract all text between 2 text (inclusive) (“start” and “end”)  

1. Create a pattern with “start” and “end” 

{m, n} : m to n times (inclusive) 

{m,} : >=m times 

{m} : exactly m times

extracted=sample_regex.search(sample_text)
print(extracted.group())

#return
start
string
second string
end

Example 5: How do we find the line with only the word “string”? 

1. Extract the 2 lines after “start”

sample_regex=re.compile(r'(?<=start)(\n.*){2}’) 
extracted=sample_regex.search(sample_text)
print(extracted.group())
#return
string
second string
2. Split the 2 lines by line. Will return a list.
line_split=re.split('\n',str(extracted.group()))
print(line_split)#return ['', 'string', 'second string’]
3. Check the items 1 & 2 in the list
item1= bool(re.search(r’^string$’, line_split[1]))

Boolean function: convert search results to true or false 

^ : start of line 

$: end of line 

^string$: the entire line should be only “string”

print(item1) #return True
print(item2) #return False
Complete Code
sample_regex=re.compile(r'(?<=start)(\n.*){2}')
extracted=sample_regex.search(sample_text)
line_split=re.split('\n',str(extracted.group()))
item1= bool(re.search(r'^string$', line_split[1]))
item2= bool(re.search(r'^string$', line_split[2]))
print('Example 5')
print(item1)
print(item2)

Tuesday 12 April 2022

Seasonal Adjustment Outlier Thresholds

A good summary of the seasonal adjustment thresholds for outliers adopted by various organisations: https://1drv.ms/b/s!Aq4mhLQ_aASsaKR4_xJUPnSs74E?e=5LzwG5

VBA Cheat Sheet / Excel Macros Quick Guide

 TERM

DESCRIPTION

ARRAY  index starts from 0 by default

To amend

Redim

OR

Option Base 1

 

However when assigning Range to array, array starts from 1.

https://bettersolutions.com/excel/cells-ranges/vba-working-with-arrays.htm

https://bettersolutions.com/vba/arrays/option-base-1.htm 

 

Static Array

Dim arr(0 To 5) As Long (other data types can be used)

 

 

Dynamic Array

Dim arr() As Long (other data types can be used)

 

By default, the array will start at 0. To change, place at top of module: "Option Base 1".

 

Dim lStudentCount()As Long
    lStudentCount = Sheet1.Range("A" & Rows.Count).End(xlUp).Row
 
    ' Create array of correct size
    Dim arr() As Long
    ReDim arr(1 To lStudentCount)

 

Preserve the values of a dynamic  array when it is ReDim

If we use ReDim on an existing array, then the array and its contents will be deleted.

 

To retain the original values, use Preserve. But the starting index value must be the same.

 

Dim arr() as String

ReDim arr(1 To 2)

    arr(1) = "Apple"

    arr(2) = "Apple"

 

'It must start at 1.

ReDim Preserve arr(1 To 6)

Fill dynamic array with values

 

Dim arr () As Variant
arr = Array("John", "Hazel", "Fred")

 

Dim arr ()As Variant
arr = Range("A1:D2") --> this will work if there is no reference to a worksheet or workbook

 

arr = Sheets("sheet1").Range("A1:G311").Value2 --> If there is a reference to a worksheet or workbook, you will need to use "Value" or "Value2"

 
Differences between Text, Value and Value2

a. Range.Text Property

b. Range.Value Property

c. Range.Value2 Property

 

 

a) Doesn't work with array! Returns the value exactly as it is on your screen. If your column width is too narrow and the value in the cell is shown as ###, that’s exactly what Text will return.

 

b) Returns the value in whatever format it was in Excel and converts it to the VBA equivalent data type. This is usually a fine way to extract the data, unless your data is formatted as currency or a date. Transmitting as a date may be useful, but the currency can cause some issues. The VBA currency data type only carries 4 decimal places, so your cell value may get truncated when converted to VBA.

 

c) Returns the value independent of format. Eg, a formatted date will be returned as the serial number date value Excel uses.

 

Adapted from https://wellsr.com/vba/2017/excel/excel-vba-assign-range-to-array/

 

2-dimensional array

Dim arr(1 To 5, 1 To 2) As Long

 

The  above is an array with 5 rows & 2 columns.

Calling out an element of an array

Specify the row and column of an array even if it is a n x 1.

 

Debug.Print arr(1)  --> Will give an error

 

Debug.Print arr(2,1)  --> correct

 

 

Example:

 

Dim arr() As Variant

arr = ActiveSheet.Range("A1:A5").Value2

 

Dim c As Variant

For Each c In arr

    Debug.Print c

Next c

 

WORKBOOK

 

ActiveWorkbook

Refers to the Workbook which is active (the one you see)

 

ThisWorkbook

Refers to the Workbook that houses the code

 

Workbooks(n)

n is a index number being in the order that the open Workbooks were opened

 

Workbooks("aBook.xls")

Refers to ONLY the workbook of that name

 

WORKSHEET (ONLY WORKSHEET)

 

Worksheets(n)

n is the index number of a Worksheet from left to right in a workbook

 

Worksheets("aSheet")

Refers to a Worksheet tab name

 

SHEET (INCLUDE WORKSHEET AND CHARTSHEET)

 

ActiveSheet

Refers to the Sheet which is active (the one you see)

 

Sheets(n)

n is the index number of a Sheet from left to right in a workbook

 

Sheets("aSheet")


Refers to a Sheet tab name

Directly use the code name without quotation marks

 

Refers to a Sheet code name

Note: The code name remains the same if you change the Sheet tab name or the order of your sheets. So this is the safest way to reference a Sheet.

 

The code name can be found in the VBA Editor:

 

 

Chart6, Sheet1 are examples of  code names while the tab names are in brackets.

 

 

Changing the code name:

 

Select the Sheet and its Properties will show. Click on the box for (Name) and Name to change its code and tab name, respectively.

 

CHART

 

ActiveChart

Refers to the Chart sheet / embedded chart in a worksheet which is active (the one you see). If no chart sheet/embedded chart  is active, code will return an error.

 

Charts(n)

n is the index number of a Chart sheet from left to right in a workbook

 

Charts("Chart1")


Refers to a Chart sheet tab name

CELL

 

Cells ()

 

Cells (row, column)

Column is optional.

If only Row index is used & index exceeds the number of columns in the specified range, the reference will wrap to successive rows within the range columns.

To refer to "A1":

Cells(1)

 

To refer to "B1":

Cells(2)

 

To refer to "A3":

Cells(3,1) is the same as Cells(3, "A")

 

To select all cells in a worksheet:

ActiveSheet.Cells.Select

 

Range ()

Refers to a single cell, adjacent cells, non-adjacent cells, row(s) or column(s)

 

Examples

Range("A1")

 

Range("A1, B5")

 

Range("A3:E5")

 

Range ("3:3")

 

Range ("3:5")

 

Range ("A:A")

 

Range ("A:C")

Range("A3:E5") is same as Range (Cells (3,1),Cells (5,5))

Range("PriceList") --> A named range. It is useful to have a named range when columns/rows/cells are added/deleted.

 

Range("B" & LastCopyRow2 & ":L" & LastCopyRow2), where LastCopyRow2 is a variable

 

Range("A2:" & Cells(LastUsedRow(i), LastUsedCol(i)).Address)

 

Rows()

Refers to a row(s)

 

To refer to all the rows in a worksheet:

Activesheet.Rows

 

To refer to Row 1-3 in a worksheet:

Activesheet .Rows("1:3")

 

To refer to Row 3 in a worksheet:

Activesheet .Rows(3)

 

To refer to all the rows in a specific range:

Activesheet.Range("B2:D4").Rows

 

To refer to Row 3 in a specific range:

Activesheet.Range("B2:D4").Rows(3)

 

 

Columns()

Refers to a column(s)

 

To refer to all the columns in a worksheet:

Activesheet.Columns

 

To refer to Column 3 in a worksheet:

Activesheet .Columns(3) is the same as

Activesheet .Columns("C")

 

To refer to Column A-C in a worksheet:

Activesheet .Columns("A:C")

 

To refer to all the columns in a specific range:

Activesheet.Range("B2:D4").Columns

 

To refer to Column 3 in a specific range:

Activesheet.Range("B2:D4").Columns(3)

Row

 

Returns the row number. If more than 1 row is referenced, it will return the 1st row number in the specified range.

 

Returns "3":

Activesheet.Range("F3").row

 

 

Column

 

Returns the column number. If more than 1 column is referenced, it will return the 1st column number in the specified range.

 

 

Returns "6":

Activesheet.Range("F3:H5").column

 

THE RANGE.END() METHOD

To find last used row/column. End() is akin to pressing the Ctrl+Arrow Key .

End(xlUp)

 

Example 1:

Range.End(xlUp) Method1
LastRow1 = Cells(Rows.Count, 1).End(xlUp).Row

Rows.Count is the Row number of the last row in the worksheet.

Cells(Rows.Count, 1) will move down to the last row in Column A.

Next, End(xlUp) will move up to the last used row in Column A.

Finally, Row returns the row number of the last used row in Column A.

 

End(xlDown)

 

Example 2:

 
LastRow2= Cells(1, 1).End(xlDown).Row
 
From A1,  move down to the last used row in Column A.

 

End(xlToLeft)

Example 3:

 
LastCol3= Cells(1, Columns.Count).End(xlToLeft).Column
 
Similar to Example 1, except it is to find last used column.

 

End(xlToRight)

Example 4:

 
LastCol4= Cells(1, 1).End(xlToRight).Column
Similar to Example 2, except it is to find last used column.

 

HOW TO FIND THE SPECIFIC ROW & COLUMN NO. OF A USED RANGE?

 

 

Example

LastCopyRow = ActiveSheet.UsedRange.Rows(ActiveSheet.UsedRange.Rows.Count).Row

 

Explanation

ActiveSheet.UsedRange.Rows.Count will return the number of used rows in the UsedRange: 6

 

ActiveSheet.UsedRange.Rows(ActiveSheet.UsedRange.Rows.Count) refers to the last row in the UsedRange: 6

 

ActiveSheet.UsedRange.Rows(ActiveSheet.UsedRange.Rows.Count).Row returns the row No. of the last row in the UsedRange: 16

 

 

LastCopyColumn = ActiveSheet.UsedRange.Columns(ActiveSheet.UsedRange.Columns.Count).Column

 

WHAT IS PASSING ARGUMENTS BY VALUE VS BY REFERENCE?

 

 

Passing Arguments By Value does not change the original variable as only a copy of the variable is passed.

 

Passing Arguments By Reference changes the original variable as the variable itself is accessed.

 

 

A copy of a variable is passed

Original variable is changed

Ref

No

Yes

Val

Yes

No

 

Example for ByRef vs ByVal

 

Dim i As Long

Function Add_ByRef(ByRef i As Long) As Long

i = i + 5

Add_ByRef = i

End Function

 

Function Add_ByVal(ByVal i As Long) As Long

i = i + 5

Add_ByVal = i

End Function

 

Sub Test_ByRef ()

i = 2

Debug.Print Add_ByRef(i) ‘Return 7

Debug.Print I ‘Return 2

End Sub

 

Sub Test_ByVal ()

i = 2

Debug.Print Add_ByVal(i) ‘Return 7

Debug.Print i ‘Return 2

End Sub

 

 

 

 

COMMON ERRORS

1)      Activating a sheet is needed when

selecting anything in it. Eg, Sometimes selecting a cell doesn’t work because the sheet has not been activated.

At other times, the active sheet could differ from the sheet stated in the code. Hence it is best practice to always activate a sheet.

ThisWorkbook.Worksheets("Data").Activate

 

2)      Ensuring a workbook is opened for formulas to be updated

For certain formulas like Indirect and Offset when used to reference another Excel file; that file must be opened for the values to be updated.

Offset can usually be substituted with Index which doesn’t require the referenced Excel file to be opened.

 

 

Download this article at https://1drv.ms/b/s!Aq4mhLQ_aASsaZDGdaKPusJ66BA?e=JV4aIP

How to Read in 1 or All Excel Sheets into a Pandas DF Using Xlwings?

                                                                   Photo by Jeff Sheldon on Unsplash One of the advantages of using Xlwings...