Python for SQL Server Specialists Part 4: Python and SQL Server

Python for SQL Server Specialists Part 4: Python and SQL Server

In the last article in this series about Python for SQL Server specialists, you are going to learn how to use SQL Server Python libraries in SQL Server. You can use two scalable libraries, the revoscalepy and microsoftml libraries, which correspond to equivalent R libraries.

SQL Server and ML Integration

With SQL Server 2016 and 2017, you get a highly scalable Machine Learning (ML) engine. Not every function and algorithm is rewritten as a scalable one. Nevertheless, you will probably find the one you need for your analysis of a big dataset. You can store a Python or R data mining or machine learning model in a SQL Server table and use it for predictions on new data. You can even store graphs in a binary column and use it in SQL Server Reporting Services (SSRS) reports. Finally, ML support is not limited to SQL Server only. You can use R code also in Power BI Desktop and Power BI Service, where we at this time (April 2018) still waiting for Python support. However, you can already use both languages, Python and R, in Azure Machine Learning (Azure ML) experiments.

Microsoft provides the highly scalable ML engine in two flavors:
• ML Services (In-Database): This is the installation that integrates ML into SQL Server. It includes a database service that runs outside the SQL Server Database Engine and provides a communication channel between the Database Engine and R runtime. You install it with SQL Server setup. The ML engine includes the open source R and Python components and in addition a set of scalable R and Python packages.
• Microsoft ML Server: This is a standalone ML server with the same open and scalable packages that run on multiple platforms.

Some of the scalable packages shipped with SQL Server R Services are:
• RevoScaleR (for R) and revoscalepy (for Python): This is a set of parallelized scalable functions for processing data, data overview and preliminary analysis, and machine learning models. The procedures in this package can work with chunks of data at a time, so they do not need to load all of the data in memory immediately.
• MicrosoftML (for R) and microsoftml (for Python): This is a package from December 2016, with many additional scalable machine-learning algorithms implemented.
The following figure shows how the communication process between SQL Server and ML engine works:

The components involved and their communications are as follows:
• In SQL Server Database Engine, you run R or Python script with the sys.sp_execute_external_script system stored procedure. SQL Server sends the request to the Launchpad service, a new service that supports the execution of external scripts.
• The Launchpad service starts the launcher appropriate for the language of your script, either the RLauncher.dll or the PyLauncher.dll, and therefore you can launch an external script from SQL Server using the R or Python language. You can see that the infrastructure is prepared to enable the execution of scripts in additional programming languages.
• The RLauncher or the PyLauncher starts RTerm.exe, the R terminal application for executing R scripts, or Python.exe, the Python terminal application.
• The terminal application in any of the two languages sends the script to BxlServer. This is a new executable used for communication between SQL Server and the ML engine. The scalable ML functions are implemented in this executable as well.
• The BxlServer uses SQL Satellite, a new extensibility API that provides a fast data transfer between SQL Server and external runtime. Again, currently the R and the Python runtimes are supported.

Using the Scalable Functions

Time to start using the scalable function. I will show you how to use some of the functions from the revoscalepy package, which is part of the imports at the beginning of the following code. The code defines the connection string to my local SQL Server, the AdventureWorksDW2016 database. Remember, the dbo.vTargetMail view comes from this database. Also note that the RUser used to connect to SQL Server needs permission to use the sys.sp_execute_external_script procedure. The code also defines the chunk size of 1,000 rows.

import numpy as np
import pandas as pd
import pyodbc
import revoscalepy as rp
# Connection string and chunk size
sqlConnStr = """Driver=SQL Server;Server=SQL2017EIM;
chunkSize = 1000;


The next code defines the query to read the data from SQL Server. Note that an ODBC connection is not needed. The RxSqlServerData() function generates a SQL Server data source object. You can think of it as a proxy object to the SQL Server rowset, which is the result of the query. The data itself is not stored in the memory.

# Define the query
TMquery = """SELECT EnglishOccupation AS Occupation,
YearlyIncome, Age, TotalChildren,
NumberCarsOwned, BikeBuyer
FROM dbo.vTargetMail"""
# Only creates the data source object and does not populate it
sqlTM = rp.RxSqlServerData(sql_query = TMquery,
connection_string = sqlConnStr,
string_as_factors = True,
rows_per_read = chunkSize);


Of course, you can also load the data in an in-memory data frame, like the following code does.

# Import the data in memory
TMSQL = rp.rx_import(input_data = sqlTM, report_progress = 3);


With the rx_get_info() function, you can get some basic information about both, the proxy object and the in-memory data frame.

# Get info about the data source and memory data


Here are the results. You can see that the first object is just a proxy, while the second has rows and columns.

>>> rp.rx_get_info(sqlTM)
Connection string:Driver=SQL Server;Server=SQL2017EIM;
Data Source:SqlServer

>>> rp.rx_get_info(TMSQL)
Number of observations:18484.0
Number of variables:6.0


You can use the rx_summary() revoscalepy scalable function to quickly get some descriptive statistics for the data. The function can use both, the proxy object and the in-memory data frame as the source.

# rx_summary() can use the data source or the memory data
sumOut = rp.rx_summary(formula = "~ NumberCarsOwned + Occupation + F(BikeBuyer)",
data = sqlTM)
sumOut = rp.rx_summary(formula = "~ NumberCarsOwned + Occupation + F(BikeBuyer)",
data = TMSQL)


For the sake of brevity, I am showing only one result here.

Summary Statistics Results for: ~ NumberCarsOwned + Occupation + F(BikeBuyer)

Number of valid observations: 18484.0

Name Mean StdDev Min Max ValidObs MissingObs
0 NumberCarsOwned 1.502705 1.138394 0.0 4.0 18484.0 0.0

Category Counts for Occupation
Number of categories: 5

Professional 5520.0
Management 3075.0
Skilled Manual 4577.0
Clerical 2928.0
Manual 2384.0

Category Counts for F(BikeBuyer)
Number of categories: 2

1 9352.0
2 9132.0


The functions that do not come from the revoscalepy package cannot use the proxy object. For example, the following code uses the pandas crosstab() function to do a crosstabulation on the data from the in-memory data frame and from the proxy object. Note that the code that tries to use the proxy object produces an error.

# Pandas crosstab can use the data, but can't use the data source object
pd.crosstab(TMSQL.NumberCarsOwned, TMSQL.TotalChildren)
# The next row produces an error
pd.crosstab(sqlTM.NumberCarsOwned, sqlTM.TotalChildren)
# AttributeError: 'RxSqlServerData' object has no attribute 'NumberCarsOwned'


The next step is initializing and training a linear regression model, using number of cars owned as the target variable, and income, age and number of children as input variables. Please note the syntax of the rx_lin_mod() function – it actually uses R syntax for the function parameters. This syntax might be simpler for you if you already use R; however, it might look a bit weird to pure Python developers.

# Create a linear regression model
linmod = rp.rx_lin_mod(
"NumberCarsOwned ~ YearlyIncome + Age + TotalChildren",
data = sqlTM)


Finally, the following code makes the predictions on the in-memory data frame and shows the first ten rows of those predictions.

# Predictions on the memory data frame
predmod = rp.rx_predict(linmod, data=TMSQL, output_data = TMSQL)


Note that the RevoScaleR package for R ( includes many functions more than the revoscalepy package ( currently supports for Python.

Executing Python Code in SQL Server

Now you need to switch to SQL Server Management Studio (SSMS). You will use Python inside T-SQL code. If you did not configure your SQL Server to allow external scripts, you have to do it now.

-- Configure SQL Server to allow external scripts
USE master;
EXEC sys.sp_configure 'show advanced options', 1;
EXEC sys.sp_configure 'external scripts enabled', 1;
-- Restart SQL Server
-- Check the configuration
EXEC sys.sp_configure;


You can immediately check whether you can run Python code with the sys.sp_execute_external_script procedure. The following code returns a 1×1 table, with value 1 in the single cell.

-- Check whether Python code can run
EXECUTE sys.sp_execute_external_script
@language =N'Python',
OutputDataSet = InputDataSet
print("Input data is: \n", InputDataSet)
@input_data_1 = N'SELECT 1 as col';


And finally, here is the big code that runs Python to create the same linear regression model as before, however this time within SQL Server. In the result, you get the actual data with the predicted number of cars.

-- Create a model and use it inside SQL Server
USE AdventureWorksDW2016;
EXECUTE sys.sp_execute_external_script
@language =N'Python',
from revoscalepy import rx_lin_mod, rx_predict
import pandas as pd
linmod = rx_lin_mod(
"NumberCarsOwned ~ YearlyIncome + Age + TotalChildren",
data = InputDataSet)
predmod = rx_predict(linmod, data = InputDataSet, output_data = InputDataSet)
OutputDataSet = predmod
@input_data_1 = N'
SELECT CustomerKey, CAST(Age AS INT) AS Age,
CAST(YearlyIncome AS INT) AS YearlyIncome,
TotalChildren, NumberCarsOwned
FROM dbo.vTargetMail;'
"CustomerKey" INT NOT NULL,
"YearlyIncome" INT NOT NULL,
"TotalChildren" INT NOT NULL,
"NumberCarsOwned" INT NOT NULL,
"NumberCarsOwned_Pred" FLOAT NULL));


Before finishing this article, let me point out casts in the input SELECT statement. In comparison to SQL Server, Python supports a limited number of data types. Some conversions between SQL Server data types can be done implicitly, other must be done manually. You can read the details about possible implicit conversions at


This article concludes the series about Python for SQL Server specialists. I hope that you enjoyed it, and especially that you gained some new knowledge that you will use soon.


Click here to download the code.

How to extract Twitter data with a windows service created with Python in Visual Studio 2017

How to extract Twitter data with a windows service created with Python in Visual Studio 2017

Hi everyone,

In this post, we will code a script in python (with Visual Studio 2017) to create a program which we can execute as a windows service in order to extract (in almost real time) the tweets related to certain words or hashtags, store them in a SQL server database, and then consume them with Power BI.

The tasks to be performed by the script are to connect to the Twitter streaming API using some tokens generated from the official Twitter page for developers. Then it will make a request to the API passing by parameter a list of hashtags or terms and will return indefinitely in “real time” the tweets that are published and that contain these terms. Once the tweets are obtained, they are stored in a buffer (with parameterizable size) “n” tweets that when reaching the maximum size of the buffer will be stored into a SQL Server table.

In order to do this, the first thing we must do is to create a new Python project in the Visual Studio 2017 Environment:

The next step will be to add the libraries that we will use in our script, and that will help the work to a great extent, in this case we will make use of the following:

  • The library to be able to handle JSON files (since the Twitter API returns data in this format).
  • We also use pandas library to store data in dataframes (in memory data structures, similar to SQL tables).
  • We need to use tweepy that gives us access to Twitter API.
  • pyodbc and sqlAlquiemy provide us with tools to connect to a SQL Server databases.

To be able to make use of these libraries, we must add them to our Python environment from the solution explorer:

Search the library in the Python environment:

Then we will add the following code to the main Python script:

import json
import pandas as pd
import pyodbc
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
from import sql
from sqlalchemy import create_engine
from import json_normalize

In order to access the Twitter API, we need to create an application in the developers Twitter portal on page  we can register applications with any Twitter user. When registering an application some tokens are generated that are necessary to access the API, these tokens will be stored in variables in our script:

#Declare variables that contains the user credentials to access Twitter API
#You can get your own keys in
cKey =         "ENTER YOUR API KEY"
cSecret =      "ENTER YOUR API SECRET"


In our example we get tokens of the app created before:

The workflow of our script will be this:

The whole code is here:

#Import libraries

import json
import pandas as pd
import pyodbc
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
from import sql
from sqlalchemy import create_engine
from import json_normalize

#Declare variables that contains the user credentials to access Twitter API
#You can get your own keys in
cKey =         "ENTER YOUR API KEY"
cSecret =      "ENTER YOUR API SECRET"

#Define after how many twitts we do a insert in the data base.
bufferSize = 5
#Defina an array to store the tweets readed from the stream api
twittsBuffer = []
#Define a connectiont to read-write to the Sql server Database 
engine = create_engine("mssql+pyodbc://MyDbUser:MyPassword@MySQLServer/TwitterDB?driver=SQL+Server+Native+Client+11.0")

#Define a function that receive a twitt by parameter and store it into the twittBuffer variable
#if the twittBuffer reach the buffersize defined lenght then call the function AddTwittsToDB that insert the twitts into
#the twittsBuffer array into the SQL Server database and clean the buffer
def AddTwittToBuffer(twitt):
    global twittsBuffer
    if (len(twittsBuffer) == bufferSize):
        twittsBuffer = []
    print(twitt['coordinates'] if twitt['coordinates']!= None else 'no coordinates')


#This function write the twitts stored in the variable twitBuffer to the SQL Database
def AddTwittsToDB(twitts):
    tData = {'id': [], 
             'text': [], 
             'screen_name': [], 
             'created_at': [],
             'retweet_count': [], 
             'favorite_count': [],
             'friends_count': [], 
             'followers_count': [], 
    for t in twitts:

        if t['place'] != None :
        else :
        if t['coordinates'] != None :
        else :
    tweets = pd.DataFrame(tData)
    tweets.set_index('id', inplace=True)
    return True
#Create a listener class that process received tweets
#On error print status
class StdOutListener(StreamListener):

    def on_data(self, data):
        t= json.loads(data)
        return True

    def on_error(self, status):
#Define a main function, the entry point of the program
if __name__ == '__main__':

    #This object handles Twitter authetification and the connection to Twitter Streaming API
    myListener = StdOutListener()
    authenticator = OAuthHandler(cKey, cSecret)
    authenticator.set_access_token(aToken, aTokenSecret)
    stream = Stream(authenticator, myListener)

    #This line filter Twitter Streams to capture data tweets with the included text: 'Microsoft' or 'SolidQ' or 'Visual Studio'
    stream.filter(track=['PowerBI', 'Tableau', 'Qlikview','Microstrategy','Pyramid Analytics','Business Objects', 'Ibm cognos'])


During the execution of the script, if the destination table does not exist, it will be created and the tweets will be inserted into the table.

The table we created will have the following columns:

  1. The id of the tweet
  2. The tweet origin country
  3. The date of creation in text format (then we will transform it to a date format)
  4. The number of favorites of the user who tweets (at the time he tweeted)
  5. The number of followers of the user who tweets (at the time he tweeted)
  6. The number of friends of the user who tweets (at the time he tweeted)
  7. The language detected by the Twitter API
  8. The latitude (if it is informed)
  9. The longitude (if it is informed)
  10. The number of retweets of the message.
  11. The name of the user who tweets
  12. The text of the tweet

If we query the database we will see the following:

In our example, we are getting tweets that contain the following terms ‘PowerBI’, ‘Tableau’, ‘Qlikview’, ‘Microstrategy’, ‘Pyramid Analytics’, ‘Business Objects’, ‘IBM Cognos’ to make a comparison of tweet counts about these terms.

Once we have the script created and tested, we can run it as a windows service so that it is continuously downloading the tweets, this can be done easily with tools such as Non Suck Service Manager you can download from its page

Once the nssm is downloaded, we go to the folder where the 64-bit executable is (or the 32-bit executable if it’s your case) and execute the following from the command line:

This window will be opened:

In path we introduce the path where the python executable is adding the executable name of python.exe, in the startup directory we enter the directory (only the path) where the Python executable is stored and in arguments insert the path and file name of the Python script that we created with Visual Studio 2017.

Doing this we will have created the service, we must start it and we will see that tweets begin to appear in our SQL Server table.

Now we have our table with the stored tweets we must create a couple of tables to relate the terms with the table of tweets and create a model that we can consume with Power BI, we do this in several steps:

First, we create a dimension with the terms and a code for each one in bitmask format, there are 7 terms so the table will have 7 files and the ids will be numbered from 1 to 64 in powers of 2 (1, 2, 4, 8,16, 32, and 64), in this way we can relate each tweet with its terms by means of a bit mask.

For example if a tweet contains the words PowerBI and QlickView, the id that relates it to its terms will be the sum of the two ids of those terms, in this case, the relationship ID would be 5, the sum of id 1 corresponding to Power Bi and Id 4 which is the id of the term QlickView. If a tweet contains all the 7 terms its relationship id will be 127 (the sum of all the ids 1+2+4+8+16+32+64)

The script that generates this dimension table is:

with Dim_Terms as(
SELECT 1  as Id, 'PowerBI'			AS Term union all 
SELECT 2  as Id, 'Tableau'			AS Term union all 
SELECT 4  as Id, 'Qlikview'			AS Term union all 
SELECT 8  as Id, 'Microstrategy'		AS Term union all 
SELECT 16 as Id, 'Pyramid Analytics'	        AS Term union all 
SELECT 32 as Id, 'Business Objects'		AS Term union all 
SELECT 64 as Id, 'Ibm cognos'			AS Term )
select * into DimTerm from Dim_Terms


Table queried will look like this:

The relationship established between this dimension of terms and the table of “facts” is a “many to many” relationship so we are going to need another 2 tables that consolidate the relationship to be able to use it in Power BI

The first of these relationship tables is the one that contains all the possible combinations of possible terms from a single term up to 7 at a time, this table will have 127 records (the 0 that would correspond to any term will not be contemplated at the moment)

The query that will provide us this table is:

with nums as
  select 1 as id
  union all 
  select id + 1
    from nums
    where id + 1 < 128
datos as(
  iif(cast( & 64 as bit)  = 1, 'Ibm_cognos','') + 
    iif(cast( & 32 as bit)  = 1, ' Business_Objects','') +
    iif(cast( & 16 as bit) 	= 1, ' Pyramid_Analytics','') +
    iif(cast( & 8 as bit)	= 1, ' Microstrategy','') +
    iif(cast( & 4 as bit)	= 1, ' Qlikview','') +
    iif(cast( & 2 as bit)  	= 1, ' Tableau','') +
    iif(cast( & 1 as bit)	= 1, ' Microsoft','')),' ', ' - '), '_',' ')  as Terms
from nums t
select * into Dim_AllTerms from datos option (maxrecursion 0)


And its structure will be:

Now we must relate, through a table of relationship “Many to Many”, the dimension with the table that maps each “combination of terms” (Dim_AllTerms) with its terms of the dimension (DimTerm) we do this with next query:

with nums as
 select 1 as id
 union all 
 select id + 1
 from nums
 where id + 1 < 128
datos as(
 cast( & 64 as bit) as bit7
 ,cast( & 32 as bit) as bit6
 ,cast( & 16 as bit) as bit5
 ,cast( & 8 as bit) as bit4
 ,cast( & 4 as bit) as bit3
 ,cast( & 2 as bit) as bit2
 ,cast( & 1 as bit) as bit1
from nums t
 case trend
 when 'bit1' then 1
 when 'bit2' then 2
 when 'bit3' then 4
 when 'bit4' then 8
 when 'bit5' then 16
 when 'bit6' then 32
 when 'bit7' then 64
 else 0
 end as IdTrend
 into F_MtM_Terms
 (select * from datos ) p
 valor for trend in (bit1, bit2, bit3, bit4,bit5,bit6,bit7)
 ) as unpivo
 valor = 1
option (maxrecursion 0)


The table will have the following appearance, as you can see the Id of “Terms” 5, it is repeated 2 times, one for each individual term that it’s contained, the 1 of Microsoft and the 4 of QlikView:

Finally, we create a view that will return tweets data in form of a fact table:

  ,screen_name as UserName
  , CONVERT(DATETIME, SUBSTRING(created_at,9,2)+'/'
  + CASE SUBSTRING(created_at,5,3)
    WHEN 'Jan' then '01/'
    WHEN 'Feb' then '02/'
    WHEN 'Mar' then '03/'
    WHEN 'Apr' then '04/'
    WHEN 'May' then '05/'
    WHEN 'Jun' then '06/'
    WHEN 'Jul' then '07/'
    WHEN 'Aug' then '08/'
    WHEN 'Sep' then '09/'
    WHEN 'Oct' then '10/'
    WHEN 'Nov' then '11/'
    WHEN 'Dec' then '12/'
  else '' end
  +RIGHT(created_at,4) + ' '
  + SUBSTRING(created_at,12,8), 105) AS Created_Date
  , iif([text] like '%PowerBI%',1,0) 
  + iif([text] like '%Tableau%',2,0) 
  + iif([text] like '%Qlikview%',4,0) 
  + iif([text] like '%Microstrategy%',8,0) 
  + iif([text] like '%Pyramid Analytics%',16,0) 
  + iif([text] like '%Business Objects%',32,0) 
  + iif([text] like '%Ibm cognos%',64,0) IdTrend


Model in power BI have next structure:

* It’s important to note that the many to many relationships must have the cross filter address set to “Both” to work fine.

If we see the used measures are very simple, a sum of favorites, followers and friends, and a count of the number of tweets.

The final result of a pair of dashboards as an example, showing the tweets by region and language and another showing the evolution of the last 2 days in “numDia-NumHora” format

I hope this will be useful or at least encourages you to experiment a bit with Python, it is a very useful tool with a very easy learning curve.



Python for SQL Server Specialists Part 1: Introducing Python

Python for SQL Server Specialists Part 1: Introducing Python

Python is one of the most popular programming languages. It is a general purpose high level language. It was created by Guido van Rossum, publicly released in 1991. SQL Server 2016 started to support R, and SQL Server 2017 adds support for Python. Now you can select your preferred language for the data science and even other tasks. R has even more statistical, data mining and machine learning libraries, because it is more widely used in the data science community; however, Python has broader purpose than just data science, and is more readable and might thus be simpler to learn. This is the first of the four articles that introduce Python to SQL Server developers and business intelligence (BI) specialists. This means that the articles are more focused on Python basics and data science, and less on general programming with Python.


Starting with Python

Python is an interpreted language. The philosophy of the language is about the code readability. For example, you use white spaces to delimit code blocks instead of special characters like semicolon or curly brackets. Python supports automatic memory management. It has a dynamic type system. You can use multiple program paradigms in Python, including procedural, object-oriented, and functional programming. You can find Python interpreters for all major operating systems. The reference implementation of Python, namely CPython, is open source software, managed by the non-profit Python Software Foundation. Of course, being open source, also means that there is a reach set of libraries available. Even the standard library is impressive and comprehensive.

In order to start working with Python and R, you need to do some installation. I am not covering general SQL Server and Visual Studio installation, I am just explaining what you need to do to start using Python with SQL Server.


Installing ML Services and VS 2017 for Data Science

You just start SQL Server setup, and then from the Feature Selection page select Database Engine Services, and underneath Machine Learning (ML) Services (In-Database), with Python only, or both languages, R and Python, selected. After that, all you need are client tools, and you can start writing the code. The following figure shows the SQL Server setup Feature Selection page with appropriate features selected.

The next step is installing client tools. Of course, you need SQL Server Management Studio (SSMS). In addition, you might want to install Visual Studio (VS) 2017. You can use either Professional or even free Community edition to develop python (and also R) code.

When installing Visual Studio 2017, be sure to select Python development workload, and then Data science and analytical applications, like the following figure shows. This will install Python language templates, including data science templates, and also R Tools for Visual Studio.

Selecting the Appropriate Python Engine

There you go, you are nearly ready. There is a small trick here. VS 2017 installs also its own Python interpreter. In order to use the scalable, the one installed with SQL Server, the one that enables executing code in the Database Engine context and includes Microsoft scalable libraries, you need to setup an additional Python environment, pointing to the scalable version of the interpreter. The path for this scalable interpreter is, if you installed the default instance of SQL Server, C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\PYTHON_SERVICES\python.exe. You can see how to setup this environment in the following figure.

That’s it. You are ready to start Python programming. Just start a new project and select the Python Application template from the Python folder. You can also explore the Python Machine Learning templates, which include Classifier, Clustering, and Regression projects. If you selected the Python Application template, you should have open the first empty Python script with default name the same as the project name and default extension py, waiting for you to write and interactively execute Python code.

Python Language Basics

Python uses the hash mark for a comment. You can execute Python code in VS 2017 by highlighting the code and simultaneously pressing the Ctrl and Enter keys. You can use either single or double apostrophes for delimiting strings. The first command you will learn is the print() command. Write and execute the following code:

# Hash starts a comment
print("Hello World!")
# This line ignored
print('Printing again.')
print('O"Hara')   # In-line comment

You can observe the code you wrote and the results in the Interactive window, which is by default below the script window, at the bottom left side of the screen.

Python supports all basic mathematical and comparison operators, like you would expect. The following code introduces them. Note that you can combine strings and expressions in a single print() statement.

1 + 2
print("The result of 3 + 20 / 4 is:", 3 + 20 / 4)
10 * 2 - 7
10 % 4
print("Is 7 less or equal to 5?", 7 <= 5)
print("Is 7 greater than 5?", 7 > 5)

The next step is to introduce the variables. Note that Python is case-sensitive. The following code shows how you can assign values to variables and use them for direct computations and as the arguments of a function.

# Integer
a = 2
b = 3
a ** b
# Float
c = 7.0
d = float(5)
print(c, d)

You can define strings inside double or single quotes. This enables you to use single quotes inside a double-quoted string, and vice-versa. You can use the “%?” operator for formatting strings to include variables, where the question mark stands for a single letter denoting the data type of the variable, for example “s” for strings and “d” for numbers. The str.format() method of the string data type allows you to do variable substitutions in a string. Here are some examples.

e = "String 1"
f = 10
print("Let's concatenate string %s and number %d." % (e, f))
four_cb = "String {} {} {} {}"
print(four_cb.format(1, 2, 3, 4))

The result of the previous code is:

Let’s concatenate string String 1 and number 10.
String 1 2 3 4

You can also create multi-lines strings. Just enclose the strings in a pair of three double quotes. You can also use special characters, like tab and line feed. Escape them with a single backslash character plus a letter, for example letter t for a tab and letter n for a line feed.

You can always get interactive help with the help() command. A Python module is a file with default extension .py containing Python definitions and statements. You can import a module into your current script with the import command, and then use the functions and variables defined in that module. Besides modules provided with the installation, you can, of course, develop your own modules, distribute them, and reuse the code.

Using Functions, Branches, and Loops

Like in any serious programming language, you can encapsulate your code inside a function. You define a function with the def name(): command. Functions can use arguments. Functions can also return values. The following code defines two functions, one that has no arguments, and one that has two arguments and returns a value. Note that there is no special ending mark of a function body – the correct indentation tells the Python interpreter where the body of the first function ends, and the definition of the second function starts.

def p_n():
    print("No args...")
def add(a, b):
    return a + b

When you call a function, you can pass parameters as literals, or through variables. You can also do some manipulation with the variables when you pass them as the arguments to a function. The following code shows these possibilities.

# Call with variables and math
a = 10
b = 20
add(a / 5, b / 4)

You can make branches in the flow of your code with the if..elif..else: statement. The following code shows you and example.

a = 10
b = 20
c = 30
if a > b:
    print("a > b")
elif a > c:
    print("a > c")
elif (b < c):
    print("b < c")
    if a < c:
        print("a < c")
    if b in range(10, 30):
        print("b is between a and c")
    print("a is less than b and less than c")

The results of the code are:

b < c
a < c
b is between a and c

The simplest data structure is the list. Python list is a set of comma-separated values (or items) between square brackets. You can use a for or for each loop to iterate over a list. There are many methods supported by a list. For example, you can use the list.append() method to append an element to a list. The following code shows how to create lists and loop over them with the for and foreach loops. Finally, it shows a while loop.

animals = ["cat", "dog", "pig"]
nums = []
for animal in animals:
    print("Animal: ", animal)
for i in range(2, 5):
i = 1
while i <= 10:
    i = i + 1

The last data structure presented in this introduction article is the dictionary. A dictionary is a set of the key – value pairs. You can see an example of a dictionary in the following code.

states = {
    "Oregon": "OR",
    "Florida": "FL",
    "Michigan": "MI"}
for state, abbrev in list(states.items()):
    print("{} is abbreviated {}.".format(state, abbrev))

I mentioned that in Python you can also use object-oriented paradigm. However, going deeper with object-oriented programming with Python is beyond the scope of this and the following articles.


I guess the programming in Python introduced so far was not over exciting. However, you always need to start with basics, and only after you embrace the basics, the exciting part starts. Therefore, don’t miss my next article, when I will introduce the most important data structure for advanced analytics, the data frame structure.

Microsoft, Facebook and Telxius present the most advanced transatlantic subsea cable

Microsoft, Facebook and Telxius present the most advanced transatlantic subsea cable

When talking about technologies, speed and capacity in the transfer of data plays a fundamental role. That is why two of the most important technology companies in the world, such as Microsoft and Facebook, along with Telxius, a telephone infrastructure subsidiary, have inaugurated the most advanced transatlantic cable so far. The result of this ambitious project has been named Marea (Spanish for “Tide”) and its main aim is to satisfy the growing demand for Internet services and cloud computing in recent years. Marea is the most advanced subsea cable that crosses the Atlantic Ocean. The cable connects the United States (Virginia Beach) with Spain (Bilbao).


I use columnstore indexes on SQL 2014. What would be the advantages of upgrading to SQL 2016?

I use columnstore indexes on SQL 2014. What would be the advantages of upgrading to SQL 2016?

SQL Server 2016 really represents a drastic change for the better in regards to general performance. And the proof is that, thanks to the engine changes, this is the first SQL Server version where Microsoft actively provides a widespread increase in performance levels by simply upgrading. But what impact can it have in your daily loads if you use columnstore indexes in SQL server 2014 datawarehouse? There was already an amazing increase in performance levels with SQL Server 2014 and the columnstore indexes. Is it worth upgrading to SQL 2016? (more…)

Integration Services (SSIS) vs. POWERCENTER. ETL tool comparation (I)

Integration Services (SSIS) vs. POWERCENTER. ETL tool comparation (I)

Below is a comparison table between ETL tools: SQL Server Integration Services aka SSIS vs PowerCenter. What tool should I choose? What are the differences? What are the advantages and disadvantages? Do they meet my requirements and particular needs? These are some of the questions that you may have asked yourself at the start of any new project, and we will aim to answer them in this post by providing you with some key facts regarding two very potent tools: SQL Server Integration Services a.k.a. SSIS versus Informatica PowerCenter.


Virtual SANs and the trend towards SDS

Virtual SANs and the trend towards SDS

Lately, we are noticing a trend whereby traditional SANs are increasingly losing ground. This can be due to various reasons, such as forced migrations to the cloud, limited budget environments or even customers who are getting annoyed with their storage providers’ draconian terms and conditions. A common example is to find very costly maintenance agreements, higher-than-market disk size upgrade costs, as well as difficulties to expand/ improve hardware without having to completely replace it. There is also a strong trend to continue going virtual and trading servers like they were commodities. All the above is therefore contributing to an increase in this trend towards SDS, while the VSANs continue to attract new supporters.


Creating an installer for your SSIS-assemblies

Creating an installer for your SSIS-assemblies

When you develop SSIS packages with custom functionality, either self signed assemblies or script components/script transformations that implement assemblies you got from codeplex, it sometimes become a nuisance to get them into GAC on production machines. What often happens is that when you try to run the package in earnest via the SQL Agent, it fails to find the assembly.

In my case I’ll be adding to my SSIS package dataflow in order to stream data into encrypted zip files.
If you would like to try out a cool way to deal with zip files in dotnet, I recommend it: DotNetZip – Zip and Unzip in C#, VB, any .NET language

So back to trying to run custom components and assemblies in SSIS packages. I bet you have seen something like this at some point in your career:

Error: System.IO.FileNotFoundException: Could not load file or assembly ‘Ionic.Zip, Version=, Culture=neutral, PublicKeyToken=edbe51ad942a3f5c’ or one of its dependencies. The system cannot find the file specified.
File name: ‘Ionic.Zip, Version=, Culture=neutral, PublicKeyToken=edbe51ad942a3f5c’

This tells us that our dll can’t be found when the SISS execution reaches the dataflow that uses it.

I wanted to find a solution that helps me deal with this in a good manner that is pretty future proof.
The first thing that came to mind is doing what I always recommend: use powershell. The problem is that sometimes you don’t have powershell, or an older version or aren’t allowed to run what scripts you want in production.

The next thing that was suggested to me was to copy gacutil.exe over to the production environment and do my deployment that way. That is not recommended though as it violates the license of Visual Studio SDK.

I tried out a couple of different plug ins for Visual Studio to be able to create an installer and the one that worked with the least hassle for me was WiX.

These are the steps I took to get my first working installer that added dll-files to gac:

  • Make sure that your assembly is signed. If you are unsure about whether your own assembly is signed, then it probably isn’t. Since the example is of someone else’s assembly then it probably is signed.
  • Make sure that your installation of Visual Studio is compatible with WiX (I use Visual Studio 2015 Developer)
  • Download the latest stable release of the WiX toolset that works with your version of VS.
  • Put the assembly (dll file) in any folder on your computer and have the corresponding path at hand. It doesn’t matter where the file is as long as you can access it when building the project.
  • Create a new VS project for WiX as shown in this picture: create wix project in VS
  • Put an XML fragment into Project.wxs that points to your file:the important stuff is now in the xml file
  • Set the project to release mode, build it, and go to the bin file to take a look.
  • The three files are all needed for your installer, so zip them together and you are basically done: set build to release, open the bin folder and look for three files - add them to a zip and you got your installer
  • Move your zip file with the three files to the machine you need the DLL on and run the installer. If you want a prettier installer with more options and so on, put more effort into the WiX project!

Here is the excerpt of the xml that worked for me, mileage might vary so look up the documentation and the comment sections of other articles you find.

  <ComponentGroup Id="ProductComponents" Directory="INSTALLFOLDER">
<Component Id="Ionic.Zip">
  <File Name="Ionic.Zip.dll" KeyPath="yes" Source="C:\Import\Ionic.Zip.dll" Assembly=".net" AssemblyManifest="Ionic.Zip.dll"/>


That time I used ON UPDATE CASCADE in production


While working with a invoicing system back in 2013 I was asked by my customer if it would be possible to enable them to have some more special handling of their biggest client. The way their system was set up, one client accounted for over half of the accounts receivables. The client and my customer were also very close in their operations. What they needed from the system was the ability to split the client into four legal entities while remaining all the custom logic and prices. They asked me if a quick-fix was possible.

“A quick-fix solution is as fast, as easy and as well-documented as its explanation from the customer” – me just now.

The system had a procedure which would work out all the numbers for each client and for this particular one handle subsidiaries together on all invoices. With this reorganization coming, all units would be connected to one of four new client subsidiaries and each subsidiary would instead receive its own invoices for its units.


What the options for foreign key constraints do

You can add two clauses (ON UPDATE and ON DELETE) when creating a foreign key constraint. Both can have one of four options.

NO ACTION – this is standard, an error is raised and the execution halts if the new value would violate referential integrity.
CASCADE – data integrity is conserved by updating or deleting records in the related tables in the same transaction. In other words completely hazardous when dealing with deletes. Use with caution.
SET NULL – If the relation between records is broken, the integrity of the referencing record is upheld by setting the value of the foreign key column to NULL.
SET DEFAULT – Same as above except the value becomes the default for the column instead of NULL, there has to of course be a default value for the column for this to work.

From rough draft to the final solution

On the back of an envelope I wrote down these nine steps that would re-route the client logic and make sure everything would still sum up nicely:

  1. Begin a transaction and run the stored procedures to create the invoices as normal
  2. For each unit represented for the four new subsidiaries: create a new invoice header with a temporary and made up sequence starting from -1 and going backwards to avoid collisions.
  3. For each invoice line: map invoiceId by subsidiary (-1 through -4)
  4. Drop the now empty and unused invoice headers created in step 1.
  5. Drop the foreign key constraint between headers and lines
  6. Re-create this foreign key but set it with action ON UPDATE CASCADE
  7. Re-run the stored procedure that fixes invoiceIds for headers, this procedure is a sub-procedure when creating invoices. When the headers are updated, the lines inherit the information through our foreign key constraint.
  8. Drop and re-create the foreign key once again as it existed before the operation.
  9. Commit transaction. Done! No lines removed, all accounted for and all connected nicely.