Data Profiling

Hello everyone. Here I have a script (written in T-SQL) to perform a data profile across a range of columns as specified by the user.

I hope the code comments I have provided are simple enough to follow. If not, send me an email via the contacts page and I would be happy to help.

Often, I see internet postings for how to perform searches across all tables and columns, so I hope this comes in handy for everyone. You should be able to easily adapt this script to your flavor of SQL.

This script is handy when you need to find such things as the count of NULL markers in all columns across an entire database, or how many string fields contain the literal “foo”. Hope you get the idea. If your database is huge, be careful and limit your search.

Again, read through the comments and adjust the script as needed to cater to your search needs.

Hit the copy button in the upper right of the code snippet to copy the code.

Scott Peters 10/03/2022 The following script performs a data profile on specified columns. You will need to modify the tables #MySQLStatement and #MyTableColumns according to your needs. See comments in my code. 1) #MySQLStatement contains the SQL statement to run for each column specified. 2) #MyTableColumns contains the columns in which to run the SQL statement on. 3) The cursor mycursor runs each SQL statement defined in #MyTableColumns. ----------------------------------------------------------------------*/ SET NOCOUNT ON DROP TABLE IF EXISTS #MyTableColumns; DROP TABLE IF EXISTS #MySQLStatement; GO CREATE TABLE #MySQLStatement ( LineNumber INTEGER PRIMARY KEY, SyntaxLine VARCHAR(5000) NOT NULL ); GO /*------------------------------------------------------------------ Modify the below query as needed. You may be looking for NULL markers or empty strings, or you may be looking for certain values in all varchar columns.... This example calculates record count (excluding NULL), MIN AND MAX values. vSchemaName, vTableName, vColumnName, and vRowNumber are replaced via a REPLACE function below. Do not modify these strings. --------------------------------------------------------------------*/ INSERT INTO #MySQLStatement VALUES (1,'UPDATE #MyTableColumns') ,(2,'SET') ,(3,'RecordCount = (SELECT COUNT(vColumnName) FROM vSchemaName.vTableName)') ,(4,',MinValue = (SELECT MIN(vColumnName) FROM vSchemaName.vTableName)') ,(5,',MaxValue = (SELECT MAX(vColumnName) FROM vSchemaName.vTableName)') ,(6,'WHERE RowNumber = vRowNumber') GO ----------------------------------------------------------------- ----------------------------------------------------------------- DECLARE @vSQLStatement NVARCHAR(1000) = (SELECT STRING_AGG(SyntaxLine,' ') WITHIN GROUP (ORDER BY LineNumber ASC) AS SyntaxFull FROM #MySQLStatement); --Determine which columns to perform your search on. WITH cte_RowNumber AS ( SELECT ROW_NUMBER() OVER (ORDER BY t.[Name], c.[Name]) AS RowNumber, @@SERVERNAME AS ServerName, s.[Name] AS SchemaName, t.[Name] AS TableName, c.[Name] AS ColumnName, ty.[Name] AS DataType, --Add any columns as needed below. --This example calculates record count (excluding NULL), MIN AND MAX values. --These columns will need to match what is set in #MySQLStatement. CAST(NULL AS BIGINT) AS RecordCount, CAST(NULL AS VARCHAR(5000)) AS MinValue, CAST(NULL AS VARCHAR(5000)) AS MaxValue FROM [sys].[Schemas] s LEFT OUTER JOIN [sys].[Tables] t ON s.[Schema_id] = t.[Schema_id] INNER JOIN [sys].[Columns] c ON t.[Object_id] = c.[Object_id] INNER JOIN [sys].[Types] ty ON ty.User_Type_ID = c.User_Type_ID WHERE 1=1 --AND /*-------------------------------------------------------- Modify below as needed to limit your search. If you are searching for specific string, make sure you set the ty.Name to VARCHAR, NVARCHAR, CHAR, etc... ---------------------------------------------------------*/ --s.[Name] = @vSchemaName --AND --t.[Name] IN ('','','') --Table names --AND --ty.[Name] IN ('') --Data types ---------------------------- ) SELECT *, --Replace Column, Table, Schema and RowNumber REPLACE(REPLACE(REPLACE(REPLACE(@vSQLStatement,'vColumnName',ColumnName),'vTableName',TableName),'vSchemaName',SchemaName),'vRowNumber',RowNumber) AS SQLStatement INTO #MyTableColumns FROM cte_RowNumber ORDER BY 1,2,3,4,5; GO ----------------------------------------------------------------- ----------------------------------------------------------------- --Loop through #MyTableColumns and run each SQL statement dynamically. DECLARE @vSQLStatement VARCHAR(8000); DECLARE mycursor CURSOR FOR (SELECT SQLStatement FROM #MyTableColumns); OPEN mycursor; FETCH NEXT FROM mycursor INTO @vSQLStatement; WHILE @@FETCH_STATUS = 0 BEGIN EXEC (@vSQLStatement); --Run each SQL statement via dynamic SQL. FETCH NEXT FROM mycursor INTO @vSQLStatement; END; CLOSE mycursor; DEALLOCATE mycursor; ----------------------------------------------------------------- ----------------------------------------------------------------- SELECT * FROM #MyTableColumns ORDER BY 1,2,3;

Happy coding!

Leave a Reply