Wednesday, April 29, 2015

Impact of NULL values on where-clause/group-by-clause in Hive queries

Following is the check to verified that NULL values do not impact GROUP BY but it DOES IMPACT where clause.

Query: select count(*) from table1 where (field1 is NULL) AND dth >= '2014-12-01-00' AND dth <= '2014-12-01-23';
Result: 24

- select count(*) from table1 where (field1 != 'Value1') AND dth >= '2014-12-01-00' AND dth <= '2014-12-01-23';
Result: 1853517 (The correct count is 1853541 which gets reported incorrectly here because count of NULL values is ignored.)

- select count(*) from table1 where (field1 = 'Value1') AND dth >= '2014-12-01-00' AND dth <= '2014-12-01-23'
Result: 142570

select field1, count(*) from table1 where dth >= '2014-12-01-00' AND dth <= '2014-12-01-23' GROUP by field1;
Result:

field1_c1
ValidateAuthorization
1196966
GetUserProfile
470557
Authorize
142570
SignIn
86351
26101
Register
14726
GetUserEntitlement
12056
UpdateUser
11529
GetChildApprovalStatus
9813
UserProfile
9362
LogOut
4974
LogOn
3763
GetExtendedProfile
3060
GetAvaiableTrials
1934
LinkAccounts
1011
UpdateProfile
387
ChangePassword
337
SignUpChild
206
UploadProfilePicture
173
CoppaSentinelValidate
80
GetChildrenForModerator
38
GetModeratorsForChild
37
NULL
24
UpdateUserType
16
LdapSignIn
14
ModeratorApproval
13
UpdateChildModeratorAccountStatus
12
SignUp
1