Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integer64 still remains numeric upon opening with read_fst #267

Open
markdanese opened this issue Apr 30, 2022 · 9 comments
Open

Integer64 still remains numeric upon opening with read_fst #267

markdanese opened this issue Apr 30, 2022 · 9 comments
Assignees
Labels
Milestone

Comments

@markdanese
Copy link

markdanese commented Apr 30, 2022

I apologize in advance for not having a reproducible example but I don't really know how to do it in this situation. Will amend this if I figure something out.
I use write_fst() to save a file that uses integer64 for the person id. When I open the file using read_fst() it comes in as numeric with an 'integer64' label (which I have never seen before. When I look at the file it is all scientific notation. See below for the person_id for the cohort object.

image

However, when I simply run bit64::is.integer64(cohort$person_id) (which returns TRUE) its class immediately becomes integer64 and all is fine. There is no resaving or any changes to the actual files other than this single line of code. See below for the status upon refreshing the view in RStudio:

image

So it seems that when restoring an integer64 file, something isn't quite completing the process to make the variable an integer64. This is all done in a fresh session.

If this should be directed elsewhere, please let me know.

> sessionInfo()
R version 4.1.3 (2022-03-10)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.3

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] fstcore_0.9.12    fst_0.9.8         data.table_1.14.2

loaded via a namespace (and not attached):
[1] bit_4.0.4        compiler_4.1.3   parallel_4.1.3   tools_4.1.3      Rcpp_1.0.8.3     bit64_4.0.5      cellranger_1.1.0 readxl_1.4.0 
@MarcusKlik
Copy link
Collaborator

Hi @markdanese, thanks for reporting your issue!

I think this is just the way that bit64 labels a long integer value. R does not have a concept of long (8 byte) integers, but a double is 8 byte in length. So bit64 uses a double vector to store a long integer vector in memory and then overrides calculations done on that vector. A round trip to and from disk does not alter these labels or the underlying data structure:

library(bit64)

# write and read again
read_write_cycle <- function(x) {
  tmp_file <- tempfile(fileext = "fst")
  
  x |>
    fst::write_fst(tmp_file)
  
  fst::read_fst(tmp_file)
}

# sample table with very large integers
x <- data.frame(
  LongInt = bit64::as.integer64(sample(1e10:(1e10 + 100), 1000, replace = TRUE))
)

typeof(x$LongInt)
#> [1] "double"

str(x$LongInt)
#> integer64 [1:1000] 10000000047 10000000005 10000000008 10000000083 10000000077 10000000054 10000000026 10000000039 ...

y <- read_write_cycle(x)

typeof(y$LongInt)
#> [1] "double"

str(y$LongInt)
#> integer64 [1:1000] 10000000047 10000000005 10000000008 10000000083 10000000077 10000000054 10000000026 10000000039 ...

I hope that helps!

@MarcusKlik MarcusKlik self-assigned this Nov 16, 2022
@MarcusKlik MarcusKlik added this to the fst v0.9.10 milestone Nov 16, 2022
@markdanese
Copy link
Author

markdanese commented Nov 16, 2022

@MarcusKlik It is more than just naming because it leads to my code failing. I have to load the bit64 package in order for my code to work. This never used to be the case (I have used this script for several years). But it is certainly my fault for not putting together a reproducible example. And it may be an issue with the bit64 package (or some strange interaction between it and fst). Or perhaps even something related to my M1 Mac since it started after I got a new computer. I will try to create an example and see if that helps.

@OfekShilon
Copy link

This looks to me like an rstudio issue. If you can't repro the creation - can you share a file demonstrating it?

@MarcusKlik
Copy link
Collaborator

Hi @markdanese, I can't see anything funny about the returned integer64 value in the example above. The attributes are also identical:

# write and read again
read_write_cycle <- function(x) {
  tmp_file <- tempfile(fileext = "fst")
  
  x |>
    fst::write_fst(tmp_file)
  
  fst::read_fst(tmp_file)
}

# sample table with very large integers
x <- data.frame(
  LongInt = bit64::as.integer64(sample(1e10:(1e10 + 100), 1000, replace = TRUE))
)

y <- read_write_cycle(x)

attributes(x$LongInt)
#> $class
#> [1] "integer64"
attributes(y$LongInt)
#> $class
#> [1] "integer64"

perhaps you can still add a reproducible example showing the exact problem? thanx

@markdanese
Copy link
Author

My apologies for not getting back to this. As I said above, it may not be an fst problem. But I wanted others to know about it, and how to work around it, in case someone else ran into this problem. Hopefully this is more helpful. Please let me know if you need anything else.

# Fresh R session

library(fst)

x <- data.frame(person_id = c(1346900019, 1348000031), age = c(80, 75))

write_fst(x, "./data/raw/test_fst2.fst")

x <- read_fst("./data/raw/test_fst.fst")

x
# look at patient ids 

library(bit64)
x
# look at patient ids again

Here is the output on my computer from the script above:

> # Fresh R session
> 
> library(fst)
> 
> x <- data.frame(person_id = c(1346900019, 1348000031), age = c(80, 75))
> 
> write_fst(x, "./data/raw/test_fst2.fst")
> 
> x <- read_fst("./data/raw/test_fst.fst")
> 
> x
      person_id age
1 6.654570e-315  80
2 6.660005e-315  71
> # look at patient ids 
> 
> library(bit64)
Loading required package: bit

Attaching package: ‘bit’

The following object is masked from ‘package:base’:

    xor

Attaching package bit64
package:bit64 (c) 2011-2017 Jens Oehlschlaegel
creators: integer64 runif64 seq :
coercion: as.integer64 as.vector as.logical as.integer as.double as.character as.bitstring
logical operator: ! & | xor != == < <= >= >
arithmetic operator: + - * / %/% %% ^
math: sign abs sqrt log log2 log10
math: floor ceiling trunc round
querying: is.integer64 is.vector [is.atomic} [length] format print str
values: is.na is.nan is.finite is.infinite
aggregation: any all min max range sum prod
cumulation: diff cummin cummax cumsum cumprod
access: length<- [ [<- [[ [[<-
combine: c rep cbind rbind as.data.frame
WARNING don't use as subscripts
WARNING semantics differ from integer
for more help type ?bit64

Attaching package: ‘bit64’

The following objects are masked from ‘package:base’:

    :, %in%, is.double, match, order, rank

> x
   person_id age
1 1346900019  80
2 1348000031  71
> # look at patient ids again
> sessionInfo()
R version 4.1.3 (2022-03-10)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.6.9

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] bit64_4.0.5    bit_4.0.5      fstcore_0.9.14 fst_0.9.8     

loaded via a namespace (and not attached):
[1] compiler_4.1.3    parallel_4.1.3    tools_4.1.3       rstudioapi_0.14   Rcpp_1.0.10       data.table_1.14.8

@markdanese
Copy link
Author

I just updated all of my packages to R 4.3.2 and the results are the same. Just FYI.

R version 4.3.2 (2023-10-31) -- "Eye Holes"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: aarch64-apple-darwin20 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> # Load packages ---------------------------------------------------------------------
> 
> # Fresh R session
> 
> library(fst)
fst package v0.9.8
> 
> x <- data.frame(person_id = c(1346900019, 1348000031), age = c(80, 75))
> 
> write_fst(x, "./data/raw/test_fst2.fst")
fstcore package v0.9.18
(OpenMP detected, using 10 threads)
> 
> x <- read_fst("./data/raw/test_fst.fst")
> 
> x
      person_id age
1 6.654570e-315  80
2 6.660005e-315  71
> # look at patient ids 
> 
> library(bit64)
Loading required package: bit

Attaching package: ‘bit’

The following object is masked from ‘package:base’:

    xor

Attaching package bit64
package:bit64 (c) 2011-2017 Jens Oehlschlaegel
creators: integer64 runif64 seq :
coercion: as.integer64 as.vector as.logical as.integer as.double as.character as.bitstring
logical operator: ! & | xor != == < <= >= >
arithmetic operator: + - * / %/% %% ^
math: sign abs sqrt log log2 log10
math: floor ceiling trunc round
querying: is.integer64 is.vector [is.atomic} [length] format print str
values: is.na is.nan is.finite is.infinite
aggregation: any all min max range sum prod
cumulation: diff cummin cummax cumsum cumprod
access: length<- [ [<- [[ [[<-
combine: c rep cbind rbind as.data.frame
WARNING don't use as subscripts
WARNING semantics differ from integer
for more help type ?bit64

Attaching package: ‘bit64’

The following object is masked from ‘package:utils’:

    hashtab

The following objects are masked from ‘package:base’:

    :, %in%, is.double, match, order, rank

> x
   person_id age
1 1346900019  80
2 1348000031  71
> # look at patient ids again
> 
> sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.6.9

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] bit64_4.0.5    bit_4.0.5      fstcore_0.9.18 fst_0.9.8     

loaded via a namespace (and not attached):
 [1] vctrs_0.6.5       cli_3.6.1         rlang_1.1.2       stringi_1.8.2     forcats_1.0.0     purrr_1.0.2       generics_0.1.3   
 [8] data.table_1.14.8 glue_1.6.2        colorspace_2.1-0  backports_1.4.1   readxl_1.4.3      scales_1.3.0      fansi_1.0.5      
[15] cellranger_1.1.0  munsell_0.5.0     tibble_3.2.1      lifecycle_1.0.4   stringr_1.5.1     compiler_4.3.2    dplyr_1.1.4      
[22] Rcpp_1.0.11       pkgconfig_2.0.3   tidyr_1.3.0       R6_2.5.1          tidyselect_1.2.0  utf8_1.2.4        parallel_4.3.2   
[29] pillar_1.9.0      magrittr_2.0.3    tools_4.3.2       broom_1.0.5      

@MarcusKlik
Copy link
Collaborator

MarcusKlik commented Dec 6, 2023

Hi @markdanese, I think the issue is with how R interprets your original data:

library(bit64)

x <- data.frame(
  person_id = c(1346900019, 1348000031),
  age = c(80, 75)
)

# column types
typeof(x$person_id)
#> [1] "double"
typeof(x$age)
#> [1] "double"

# registered as integer64
is.integer64(x$person_id)
#> [1] FALSE

in this case, your columns are both stored as doubles. If you want to use integer64 and integer columns, you could use:

library(bit64)

x <- data.frame(
  person_id = as.integer64(c(1346900019, 1348000031)),
  age = c(80L, 75L)
)

# column types
typeof(x$person_id)
#> [1] "double"
typeof(x$age)
#> [1] "integer"

# registered as integer64
is.integer64(x$person_id)
#> [1] TRUE

you can see that although the integer64 column is registered as such, in fact it's just a double in memory (that works because a double has a byte length of 8, the same as integer64)

After a write/read cycle, the column types are preserved:

# write read cycle
write_fst(x, "test_fst.fst")
y <- read_fst("test_fst.fst")

# column types
typeof(y$person_id)
#> [1] "double"
typeof(y$age)
#> [1] "integer"

# registered as integer64
is.integer64(y$person_id)
#> [1] TRUE

print(y$person_id)
#> integer64
#> [1] 1346900019 1348000031

does that answer your question?

@markdanese
Copy link
Author

Please feel free to close this issue if it isn't related to fst. I really don't understand how all of this works, and I don't follow your explanation. It is still a problem for me. I can't open my fst files and use them without the bit64 library call. As you can see above, if I don't call bit64 explicitly, I get person identifier numbers that look like "6.654570e-315" which are not usable. I don't "want" to use bit64. It is the only thing that seems to make my data usable when reading in an fst file.

@markdanese
Copy link
Author

Just in case my example was misleading, this shows the problem in selecting a record immediately after opening the fst file.

> # Fresh R session
> 
> library(fst)
fst package v0.9.8
> 
> x <- data.frame(person_id = c(1346900019, 1348000031), age = c(80, 75))
> 
> write_fst(x, "./data/raw/test_fst2.fst")
fstcore package v0.9.18
(OpenMP detected, using 10 threads)
> 
> x <- read_fst("./data/raw/test_fst.fst")
> 
> x[x$person_id == "1346900019",]
[1] person_id age      
<0 rows> (or 0-length row.names)
> # can't find record 
> 
> library(bit64)
Loading required package: bit

Attaching package: ‘bit’

The following object is masked from ‘package:base’:

    xor

Attaching package bit64
package:bit64 (c) 2011-2017 Jens Oehlschlaegel
creators: integer64 runif64 seq :
coercion: as.integer64 as.vector as.logical as.integer as.double as.character as.bitstring
logical operator: ! & | xor != == < <= >= >
arithmetic operator: + - * / %/% %% ^
math: sign abs sqrt log log2 log10
math: floor ceiling trunc round
querying: is.integer64 is.vector [is.atomic} [length] format print str
values: is.na is.nan is.finite is.infinite
aggregation: any all min max range sum prod
cumulation: diff cummin cummax cumsum cumprod
access: length<- [ [<- [[ [[<-
combine: c rep cbind rbind as.data.frame
WARNING don't use as subscripts
WARNING semantics differ from integer
for more help type ?bit64

Attaching package: ‘bit64’

The following object is masked from ‘package:utils’:

    hashtab

The following objects are masked from ‘package:base’:

    :, %in%, is.double, match, order, rank

> x[x$person_id == "1346900019",]
   person_id age
1 1346900019  80
> 
> # record correctly retrieved after bit64 library loaded explicitly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants